This study delves into the creation of a synthetic dataset, designed to emulate real-world retail scenarios for the purpose of machine learning (ML) evaluation. Utilizing Python, the dataset was generated with 39,500 unique product identities, with a total of approximately 70,000 samples. These samples were distributed across the product identities based on a chi-square distribution. Each product was assigned a set of attributes, including weight, aisle and shelf numbers, and a restocking threshold. In addition, the dataset incorporated the time elapsed since the last restocking for each product, providing a more comprehensive view of the retail environment.
The preprocessing stage was a critical part of the dataset preparation. It involved feature engineering, where new variables were introduced to the dataset. These variables included a binary indicator of whether a product's weight on the shelf is below a certain threshold and the time elapsed since the last restock. These new features were designed to enhance the performance of the ML models by providing additional, relevant information.
The initial dataset exhibited an imbalance with respect to the 'need_restock' label. To address this issue, the sklearn resample utility was used to undersample the majority class, aligning it with the minority class count. This process resulted in a balanced dataset, with each class containing 35,040 samples. The dataset was then randomized to ensure diversity and prevent any potential bias in the ML model evaluation.