The dataset for a vision-based unmanned kiosk

Citation Author(s):
Submitted by:
Last updated:
Wed, 10/20/2021 - 08:53
Data Format:
0 ratings - Please login to submit your rating.


In recent years, many unmanned retail stores have been introduced with the development
of computer vision, sensors and wireless technology. However, the trend has slowed due to the costs
associated with implementation. A vision-based kiosk, which is only equipped with a vision sensor, can
be a compact and affordable option. The vision-based kiosk detects a product via CNN-based object
detectors and performs instance-level classification by aggregating detections from multiple cameras.
There are many public datasets for retail object recognition. However, the products that appear in the dataset are captured under certain conditions, such as a white background, on the shelf, and a pack shot. We provide a dataset which is constructed solely for the vision-based unmanned kiosk. The dataset consists of 115k images collected with the prototype unmanned kiosk system. In the system, multiple cameras are installed to capture an object from various viewpoints. For the actual purchasing process, retail products are selected and returned by customers in the scene


The proposed kiosk system consists of three levels of racks, and each level has two cameras. Consequently, there exist six cameras in the system. We assume ‘instance-level’ inference, in which six frames are considered as a single instance. Therefore, a single image has a camera id and an instance id.

For ground truth, we followed ‘darknet’[1] file format. There are .txt file for each .jpg file in the same directory with the same name.

The object number and object coordinates on this image is appeared in the .txt file, for each object in new line:

<object-class> <x center> <y center> <width>


<object-class> - integer object number from 0 to (classes-1)

<x_center> <y_center> <width> <height> - float values relative to width and height of image, it can be equal from (0.0 to 1.0]


The zip file “train” contains two training datasets. The first is a ‘view-based’ dataset to which the proposed view-based annotation is applied. The second is a ‘conventional dataset’ which is annotated in the typical manner with only superclasses. These two dataset share identical images but with different label sets annotated with different schemes. There are 12,323 images and ground truth *.txt files for each dataset.


The zip file “valid” contains two validation sets just as training dataset. There are 11,859 images and ground truth *.txt files for each dataset.


The zip file “test” contains 10,306 instances. There is no ground truth label, and only the superclass of an instance is appeared as a name of folder. There are 61,836 images in total.


The zip file “search_data” contains 5,370 instances. The dataset can be used to collect subclass detection results. There is no ground truth label, and only the superclass of an instance is appeared as the folder name. There are 5,370 images in total.


[1] :


Edit : The total number of images in the dataset is 142,420, as there are 32,220 images in search dataset.

Submitted by JI YEA CHON on Wed, 06/22/2022 - 08:53