EvIs-Kitchen

Citation Author(s):
Yuzhe
Hao
Tokyo Institute of Technology
Koichi
Shinoda
Tokyo Institute of Technology
Submitted by:
Yuzhe Hao
Last updated:
Mon, 07/08/2024 - 15:58
DOI:
10.21227/7nkz-9p74
Data Format:
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Egocentric video and Inertial sensor data Kitchen activity dataset is the first V-S-S interaction-focused dataset for the ego-HAR task.

It consists of sequences of everyday kitchen activities involving rich interactions among the subject's body, object, and environment.

Besides the egocentric videos recorded by the GoPro camera, our dataset also includes the inertial sensor data recorded from the Fitbit watches attached on the subject's wrists, which are synchronized and correlated with the video data stream.

In total, our dataset contains 4,527 action samples from 12 subjects and 7 recipes, with 35 verb classes label and 56 noun classes label.

Instructions: 

 

You can also access the document through the Introduction page: https://yuzhehao.github.io/EvIs-Kitchen-Introduction/

 

Our dataset contains 4 major folders: /Annotation/Video/RGB-frames, and /Sensor

/Annotation:

The annotation of all action segments are in one csv file. Each line in this file is an annotation for a sample:

  • narration_id ("S01R01_011"): "S01" means this action is from subject-1. "R01" means it is from recipe-1. The following "011" is the index of this action in the entire cooking process.
  • verb ("crack"): The Verb label of this action segment.
  • noun ("egg"): The Noun label of this action segment.
  • start_frame (4215): The index of frame (in RGB-frames sequence and in Sensor sequence) when this action starts.
  • stop_frame (4394): The index of frame (in RGB-frames sequence and in Sensor sequence) when this action ends.
  • start_time (02:20.5): The time ine the Video when this action starts.
  • stop_time (02.26.4): The time ine the Video when this action ends.
  • temporal_length (5976): The temporal length how long does this action last (with ms as unit).

/Video:

The original raw video recorded by the GoPro camera, with 1920x1080 resolution in 60fps. Each MP4 file is a complete process of one subject cooking one of the recipes, and contains many action segments.

/RGB-frames:

The 30fps video frames sequence of each long video in /Video directory. Each folder contains the frame sequence for the corresponding long cooking video.

All frame image is resize to 228x128 for reducing the redundancy, saving more GPU memory cost during the training.

/Sensor:

The 30fps inertial sensor data recorded by the Fitbit watches in npy format. Each npy file contains the complete sensor data sequence for the corresponding long cooking video.

For the sensor data sequence, the shape of each frame is (2,10). The first dimension means the left/right hands, their order is [left, right]. The second dimension means the 10 inertial sensor data, which are: 3-axis accelerometer, 3-axis gyroscope, 4-digit orientation. The order of the 10 inertial sensor data is: [acc-x, acc-y, acc-z, gyro-x, gyro-y, gyro-z, ori-a, ori-b, ori-c, ori-d]