Datasets
Standard Dataset
BEVTrack Benchmark

- Citation Author(s):
- Submitted by:
- Zekun Qian
- Last updated:
- Sat, 02/08/2025 - 00:19
- DOI:
- 10.21227/kx99-0x86
- License:
- Categories:
- Keywords:
Abstract
The BEVTrack benchmark consists of two complementary datasets: CVMHAT-BEVT and BEVT-S, designed for multi-view human tracking with bird's-eye view (BEV) capabilities. CVMHAT-BEVT is adapted from the public CVMHAT dataset, featuring 21 synchronous multi-view videos across five scenes, with 7-12 individuals per scene and video durations ranging from 200 to 1,500 frames. Each scene contains 2-4 side views and includes synchronized BEV footage captured by drones, along with annotated bounding boxes and unified ID numbers for all subjects across views.
To address the limitations of real-world datasets, we introduce BEVT-S, a large-scale synthetic dataset created using Unity 3D engine and PersonX model library. BEVT-S comprises five diverse scenes, each spanning 25m × 25m and containing 15-20 individuals with varied walking trajectories. The dataset includes 75 pairs of first-person view (FPV) videos (50 for training, 25 for testing), with each scene running for 1,000 frames. BEVT-S provides comprehensive annotations including subject positions (in meters), body orientations in BEV, camera poses, and unified bounding box IDs across all views. This synthetic dataset enables robust training and evaluation of multi-view human tracking systems with BEV integration.
BEVTrack Benchmark Dataset Documentation
Dataset Overview
The BEVTrack Benchmark is a comprehensive dataset designed for multi-person tracking in Bird's-Eye-View (BEV) scenarios. It consists of two main parts:
-
BEVT-S: Synthetic dataset with 5 scenes
-
CVMHAT-BEVT: Real-world dataset with different FPV camera configurations
Dataset Structure
1. BEVT-S (Synthetic Dataset)
Contains 5 scenes (scene1-5) with identical structure. Each scene includes:
1.1 Annotation Directory
Contains ground truth and processed data:
3D Ground Truth Files:
-
camera*.txt
: 3D positions of camera wearers -
camerabias*.txt
: 3D positions of cameras -
person*.txt
: 3D positions of subjects
Tracking Data Files:
f_top_bbox_pid.pth
-
Top-view bounding boxes with person IDs
Format:
f_top_bbox_id_dict[frame].append([top_bbox, pid])
fp.pth
-
Frame-person position mapping
Format:
f_pid_dict[f"{frame_id}_{p_id}"] = [x, y, r]
fps.pth
-
Frame-person state dictionary
Format:
f_pids_dict[int(frame_id)][int(p_id)] = [x, y, r]
fv.pth
-
Frame-view person detection data
Format:
fv_dict[f"{frame}_{view_id}"].append([pid, bbox])
fvp.pth
-
Frame-view-person bounding box mapping
Format:
fvp_dict[f"{frame}_{view_id}_{pid}"] = bbox
fvskwh.pth
-
Frame-view skeleton and dimension data
Format:
fv_sk_wh[f"{frame_id}_{view_id}"] = [keypoints([17 * 3], from pifpaf), wh]
fv_sk_box.pth
-
Frame-view skeleton and box data
Format:
fv_sk_box[f"{frame_id}_{view_id}"] = [keypoints, boxes]
init_inf.txt
-
Frame-wise object count information
Note: Object counts start from the third line
1.2 Video and Image Data
-
combine/
: Visualization videos -
hor*_video/
: Raw First-Person View (FPV) videos -
top_video/
: Raw top-view videos
1.3 Segmentation Annotations
-
hor*_bbox/
: Subject segmentation masks for FPV -
top_bbox/
: Subject segmentation masks for top view- Organized by frame numbers (e.g.,
0001/
) - Individual masks for each subject (e.g.,
01.png
)
- Organized by frame numbers (e.g.,
2. CVMHAT-BEVT (Real-world Dataset)
2.1 Ground Truth Directory (GT_txt/)
Contains view-specific ground truth files:
-
Format:
V{num_fpv}_G{group}_h{camera}.txt
num_fpv
: Number of FPV cameras in the setup (2, 3, or 4 FPV cameras)group
: Group/sequence numbercamera
: Camera identifier (h1, h2, h3, h4 for FPV cameras)
2.2 Images Directory
Organized by number of FPV cameras and groups:
Two-Camera Setup (V2):
-
V2_G1/
V2_G3/
: Same structure as V2_G1h1/
,h2/
: First-person view imagest/
: Top-view images
-
V2_G5/
: Same structure as V2_G1
Three-Camera Setup (V3):
-
V3_G1/
V3_G2/
: Same structure as V3_G1h1/
,h2/
,h3/
: First-person view imagest/
: Top-view images
-
V3_G3/
: Same structure as V3_G1 -
V3_G5/
: Same structure as V3_G1
Four-Camera Setup (V4):
-
V4_G1/
V4_G2/
: Same structure as V4_G1h1/
,h2/
,h3/
,h4/
: First-person view imagest/
: Top-view images
2.3 File Naming Convention
Images follow the naming pattern: V{num_fpv}_G{group}_{view}_{frame}.jpg
-
Examples:
V2_G1_h1_0001.jpg
: Two-camera setup, Group 1, Camera 1, Frame 1V3_G1_h3_0001.jpg
: Three-camera setup, Group 1, Camera 3, Frame 1V4_G1_h4_0001.jpg
: Four-camera setup, Group 1, Camera 4, Frame 1
File Formats
Ground Truth Files
Person/Camera Position Files (*.txt):
-
Each line represents: [frame_id, track_id, x, y, w, h]
Data Files (*.pth)
-
Bounding Box Data:
- Contains detection and tracking information
-
Skeleton Data:
- 17 keypoints
- Generated using OpenPifPaf
Image Files
-
Video Frames (*.png, *.jpg):
- Raw video frames from both top and FPV views
- Sequential naming (0001.png, etc.)
-
Segmentation Masks (*.png):
- Binary masks for each subject
- Organized by frame and subject ID