Abstract

The BEVTrack benchmark consists of two complementary datasets: CVMHAT-BEVT and BEVT-S, designed for multi-view human tracking with bird's-eye view (BEV) capabilities. CVMHAT-BEVT is adapted from the public CVMHAT dataset, featuring 21 synchronous multi-view videos across five scenes, with 7-12 individuals per scene and video durations ranging from 200 to 1,500 frames. Each scene contains 2-4 side views and includes synchronized BEV footage captured by drones, along with annotated bounding boxes and unified ID numbers for all subjects across views.

To address the limitations of real-world datasets, we introduce BEVT-S, a large-scale synthetic dataset created using Unity 3D engine and PersonX model library. BEVT-S comprises five diverse scenes, each spanning 25m × 25m and containing 15-20 individuals with varied walking trajectories. The dataset includes 75 pairs of first-person view (FPV) videos (50 for training, 25 for testing), with each scene running for 1,000 frames. BEVT-S provides comprehensive annotations including subject positions (in meters), body orientations in BEV, camera poses, and unified bounding box IDs across all views. This synthetic dataset enables robust training and evaluation of multi-view human tracking systems with BEV integration.

Instructions:

BEVTrack Benchmark Dataset Documentation

Dataset Overview

The BEVTrack Benchmark is a comprehensive dataset designed for multi-person tracking in Bird's-Eye-View (BEV) scenarios. It consists of two main parts:

BEVT-S: Synthetic dataset with 5 scenes
CVMHAT-BEVT: Real-world dataset with different FPV camera configurations

Dataset Structure

1. BEVT-S (Synthetic Dataset)

Contains 5 scenes (scene1-5) with identical structure. Each scene includes:

1.1 Annotation Directory

Contains ground truth and processed data:

3D Ground Truth Files:

camera*.txt: 3D positions of camera wearers
camerabias*.txt: 3D positions of cameras
person*.txt: 3D positions of subjects

Tracking Data Files:

f_top_bbox_pid.pth

Top-view bounding boxes with person IDs

Format: f_top_bbox_id_dict[frame].append([top_bbox, pid])

fp.pth

Frame-person position mapping

Format: f_pid_dict[f"{frame_id}_{p_id}"] = [x, y, r]

fps.pth

Frame-person state dictionary

Format: f_pids_dict[int(frame_id)][int(p_id)] = [x, y, r]

fv.pth

Frame-view person detection data

Format: fv_dict[f"{frame}_{view_id}"].append([pid, bbox])

fvp.pth

Frame-view-person bounding box mapping

Format: fvp_dict[f"{frame}_{view_id}_{pid}"] = bbox

fvskwh.pth

Frame-view skeleton and dimension data

Format: fv_sk_wh[f"{frame_id}_{view_id}"] = [keypoints([17 * 3], from pifpaf), wh]

fv_sk_box.pth

Frame-view skeleton and box data

Format: fv_sk_box[f"{frame_id}_{view_id}"] = [keypoints, boxes]

init_inf.txt

Frame-wise object count information

Note: Object counts start from the third line

1.2 Video and Image Data

combine/: Visualization videos
hor*_video/: Raw First-Person View (FPV) videos
top_video/: Raw top-view videos

1.3 Segmentation Annotations

hor*_bbox/: Subject segmentation masks for FPV
top_bbox/: Subject segmentation masks for top view
- Organized by frame numbers (e.g., 0001/)
- Individual masks for each subject (e.g., 01.png)

2. CVMHAT-BEVT (Real-world Dataset)

2.1 Ground Truth Directory (GT_txt/)

Contains view-specific ground truth files:

Format: V{num_fpv}_G{group}_h{camera}.txt
- num_fpv: Number of FPV cameras in the setup (2, 3, or 4 FPV cameras)
- group: Group/sequence number
- camera: Camera identifier (h1, h2, h3, h4 for FPV cameras)

2.2 Images Directory

Organized by number of FPV cameras and groups:

Two-Camera Setup (V2):

V2_G1/V2_G3/: Same structure as V2_G1
- h1/, h2/: First-person view images
- t/: Top-view images
V2_G5/: Same structure as V2_G1

Three-Camera Setup (V3):

V3_G1/V3_G2/: Same structure as V3_G1
- h1/, h2/, h3/: First-person view images
- t/: Top-view images
V3_G3/: Same structure as V3_G1
V3_G5/: Same structure as V3_G1

Four-Camera Setup (V4):

V4_G1/V4_G2/: Same structure as V4_G1
- h1/, h2/, h3/, h4/: First-person view images
- t/: Top-view images

2.3 File Naming Convention

Images follow the naming pattern: V{num_fpv}_G{group}_{view}_{frame}.jpg

Examples:
- V2_G1_h1_0001.jpg: Two-camera setup, Group 1, Camera 1, Frame 1
- V3_G1_h3_0001.jpg: Three-camera setup, Group 1, Camera 3, Frame 1
- V4_G1_h4_0001.jpg: Four-camera setup, Group 1, Camera 4, Frame 1

File Formats

Ground Truth Files

Person/Camera Position Files (*.txt):

Each line represents: [frame_id, track_id, x, y, w, h]

Data Files (*.pth)

Bounding Box Data:
- Contains detection and tracking information
Skeleton Data:
- 17 keypoints
- Generated using OpenPifPaf

Image Files

Video Frames (*.png, *.jpg):
- Raw video frames from both top and FPV views
- Sequential naming (0001.png, etc.)
Segmentation Masks (*.png):
- Binary masks for each subject
- Organized by frame and subject ID

Dataset Files

BEVT_dataset.tar.gz (66.87 GB)

Datasets

Standard Dataset

BEVTrack Benchmark

Abstract

Dataset Files

QUESTIONS?