BEVTrack Benchmark

Citation Author(s):
Zekun
Qian
Submitted by:
Zekun Qian
Last updated:
Sat, 02/08/2025 - 00:19
DOI:
10.21227/kx99-0x86
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The BEVTrack benchmark consists of two complementary datasets: CVMHAT-BEVT and BEVT-S, designed for multi-view human tracking with bird's-eye view (BEV) capabilities. CVMHAT-BEVT is adapted from the public CVMHAT dataset, featuring 21 synchronous multi-view videos across five scenes, with 7-12 individuals per scene and video durations ranging from 200 to 1,500 frames. Each scene contains 2-4 side views and includes synchronized BEV footage captured by drones, along with annotated bounding boxes and unified ID numbers for all subjects across views.

To address the limitations of real-world datasets, we introduce BEVT-S, a large-scale synthetic dataset created using Unity 3D engine and PersonX model library. BEVT-S comprises five diverse scenes, each spanning 25m × 25m and containing 15-20 individuals with varied walking trajectories. The dataset includes 75 pairs of first-person view (FPV) videos (50 for training, 25 for testing), with each scene running for 1,000 frames. BEVT-S provides comprehensive annotations including subject positions (in meters), body orientations in BEV, camera poses, and unified bounding box IDs across all views. This synthetic dataset enables robust training and evaluation of multi-view human tracking systems with BEV integration.

Instructions: 

BEVTrack Benchmark Dataset Documentation

Dataset Overview

The BEVTrack Benchmark is a comprehensive dataset designed for multi-person tracking in Bird's-Eye-View (BEV) scenarios. It consists of two main parts:

  1. BEVT-S: Synthetic dataset with 5 scenes

  2. CVMHAT-BEVT: Real-world dataset with different FPV camera configurations

Dataset Structure

1. BEVT-S (Synthetic Dataset)

Contains 5 scenes (scene1-5) with identical structure. Each scene includes:

1.1 Annotation Directory

Contains ground truth and processed data:

3D Ground Truth Files:

  • camera*.txt: 3D positions of camera wearers

  • camerabias*.txt: 3D positions of cameras

  • person*.txt: 3D positions of subjects

Tracking Data Files:

f_top_bbox_pid.pth

Top-view bounding boxes with person IDs

Format: f_top_bbox_id_dict[frame].append([top_bbox, pid])

fp.pth

Frame-person position mapping

Format: f_pid_dict[f"{frame_id}_{p_id}"] = [x, y, r]

fps.pth

Frame-person state dictionary

Format: f_pids_dict[int(frame_id)][int(p_id)] = [x, y, r]

fv.pth

Frame-view person detection data

Format: fv_dict[f"{frame}_{view_id}"].append([pid, bbox])

fvp.pth

Frame-view-person bounding box mapping

Format: fvp_dict[f"{frame}_{view_id}_{pid}"] = bbox

fvskwh.pth

Frame-view skeleton and dimension data

Format: fv_sk_wh[f"{frame_id}_{view_id}"] = [keypoints([17 * 3], from pifpaf), wh]

fv_sk_box.pth

Frame-view skeleton and box data

Format: fv_sk_box[f"{frame_id}_{view_id}"] = [keypoints, boxes]

init_inf.txt

Frame-wise object count information

Note: Object counts start from the third line

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.2 Video and Image Data

  • combine/: Visualization videos

  • hor*_video/: Raw First-Person View (FPV) videos

  • top_video/: Raw top-view videos

1.3 Segmentation Annotations

  • hor*_bbox/: Subject segmentation masks for FPV

  • top_bbox/: Subject segmentation masks for top view

    • Organized by frame numbers (e.g., 0001/)
    • Individual masks for each subject (e.g., 01.png)

2. CVMHAT-BEVT (Real-world Dataset)

2.1 Ground Truth Directory (GT_txt/)

Contains view-specific ground truth files:

  • Format: V{num_fpv}_G{group}_h{camera}.txt

    • num_fpv: Number of FPV cameras in the setup (2, 3, or 4 FPV cameras)
    • group: Group/sequence number
    • camera: Camera identifier (h1, h2, h3, h4 for FPV cameras)

2.2 Images Directory

Organized by number of FPV cameras and groups:

Two-Camera Setup (V2):

  • V2_G1/V2_G3/: Same structure as V2_G1

    • h1/, h2/: First-person view images
    • t/: Top-view images
  • V2_G5/: Same structure as V2_G1

Three-Camera Setup (V3):

  • V3_G1/V3_G2/: Same structure as V3_G1

    • h1/, h2/, h3/: First-person view images
    • t/: Top-view images
  • V3_G3/: Same structure as V3_G1

  • V3_G5/: Same structure as V3_G1

Four-Camera Setup (V4):

  • V4_G1/V4_G2/: Same structure as V4_G1

    • h1/, h2/, h3/, h4/: First-person view images
    • t/: Top-view images

2.3 File Naming Convention

Images follow the naming pattern: V{num_fpv}_G{group}_{view}_{frame}.jpg

  • Examples:

    • V2_G1_h1_0001.jpg: Two-camera setup, Group 1, Camera 1, Frame 1
    • V3_G1_h3_0001.jpg: Three-camera setup, Group 1, Camera 3, Frame 1
    • V4_G1_h4_0001.jpg: Four-camera setup, Group 1, Camera 4, Frame 1

File Formats

Ground Truth Files

Person/Camera Position Files (*.txt):

  • Each line represents: [frame_id, track_id, x, y, w, h]

Data Files (*.pth)

  1. Bounding Box Data:

    • Contains detection and tracking information
  2. Skeleton Data:

    • 17 keypoints
    • Generated using OpenPifPaf

Image Files

  1. Video Frames (*.png, *.jpg):

    • Raw video frames from both top and FPV views
    • Sequential naming (0001.png, etc.)
  2. Segmentation Masks (*.png):

    • Binary masks for each subject
    • Organized by frame and subject ID