Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data.
In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset (BOVText V2). There are four
features for BOVText V2. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing
largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open scenarios, including many virtual scenarios, e.g.,
Life Vlog, Driving, Movie, Game, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for
the different representational meanings in the video. Fourthly, the BOVText V2 provides bilingual text annotation to promote
multiple cultures’ lives and communication.