Detecting and Localizing Text-Image Synchronization Forgery

Name: Detecting and Localizing Text-Image Synchronization Forgery
Creator: Zhigeng Han
License: https://creativecommons.org/licenses/by/4.0/

Citation Author(s):: Jian Chen
Submitted by:: Zhigeng Han
Last updated:: Sun, 02/16/2025 - 07:23
DOI:: 10.21227/0pap-1m14

111 views

Categories:

Keywords:

deepfake detection

Text-image synchronization forgery

Multimodal hierarchical fusion reasoning

Audio-video as context.

ACCESS DATASET CITE

Abstract

DLSF is the first dedicated dataset for Text-Image Synchronization Forgery (TISF) in multimodal media. The source data for this dataset is scraped from the Chinese news aggregation platform, Toutiao. This dataset includes extensive text, image, and audio-video data from news articles involving politicians and celebrities, featuring samples of both entity-level and attribute-level TISF. It provides comprehensive annotations, including labels for text-image authenticity, types of TISF, image forgery regions, and text forgery tokens. The current DLSF dataset consists of 2,200 image-text-audio-video sample pairs, including 179 pairs of attribute-level TISF samples (FA+TA) and 279 pairs of entity-level TISF samples (FS+TS). It is designed to evaluate model performance in detecting and localizing TISF effectively.

Instructions:

The DLSF dataset includes the files train_v1.3.json and test_v1.3.json, with the data organized as follows:

{

"title": "房产过户遵从遗嘱保障权益",

"video_path": "./Data/videos/o8DEmpgEh7zAIDkfmdBxdzxJEujAeBQvIxUPtg.mp4",

"image_path": "./Data/images/7369231429441765899.jpg",

"fake_text_pos": [

10,

"bbox": [

157,

128,

355,

392

"fake_cls": "face_attribute&text_attribute",

"con_label": 0

title represents the news headline text.

video_path represents the storage path for the video.

image_path represents the storage path for the news images.

fake_text_pos marks the positions of the words that were altered in the text.

bbox indicates the areas in the image that were tampered with.

fake_cls represents the type of text-image synchronization forgery (face_attribute: image attribute editing, face_swap: face swapping, text_attribute: text attribute editing, text_swap: entity name replacement).

con_label indicates whether the text-image pair is synchronously forged (0 for forged, 1 for not forged).

In addition, the DLSF dataset includes the following folders:

The videos folder contains the original news videos.

The images folder contains both the original and tampered news images.

The audio folder contains encoded audio data, stored in .npy format.