Datasets
Standard Dataset
DeepGuardDB: Real and Text-to-Image Synthetic Images Dataset
- Citation Author(s):
- Submitted by:
- Gueltoum Bendiab
- Last updated:
- Mon, 10/21/2024 - 18:43
- DOI:
- 10.21227/10ap-pk52
- License:
- Categories:
- Keywords:
Abstract
"Recent advancements in deep learning and generative models have significantly enhanced text-to-image (T2I) synthesis, allowing for the creation of highly realistic images based on textual inputs. While this progress has expanded the creative and practical applications of AI, it also presents new challenges in distinguishing between authentic and AI-generated images. This challenge raises serious concerns in areas such as security, privacy, and digital forensics. In response, there has been growing attention on the development of advanced AI-based detectors designed to reliably differentiate between synthetic and real images, ensuring data authenticity and protection against potential misuse. Using reliable and diverse datasets of fake and real data is crucial for training and evaluating the learning models effectively. For that, the research community has made significant efforts to develop dedicated datasets for this specific purpose. As the T2I generation tools continue to evolve rapidly, there is an ongoing need to update and refine existing datasets to keep pace with the latest advancements. This constant evolution drives us to continuously improve our resources, ensuring that they reflect the state-of-the-art in image generation. In this context, we have constructed the DeepGuardDB dataset, which plays a pivotal role in evaluating and enhancing models designed to differentiate between AI-generated images and real ones. To ensure a comprehensive and representative evaluation, the DeepGuardDB dataset has been meticulously curated, addressing the limitations of existing datasets by incorporating a diverse array of visual content. DeepGuardDB dataset leverages Stable Diffusion3, which produces higher-quality images in addition to Imagen and DALL-E 3. DeepGuardDB contains 13,000 images, evenly split between real and generated images, with 6500 (50%) representing each category. The real images included in DeepGuardDB are collected from two well-established datasets, each recognized for its richness and diversity: MS-COCO (Microsoft Common Objects in Context) and Flickr30k. For the AI-generated images, DeepGuardDB leverages three of the most advanced T2I generation platforms available today: Stable Diffusion 3, Imagen, and DALL-E 3. The synthetic images were created using the same prompts as those used to generate the real images. By employing identical textual descriptions, the AI aimed to produce images that closely resemble the authentic ones. This approach highlights the challenge of distinguishing between real and AI-generated content, as the use of the same prompts ensures that both sets of images share similar themes, subjects, and visual cues"
"We distribute DeepGuardDB using a modular file structure to ensure ease of use and organization. The dataset consists of 13,000 images, which are divided into two separate folders: one containing 6,500 AI-generated (fake) images and the other containing 6,500 authentic (real) images. In addition to the images, we provide a comprehensive JSON file that maps each pair of images (fake and real) to its corresponding prompts and associated hyperparameters. This JSON file serves as a valuable resource, allowing users to trace the generation process and understand the input parameters used for both the real and synthetic images, making the dataset more transparent and easier to analyse.
For instance, here is a description of an image pair along with its key-value entries in the JSON file.
{
"id": 2,
"real image file name": "116482438.jpg",
"fake image file name": "8ac2a4ac-29f3-4bbe-897b-f19cd6a25c51.jpg",
"prompts": "An elderly woman sits in a chair with a book and smiles at the pretty flowers in her room",
"platform": "sd"
}
This JSON provides metadata for the pair of images (116482438.jpg, 8ac2a4ac-29f3-4bbe-897b-f19cd6a25c51.jpg), one real and one AI-generated. It includes the following details:
- id: A unique identifier for the image pair (ID: 2).
- real_image_file_name: The file name of the real image ("116482438.jpg").
- fake_image_file_name: The file name of the AI-generated (fake) image ("8ac2a4ac-29f3-4bbe-897b-f19cd6a25c51.jpg").
- prompts: The text prompt used to generate the synthetic image ("An elderly woman sits in a chair with a book and smiles at the pretty flowers in her room"). The fake images were generated using the same prompts associated with the corresponding real images.
- platform: The platform used for generating fake images by using the same prompt associated with the real one. ("sd", likely referring to Stable Diffusion). Possible values of this field are:
- sd
- dall_E3
- imagen
- glide
This structure links real and synthetic images along with their prompts and generation details."
|_ DeepGuardDB_v1.0
|_ DALLE_dataset
|_ fake
|_ real
|_ SD_dataset
|_ fake
|_ real
|_ IMAGEN_dataset
|_ fake
|_ real
|_ GLIDE_dataset
|_ fake
|_ real
|_ json files
|_ dalle_json.json
|_ sd_json.json
|_ glide_json.json
|_ imagen_json.json