Chinese Social Media Autism Children Dataset (CSMACD)

- Citation Author(s):
-
Wondimagegn Bekele Munto (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China)Mengying Zhou (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China)Wei Li (Longgang District Maternity & Child Healthcare Hospital, Shenzhen 518172, China)Zhihai Lv (Longgang District Maternity & Child Healthcare Hospital, Shenzhen 518172, China)Na Li (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China )Yuanjie Cao (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China)Yi Pan (Shenzhen University of Advanced Technology, Shenzhen, China)Sufen Hu (Longgang District Maternity & Child Healthcare Hospital, Shenzhen 518172, China)Yanjie Wei (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China)Wenhui Xi (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China)
- Submitted by:
- Wondimagegn Munto
- Last updated:
- DOI:
- 10.21227/7hqx-me42
- Data Format:
- Links:
- Categories:
- Keywords:
Abstract
This paper introduces the Chinese Social Media Autism Children Dataset (CSMACD), a novel resource for autism spectrum disorder (ASD) research. CSMACD compiles high-definition, unobstructed frontal facial images of Chinese children (aged 6 months to 15 years) with ASD, sourced from mainstream social media platforms (e.g., Bilibili, Douyin, and Tencent Video). Videos were identified using ASD-related keywords (e.g., "autism," "Star Baby") and recommendation algorithms. A total of 182 ASD facial images (140 males, 42 females) were curated by verifying uploader claims and analyzing video content. To establish a neurotypical (TD) control group, 182 pediatric facial images with matched gender ratios were manually selected from the East Asian subset of the Flickr-Faces-HQ Dataset (FFHQ). All data, including video links and facial landmarks, are publicly available via GitHub and Gitee repositories. While CSMACD is designed to expand with future social media contributions, its open-source nature necessitates caution due to potential variability in label accuracy and data quality. This dataset aims to support research in ASD facial analysis, machine learning, and cross-cultural behavioral studies.
Instructions:
CSMACD
Description
The Chinese Social Media Autism Children Dataset (CSMACD) is a multimodal dataset collecting facial images and videos of autistic and typical-developing children from mainstream Chinese social media platforms. It aims to support research on autism early screening tools, behavioral pattern analysis, and facial feature recognition, and offers high-quality data for researchers in psychology, medicine, and AI. The dataset strictly follows privacy protection standards to ensure legality and compliance.
Data Sources
Platform Selection:Bilibili, Douyin, Quick Worker, Watermelon video, Tencent Video, and Youku Video;
Reasons:These platforms have a large user base and wide content coverage, with numerous children-related videos, which can effectively reflect real world behavioral characteristics.
Search Keywords
Related to Autistic Children
Chinese Keywords:ASD, autism, autistic, spectrum, autism spectrum, autism spectrum disorder, Asperger's, Biwa, Star Baby, children from the stars, special children;
English Abbreviation:ASD(Autism Spectrum Disorder);
Related to Typically-Developing Children
Human child,Young human boy,Slicked-back hairstyle boy,No-bangs boy;
Note :Keyword selection is based on literature review and platform common tags, with some data obtained via recommendation algorithms (e.g., homepage recommendations).
Child Inclusion Criteria
General Conditions
- Age Range: 6 months-15 years. Accounts with early childhood images (e.g., 5-6 months) are included even if the current age exceeds the limit;
- Twin/Trplet Handling: Treat twins or triplets as one entity;
- Single-Platform Deduplication: For multiple accounts of the same child on one platform, only include the main account (most videos, highest quality);
- Cross-Platform Deduplication: For multi-platform accounts of the same child, select the most complete-data platform account;
- Child Gender Labeling: Infer from multiple video content dimensions
- Observable Features: Child's hairstyle, clothing, voice characteristics (if voiced);
- Contextual Information: Video title tags (e.g., "Boy's Rehabilitation Diary"), poster's descriptive text;
- for infants, also refer to caregiver's terms (e.g., "son","daughter").
Autistic Children
- User profile or video content clearly states diagnosis (e.g., "diagnosed with autism", "ASD diagnosis certificate");
- Profile explicitly states the child has autism, or account content includes medical diagnosis proof or ongoing rehabilitation records (text/video format); exclude accounts without clear diagnosis evidence.
Typically-Developing Children
- Account has no disease statement, and video content doesn't involve special medical or intervention training.
Child Video Selection Criteria
Quantity
1-3 videos per child (prioritizing high-quality segments).
Duration
10 seconds-3 minutes. Can be longer considering platform characteristics.
Exclusion Rules
- Promotional/advertising videos (e.g., organizational promotion, fundraising);
- Multi-person videos (autism group should avoid other children on camera);
- Content with strong filters/beauty effects.
Quality Requirements
- Resolution ≥ 720p. Low-resolution videos (< 720p) must have clear facial images;
- Child's face unobstructed, close-up, neutral expression or small movements, for easy facial feature extractio;
- Early childhood (e.g., 6 months-3 years) videos preferred;
Note :Some social platforms support photo-and-text format content besides videos. We include this in video selection, using the same criteria.
Facial Image Cropping Standards
Quantity
Only 1 optimal facial image per child.
Technical Requirements
- Full face (including ears), no shadows/obstructions;
- Facial yaw ≤ 40° (OpenFace tool's optimal recognition range);
- Neutral expression preferred, avoid exaggerated actions (e.g., laughing, crying), maintain original facial structure.
Video Formats
Bilibili, Douyin, Quick Worker, Watermelon Video: MP4 format;
Tencent Video: QLV forma;
Youku Video: KUX format.
Dataset Structure and Usage
Facial Image Naming Rule
Example:2-1-3-4
First Digit: Platform Number (1 = Bilibili, 2 = Douyin, 3 = Quick Worker, 4 = Tencent Video, 5 = Watermelon Video, 6 = Youku Video, 7 = Rednote);
Second Digit: User Sequence Number within Platform;
Third Digit Z: Video Sequence Number of the User;
Fourth Digit W: Facial Image Sequence Number within the Video.
Application Scenarios
Behavioral Analysis: Analyze autistic children's social interaction patterns by combining video timestamps.
AI Model Training: Train classification models using facial features (e.g., AU action units) extracted by OpenFace.
Cross-modal Research: Link facial features with voice and limb movement data.