Skip to main content

Datasets

Standard Dataset

historical information-event emotion dataset

Citation Author(s):
Yaowen Liu
Submitted by:
Yaowen Liu
Last updated:
DOI:
10.21227/qbgv-4703
2 views
Categories:
Keywords:
No Ratings Yet

Abstract

This dataset, constructed around the Jilin Baishan Incident, aims to enhance the emotion prediction capabilities of large language models. Approximately 3.5 million raw comments were collected via the Weibo API, covering key information such as user identifiers, text content, timestamps, and interaction metrics. The data underwent preprocessing steps including normalization, Chinese tokenization, stopword removal, deduplication, and anomalous sample exclusion.

The innovation lies in the batch processing approach that maximizes the value of users' historical data, segmenting each user's historical Weibo records chronologically into training samples of 50 entries per batch, assigned with consistent emotion labels reflecting the user's stance toward the target event. This method effectively overcomes the context length constraints of large language models while substantially expanding the training dataset size and enhancing the model's capacity to capture temporal patterns.

All training samples exclude users' direct comments on the target event, ensuring the model performs genuine prediction tasks rather than simple recognition tasks. This dataset is particularly suitable for studying emotional evolution and prediction in social events, providing rich resources for sentiment analysis and social computing research.

Instructions:

Historical Information-Event Emotion (HIEE) Dataset Description
1. Dataset Overview
The HIEE Dataset is a large-scale Chinese dataset designed for predicting users' emotional stances toward specific events based on their historical social media behavior. Using the Jilin Baishan Incident as the research subject, this dataset predicts users' emotional positions through their historical Weibo content, providing data support for sentiment analysis, social computing, and user behavior prediction research.
2. Data Source and Scale
Data Source: Public data from Weibo platform, collected via API
Raw Data Volume: Approximately 3.5 million Weibo comments
Final Data Volume: Training samples formed after preprocessing and batch processing (specific quantity depends on processing results)
Collection Time Range: Relevant time periods before and after the Jilin Baishan Incident
3. Data Structure
Each training sample contains the following main components:
User Historical Records: 50 Weibo texts arranged in chronological order
Temporal Features: Publication timestamps for each Weibo post
Social Interaction Metrics: Such as like count, comment count, repost count, etc.
Emotion Labels: User's emotional stance toward the Jilin Baishan Incident (e.g., anger, sadness, surprise, etc.)
4. Data Preprocessing
Text normalization (standardization of emoticons, URL processing, etc.)
Chinese word segmentation and stopword filtering (based on jieba toolkit)
Duplicate content removal using edit distance algorithm
Anomalous sample filtering (excessively long texts, advertising content, etc.)
5. Dataset Characteristics
Historical Information Aggregation: Integrating user historical behavior through batch processing
Genuine Prediction Task: Excluding users' direct comments on the target event to ensure the model performs true prediction tasks
Temporal Pattern Preservation: Maintaining the chronological order of user's Weibo posts to capture emotional change trends
Diverse Emotion Labels: Covering various emotional responses to social events
6. Usage Recommendations
Suitable for user behavior prediction, sentiment analysis, social event research, and related fields
Can be used to train and evaluate the emotion prediction capabilities of large language models
Research on public opinion evolution and emotional propagation mechanisms in social events
7. Data Ethics Statement
Data collection complies with platform regulations, has been anonymized to remove information potentially involving personal privacy, and is intended for academic research purposes only.

Dataset Files

Files have not been uploaded for this dataset