Datasets
Standard Dataset
InnoStock
- Citation Author(s):
- Submitted by:
- chang zong
- Last updated:
- Sun, 12/01/2024 - 02:11
- DOI:
- 10.21227/04k3-4667
- License:
- Categories:
- Keywords:
Abstract
Innostock focuses on stock price movement prediction tasks of newly formed technology companies listed on China's Sci-Tech Innovation Board, aggregating their financial news from various online platforms. It's stock prices were originally collected from CSMAR (https://cn.gtadata.com). To support multimodal input of each stock, we further collect the industrial sector relationships for each stock and build knowledge graphs. We label each stock according to the increase rate of its adjusted closed prices to support a fine-grained prediction task, where each stock has a movement label from three options (up, flat, and down). We chronologically partitioned each dataset into training, validation, and testing subsets for machine learning purposes.
- stock_price.pkl, the pickle file of stock prices, which is a list of lists, where each list indicates the price of a particular stock (stock ID) in a specific timestamp (date).
- trend_label.pkl, the pickle file of stock movement trend labels, which is a list of lists, where each list indicates the label sequence of a stock, with 0 as up, 1 as flat, and 2 as down.
- node_init_emb.pkl, the pickle file of stock features for learning with stock knowledge graph, which is a dictionary, where each item is a 768-dimensional embedding representing the semantic meaning of all indicators related to a stock company.
- doc_input.pkl, the pickle file of stock documents for text modality learning, which is a list of lists, where each list contains the title of news of the stock, and its publish date.
- graph_input.pkl, the pickle file of stock knowledge graph for graph modality learning, which is a list of NetworkX objects, where each object contains all nodes and edges of a particular relationship among stocks.
- indicator_input.pkl, the pickle file of stock indicator sequences for time-series modality learning, which is a list of lists, where each list contains key indicator values of a particular stock from a time span.