ALIN Open Dataset for Math Adaptive Learning
- Citation Author(s):
- Submitted by:
- Chengwei Huang
- Last updated:
- Thu, 03/09/2023 - 04:33
- Data Format:
# Student Test Results Prediction based on Learning Behavior: Learning Beyond Tests
Dataset Part A: The Goal is to predict Test Results, in the form of averaged correctness, averaged timespent in the test, based only on the learning history (learning behavior records)
Dataset Part B: The objective is to predict the last test results, points and scores, based on the learning behavior records and the first test results.
# About the dataset
The raw data is provided by ALIN.ai where a large number of students participated in math learning and tests, online.
The feature constructed from the raw data is achieved by applying statistic functionals to the backend data sheets where learning behavior is recorded, such as 'points earned' in a 'learning session'.
The final cohort we build from this learning senario consists of the predicting target: averaged correctness, and the averaged timespent, and the input features (43 dimensions).
The input features consist of two parts, one contains information(5 dimensions) on the test itself, e.g. the difficulty; the other contains the rest 38 dimensions of features.
The dataset contains the test results including the first and the last tests, as well as the behavior learning records between the two tests.
The grains of the dataset include test, sequence, topic and problem, from coarse to fine.
The features extracted from the dataset are based on the sequence grain, such as the number of problems of each sequence in the first test.
The target is to predict the point of each sequence and the total socre of the last test.
## github link:
Model_A (based on Dataset_A): The baseline predictor is build by XGBoostRegressor, with an accuracy above 80% was observed on the averaged correctness prediction, while an only 10% accuracy was observed on averaged timespent prediction.
LightGBM,RandomForest,and DNN models are also implemented for comparison.
Model_B (based on Dataset_B): The baseline assumed the points and socres of the last test are equal to those of the first test.
# Files Description
contains data cohort used for modeling
contains python machine learning scripts for modeling
contains the scores of the first and last tests for each student, along with its data type (train/validation/test)
contains the points of each sequence of the first and last tests for each student
contains some sequence-related features of the first test for each student
contains some sequence-related features extracted from behavioral records during the first and the last tests for each student
extracts normalized features from the dataset
yields the results of baseline model
yields the results of GBRT model
yields the results of regression model
evaluates the performance of models by comparing with the groud truth
sequentially executes the process
- student test data student_sequence_first_test.csv (4.64 MB)
- student behavior data student_sequence_period_behavior.csv (16.78 MB)
- student test data student_sequence_test_points.csv (664.36 kB)
- student test data student_test_data.csv (95.98 kB)
- Student Data for Model A student_data_processed.csv (1.31 MB)
- model A baseline machine learning models predictExperiment.py (3.18 kB)
- model B baseline baseline.py (330 bytes)
- model B script boosting.py (3.41 kB)
- model B script evaluation.py (2.94 kB)
- model B script feature_extraction.py (7.56 kB)
- model B script regression.py (4.07 kB)
- model B script run.py (1.78 kB)