ALIN Open Dataset for Math Adaptive Learning

Citation Author(s):
Submitted by:
Chengwei Huang
Last updated:
Sat, 10/29/2022 - 12:42
Data Format:
0 ratings - Please login to submit your rating.


# Student Test Results Prediction based on Learning Behavior: Learning Beyond Tests

Dataset Part A: The Goal is to predict Test Results, in the form of averaged correctness, averaged timespent in the test, based only on the learning history (learning  behavior records)

Dataset Part B: The objective is to predict the last test results, points and scores, based on the learning behavior records and the first test results.

# About the dataset

The raw data is provided by where a large number of students participated in math learning and tests, online. 


The feature constructed from the raw data is achieved by applying statistic functionals to the backend data sheets where learning behavior is recorded, such as 'points earned' in a 'learning session'. 

The final cohort we build from this learning senario consists of the predicting target: averaged correctness, and the averaged timespent, and the input features (43 dimensions). 

The input features consist of two parts, one contains information(5 dimensions) on the test itself, e.g. the difficulty; the other contains the rest 38 dimensions of features.  


The dataset contains the test results including the first and the last tests, as well as the behavior learning records between the two tests.

The grains of the dataset include test, sequence, topic and problem, from coarse to fine.

The features extracted from the dataset are based on the sequence grain, such as the number of problems of each sequence in the first test.

The target is to predict the point of each sequence and the total socre of the last test. 

## github link:


Model_A (based on Dataset_A): The baseline predictor is build by XGBoostRegressor, with an accuracy above 80% was observed on the averaged correctness prediction, while an only 10% accuracy was observed on averaged timespent prediction.

LightGBM,RandomForest,and DNN models are also implemented for comparison.


Model_B (based on Dataset_B): The baseline assumed the points and socres of the last test are equal to those of the first test.


# Files Description

## student_data_processed.csv

contains data cohort used for modeling


contains python machine learning scripts for modeling


contains the scores of the first and last tests for each student, along with its data type (train/validation/test)


contains the points of each sequence of the first and last tests for each student


contains some sequence-related features of the first test for each student


contains some sequence-related features extracted from behavioral records during the first and the last tests for each student

extracts normalized features from the dataset

yields the results of baseline model

yields the results of GBRT model

yields the results of regression model

evaluates the performance of models by comparing with the groud truth

sequentially executes the process