Original and Imputed data for Decoding contextual factors differentiating adolescents’ high, average and low digital reading performance through machine learning methods

Citation Author(s):
Yi
Peng
Zhejiang University
Submitted by:
Yi Peng
Last updated:
Fri, 10/07/2022 - 03:04
DOI:
10.21227/pfv5-gm79
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This paper presents the data that is used in the article entitled “Decoding contextual factors differentiating adolescents’ high, average and low digital reading performance through machine learning methods”, which investigated the key contextual factors that synergistically differentiate high and low performers, high and average performers, and low and average performers in digital reading, through the utilization of machine learning methods, namely, support vector machine (SVM) and SVM recursive feature elimination (SVM-RFE). Additionally, the Shapley additive explanations (SHAP) method was used to interpret the model and detect the features having a positive or negative impact on the final prediction. The latest-released Programme for International Student Assessment (PISA) 2018 data were analyzed, and the samples included 276,269 15-year-old students from 38 Organization for Economic Co-operation and Development (OECD) countries. The classification is based on the OECD’s definition of high, low and average performers, with reading scores at Levels 5 and 6 (i.e., at or above 625.61 score points), at Levels 1a and 1b (less than 407.47 score points) and in between, respectively. PV1READ (the first plausible value 1 in Reading) was randomly selected to represent each student’s reading score (Hu et al., 2022; Gorostiaga & Rojo-Álvarez, 2016). The 150 independent factors/variables were collected from the student questionnaire, school questionnaire and ICT familiarity questionnaire. Additionally, the country-level factor (i.e., GDP per capita) collected from the World Bank dataset (URL: https://data.worldbank.org/) was also taken into account.

Instructions: 

The dataset includes three files, which is composed of the sample features  of 1) the high performers and low performers; 2) the high performers and the average performers; 3) the low performers and the average performers.