Open-Ended Scrum Framework Questions in a Software Engineering Course

0
0 ratings - Please login to submit your rating.

Abstract 

This dataset is used for the automated assessment of open-ended exam questions in the online course Introduction to Software Engineering at Constantine the Philosopher University in Nitra. The dataset originates from the Moodle Learning Management System (LMS) and includes responses to eight open-ended questions centered on fundamental terminology related to the Scrum framework, a key methodology in agile software development.

The dataset comprises 528 responses from 66 university students, alongside 800 responses generated by ten Large Language Models (LLMs) under controlled conditions. AI-generated responses were compared against students and reference answers extracted from course study materials, which were designed to serve as benchmarks for evaluating the quality and accuracy of both student and AI-generated responses.

The open-ended exam questions included:

a) Explain the term sprint.

b) Explain the term product backlog.

c) Explain the term sprint backlog.

d) Explain the term user story.

e) What phases does a software product go through in development?

f) How are these phases implemented in Scrum?

g) What do you see as the advantages of using Scrum?

h) What do you see as the disadvantages of using Scrum?

To collect AI-generated data, the questions were combined with five distinct introduction prompts and input into each LLM across two iterations per session. The prompts simulated various student input styles, such as "Provide the most accurate answers" and "Write correct answers." This process resulted in a dataset of 800 AI-generated entries for analysis.

The dataset, gathered in autumn 2024, offers a comprehensive resource for studying the accuracy, consistency, and application of LLMs in educational settings. These findings contribute to improving the automated assessment of open-ended student responses in blended learning environments.

Instructions: 

The pre-processing of this dataset was crucial to ensure its integrity and suitability for analyzing the differences and similarities between student responses and AI-generated answers. A series of steps was implemented to refine the data and prepare it for detailed analysis.

The first step involved cleaning and normalizing the text to ensure uniformity. All symbols, unnecessary punctuation, and non-alphanumeric characters were removed from the dataset. To maintain consistency in the text format, all responses were converted to lowercase. Additionally, stop words were excluded from the analysis to focus on the most relevant terms and linguistic patterns.

To provide a benchmark for evaluating the quality and accuracy of responses, the dataset was enriched with the variable ref_answer, which contained reference answers extracted from the course study materials. These reference answers were designed to align closely with the course content and served as the gold standard against which student and AI-generated responses could be measured. Ensuring the privacy of student data was a critical aspect of the pre-processing phase. To comply with data protection regulations, all identifiable information was anonymized.

Linguistic and structural features of the responses were extracted using SpaCy’s en_core_web_lg model. This allowed for an analysis of basic text characteristics, such as the frequency of different parts of speech (e.g., nouns, verbs, and adjectives) and the average length of answers. These characteristics provided valuable insights into the patterns and content of both student and GenAI responses. For instance, student answers were found to primarily consist of nouns (40%), followed by verbs (12%) and adjectives (6%), reflecting a focus on keywords and definitions. Similarly, GenAI-generated answers also predominantly used nouns (36%), with a comparable distribution of other parts of speech, indicating similar stylistic trends.

The readability of the responses was evaluated using the Coleman-Liau Index, a metric that estimates the education level required to comprehend a text. This analysis revealed that AI-generated answers were generally more complex, with readability scores ranging from 15.4 to 17.6, compared to 14.8 for the best-performing students. This observation highlighted a significant difference in the depth and complexity of responses between students and AI.