Labeled Dataset with extracted features

Citation Author(s):
Maliha Noushin
Islamic University of Technology
Zannatun Naim
Islamic University of Technology
Islamic University of Technology
Moonawara Anjum
Islamic University of Technology
Jubair Ibna
Islamic University of Technology
Islamic University of Technology
Submitted by:
Maliha Nosuhin Raida
Last updated:
Mon, 06/20/2022 - 07:00
Data Format:
0 ratings - Please login to submit your rating.


Technical question-answering sites like Stack Overflow are gaining enormous attention from the practitioners of specialized fields to exchange their programming knowledge. They ask questions on different topics, having various levels of difficulty and complexity. To answer such questions, all practitioners do not have the same level of expertise on those topics. However, the existing approach of Stack Overflow does not consider the difficulty and primarily filters out the questions based on topics only. For this reason, a large percentage of questions fail to attract the attention of appropriate users, resulting in questions having no answer or a significant delay in response time. To address these limitations, we incorporate three models, namely TF-IDF, LDA, and Doc2Vec, to extract semantic and context-dependent features that can measure the difficulty of questions. Each of these models is used with different classifiers along with other features to classify the questions based on difficulty. Extensive experiments on different datasets exhibit the effectiveness of our models, and the Doc2Vec outperforms the other models. We also discovered that the contextual features are correlated with question difficulty, and one subset of features outperforms others. The proposed approach can be beneficial for building an automatic tagger based on question difficulty.


The Stack Overflow posts/questions were extracted from the data dump in The data dump of 2017 was considered for this procedure because of the data stability. And the study was done considering only Java-related posts\questions. The dataset is divided into two-part, filtered and generalized. The filtered dataset was collected from paper SOQDE[10.1109/APSEC.2018.00059] and Generalized dataset was extracted for the comparative study and label in the same process.

Both of the datasets consist of features:

- Id: Post\Question identification number 

- Title: Post\Question title 

- Body: Post\Question full body 

- Tags: Post\Question tags, e.g. <java>

- ProcessedBody: Post\Question textual body, only part includes in <p></p>

- CodeOnly: List of codes added to question

- LOC: Line of code added to questions 

- QuestionLength: Processed body word count

- Url+ImageCount: Number of <href> in question body

- Reputation: Questioner's reputation

- user_badge_bronze_counts: Questioner's bronze badge count

- user_badge_gold_counts: Questioner's gold badge count

- user_badge_silver_counts: Questioner's silver badge count

- view_count: Question view count

- answer_count: Number of answers in certain question

- favorite_count: Number of users added certain questions as favorite

- accept_rate: The percentage of answers accepted based on the questions asked by the questioner

- question_score: The total number of upvotes it received minus the total number of downvotes it received in a question

- up_vote_count: Number of up votes on a certain question for being useful and appropriate

- creation_date: Question creation date

- First_answer_date: First answer creation date

- Accepted_answer_date: Accepted answer creation date

- PostownerID: Questioner's user id

- Label: Question Difficulty Label(Basic, Intermediate, Advanced)

The whole dataset is provided into 2 excel files, namely:

- Filtered Dataset: total samples 507

- Basic labeled samples 375  

- Intermediate labeled samples 104

- Advanced labeled samples 28

- Generalized Dataset: total samples 738

- Basic labeled samples 360

- Intermediate labeled samples 305

- Advanced labeled samples 73