MageCode

Citation Author(s):
Hung
Pham
Submitted by:
Van Tong
Last updated:
Tue, 10/01/2024 - 10:40
DOI:
10.21227/nzcz-p164
License:
0
0 ratings - Please login to submit your rating.

Abstract 

We collected programming problems and their solutions from previous studies. After applying some pre-processing steps, we queried advanced LLMs, such as GPT4, with the collected problems to produce machine-generated codes, while the original solutions were labeled as human-written codes. Finally, the entire collected dataset was divided into training, validation, and test sets, ensuring that there is no overlap among these sets, meaning no solutions in two different sets that solve the same programming problem.

Instructions: 
configs:
  - config_name: python
    data_files:
      - split: train
        path: python_train_set.csv
      - split: validation
        path: python_val_set.csv
      - split: test
        path: python_test_set.csv
  - config_name: java
    data_files:
      - split: train
        path: java_train_set.csv
      - split: validation
        path: java_val_set.csv
      - split: test
        path: java_test_set.csv
  - config_name: cpp
    data_files:
      - split: train
        path: cpp_train_set.csv
      - split: validation
        path: cpp_val_set.csv
      - split: test
        path: cpp_test_set.csv
task_categories:
  - text-classification
size_categories:
  - 10K<n<100K