Abstract

We collected programming problems and their solutions from previous studies. After applying some pre-processing steps, we queried advanced LLMs, such as GPT4, with the collected problems to produce machine-generated codes, while the original solutions were labeled as human-written codes. Finally, the entire collected dataset was divided into training, validation, and test sets, ensuring that there is no overlap among these sets, meaning no solutions in two different sets that solve the same programming problem.

Instructions:

configs:
  - config_name: python
    data_files:
      - split: train
        path: python_train_set.csv
      - split: validation
        path: python_val_set.csv
      - split: test
        path: python_test_set.csv
  - config_name: java
    data_files:
      - split: train
        path: java_train_set.csv
      - split: validation
        path: java_val_set.csv
      - split: test
        path: java_test_set.csv
  - config_name: cpp
    data_files:
      - split: train
        path: cpp_train_set.csv
      - split: validation
        path: cpp_val_set.csv
      - split: test
        path: cpp_test_set.csv
task_categories:
  - text-classification
size_categories:
  - 10K<n<100K

Dataset Files

magecode-dataset.zip (83.56 MB)

Datasets

Standard Dataset

MageCode

Abstract

Dataset Files

QUESTIONS?