MageCode

Citation Author(s):: Hung Pham
Submitted by:: Van Tong
Last updated:: Tue, 10/01/2024 - 14:40
DOI:: 10.21227/nzcz-p164

42 views

Categories:

Keywords:

Machine-generated code

large language model

ACCESS DATASET CITE

Abstract

We collected programming problems and their solutions from previous studies. After applying some pre-processing steps, we queried advanced LLMs, such as GPT4, with the collected problems to produce machine-generated codes, while the original solutions were labeled as human-written codes. Finally, the entire collected dataset was divided into training, validation, and test sets, ensuring that there is no overlap among these sets, meaning no solutions in two different sets that solve the same programming problem.

Instructions:

configs: - config_name: python data_files: - split: train path: python_train_set.csv - split: validation path: python_val_set.csv - split: test path: python_test_set.csv - config_name: java data_files: - split: train path: java_train_set.csv - split: validation path: java_val_set.csv - split: test path: java_test_set.csv - config_name: cpp data_files: - split: train path: cpp_train_set.csv - split: validation path: cpp_val_set.csv - split: test path: cpp_test_set.csv task_categories: - text-classification size_categories: - 10K<n<100K