Datasets
Standard Dataset
MageCode
- Citation Author(s):
- Submitted by:
- Van Tong
- Last updated:
- Tue, 10/01/2024 - 10:40
- DOI:
- 10.21227/nzcz-p164
- License:
27 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
We collected programming problems and their solutions from previous studies. After applying some pre-processing steps, we queried advanced LLMs, such as GPT4, with the collected problems to produce machine-generated codes, while the original solutions were labeled as human-written codes. Finally, the entire collected dataset was divided into training, validation, and test sets, ensuring that there is no overlap among these sets, meaning no solutions in two different sets that solve the same programming problem.
Instructions:
configs: - config_name: python data_files: - split: train path: python_train_set.csv - split: validation path: python_val_set.csv - split: test path: python_test_set.csv - config_name: java data_files: - split: train path: java_train_set.csv - split: validation path: java_val_set.csv - split: test path: java_test_set.csv - config_name: cpp data_files: - split: train path: cpp_train_set.csv - split: validation path: cpp_val_set.csv - split: test path: cpp_test_set.csv task_categories: - text-classification size_categories: - 10K<n<100K