Testing Results from Manuscript "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis"

Citation Author(s):
Anastasios
Nikolakopoulos
Submitted by:
Tasos Nikolakopoulos
Last updated:
Thu, 11/28/2024 - 12:02
DOI:
10.21227/7xh6-p071
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This is a dataset that contains the testing results presented in the manuscript "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis", and it aims to assess offline LLMs' capabilities in code generation for data analytics tasks. Best utilization of the dataset would occur after thorough understanding of the manuscript. A total of 250 testing results were generated. They were merged, leading to the creation of this current dataset.

Instructions: 

Each test generated a set of information, which was organized into a single testing result object. A total of 250 such objects were combined into a dataset of 250 rows, with each row corresponding to an individual test object. Each row in the final dataset contains detailed information about the associated test, including attributes related to the correctness, readability, and execution performance of the generated code, as well as system resource monitoring metrics. Below is a detailed explanation of each attribute in the dataset:

  • Correctness: Indicates whether the generated code correctly produced the intended result based on the natural language query. Values are either `True` or `False`.  
  • Readability: Represents the human readability of the generated code, scored on a scale of 1 to 3, where 3 denotes highly readable code. These scores are calculated using a custom readability function.  
  • Code Execution Errors: Contains information about errors that occurred during code execution. If no errors occurred, the value is `None`. Otherwise, the column includes the error message explaining the malfunction.  
  • Executed Command: Contains the full code executed for each test, which may include Python comments.  
  • Code Repetition ID: Indicates the iteration number of each test, as each query was tested ten times. Values range from 1 to 10.  
  • Dataset: Specifies the dataset used for each test. Possible values are `supermarket`, `netflix`, `shared-cars-locations`, `covid19-twitter`, and `madrid-daily-weather`. Each entry refers to a test conducted on the corresponding dataset.  
  • User Query: Contains the exact query submitted by the user to the offline LLM for code generation. The corresponding generated code can be found in the "Executed Command" column.  
  • LLM Response CPU: Includes the CPU usage percentage recorded during the LLM's code generation process.  
  • LLM Response Memory: Provides the memory utilization percentage of the LLM server during code generation.  
  • LLM Response GPU: Lists the GPU usage percentages recorded on the LLM server during code generation.  
  • LLM Response GPU Memory: Contains the GPU memory utilization percentages during the LLM's code generation process.  
  • LLM Response Time: Records the response time of the LLM server, measured from the receipt of the query to the delivery of the generated response. Values are in seconds.  
  • Automated: Indicates whether the code was executed automatically (`True`) or semi-automatically with minimal human intervention (`False`).  
  • Query Number: Identifies which query each test corresponds to. Since five queries were created for each dataset, possible values are `q1`, `q2`, `q3`, `q4`, and `q5`.  
  • Query Level: Indicates the contextual complexity of each query. Possible values are `basic`, `intermediate`, or `advanced`.  

 This dataset provides a comprehensive overview of the data collected during the testing process, enabling a thorough evaluation of the system.

Comments

New upload

Submitted by Tasos Nikolakopoulos on Thu, 11/28/2024 - 02:42