Datasets
Standard Dataset
Testing Results from Manuscript Testing Results of Submission "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis"
- Citation Author(s):
- Submitted by:
- Tasos Nikolakopoulos
- Last updated:
- Thu, 11/28/2024 - 02:38
- DOI:
- 10.21227/7xh6-p071
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
Large Language Models (LLMs) have recently attracted considerable attention from the scientific community, due to their advanced capabilities and potential to serve as vital tools across various industries and academic fields. An important implementation domain for LLMs is Data Science, in which they could enhance the efficiency of Data Analysis and Profiling tasks. With the utilization of LLMs in Data Analytics tools, end-users could directly issue data analysis queries in natural language, bypassing the need for specialized user interfaces. However, due to the sensitive nature of certain data in some organizations, it is unwise to consider using established, cloud-based LLMs. This article explores the feasibility and effectiveness of a standalone, offline LLM in generating code for performing data analytics, given a set of natural language queries. A methodology tailored to a code-specific LLM is presented, evaluating its performance in generating Python Spark code and successfully producing the desired result. The model is assessed on its efficiency and ability to handle natural language queries of varying complexity, exploring the potential for wider adoption of offline LLMs in future data analysis frameworks and software solutions.
<p>Each test generated a set of information, which was organized into a single testing result object. A total of 250 such objects were combined into a dataset of 250 rows, with each row corresponding to an individual test object. Each row in the final dataset contains detailed information about the associated test, including attributes related to the correctness, readability, and execution performance of the generated code, as well as system resource monitoring metrics. Below is a detailed explanation of each attribute in the dataset:</p><p> </p><p>- **Correctness**: Indicates whether the generated code correctly produced the intended result based on the natural language query. Values are either `True` or `False`. </p><p>- **Readability**: Represents the human readability of the generated code, scored on a scale of 1 to 3, where 3 denotes highly readable code. These scores are calculated using a custom readability function. </p><p>- **Code Execution Errors**: Contains information about errors that occurred during code execution. If no errors occurred, the value is `None`. Otherwise, the column includes the error message explaining the malfunction. </p><p>- **Executed Command**: Contains the full code executed for each test, which may include Python comments. </p><p>- **Code Repetition ID**: Indicates the iteration number of each test, as each query was tested ten times. Values range from 1 to 10. </p><p>- **Dataset**: Specifies the dataset used for each test. Possible values are `supermarket`, `netflix`, `shared-cars-locations`, `covid19-twitter`, and `madrid-daily-weather`. Each entry refers to a test conducted on the corresponding dataset. </p><p>- **User Query**: Contains the exact query submitted by the user to the offline LLM for code generation. The corresponding generated code can be found in the "Executed Command" column. </p><p>- **LLM Response CPU**: Includes the CPU usage percentage recorded during the LLM's code generation process. </p><p>- **LLM Response Memory**: Provides the memory utilization percentage of the LLM server during code generation. </p><p>- **LLM Response GPU**: Lists the GPU usage percentages recorded on the LLM server during code generation. </p><p>- **LLM Response GPU Memory**: Contains the GPU memory utilization percentages during the LLM's code generation process. </p><p>- **LLM Response Time**: Records the response time of the LLM server, measured from the receipt of the query to the delivery of the generated response. Values are in seconds. </p><p>- **Automated**: Indicates whether the code was executed automatically (`True`) or semi-automatically with minimal human intervention (`False`). </p><p>- **Query Number**: Identifies which query each test corresponds to. Since five queries were created for each dataset, possible values are `q1`, `q2`, `q3`, `q4`, and `q5`. </p><p>- **Query Level**: Indicates the contextual complexity of each query. Possible values are `basic`, `intermediate`, or `advanced`. </p><p> </p><p>This dataset provides a comprehensive overview of the data collected during the testing process, enabling a thorough evaluation of the system.</p>