Datasets
Standard Dataset
LLM Empowering Urban Science: The Exploration of Constructing a New Instruction Dataset
- Citation Author(s):
- Submitted by:
- Yufei Lin
- Last updated:
- Sat, 01/18/2025 - 21:41
- DOI:
- 10.21227/dcsz-9r08
- Data Format:
- License:
Abstract
The application of large language models (LLMs) in urban planning has gained momentum, with prior research demonstrating their value in participatory planning, process streamlining, and event forecasting. This study focuses on further enhancing urban planning through the integration of more comprehensive datasets. We introduce a newly developed instruction dataset that amalgamates crucial information from several prominent urban datasets, including highD, NGSIM, the Road Networks dataset, TLC Trip data, and the Urban Flow Prediction Survey dataset. The dataset is structured with inputs, outputs, and explanations, and utilizes a combination of multiple-choice and short-answer tasks to facilitate LLM learning. For model training, LLMs such as LLaMA3, BERT, and T5 are fine-tuned and evaluated. The aim is to improve the models' performance in handling urban planning tasks and to offer practical and contextually relevant insights, thereby contributing to more effective and informed urban planning decisions.
To fine-tune language models for urban planning applications, we developed a specialized dataset designed to generate actionable, context-specific instructions. This dataset includes a wide variety of urban challenges, including zoning regulations, traffic management, sustainable development, and resource optimization. It is organized into two primary categories—multiple selection and question answering—each tailored to improve the model's ability to address diverse task types effectively.
For each category, we provide a well-defined task definition to establish the scope and expectations. These definitions are complemented by illustrative positive and negative examples, which guide the model in generating accurate and contextually relevant responses in the desired format. Each dataset entry consists of three key components: an input, an output, and an explanation. The input is structured as a context and a question; the context provides a scenario or problem description, while the question specifies the task to be addressed. The output differs based on the task type: for multiple selection tasks, it is a choice among options (e.g., A, B, C, or D), whereas for question answering, the output is a text segment extracted from or within the provided context. The explanation elucidates the reasoning process behind the output, enabling the model to generate answers that are not only precise, but also supported by detailed logic.
The dataset was initially created using 10 seed positive and negative examples related to the general information of the targeted dataset. These examples were then augmented using a GPT model, which rephrased the outputs and explanations to generate a total of 1,500 positive examples and 1,500 negative examples for each dataset. By incorporating these structured examples, the dataset fosters the development of language models capable of providing actionable insights and transparent reasoning for real-world urban planning scenarios.
Dataset Files
- Fine-tunes models for traffic analysis and autonomous vehicle insights. highD_instructions.json (1.12 MB)
- Trains models to analyze highway traffic and optimize driving strategies. ngsim_instructions.json (1.23 MB)
- Optimizes routing and infrastructure using road network data. road_networks_instructions.json (1.15 MB)
- Enhances urban mobility analysis with NYC trip records. tlc_trip_instructions.json (1.40 MB)
- Enables accurate urban flow forecasting and planning. urban_flow_prediction_survey_instructions.json (1.07 MB)