LLM | IEEE DataPort

HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds.

Categories:

Urban Mobility Research Dataset (Generated with the Quanser Interactive Lab)

Fair Use for Academic Research: If you use this dataset, please cite the following paper to ensure proper attribution

M. A. Onsu, P. Lohan, B. Kantarci, A. Syed, M. Andrews, S. Kennedy, "Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring," 30th IEEE Symposium on Computers and Communications (ISCC), July 2025, Bologna, Italy.

Preprint available here: https://arxiv.org/pdf/2502.11304

Categories:

Forbes Billionaire dataset

The Forbes 2022 Billionaires List dataset contains information about the world's wealthiest individuals, including their net worth, industry, country, and key business ventures. The dataset provides structured details such as rankings, company associations, and financial status, making it useful for various NLP tasks like table-to-text generation, entity recognition, and financial analysis.

Categories:

LLM Empowering Urban Science: The Exploration of Constructing a New Instruction Dataset

The application of large language models (LLMs) in urban planning has gained momentum, with prior research demonstrating their value in participatory planning, process streamlining, and event forecasting. This study focuses on further enhancing urban planning through the integration of more comprehensive datasets. We introduce a newly developed instruction dataset that amalgamates crucial information from several prominent urban datasets, including highD, NGSIM, the Road Networks dataset, TLC Trip data, and the Urban Flow Prediction Survey dataset.

Categories:

DragonVerseQA

DragonVerseQA is an open-domain and long-form Over-The-Top (OTT) Question-Answering (QA) dataset specifically oriented to the fantasy universe of "The House of the Dragon" and "Game Of Thrones" TV series. The curated dataset combines full episode summaries sourced from HBO and fandom wiki websites, user reviews from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain, legally admissible sources, and structured data from repositories like WikiData into one dataset.

Categories:

Artificial Intelligence

CHVM-1K-A1

Partial dataset of CHVM-1K dataset for illustration purposes.

{

"question": "What stages can be divided into in the development history of ancient Chinese bronzes? Why?",

Categories:

Artificial Intelligence

Human Thinking Data

Recent research indicates that fine-tuning smaller parameter language models using reasoning samples generated by large languages models (LLMs) can effectively enhance the performance of small models in complex reasoning tasks. However, after fine-tuning the small model using the existing Zero-shot-CoT method, there are still shortcomings in problem understanding, mathematical calculations, logical reasoning, and missing steps when handling problems.

Categories:

Artificial Intelligence

knowledge conflict test dataset

<p>A dataset to detect knowledge conflict.</p>
The dataset contains 90 groups of natural language sentences with contradictions and 10 groups without contradictions, each group containing 5 sentences, usually 3 identical questions and 2 declarative sentences. The Agent should be able to accurately detect the contradictory statements.

Categories:

Artificial Intelligence

Python Program Resource Usage from Fuzzing Corpora

Resource usage fuzzing samples and related data. Contains samples from Pythoin, random data, GPT-3.5, GPT-4, Gemini-1.0, Claude Instant, and Claude Opus. These samples are generated for 50 Python functions. Also included are resource measures for CPU time, instruction count, function calls, peak RAM usage, final RAM allocated, and coverage. These values were collected on an isolated system and account for interference from other processes.

Categories:

Dataset on RAG Pipeline Evaluation for Retrieval and Generative Response Accuracy Testing

This dataset has been meticulously curated to evaluate the efficiency of Retrieval-Augmented Generation (RAG) pipelines in both retrieval and generative accuracy, with a particular focus on scenarios involving overlapping contexts. The dataset comprises two primary components: Motor data and Employee data. The Motor dataset includes master data of various motor models along with their corresponding manuals, linked by the motor's model name. Similarly, the Employee dataset encompasses employee master data and associated policy documents, linked by department.

Categories:

Artificial Intelligence