Skip to main content

LLM

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds.

Categories:

Fair Use for Academic Research: If you use this dataset, please cite the following paper to ensure proper attribution

M. A. Onsu, P. Lohan, B. Kantarci, A. Syed, M. Andrews, S. Kennedy, "Leveraging Multimodal-LLMs Assisted by Instance Segmentation for Intelligent Traffic Monitoring," 30th IEEE Symposium on Computers and Communications (ISCC), July 2025, Bologna, Italy.

 

 

Preprint available here: https://arxiv.org/pdf/2502.11304

 

Categories:

The Forbes 2022 Billionaires List dataset contains information about the world's wealthiest individuals, including their net worth, industry, country, and key business ventures. The dataset provides structured details such as rankings, company associations, and financial status, making it useful for various NLP tasks like table-to-text generation, entity recognition, and financial analysis.

Categories:

The application of large language models (LLMs) in urban planning has gained momentum, with prior research demonstrating their value in participatory planning, process streamlining, and event forecasting. This study focuses on further enhancing urban planning through the integration of more comprehensive datasets. We introduce a newly developed instruction dataset that amalgamates crucial information from several prominent urban datasets, including highD, NGSIM, the Road Networks dataset, TLC Trip data, and the Urban Flow Prediction Survey dataset.

Categories:

DragonVerseQA is an open-domain and long-form Over-The-Top (OTT) Question-Answering (QA) dataset specifically oriented to the fantasy universe of "The House of the Dragon" and "Game Of Thrones" TV series. The curated dataset combines full episode summaries sourced from HBO and fandom wiki websites, user reviews from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain, legally admissible sources, and structured data from repositories like WikiData into one dataset.

Categories:

Partial dataset of CHVM-1K dataset for illustration purposes.

    {

        "question": "What stages can be divided into in the development history of ancient Chinese bronzes? Why?",

Categories:

Recent research indicates that fine-tuning smaller parameter language models using reasoning samples generated by large languages models (LLMs) can effectively enhance the performance of small models in complex reasoning tasks. However, after fine-tuning the small model using the existing Zero-shot-CoT method, there are still shortcomings in problem understanding, mathematical calculations, logical reasoning, and missing steps when handling problems.

Categories:

<p>A dataset to detect knowledge conflict.</p>
The dataset contains 90 groups of natural language sentences with contradictions and 10 groups without contradictions, each group containing 5 sentences, usually 3 identical questions and 2 declarative sentences. The Agent should be able to accurately detect the contradictory statements.

Categories:

Resource usage fuzzing samples and related data. Contains samples from Pythoin, random data, GPT-3.5, GPT-4, Gemini-1.0, Claude Instant, and Claude Opus. These samples are generated for 50 Python functions. Also included are resource measures for CPU time, instruction count, function calls, peak RAM usage, final RAM allocated, and coverage. These values were collected on an isolated system and account for interference from other processes.

Categories:

This dataset has been meticulously curated to evaluate the efficiency of Retrieval-Augmented Generation (RAG) pipelines in both retrieval and generative accuracy, with a particular focus on scenarios involving overlapping contexts. The dataset comprises two primary components: Motor data and Employee data. The Motor dataset includes master data of various motor models along with their corresponding manuals, linked by the motor's model name. Similarly, the Employee dataset encompasses employee master data and associated policy documents, linked by department.

Categories: