Kangning Dataset of Clinical Interview for Depression

Citation Author(s):
Kaining
Mao
Deborah Baofeng
Wang
Tiansheng
Zheng
Rongqi
Jiao
Yanhui
Zhu
Bin
Wu
Lei
Qian
Wei
Lyu
Jie
Chen
University of Alberta
Minjie
Ye
Submitted by:
Kaining Mao
Last updated:
Mon, 04/08/2024 - 19:24
DOI:
10.21227/b8rw-gb61
Research Article Link:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

 

create

We're excited to present a unique challenge aimed at advancing automated depression diagnosis. Traditional methods using written speech or self-reported measures often fall short in real-world scenarios. To address this, we've curated a dataset of authentic depression clinical interviews from a psychiatric hospital.

The dataset includes 113 recordings (89 for training and 24 for testing), featuring interactions with 52 healthy individuals and 61 diagnosed with depression. Each participant underwent assessments using the Montgomery-Asberg Depression Rating Scale (MADRS) in Chinese, with diagnoses confirmed by psychiatry specialists.

These interviews were meticulously audio-recorded, transcribed, and annotated by experienced physicians, ensuring data quality. Participants are tasked with developing machine learning models to detect depression presence and predict severity levels using audio and text features extracted from interviews.

Join us in leveraging this groundbreaking dataset to revolutionize depression diagnosis and advance mental health care. Let's make a difference together!

Instructions: 

Dataset Description

editEdit

Data Files

  • train.zip - Contains 89 clinical interview audio recordings in MP3 format.
  • train_json.zip - Provides transcriptions for the corresponding audio recordings.
  • test.zip - Contains 24 clinical interview audio recordings in MP3 format.
  • test_json.zip - Provides transcriptions for the corresponding audio recordings.
  • train.csv - Metadata for the training set, including audio filenames, participant gender, and age.
  • test.csv - Metadata for the test set, including participant IDs, gender, and age.
  • sample_submission.csv - A template for participants to submit their predictions in the correct format.

Columns

  • Participant - Participant ID
  • File_name - File name of interview recording
  • Gender - Gender
  • Age - Age

Transcripts

Transcripts are available in JSON format, containing information such as start and end times, speaker identification, and word-level details. Each transcript entry includes the background, end, one-best transcription, speaker ID, and a list of word results.
Example transcript entry:

{
 "data": [
   {
     "bg": "240",
     "ed": "1160",
     "onebest": "都去找,",
     "si": "0",
     "speaker": "1",
     "wordsResultList": [
       {
         "alternativeList": [],
         "wc": "1.0000",
         "wordBg": "4",
         "wordEd": "38",
         "wordsName": "都",
         "wp": "n"
       },
       {
         "alternativeList": [],
         "wc": "1.0000",
         "wordBg": "39",
         "wordEd": "55",
         "wordsName": "去",
         "wp": "n"
       },
       {
         "alternativeList": [],
         "wc": "1.0000",
         "wordBg": "56",
         "wordEd": "83",
         "wordsName": "找",
         "wp": "n"
       },
       {
         "alternativeList": [],
         "wc": "0.0000",
         "wordBg": "83",
         "wordEd": "83",
         "wordsName": ",",
         "wp": "p"
       }
     ]
   },
   {
     "bg": "1600",
     "ed": "3450",
     "onebest": "嗯没有紧张。",
     "si": "0",
     "speaker": "1",
     "wordsResultList": [
       {
         "alternativeList": [],
         "wc": "1.0000",
         "wordBg": "32",
         "wordEd": "44",
         "wordsName": "嗯",
         "wp": "s"
       },
       {
         "alternativeList": [],
         "wc": "1.0000",
         "wordBg": "45",
         "wordEd": "109",
         "wordsName": "没有",
         "wp": "n"
       }
     ]
   }
 ]
}

 

For showcasing the parsing of JSON files, please refer to the code snippet below:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat May 23 21:36:02 2020
@author: Kaining
"""
import logging
import os
import json
import ast
import csv
from contextlib import closing
'''
Load the *.json format iFlyTek result
'''
def load_json(file_path):
   with closing(open(file_path, "rb")) as file:
       data_dict = json.load(file, encoding='utf-8')
   # print(data_dict.keys())
   data_dict = ast.literal_eval(str(data_dict['data']))
   subject_id = file_path.split('/')[-1].split('.')[0]
   return data_dict, subject_id
'''
Write into *.csv file
'''
mapping = {'1': 'Doctor', '2': 'Patient'}
def write_to_csv(dest_path, data_dict, subject_id):
   data_to_write = []
   tmp = []
   for key in data_dict[0].keys():
       if key == "si":
           continue
       tmp.append(key)
   data_to_write.append(tmp)
   for i in range(len(data_dict)):
       tmp = []
       for key in data_dict[i].keys():
           if key == "si":
               continue
           if key == "speaker" and data_dict[i][key] == "1":
               if i == 0:
                   mapping['2'] = "Doctor"
                   mapping['1'] = "Patient"
                   tmp.append(mapping[data_dict[i][key]])
               else:
                   tmp.append(mapping[data_dict[i][key]])
           elif key == "speaker" and data_dict[i][key] == "2":
               if i == 0:
                   mapping['1'] = "Doctor"
                   mapping['2'] = "Patient"
                   tmp.append(mapping[data_dict[i][key]])
               else:
                   tmp.append(mapping[data_dict[i][key]])
           elif key == "wordsResultList":
               tmp_2 = []
               for item in data_dict[i]["wordsResultList"]:
                   tmp_2.append(item["wordsName"].replace('\n', '').replace('\r', ''))
               tmp.append(tmp_2)
           else:
               tmp.append(data_dict[i][key])
       data_to_write.append(tmp)
   with closing(open(dest_path, 'w', encoding='utf-8')) as file:
       writer = csv.writer(file, delimiter=',')
       writer.writerows(data_to_write)
       # print("Finished writing")
       logging.info("Finished writing")