Automatic extraction of valuable, structured evidence from the exponentially growing clinical trial literature can help physicians practice evidence-based medicine quickly and accurately. However, current research on evidence extraction has been limited by the lack of generalization ability on various clinical topics and the high cost of manual annotation. In this work, we address these challenges by constructing a PICO-based evidence dataset PICO-DS, covering five clinical topics. This dataset was automatically labeled by a distant supervision based on our proposed textual similarity algorithm called ROUGE-Hybrid. PICO-DS is a distant supervision dataset that includes 24,909 samples across 5 medical topics. Each sample has its corresponding PICO label. We according to the PICO framework defines four types of tags: P on behalf of the Patient/Population/Problem, I on behalf of Intervention/Comparision, O on behalf of the Outcome, N for NA, which does not belong to the above three kinds of classification.
The PICO-DS dataset contains three folders: meta,test, and train. Each folder contains a collection of samples in csv format for 4 categories (P,I,O,N). Samples in meta and test are generated by manual annotation, while samples in train are generated by remote supervision method.