Hackathon: AI for Knowledge Discovery (NLP)
Tens of thousands of research papers on the SARS-CoV-2 virus and the COVID-19 illness have flooded journals and preprint servers, with many more added every day. This publication rate is leaving researchers unable to keep up with new findings and insights, even as they rely on these to inform their own work in understanding and combating the disease. In particular, research groups investigating novel drug candidates with in-silico drug discovery efforts are in essence searching for a proverbial needle in a haystack, and so rely on scientific insights from the literature to guide their efforts toward the most fruitful directions. Thus, they are left to navigate the current publication deluge for their primary guidance. In order to keep pace with the vast amount of relevant literature that continues to grow, researchers need more capable tools to locate and filter the information they seek in these publications. To meet this pressing need, it is desirable to develop and apply the latest Natural Language Processing (NLP) techniques for efficiently locating and filtering relevant information from a growing dataset of COVID-related literature, to help accelerate drug discovery research. The proposed NLP challenge task is framed towards designing and building a Question-Answering (QA) system to find answers to COVID-related questions in the scientific literature.
A practical usage of NLP models with COVID relevant papers might be automated information extraction from literature to facilitate drug discovery efforts. One of the crucial elements that can inform these efforts is the knowledge about viral proteins. The goal of this data challenge to build an NLP model to identify answers to protein-related questions from scientific papers.
Please register at https://easychair.org/account/signin?l=pStWPAd56eImr92BxqJrMt#
Researchers at the Brookhaven Nation Laboratory (BNL), Computational Science Initiative (CSI), and the Oak Ridge National Laboratory (ORNL), Biophysics group, have collected a question-answer dataset annotated by a biomedical domain expert to evaluate the system on specialized information acquisition for protein-related questions. Due to the limited size of the collected QA data for training a full-fledged NLP model, we recommend that participants leverage external resources to pre-train and fine-tune their models. The most significant of these is the COVID-19 Open Research Dataset (CORD-19) , provided by AllenAI’s Semantic Scholar team. Other potential resources include publicly-available QA datasets and Natural Language Inference (NLI) datasets (NLI tasks determine semantic relations between two sentences – a premise and a conclusion):
Question answering (QA) datasets:
- The Stanford Question Answering Dataset (SQuAD) : a large collection of questionanswer pairs created by crowd workers on Wikipedia articles, SQuAD 2.0 contains more than 150k pairs of question answers.
- COVID-QA : a SQuAD-like dataset consisting of 2,019 COVID related questions and answers to build a COVID-specific QA system.
- SearchQA : more than 140k general question-answer pairs from the popular television show Jeopardy.
- BioASQ : domain-specific data consisting of 1,504 question-answer pairs created by biomedical domain experts.
Natural Language Inference (NLI) datasets:
- Stanford NLI (SNLI) 
- Multi-Genre NLI (MultiNLI) 
- Medical NLI (MedNLI) 
- Science text entailment (SciTail) 
The BNL-curated QA validation/test datasets that will be provided to the participants will include 113 question/answer pairs for the following four questions:
- What are the oligomeric states of coronavirus structural proteins?
- What are the oligomeric states of non-structural coronavirus proteins?
- What are the catalytic domains (active sites) of coronavirus proteins?
- Are there antivirals that target structural viral proteins?
The answers are sentences from papers related to COVID and are labeled as one of the three categories (relevant, partially relevant, and not relevant). The QA datasets listed above may not be directly applicable to the BNL-curated datasets due to the different formats. The QA datasets above are pairs of questions and passages (contexts), and the purpose is to find an answer (mostly very short answer e.g., a word) in the passage. On the other hand, the BNL-curated datasets are composed of pairs of questions and sentences, and it aims to determine the relevance between them. The QA datasets can be still used to deliver general QA knowledge to a model. The NLI datasets are useful for semantic relation analysis between sentences. The list is not exclusive, and participants may utilize any other resources for model training.
Test references: The list of DOIs of the articles used for the QA dataset generation as well as pre-processed versions of these articles (in JSON format that is more structured and convenient to use compared to the raw PDF document) will be provided.
QA data: Question & Answer pairs will be provided for validation of the developed NLP model (by the participant) and for assessing its performance (by the organizer)
- Validation set: the validation set consists of 54 queries and 54 sentences identified from the test references mentioned above that may provide answers to the queries. Labels (relevant, partially relevant, irrelevant) are provided for all 54 pairs. This validation set can be used by the participant to validate and optimize the performance of their NLP model during development.
- Test set: the test set consist of 59 queries and 59 sentences identified from the test references that may (or may not) provide answers to the queries. Labels are not provided for this test set, and the participants are expected to predict the labels for all query-sentence pairs.
The test references and the BNL-curated QA datasets can be obtained from the following GitHub repo: https://github.com/BNLNLP/IHS-NLP-Data-Hackathon
Evaluation criteria for this task falls into two parts: (1) language modeling of the CORD-19 dataset, and (2) QA performance on the curated dataset.
First part (language modeling): due to the small dataset size, effective modeling of the language found in COVID-related scientific articles is essential to performing well on the QA task. Participants may choose to use a pretrained base language model (e.g., BERT  or BioBERT ), in which case they will receive a baseline score on the language modeling task. However, they may instead choose to domain-tune their language model of choice on the CORD-19 dataset, which if done well will give them an edge both by increasing their language modeling score, but also potentially by improving the quality of their QA model. Language modeling quality will be determined using the perplexity metric on a the “Test references” shown above. All participants should report the perplexity metric measured on each of the test references with a detailed description of the evaluation procedure.
Second part (QA performance): an ideal QA system first ranks and retrieves articles related to a given question from the literature database, and then extracts answers from the selected articles. To evaluate the participant’s system on this test environment, the participant may build an endto-end model (document retrieval and answer extraction) and check if the list of answers includes the answers in the test sets. However, given that the test data has not been generated using the entire collection of articles in COVID-19, the retrieved articles by the QA system developed by the participant may not contain the test articles. We ask the participants to perform the following tasks:
- Relevant sentence prediction: from the Test references, identify top 3 sentences that may contain answers to the queries. Provide top 3 sentences for each of the 59 queries in the test set (also indicate the articles from which each sentence was extracted).
- QA pair label prediction: for each of the 59 question-sentence pairs in the test set, label each pair as “relevant”, “partially relevant”, or “irrelevant”. For this task, the participant will need to build a classifier that identifies whether a given sentence is relevant to a given query or not.
For participants who intend to build an “end-to-end model”, the TREC (Text REtrieval Conference) workshops can provide useful resources for finding the most relevant documents to a given query. TREC aims to support research in information retrieval and provide materials for large-scale evaluation of text retrieval methodologies. Recently, researchers, clinicians, and policy makers have hosted a TREC workshop for COVID-19. Please refer to TREC-COVID .
 Wang, L. L. et al. (2020). Cord-19: The covid-19 open research dataset. arXiv preprint arXiv:2004.10706.
 Rajpurkar, P. et al. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
 Möller, T. et al. (2020). Covid-qa: A question & answering dataset for covid-19.
 Dunn, M. et al. (2017). Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
 Tsatsaronis, G. et al. (2012). Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. In AAAI fall symposium: Information retrieval and knowledge discovery in biomedical text. Citeseer.
 Bowman, S. R. et al. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
 Williams, A. et al. (2017). Abroad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
 Shivade, C. (2019). Mednli – a natural language inference dataset for the clinical domain (version 1.0.0). PhysioNet.
 Khot, T. et al. (2018). Scitail: A textual entailment dataset from science question answering. In AAAI, volume 17, pages 41–42.
 Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT. 2019.
 Lee, J., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics 36.4 (2020): 1234-1240.
Other relevant resources:
 Soto, C., Park, G., Chen, Y.C., Sedova, A., Pouchard, L. and Yoo, S., 2020. Applying Natural Language Processing (NLP) techniques on the scientific literature to accelerate drug discovery for COVID-19. F1000Research, 9.
 Acharya, Atanu, Rupesh Agarwal, Matthew B. Baker, Jerome Baudry, Debsindhu Bhowmik, Swen Boehm, Kendall G. Byler et al. “Supercomputer-based ensemble docking drug discovery pipeline with application to COVID-19.” Journal of chemical information and modeling 60, no. 12 (2020): 5832-5852.