Dataset for Disentangled Representation Learning for Interpretable Molecule Generation

Citation Author(s):: Yuanqi Du (George Mason University)

Xiaojie Guo (George Mason University)

Amarda Shehu (George Mason University)

Liang Zhao (Emory University)
Submitted by:: Yuanqi Du
Last updated:: Sun, 04/18/2021 - 13:18
DOI:: 10.21227/vteb-j724
Data Format:: smiles string

389 views

Categories:

Keywords:

Molecule Generation

Drug-like Molecules

ACCESS DATASET CITE

Abstract

Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance drug discovery, biotechnology, and material science. Computing novel small molecules with specific structural and functional properties is non-trivial, primarily due to the size, dimensionality, and multi-modality of the corresponding search space. Deep generative models that learn directly from data without the need for domain insight are recently providing a way forward. In particular, graph generative frameworks, which are able to capture detailed atomic interactions via their graph-based representation of a small molecule, are showing promising results. However, these frameworks remain opaque and do not allow obtaining any insight into their generation process. In this paper we present a first step towards addressing this limitation by leveraging the concept of disentanglement in the graph variational autoencoder framework. {\color{blue}We propose various disentanglement learning techniques within this framework, resulting in novel disentangled deep graph generative models which we compare against the state of the art in graph generative deeep learning for small molecule generation. We demonstrate that the models achieve the learning objective for inference and generation for variable-size graphs efficiently.} Extensive qualitative and quantitative experimental evaluation demonstrates the superiority of our disentanglement framework for small molecule generation along various critical measures, such as accuracy, novelty, and disentanglement learning.

Instructions:

The train and validation files for both QM9 and ZINC datasets are provided and stored in json format with atoms, bonds, and smiles string format. The best-generated data for QM9 and ZINC datasets are provided in smiles string format.