Introducing nach0: A one-stop LLM for chemical and biomedical tasks

แชร์
ฝัง
  • เผยแพร่เมื่อ 20 พ.ค. 2024
  • Presenting nach0, a new large language model transformer for solving biological and chemical tasks from researchers at Insilico Medicine and NVIDIA. Findings were published in Chemical Science Journal.
    ◾ What is nach0?
    Nach0 is a multi-domain and multi-task LLM trained on natural language understanding, synthetic route prediction, and molecular generation, and works across domains to answer biomedical questions and synthesize new molecules.
    ◾ How is it different?
    While there are other LLMs designed for biomedical discovery, including BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) and SciFive, they do not contain chemical structure descriptions. Those that have emerged with both text and chemical structure descriptions, such as Galactica, have not yet been trained for diverse chemical tasks.
    ◾ Where does nach0 get its data?
    Nach0's dataset includes abstract texts extracted from PubMed and patent descriptions derived from the U.S. Patent and Trademark Office related to the chemistry domain as well as molecular structures using simplified molecular-input line-entry system (SMILES), all of which were turned into hundreds of millions of tokens.
    ◾ What are nach0's primary tasks?
    Researchers trained nach0 to perform three key tasks:
    1) natural language processing, such as document classification and question answering;
    2) chemistry-related tasks, such as molecular property prediction, molecular generation, and reagent prediction;
    3) cross-domain tasks, including description-guided molecule design and molecular description generation.
    ◾ How does nach0 utilize NVIDIA's technology?
    The training was performed using NVIDIA NeMo, an end-to-end platform for developing custom generative AI. The research team leveraged NLP capabilities to train and evaluate the new model’s LMs. NVIDIA’s memory-mapped data loader modules allowed researchers to manage large datasets with small memory footprints and optimal reading speed.
    Congrats to the research team!
    Maksim Kuznetsov, Alex Zhavoronkov, Alex Aliper, Daniil Polykovskiy, Alán Aspuru-Guzik, Annika Brundyn, Aastha Jhunjhunwala, Anthony Costa
    See the paper here: pubs.rsc.org/en/content/artic...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น •