Chem ML/AI/Datasets

AI-driven protein design

https://www.nature.com/articles/s44222-025-00349-8

Central to this Review is a comprehensive and actionable roadmap for designers, providing step-by-step guidance on how to integrate state-of-the-art AI tools into protein design workflows, including tools for structural and functional prediction as well as generative models for de novo design.

To illustrate this roadmap in practice, we present case studies showcasing AI-driven protein design, from engineering therapeutic proteins to designing novel proteins that unlock enzyme functions and reprogramme biomolecular systems.

Looking ahead, we outline future directions highlighting the vast potential of AI to revolutionize synthetic biology, expedite drug development and drive sustainable biotechnology, positioning it as a transformative force at the forefront of protein design.

📕Nature Reviews Bioengineering (IF=37.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

Nature

AI-driven protein design

Nature Reviews Bioengineering - Artificial intelligence is revolutionizing protein design by enabling precise navigation of sequence space and accelerating the creation of functional proteins. In...

👍5🔥4❤3

629 views07:24

Chem ML/AI/Datasets

Intelligent understanding of spectra: from structural elucidation to property design

https://doi.org/10.1039/D4CS01293C

This review presents representative advances at the AI–spectroscopy intersection, highlighting how these approaches address challenges in spectroscopic analysis: automated spectral interpretation, efficient spectral prediction, and accurate property determination from spectroscopic fingerprints.

Beyond individual applications, we demonstrate how AI enables the development of unified spectrum–structure–property frameworks capable of predicting functional properties directly from spectral data. This integrated approach opens pathways for spectrum-guided, AI-driven inverse design of functional matters.

In addition, we emphasize the importance of model interpretability, which can illuminate the fundamental physics underlying spectrum–structure–property relationships.

Looking forward, we propose that integrating large-scale AI architectures with spectroscopic descriptors could establish universal spectrum–structure–property relationships, potentially revolutionizing chemical theory.

📕Chemical Society Reviews (IF = 39.0)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

pubs.rsc.org

Intelligent understanding of spectra: from structural elucidation to property design

Spectroscopy serves as a bridge between experimental observations and quantum mechanical principles, linking molecular microstructure to macroscopic material properties. Despite its central importance, establishing quantitative structure–property relationships…

❤7👍2🔥2

546 views12:16

Chem ML/AI/Datasets

Chemprop v2: An Efficient, Modular Machine Learning Package for Chemical Property Prediction

https://doi.org/10.26434/chemrxiv-2025-4p1nr

The original chemprop release was intended for use primarily via a command line interface, rather than programmatic use via a Python API. As the field has evolved, the need for increased modularity and usability in Python-based workflows has become clear.

We have completed a ground-up rewrite of chemprop that addresses this need, providing improvements in speed, extensibility, and overall user experience. We have conducted extensive benchmarking to demonstrate algorithmic parity with the original implementation, while seeing improvements of about a factor of two in execution time and a factor of three in memory usage. chemprop v2 effectively scales to multiple GPUs, which enables training more and larger models. chemprop v2 also includes some new features.

Extensive Jupyter notebook tutorials and new documentation for all major functionality were also added. chemprop v2 preserves the predictive accuracy of its predecessor and enhances modularity, speed, and usability, empowering researchers to pursue computational molecular design more effectively.

🖥

https://github.com/chemprop/chemprop

ChemRxiv
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ChemRxiv

Chemprop v2: An Efficient, Modular Machine Learning Package for Chemical Property Prediction | ChemRxiv

Accurate prediction of molecular properties is essential for computational design
in many areas of chemistry. Deep learning has been used in these prediction tasks
for a wide variety of molecular properties, and the availability of user-friendly,
open-...

❤5👍4🔥4❤‍🔥1

522 views08:04

Chem ML/AI/Datasets

Коллеги, рады поделиться нашим новым препринтом, который вышел вчера:

Machine Learning for Anticancer Activity Prediction of Transition Metal Complexes

https://doi.org/10.26434/chemrxiv-2025-1nqvm-v2

В работе мы представляем MetalCytoToxDB — крупнейшую на сегодня базу данных цитотоксичности комплексов переходных металлов:

1. Объём базы: 26 500 значений IC₅₀ для 7050 комплексов Ru, Ir, Rh, Re и Os против 754 клеточных линий, собранные из 1921 статьи.

2. На основе базы были обучены ML алгоритмы. Для рутениевых комплексов LightGBM достиг ROC-AUC = 0.81, для иридиевых — 0.73 (на кросс-валидации с разбиением по DOI во избежании data leakage).

3. Была произведена валидация во времени: модели, обученные на статьях до 2024 года, показали ROC-AUC = 0.74 и hit rate = 90% на новых данных 2025 года, что вдвое выше случайного отбора.

4. Был проведен SHAP-анализ для комплексов Ru и Ir.

5. Мульти-металльная модель позволяет делать предсказания даже для металлов с небольшим числом примеров (Rh, Re, Os), достигая ROC-AUC ~0.73–0.79.

6. Разработан Pipeline для скрининга на примере рутениевых комплексов с варьированием лигандов из PubChem.

7. Доступно веб-приложение для поиска комплексов и изучения базы: https://biometaldb.streamlit.app/. Можно также искать комплексы по DOI и авторам

🖥

Датасет на Zenodo

Please open Telegram to view this post

VIEW IN TELEGRAM

❤12👍9🔥8❤‍🔥1

566 viewsedited 09:19

Chem ML/AI/Datasets

MolAI: A Deep Learning Framework for Data-Driven Molecular Descriptor Generation and Advanced Drug Discovery Applications

🔥

https://pubs.acs.org/doi/full/10.1021/acs.jcim.5c00491

This study introduces MolAI, a robust deep learning model designed for data-driven molecular descriptor generation. Utilizing a vast training data set of 221 million unique compounds, MolAI employs an autoencoder neural machine translation model to generate latent space representations of molecules.

The model demonstrated exceptional performance through extensive validation, achieving an accuracy of >99.8% in regenerating input molecules from their corresponding latent space. This study showcases the effectiveness of MolAI-driven molecular descriptors by developing an ML-based model (iLP) that accurately predicts the predominant protonation state of molecules at neutral pH.

These descriptors also significantly enhance ligand-based virtual screening and are successfully applied in a framework (iADMET) for predicting ADMET features with high accuracy. This capability of encoding and decoding molecules to and from latent space opens unique opportunities in drug discovery, structure–activity relationship analysis, hit optimization, de novo molecular generation, and training infinite machine learning models.

🖥

https://github.com/i-TripleD/MolAI-Publication

📕Journal of Chemical Information and Modeling (IF=5.3)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

MolAI: A Deep Learning Framework for Data-Driven Molecular Descriptor Generation and Advanced Drug Discovery Applications

This study introduces MolAI, a robust deep learning model designed for data-driven molecular descriptor generation. Utilizing a vast training data set of 221 million unique compounds, MolAI employs an autoencoder neural machine translation model to generate…

❤5👍4🔥3❤‍🔥1

580 viewsedited 07:46

Chem ML/AI/Datasets

Predicting Reaction Feasibility and Selectivity of Aromatic C─H Thianthrenation with a QM–ML Hybrid Approach

https://doi.org/10.1002/anie.202510533

The direct thianthrenation of aromatic C─H bonds is a valuable late-stage functionalization strategy that can assist, for example, the development of new drugs.

We herein present a predictive computational model for this reaction, denoted PATTCH, which is based on semiempirical quantum mechanics and machine learning. It classifies each Caromatic–H unit either as reactive or not with an accuracy of above 90%. It can address both the site-selectivity and reaction feasibility question associated with the thianthrenation protocol.

First, this was achieved by selecting carefully engineered features, which take into account the electronic and steric influence on the site-selectivity. Second, parallel experimentation was used to supplement the available literature data with 54 new negative reactions (unsuccessful thianthrenation), which we show was instrumental for developing the PATTCH tool. Ultimately, we successfully applied the model to a challenging test set encompassing the differentiation between carbocycle versus heterocycle functionalization, the identification of substrates that were reported to result in a mixture of isomeric products, and to molecules that could not be thianthrenated. The computational predictions were experimentally validated.

The PATTCH tool can be obtained free of charge from 🖥 https://github.com/MolecularAI/thianthrenation_prediction.

📕Angewandte Chemie (IF=16.9)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Wiley Online Library

Predicting Reaction Feasibility and Selectivity of Aromatic C─H Thianthrenation with a QM–ML Hybrid Approach

This paper presents the open-source machine learning model PATTCH, which predicts the reaction feasibility and site-selectivity of aromatic C ─H thianthrenation reactions. The model was built using bo...

❤5🔥3❤‍🔥2👍2

542 views10:08

Chem ML/AI/Datasets

SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models🔥

https://pubs.acs.org/doi/full/10.1021/acscentsci.5c01285

In this work, we present a novel approach by fine-tuning Meta’s Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods.

We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than those of the training data.

We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.

🖥

https://github.com/THGLab/SynLlama

📕ACS Central Science (IF=10.4)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Generative machine learning models for exploring chemical space have shown immense promise, but many molecules that they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a…

❤4👍3🔥3

735 views07:58

Chem ML/AI/Datasets

A review of machine learning methods for imbalanced data challenges in chemistry

🔥

https://doi.org/10.1039/D5SC00270B

In this review, we examine the prominent ML approaches used to tackle the imbalanced data challenge in different areas of chemistry, including resampling techniques, data augmentation techniques, algorithmic approaches, and feature engineering strategies.

Each of these methods is evaluated in the context of its application across various aspects of chemistry, such as drug discovery, materials science, cheminformatics, and catalysis. We also explore future directions for overcoming the imbalanced data challenge and emphasize data augmentation via physical models, large language models (LLMs), and advanced mathematics.

The benefit of balanced data in new material design and production and the persistent challenges are discussed. Overall, this review aims to elucidate the prevalent ML techniques applied to mitigate the impacts of imbalanced data within the field of chemistry and offer insights into future directions for research and application.

📕Chemical Science (IF=7.4)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

pubs.rsc.org

A review of machine learning methods for imbalanced data challenges in chemistry

Imbalanced data, where certain classes are significantly underrepresented in a dataset, is a widespread machine learning (ML) challenge across various fields of chemistry, yet it remains inadequately addressed. This data imbalance can lead to biased ML or…

🔥9❤5👍5

597 viewsedited 14:27

Chem ML/AI/Datasets

Anomeric Selectivity of Glycosylations through a Machine Learning Lens

https://doi.org/10.1021/jacs.5c07561

Predicting the stereoselectivity of glycosylations is a major challenge in carbohydrate chemistry. Herein we show that it is possible to build machine learning models that can predict the major anomer of a glycosylation, whether the other anomer is observed as the minor product, and the anomeric ratio of the two anomers. The three models are integrated into a publicly available tool, GlycoPredictor.

From a statistical analysis of literature data, we analyze glycosylation trends and compare them to known trends in the field of carbohydrate chemistry, making it possible to elucidate a hierarchy of rules governing the stereoselectivity of glycosylations and discover promising new trends that complement expert intuition, which are tested in novel glycosylation methods.

📕Journal of the American Chemical Society (IF=15.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Anomeric Selectivity of Glycosylations through a Machine Learning Lens

Predicting the stereoselectivity of glycosylations is a major challenge in carbohydrate chemistry. Herein we show that it is possible to build machine learning models that can predict the major anomer of a glycosylation, whether the other anomer is observed…

❤3👍2🔥2

627 views15:44

Chem ML/AI/Datasets

Subgrapher: visual fingerprinting of chemical structures🔥

https://doi.org/10.1186/s13321-025-01091-4

In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images.

Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures.

Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available.

📕 Journal of Cheminformatics (IF=5.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

BioMed Central

Subgrapher: visual fingerprinting of chemical structures - Journal of Cheminformatics

Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which…

❤2👍1🔥1

602 views09:30

Chem ML/AI/Datasets

Synthesis-Aware Materials Redesign via Large Language Models

https://doi.org/10.1021/jacs.5c07743

We propose a novel framework that leverages large language models (LLMs) to transform synthetically infeasible inorganic crystal structures into synthetically feasible ones. Unlike previous studies on synthesis predictions, which focus primarily on estimating synthesizability, our method provides actionable solutions for redesigning unsynthesizable materials into synthesizable ones. By integrating an invertible structural representation and an iterative fine-tuning strategy, our framework not only predicts synthetic feasibility but also modifies unsynthesizable materials into viable candidates.

As a result, we demonstrate that LLMs can effectively modify materials of various types, enhancing their synthesizability and increasing the likelihood of successful synthesis. As an indirect experimental validation, we demonstrate that 34 materials among the top 100 redesigned (but originally unsynthesizable) structures have indeed been experimentally reported in the literature. This approach addresses a critical gap between design and synthesis in materials science, and enables the discovery of experimentally realizable compounds by employing the “learn-and-regenerate” strategy in LLMs.

📕Journal of the American Chemical Society (IF=15.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Synthesis-Aware Materials Redesign via Large Language Models

We propose a novel framework that leverages large language models (LLMs) to transform synthetically infeasible inorganic crystal structures into synthetically feasible ones. Unlike previous studies on synthesis predictions, which focus primarily on estimating…

❤‍🔥4👍2🔥2❤1

568 views08:02

Chem ML/AI/Datasets

Machine Learning-Assisted Prediction of Ground- and Excited-State Redox Potentials in Iridium(III) Photocatalysts

https://doi.org/10.1002/anie.202517393

This study introduces a data-driven framework that combines DFT calculations with machine learning to facilitate accurate and scalable predictions of ground- and excited-state redox potentials for iridium(III) photocatalysts.

We first constructed independent models to identify key geometric and electronic descriptors governing redox behavior. Shapley additive explanations-based analyses revealed clear structure–activity relationships, offering mechanistic insights and rational guidance for tuning redox potentials. Based on these insights, we developed unified multi-output models—Model G for ground-state and Model E for excited-state redox potentials—to enable rapid, cost-effective, and high-throughput predictions. By modeling oxidation and reduction processes within a shared descriptor space, we can reduce computational overhead while maintaining high predictive accuracy.

To assess cross-metal generalizability, residual transfer learning was applied to osmium (Os) photocatalysts. Using feature-similar complexes, the resulting transfer models (G-T, E-T) achieved performance comparable to Os-only baselines, demonstrating efficient few-shot cross-metal transfer. Collectively, this study establishes an interpretable and transferable machine-learning framework for photocatalyst discovery. This framework provides a foundation for large-scale screening and rational design across diverse transition-metal platforms, accelerating advancements in photoredox catalysis, solar fuel production, and broader sustainable energy technologies.

📕Angewandte Chemie (IF=16.9)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Wiley Online Library

Machine Learning‐Assisted Prediction of Ground‐ and Excited‐State Redox Potentials in Iridium(III) Photocatalysts

A data-driven framework integrating density functional theory (DFT) and machine learning (ML) predicts ground- and excited-state redox potentials of Ir(III) photocatalysts. Unified models enable inte...

❤6👍4🔥4

588 viewsedited 13:39

Chem ML/AI/Datasets

Mapping Boryl Radical Properties and Reactivity Using Machine Learning: The B-Rad and React-B-Rad Maps

https://doi.org/10.1002/anie.202511509

Boryl radicals have become indispensable in organic synthesis, yet, translating their complex steric and electronic properties into actionable reactivity insights remains challenging. Herein, we present a comprehensive classification of boryl radicals, including a publicly accessible database of 141 neutral 7e-4c boryl radicals, each parametrized by a set of electronic and steric features derived from DFT calculations.

Unsupervised machine learning (k-means clustering) and dimensionality reduction (PCA/UMAP) condense this high dimensional descriptor space into the “B-rad map”, capturing trends in sterics and electronics among the resulting five clusters. Global electrophilicity (ω) and nucleophilicity (N) indices are overlaid to create a polarity‑annotated guide, while DFT‑computed activation free energies for six benchmark reactions (HAT, radical addition, and XAT for two different substrates) yield the React‑B‑rad maps that directly link intrinsic properties to specific reaction performance. To demonstrate predictive power, supervised machine learning models (random forest) are trained on the descriptors and successfully predict radical reactivity regimes across all reaction types.

Overall, this integrated, machine-learning-driven platform can serve as both a practical guide for experimental decision-making and a foundation for data-driven discovery, paving the way towards rational design and virtual screening of boryl-radical reagents for diverse synthetic applications.

📕Angewandte Chemie (IF=16.9)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Wiley Online Library

Mapping Boryl Radical Properties and Reactivity Using Machine Learning: The B‐Rad and React‐B‐Rad Maps

Boryl radicals are central to synthesis, yet their steric and electronic complexity makes systematic mapping challenging. The B-rad map organizes 141 in silico-parametrized 7e-4c radicals into five m...

❤4👍3🔥3

673 views11:21

Chem ML/AI/Datasets

ECloudGen: leveraging electron clouds as a latent variable to scale up structure-based molecular design

https://doi.org/10.1038/s43588-025-00886-7

Here we propose a latent variable approach that bridges the gap between ligand-only data and protein–ligand complexes, enabling target-aware generative models to explore a broader chemical space, thereby enhancing the quality of molecular generation. Inspired by quantum molecular simulations, we introduce ECloudGen, a generative model that leverages electron clouds as meaningful latent variables.

ECloudGen incorporates techniques such as latent diffusion models, Llama architectures and a contrastive learning task, which organizes the chemical space into a structured and highly interpretable latent representation.

Benchmark studies demonstrate that ECloudGen outperforms state-of-the-art methods by generating more potent binders with superior physiochemical properties and by covering a broader chemical space.

🖥

https://github.com/HaotianZhangAI4Science/ECloudGen

📕 Nature Computational Science (IF=18.3)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Nature

ECloudGen: leveraging electron clouds as a latent variable to scale up structure-based molecular design

Nature Computational Science - This study presents ECloudGen, which uses latent diffusion to generate electron clouds from protein pockets and decodes them into molecules. The adopted two-stage...

❤6👍5🔥5

735 viewsedited 07:59

Chem ML/AI/Datasets

Nanostructured Material Design via a Retrieval-Augmented Generation (RAG) Approach: Bridging Laboratory Practice and Scientific Literature

https://doi.org/10.1021/acs.jcim.5c01897

The increasing complexity in designing nanostructured materials for electronics, biomedicine, and energy applications requires advanced computational methods to enhance research efficiency and minimize experimental costs. This study proposes an innovative agent-based retrieval-augmented generation (RAG) system integrated with large language models (LLMs) to automate the extraction and analysis of scientific information from extensive literature databases, specifically targeting nanostructured materials developed via two-photon polymerization (2PP). In addition to extracting and analyzing scientific data, our approach emphasizes understanding how these nanostructured materials interact with cells, which is crucial for controlling their application in biomedicine.

The developed platform demonstrates robust semantic accuracy (cosine similarity: 0.82) and high overall task precision (0.81), significantly reducing the likelihood of misinformation by incorporating dynamic query refinement mechanisms. The intuitive, user-friendly interface facilitates quick access to relevant scientific data, thereby improving researchers’ productivity and enabling more accurate experimental planning. Although the system exhibits certain limitations regarding domain-specific terminology coverage, further fine-tuning and specialized training are anticipated to enhance its performance and reliability for advanced scientific applications.

📕Journal of Chemical Information and Modeling (IF=5.3)
#article

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Nanostructured Material Design via a Retrieval-Augmented Generation (RAG) Approach: Bridging Laboratory Practice and Scientific…

The increasing complexity in designing nanostructured materials for electronics, biomedicine, and energy applications requires advanced computational methods to enhance research efficiency and minimize experimental costs. This study proposes an innovative…

❤5👍4🔥2❤‍🔥1

928 views11:25

Chem ML/AI/Datasets

Data-Driven Discovery of Polar Organic Cocrystals: Integration of Machine Learning and Automated Screening

https://doi.org/10.1021/jacs.5c16276

Polar organic cocrystals hold significant promise for various advanced technological applications. However, their relatively low occurrence emphasizes the difficulties in achieving the desired polar packing arrangements, making their discovery complex and challenging.

Here, we introduce a data-driven method that combines machine learning (ML) with high-throughput (HT) automation to speed up the discovery of polar organic cocrystals. Using ML techniques, we identified key factors that influence polar cocrystal formation, allowing for targeted selection of molecular candidates. We examined 13 cocrystal combinations with chloranilic acid (CA), screening 20 solvent systems for each, which enabled a highly efficient search across a broad chemical space. HT automation further enhanced the synthesis and characterization by enabling rapid screening and precise structural validation, while thoroughly exploring the chemical landscape. Experimental results confirmed 13 pairs of CA cocrystals, with 6 crystallizing in polar space groups, resulting in a polar discovery rate of 46%-nearly three times higher than the average in the Cambridge Structural Database (CSD) (∼13.2%). This integrated approach offers a new strategy in polar organic cocrystal research. The findings demonstrate the potential of this method to advance functional molecular materials and pave the way for next-generation applications using polar organic cocrystals.

📕Journal of the American Chemical Society (IF=15.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Data-Driven Discovery of Polar Organic Cocrystals: Integration of Machine Learning and Automated Screening

Polar organic cocrystals hold significant promise for various advanced technological applications. However, their relatively low occurrence emphasizes the difficulties in achieving the desired polar packing arrangements, making their discovery complex and…

🔥5❤3👍3

687 views17:22

Chem ML/AI/Datasets

MOF-ChemUnity: Literature-Informed Large Language Models for Metal–Organic Framework Research

https://doi.org/10.1021/jacs.5c11789

Artificial intelligence (AI) is transforming research in metal–organic frameworks (MOFs), where models trained on structured computational data routinely predict new materials and optimize their properties. This raises a central question: What if we could leverage the full breadth of MOF knowledge, not just structured data sets, but also the scientific literature? For researchers, the literature remains the primary source of knowledge, yet much of its content, including experimental data and expert insight, remains underutilized by AI systems.

We introduce MOF-ChemUnity, a structured, extensible, and scalable knowledge graph that unifies MOF data by linking literature-derived insights to crystal structures and computational data sets. By disambiguating MOF names in the literature and connecting them to crystal structures in the Cambridge Structural Database, MOF-ChemUnity unifies experimental and computational sources and enables cross-document knowledge extraction and linking. We showcase how this enables multiproperty machine learning across simulated and experimental data, compilation of complete synthesis records for individual compounds by aggregating information across multiple publications, and expert-guided materials recommendations via structure-based machine learning descriptors for pore geometry and chemistry. When used as a knowledge source to augment large language models (LLMs), MOF-ChemUnity enables a literature-informed AI assistant that operates over the full scope of MOF knowledge. Expert evaluations show improved accuracy, interpretability, and trustworthiness across tasks such as retrieval, inference of structure–property relationships, and materials recommendation, outperforming standard LLMs. This work lays the foundation for literature-informed materials discovery, enabling both scientists and AI systems to reason over the full existing knowledge in a new way.

📕Journal of the American Chemical Society (IF=15.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

MOF-ChemUnity: Literature-Informed Large Language Models for Metal–Organic Framework Research

Artificial intelligence (AI) is transforming research in metal–organic frameworks (MOFs), where models trained on structured computational data routinely predict new materials and optimize their properties. This raises a central question: What if we could…

❤7👍3🔥3❤‍🔥1

654 views09:29

Chem ML/AI/Datasets

PersADE: a database of personalized adverse drug events and their underlying molecular mechanisms

https://doi.org/10.1093/nar/gkaf1095

As a major burden on global healthcare systems, adverse drug events (ADEs) result in significant morbidity, mortality, and healthcare resource consumption. With the rapid advances in precision medicine, personalized ADEs and their molecular mechanisms are important components of drug repurposing and drug safety improvement. Thus, extensive studies have been conducted to collect valuable information on personalized ADEs, but no database has yet been available to provide such data.

In this work, PersADE, a database aiming to provide personalized drug adverse events and their molecular mechanisms, was constructed. It integrated 4 061 772 personalized drug-ADE associations, 31 756 protein-ADE associations, and 108 677 drug-protein interactions, with a particular emphasis on off-target effects.

The uniqueness of these data lies in (a) providing demographic characteristics, disease context and drug administration parameters associated with ADEs, enabling stratification of drug-ADE associations; (b) systematically integrating interactions among drugs, human proteins and ADEs, describing the mechanistic insights. Given the growing global focus on precision medicine, PersADE is highly anticipated to significantly impact studies on personalized ADEs and mechanistic explorations by providing researchers and clinicians with evidence-based tools. It is now freely accessible at: https://idrblab.org/PersADE

📕Nucleic Acids Research (IF=13.1)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

OUP Academic

PersADE: a database of personalized adverse drug events and their underlying molecular mechanisms Open Access

Abstract. As a major burden on global healthcare systems, adverse drug events (ADEs) result in significant morbidity, mortality, and healthcare resource co

❤3👍2🔥2

613 views11:22

Chem ML/AI/Datasets

Domain-Trained Language Model for Inverse Design and Synthesis of High-Performance Hydrogen Storage MOFs

https://doi.org/10.1002/anie.202513366

A domain-specific large language model, MOFs-LLM, is developed to accelerate the inverse design and synthesis of metal—organic frameworks (MOFs) for hydrogen storage. Trained on 210 million tokens derived from over 6 000 MOF-related publications and 15 000 crystal structures, the model integrates chemical knowledge with structural features to improve structure–property reasoning. Compared to baseline methods, MOFs-LLM achieves a 46.7% enhancement in capturing structure–property relationships. It enables the inverse design of 60 candidate frameworks optimized for both hydrogen storage performance and synthetic accessibility.

Guided by the model, a novel MOF (Cu-LLMs-1) was synthesized in three experimental iterations, exhibiting a hydrogen uptake of 1.33 wt% at room temperature, ranking among the top five pure MOFs under comparable conditions. These findings highlight the potential of domain-trained language models to bridge virtual screening and experimental realization in materials discovery.

📕Angewandte Chemie (IF=16.9)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Wiley Online Library

Domain‐Trained Language Model for Inverse Design and Synthesis of High‐Performance Hydrogen Storage MOFs

A domain-specific language model (MOFs-LLM), trained on over 6 000 MOF-related publications and 15 000 structures, enables inverse design of synthetically accessible hydrogen storage MOFs. Integratin...

❤‍🔥2❤1👍1🔥1

653 views08:54

Chem ML/AI/Datasets

MGDB: a curated database for molecular glues🔥

https://doi.org/10.1093/nar/gkaf1131

We developed MGDB, a specialized open-access repository integrating rigorously curated multidimensional data for MGs. MGDB contains 7396 curated MGs being sourced from 162 peer-reviewed publications and 156 patents. It consolidates structural data, 9728 experimental bioactivity data points (covering degradation efficiency, binding affinity, cellular/animal activity) across 201 targets and 108 effectors, 115 296 computed physicochemical properties, and 270 785 ADMET profiles.

The database supports text-based and chemical structure-based queries and interoperability with external resources (e.g. PubChem, ChEMBL, DrugBank, UniProt, and WIPO) via hyperlinks.

By centralizing and standardizing specialized MG information, MGDB empowers researchers to rapidly explore MG research landscapes and provides high-quality datasets for artificial intelligence-driven rational therapeutic design. MGDB is freely available at http://mgdb.idruglab.cn/.

📕Nucleic Acids Research (IF=13.1)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

OUP Academic

MGDB: a curated database for molecular glues Open Access

Abstract. Molecular glues (MGs) represent a unique class of small molecules that modulate protein–protein interactions by altering target protein surface p

🔥4❤2👍2

634 views15:48

Chem ML/AI/Datasets

molSimplify 2.0: Improved Structure Generation for Automating Discovery in Inorganic Molecular and Reticular Chemistry

🔥

https://doi.org/10.26434/chemrxiv-2025-h8gff-v2

We provide an overview of core molSimplify functionality and recent updates that enhance its capabilities for automated molecular and materials modeling. We describe the mol3D and atom3D classes, which store atomic and bonding information for a wide range of functions, including reading, modifying, and characterizing molecular geometries from common file formats. Enhancements to decoration and substructure addition functions enable systematic derivatization of template molecules.

We introduce a new mol2D class that enables graph-based uniqueness checks and substructure identification. Most importantly, we introduce improvements to transition metal complex (TMC) generation that eliminate steric clashes and enable structure building with ligands of higher denticity. Integration with machine learning models that predict coordinating atom identities enables truly high-throughput, de novo TMC generation.

We describe applications of molSimplify outside of isolated TMCs, including extensions to periodic systems (i.e., particularly metal–organic frameworks) and to metalloenzymes through the protein3D class. We demonstrate our improved combined structure prediction and generation workflow by generating structures of a database of experimentally characterized Ir complexes from only the SMILES strings of their respective ligands.

We envision that recent enhancements will make the code easily extendible to other periodic materials such as covalent organic frameworks and zeolites or to multimetallic transition metal complexes.

https://molsimplify.mit.edu/

ChemRxiv
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

❤6👍4🔥4

622 viewsedited 07:31

About

Blog

Apps

Platform