Chem ML/AI/Datasets – Telegram

Chem ML/AI/Datasets

794 subscribers

35 photos

1 video

2 files

174 links

Daily articles and news from the field of machine learning in chemistry from the researchers of IGIC RAS @chemrussia

For contact: @levkrasnov @st613laboratory @StasBezzubov

Download Telegram

About

Blog

Apps

Platform

Chem ML/AI/Datasets

794 subscribers

Chem ML/AI/Datasets

🎊Сегодня у нас наконец-то вышла статья:

Towards Accelerating the Discovery of Efficient Iridium(III) Emitters Using Novel Database and Machine Learning Based Only on Structural Formula

https://doi.org/10.1039/D5TC00305A

1. В этой статье мы собрали базу данных IrLumDB, в которой содержатся экспериментальные данные о 1287 бис-циклометалированных комлексах иридия (III) и их фотофизических свойствах (длина волны эмиссии (λmax), квантовый выход (PLQY) и время жизни).

2. На основе IrLumDB обучили XGBoost, LightGBM и Catboost предсказывать λmax и PLQY с MAE 18.26 нм и 0.13 на десятикратной кросс-валидации.

3. Протестировали работу обученных моделей на 33 синтезированных в нашей лаборатории комплексах, 12 из которых были получены для этой статьи. Комплексы были охарактеризованы с помощью ЯМР, РСА, масс-спектрометрии высокого разрешения, и частично РФА. 9 новых структур были депонированы в CCDC.

4. Сравнили на изученных нами соединениях точность предсказания длины волны эмиссии с помощью алгоритмов машинного обучения и с помощью DFT-расчетов; показали, что алгоритмы машинного обучения справляются с задачей лучше.

5. Так как нам важно искать новые комплексы с потенциально высокими квантовыми выходами, то мы разделили все комплексы на 3 класса: с низким (0-0.1), средним (0.1-0.5) и высоким PLQY (0.5-1), далее обучили классификационные модели и получили точность 72.4% на десятикратной кросс-валидации.

6. Подготовили мини-приложение IrLumDB App для того, чтобы любой исследователь смог предсказать свойства для своих комплексов. Для предсказания достаточно SMILES лигандов.

Датасет на Zenodo | IrLumDB App

📕Journal of Materials Chemistry C (IF=5.7)
#dataset #application

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥17❤5👍5🎉2

1.03K views08:41

Chem ML/AI/Datasets

Will we ever be able to accurately predict solubility?

https://doi.org/10.1038/s41597-024-03105-6

Accurate prediction of thermodynamic solubility by machine learning remains a challenge.

We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets.

We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist.

Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources.

We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Will we ever be able to accurately predict solubility?

Scientific Data - Will we ever be able to accurately predict solubility?

👍7🔥5❤3

666 views11:05

Chem ML/AI/Datasets

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

https://doi.org/10.1038/s41467-025-57479-1

Совсем свежая работа от 5 марта:

In this paper, we report a crystal structure prediction (CSP) method with state of the art accuracy and efficiency, validated on a large and diverse dataset including 66 molecules with 137 experimentally known polymorphic forms. The method combines a novel systematic crystal packing search algorithm and the use of machine learning force fields in a hierarchical crystal energy ranking.

Our method not only reproduces all the experimentally known polymorphs, but also suggests new low energy polymorphs yet to be discovered by experiment that might pose potential risks to development of the currently known forms of these compounds.

In addition, we report the prediction results of a blinded study, results for Target XXXI from the seventh CSP blind test, and demonstrate how the method can be used to accelerate clinical formulation design and derisk downstream processing.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

Nature Communications - Crystal polymorphism plays an important role in pharmaceuticals, agrisciences and other industries. Here the authors present an efficient and accurate crystal structure...

🔥3❤2👍2

778 views08:11

Chem ML/AI/Datasets

Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

https://www.nature.com/articles/s41597-025-04848-6

In this work, we introduce a new open-source Raman dataset consisting of pure chemical compounds commonly employed in the development of APIs. By curating and publishing this dataset, we aim to provide the scientific community with access to high-quality, reusable data.

Containing 3,510 samples spanning 32 compounds, this data can be utilised for referencing and can potentially facilitate in the development of more accurate and generalisable calibration models when access to reference data is limited.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

Scientific Data - Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

❤2👍1🔥1

641 views08:49

Chem ML/AI/Datasets

Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts

https://www.nature.com/articles/s43588-025-00783-z

Here we introduce NMRNet, a deep learning framework using the SE(3) Transformer for atomic environment modeling, following a pretraining and fine-tuning paradigm. To support the evaluation of nuclear magnetic resonance chemical shift prediction models, we have established a comprehensive benchmark based on previous research and databases, covering diverse chemical systems. Applying NMRNet to these benchmark datasets, we achieve competitive performance in both liquid-state and solid-state nuclear magnetic resonance datasets, demonstrating its robustness and practical utility in real-world scenarios.

📕 Nature Computational Science (IF=12.0)

#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts

Nature Computational Science - A deep learning framework (NMRNet) is developed to model atomic environments for predicting NMR chemical shifts. A benchmark dataset, nmrshiftdb2-2024, is also...

👍5❤3🔥3

719 views16:00

Chem ML/AI/Datasets

🎊 Спешим поделиться нашим новым и очень важным релизом

BigSolDB 2.0: a dataset of solubility values for organic compounds in organic solvents and water at various temperatures

https://doi.org/10.26434/chemrxiv-2025-nq0gr

Мы подготовили новую версию самой большой (из известных нам) базы по растворимости органических молекул в органических растворителях: BigSolDB 2.0. В ней почти в 2 раза больше данных, чем в первой версии и они чище.

В новой версии содержится:
— 103944 экспериментально измеренных значений растворимости
— 1448 уникальных соединений
— 213 растворителей
— данные из 1595 рецензируемых статей

На новой версии можно как обучать модели машинного обучения, так и напрямую смотреть растворимость конкретных соединений.

Для удобного просмотра мы подготовили сайт: https://bigsoldb.streamlit.app/, где можно искать молекулы как по молекулярной формуле, так по названию (Aspirin, Paracetamol и т.д.)

Скачать BigSolDB 2.0 можно по ссылке на Zenodo: https://doi.org/10.5281/zenodo.15094979

🔥14👍10❤9

805 viewsedited 15:39

Chem ML/AI/Datasets

Chem ML/AI/Datasets

🎊 Спешим поделиться нашим новым и очень важным релизом BigSolDB 2.0: a dataset of solubility values for organic compounds in organic solvents and water at various temperatures https://doi.org/10.26434/chemrxiv-2025-nq0gr Мы подготовили новую версию самой…

Media is too big

VIEW IN TELEGRAM

Небольшая демонстрация какие возможности просмотра данных по растворимости есть на сайте:

https://bigsoldb.streamlit.app/

👍7❤4🔥4

601 views17:47

Chem ML/AI/Datasets

SynCoTrain: a dual classifier PU-learning framework for synthesizability prediction

https://doi.org/10.1039/D4DD00394B

We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. SynCoTrain employs a co-training framework leveraging two complementary graph convolutional neural networks: SchNet and ALIGNN. By iteratively exchanging predictions between classifiers, SynCoTrain mitigates model bias and enhances generalizability.

Our approach uses Positive and Unlabeled (PU) learning to address the absence of explicit negative data, iteratively refining predictions through collaborative learning. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets.

By focusing on oxide crystals, a well-characterized material family with extensive experimental data, we establish SynCoTrain as a reliable tool for predicting synthesizability while balancing dataset variability and computational efficiency.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

SynCoTrain: a dual classifier PU-learning framework for synthesizability prediction

Material discovery is a cornerstone of modern science, driving advancements in diverse disciplines from biomedical technology to climate solutions. Predicting synthesizability, a critical factor in realizing novel materials, remains a complex challenge due…

🔥4❤3👍3

672 views10:07

Chem ML/AI/Datasets

Forwarded from Зоопарк из слоновой кости

#пост_по_регламенту

Ну что же, Зоопарк продолжает собирать тематические папки научных и околонаучных каналов.

Встречайте - сегодня это подборка химических каналов!

https://t.me/addlist/gJqD8Wjr2eE4Zjli

P.S. Напомним, что мы в процессе сборки других папок - биология, физика и много чего ещё (см. подробности тут)

🔥2❤1👍1

489 views18:58

Chem ML/AI/Datasets

Leveraging Prompt Engineering in Large Language Models for Accelerating Chemical Research🔥

https://pubs.acs.org/doi/full/10.1021/acscentsci.4c01935

In this Outlook, we delve into various prompt engineering techniques and illustrate relevant examples for extensive research from metal–organic frameworks and fast-charging batteries to autonomous experiments.

We also elucidate the current limitations of prompt engineering with LLMs such as incomplete or biased outcomes and constraints imposed by closed-source limitations.

Although LLM-assisted chemical research is still in its early stages, the application of prompt engineering will significantly enhance accuracy and reliability, thereby accelerating chemical research.

📕ACS Central Science (IF=13.1)

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Leveraging Prompt Engineering in Large Language Models for Accelerating Chemical Research

Artificial intelligence (AI) using large language models (LLMs) such as GPTs has revolutionized various fields. Recently, LLMs have also made inroads in chemical research even for users without expertise in coding. However, applying LLMs directly may lead…

🔥4❤3👍3

745 views08:36

Chem ML/AI/Datasets

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years

https://pubs.acs.org/doi/abs/10.1021/acs.jcim.4c00747

In this review, we aim to distill insights from current research on employing transformer models for Molecular Property Prediction (MPP). We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pretraining data, optimal architecture selections, and promising pretraining objectives.

Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field’s understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.

🔥OA версия на ArXiv: https://arxiv.org/abs/2404.03969

📕Journal of Chemical Information and Modeling (IF=5.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years

Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints…

❤3👍2🔥2

714 views16:22

Chem ML/AI/Datasets

Understanding Conformation Importance in Data-Driven Property Prediction Models🔥

https://pubs.acs.org/doi/10.1021/acs.jcim.5c00018

This study investigates the influence of using multiple conformers in machine learning-based property prediction, comparing two- and three-dimensional descriptors using three independent data sets: a large-scale quantum mechanical property, a medium-scale melting point, and small-scale enantioselective chemical reaction data sets.

One unique aspect of this study is creating these carefully controlled data sets for models’ performance evaluation in conformational diversity and the target property’s dependence on conformation.

Our findings show that using all available conformers as simple data augmentation consistently achieves high prediction accuracy among aggregation approaches, followed by mean aggregation.

📕Journal of Chemical Information and Modeling (IF=5.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Understanding Conformation Importance in Data-Driven Property Prediction Models

The prediction of molecular properties is essential in chemoinformatics and has many applications in drug discovery and materials design. Molecular representations play a key role in the prediction models to achieve high prediction accuracy. Nevertheless…

👍4❤3🔥3

707 views05:56

Chem ML/AI/Datasets

Transfer learning across different photocatalytic organic reactions

🔥

https://doi.org/10.1038/s41467-025-58687-5

Herein, we apply a domain-adaptation-based transfer-learning (TL) approach to photocatalysis. Despite being different reaction types, the knowledge of the catalytic behavior of organic photosensitizers (OPSs) from photocatalytic cross-coupling reactions is successfully transferred to ML for a [2+2] cycloaddition reaction, improving the prediction of the photocatalytic activity compared with conventional ML approaches. Furthermore, a satisfactory predictive performance is achieved by using only ten training data points.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Transfer learning across different photocatalytic organic reactions

Nature Communications - The potential of transfer learning as an effective tool for predicting photosensitizer catalytic activity remains underexplored in organic chemistry. Here, the authors apply...

❤2👍2🔥2

551 viewsedited 11:43

Chem ML/AI/Datasets

Computational Discovery of Transition-metal Complexes: From High-throughput Screening to Machine Learning🔥

https://pubs.acs.org/doi/10.1021/acs.chemrev.1c00347

The review will cover the development, promise, and limitations of “traditional” computational chemistry as it pertains to data generation for inorganic molecular discovery. The review will also discuss the opportunities and limitations in leveraging experimental data sources. We will focus on how advances in statistical modeling, artificial intelligence, multiobjective optimization, and automation accelerate discovery of lead compounds and design rules. The overall objective of this review is to showcase how bringing together advances from diverse areas of computational chemistry and computer science have enabled the rapid uncovering of structure–property relationships in transition-metal chemistry.

We aim to highlight how unique considerations in motifs of metal–organic bonding (e.g., variable spin and oxidation state, and bonding strength/nature) set them and their discovery apart from more commonly considered organic molecules. We will also highlight how uncertainty and relative data scarcity in transition-metal chemistry motivate specific developments in machine learning representations, model training, and in computational chemistry. Finally, we will conclude with an outlook of areas of opportunity for the accelerated discovery of transition-metal complexes.

📕Chemical Reviews (IF=51.4)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4❤3🔥3

909 views09:17

Chem ML/AI/Datasets

A Perspective on Foundation Models in Chemistry

🔥

https://pubs.acs.org/doi/10.1021/jacsau.4c01160

Foundation models are an emerging paradigm in artificial intelligence (AI), with successful examples like ChatGPT transforming daily workflows. Generally, foundation models are large-scale, pretrained models capable of adapting to various downstream tasks by leveraging extensive data and model scaling.

Their success has inspired researchers to develop foundation models for a wide range of chemical challenges, from materials discovery to understanding structure–property relationships, areas where conventional machine learning (ML) models often face limitations.

In addition, foundation models hold promise for addressing persistent ML challenges in chemistry, such as data scarcity and poor generalization. In this perspective, we review recent progress in the development of foundation models in chemistry across applications of varying scope.

📕JACS Au (IF=8.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

A Perspective on Foundation Models in Chemistry

Foundation models are an emerging paradigm in artificial intelligence (AI), with successful examples like ChatGPT transforming daily workflows. Generally, foundation models are large-scale, pretrained models capable of adapting to various downstream tasks…

❤3👍3🔥2

625 views06:41

Chem ML/AI/Datasets

Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning

🔥

https://doi.org/10.1038/s41467-025-56136-x

This study explicitly elucidates how chemists use thin-layer chromatography (TLC) to determine column chromatography (CC) conditions, employing statistical analysis and machine learning techniques. An experimental dataset of the CC is generated from the automatic platform developed in this study. On this basis, an “artificial intelligence (AI) experience” is generated through a knowledge discovery framework, where the relationship between the retardation factor (RF) value from TLC and retention volume from CC is unveiled in the form of explicit equations. These equations demonstrate satisfactory accuracy and generalizability, providing a scientific basis for the selection of the experimental conditions, and contributing to a better understanding of chromatography.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning

Nature Communications - The selection of experimental conditions for column chromatography is usually determined by experience. Here, authors have discovered explicit relation between thin layer...

❤5👍5🔥4

901 views08:44

Chem ML/AI/Datasets

Pre-trained molecular representations enable antimicrobial discovery🔥

https://www.nature.com/articles/s41467-025-58804-4

Here, we introduce a lightweight computational strategy for antimicrobial discovery that builds on MolE (Molecular representation through redundancy reduced Embedding), a self-supervised deep learning framework that leverages unlabeled chemical structures to learn task-independent molecular representations.

By combining MolE representation learning with available, experimentally validated compound-bacteria activity data, we design a general predictive model that enables assessing compounds with respect to their antimicrobial potential.

Our model correctly identifies recent growth-inhibitory compounds that are structurally distinct from current antibiotics. Using this approach, we discover de novo, and experimentally confirm, three human-targeted drugs as growth inhibitors of Staphylococcus aureus.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Pre-trained molecular representations enable antimicrobial discovery

Nature Communications - Here, the authors introduce a computational strategy for antimicrobial discovery that addresses the scarcity of large datasets. Based on data-driven representations of...

👍3❤2🔥2

655 views14:36

Chem ML/AI/Datasets

The QDπ dataset, training data for drug-like molecules and biopolymer fragments and their interactions

https://www.nature.com/articles/s41597-025-04972-3

In this study, we introduce the QDπ dataset which incorporates data taken from several datasets. We use a query—by—committee active learning strategy to extract data from large datasets to maximize the diversity and avoid redundancy as relevant for neural network training to construct the QDπ dataset.

The QDπ dataset requires only 1.6 million structures to express the chemical diversity of 13 elements from the various source datasets at the ωB97M-D3(BJ)/def2-TZVPPD level of theory.

The QDπ dataset enables creation of flexible target loss functions for neural network training relevant to drug discovery, including information-dense data sets of relative conformational energies and barriers, intermolecular interactions, tautomers and relative protonation energies of drug-like compounds and biomolecular fragments.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

The QDπ dataset, training data for drug-like molecules and biopolymer fragments and their interactions

Scientific Data - The QDπ dataset, training data for drug-like molecules and biopolymer fragments and their interactions

👍4❤3🔥3

663 views08:07

Chem ML/AI/Datasets

Machine learning prediction of enzyme optimum pH

https://www.nature.com/articles/s42256-025-01026-6

Here we proposed and evaluated various machine learning methods for predicting pHopt, conducting extensive hyperparameter optimization and training over 11,000 model instances.

Our results demonstrate that models utilizing language model embeddings markedly outperform other methods in predicting pHopt. We present EpHod, the best-performing model, to predict pHopt, making it publicly available to researchers. From sequence data, EpHod directly learns structural and biophysical features that relate to pHopt, including proximity of residues to the catalytic centre and the accessibility of solvent molecules.

📕Nature Machine Intelligence (IF=23.8)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Machine learning prediction of enzyme optimum pH

Nature Machine Intelligence - Accurately predicting the optimal pH level for enzyme activity is challenging due to the complex relationship between enzyme structure and function. Gado and...

👍5❤3🔥3🐳1

835 views09:44

Chem ML/AI/Datasets

Predictive modeling of visible-light azo-photoswitches’ properties using structural features

🔥

https://doi.org/10.1186/s13321-025-00993-7

In this manuscript we present the strategy for modeling photoswitch properties (maximum absorption wavelength and thermal half-life of photoisomers) of visible-light azo-photoswitches using structural data. We compile a comprehensive data set from literature sources and perform a rigorous benchmark to select the best feature type and modeling approach.

📕 Journal of Cheminformatics (IF=7.1)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Predictive modeling of visible-light azo-photoswitches’ properties using structural features

Journal of Cheminformatics - In this manuscript we present the strategy for modeling photoswitch properties (maximum absorption wavelength and thermal half-life of photoisomers) of visible-light...

👍5❤3🔥3

672 views17:50