Chem ML/AI/Datasets

Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.…

👍4🔥3❤2

729 views10:26

🔥В сборе больших открытых датасетов самая приятная часть — это когда они потом используются другими исследователями для прикладных целей, а не кладутся в стол.

Когда-то давно в 2022 году мы опубликовали на Zenodo BigSolDB — крупнейший датасет (известный нам) по растворимости, содержащий 54273 значений растворимости при температурах от 243.15 до 403.15K, в котором присутствуют 138 растворителей и 830 соединений.

📕А совсем недавно на глаза попалась статья от одной из крупнейших мировых фармацевтических корпораций – GlaxoSmithKline (GSK). Они использовли ML-модель для предсказания растворимости соединений на основе BigSolDB, а затем внедрили ее в свои лабораторные процессы.

Чем были полезны наши данные со стороны авторов:
🔹 Дополнением внутренней базы GSK, которая ограничена их исследовательскими соединениями.
🔹 Доступом к разным температурным режимам, что улучшило предсказания при высоких температурах.
🔹 Добавлением редких растворителей, которые раньше модель предсказывала с высокой погрешностью.

Такие моменты сильно повышают мотивацию и далее делать открытые датасеты для более полной систематизации экспериментальных данных по разным химическим областям.

Please open Telegram to view this post

VIEW IN TELEGRAM

👍14🔥8❤6🦄1

916 views12:14

🔥

A review of large language models and autonomous agents in chemistry

https://doi.org/10.1039/D4SC03921A

This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning.

As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains.

This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry.

📕Chemical Science (IF=7.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

A review of large language models and autonomous agents in chemistry

Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate…

👍6❤3🔥3

1.31K views10:40

Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry

https://doi.org/10.1021/acs.jmedchem.4c02749

Herein, we review the recent advancements in AI applications for retrosynthesis prediction by summarizing related techniques and the landscape of current representative retrosynthesis models and propose feasible solutions to tackle existing problems and outline future directions in this field.

📕Journal of Medicinal Chemistry (IF = 6.8)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry

Retrosynthesis is a strategy to analyze the synthetic routes for target molecules in medicinal chemistry. However, traditional retrosynthesis predictions performed by chemists and rule-based expert systems struggle to adapt to the vast chemical space of real…

🔥3❤2👍2

747 views12:06

🔥

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

https://pubs.acs.org/doi/10.1021/jacs.4c15902

The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation.

To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target.

Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation.

📕Journal of the American Chemical Society (IF=14.4)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity…

❤3👍2🔥2

729 views09:30

🔥

A simple similarity metric for comparing synthetic routes

https://doi.org/10.1039/D4DD00292J

AI-predicted routes are typically compared to experimental syntheses to check for an exact match among the top-ranked predictions (top-N accuracy).

This method is ideal for the evaluation of retrosynthetic algorithms on large datasets (>10^6 routes), but it cannot assess a degree of similarity between routes, which would be desirable for small datasets (<10^2 routes).

Here, we present a simple method to calculate a similarity score between any two synthetic routes to a given molecule.

The score is based on two concepts: which bonds are formed during the synthesis; and how the atoms of the final compound are grouped together throughout the synthesis. As a result, the similarity score overlaps well with chemists' intuition and provides a finer assessment of prediction accuracy.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Experimentally validated routes to synthetic compounds can be compared to each other by quantitative metrics (step count, yield, atom economy), or by qualitative assessments (strategy, novelty). AI-predicted routes are typically compared to experimental syntheses…

👍3🔥3❤2

722 views10:21

Balancing molecular information and empirical data in the prediction of physico-chemical properties

https://doi.org/10.1039/D4DD00154K

In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine-learning literature, which uses uncertainty estimates to trade off between the two approaches.

The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example.

The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Balancing molecular information and empirical data in the prediction of physico-chemical properties

Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab initio calculations, which are only feasible for very simple systems, over descriptor…

👍4❤2🔥1

673 viewsedited 13:31

🎊Сегодня у нас наконец-то вышла статья:

Towards Accelerating the Discovery of Efficient Iridium(III) Emitters Using Novel Database and Machine Learning Based Only on Structural Formula

https://doi.org/10.1039/D5TC00305A

1. В этой статье мы собрали базу данных IrLumDB, в которой содержатся экспериментальные данные о 1287 бис-циклометалированных комлексах иридия (III) и их фотофизических свойствах (длина волны эмиссии (λmax), квантовый выход (PLQY) и время жизни).

2. На основе IrLumDB обучили XGBoost, LightGBM и Catboost предсказывать λmax и PLQY с MAE 18.26 нм и 0.13 на десятикратной кросс-валидации.

3. Протестировали работу обученных моделей на 33 синтезированных в нашей лаборатории комплексах, 12 из которых были получены для этой статьи. Комплексы были охарактеризованы с помощью ЯМР, РСА, масс-спектрометрии высокого разрешения, и частично РФА. 9 новых структур были депонированы в CCDC.

4. Сравнили на изученных нами соединениях точность предсказания длины волны эмиссии с помощью алгоритмов машинного обучения и с помощью DFT-расчетов; показали, что алгоритмы машинного обучения справляются с задачей лучше.

5. Так как нам важно искать новые комплексы с потенциально высокими квантовыми выходами, то мы разделили все комплексы на 3 класса: с низким (0-0.1), средним (0.1-0.5) и высоким PLQY (0.5-1), далее обучили классификационные модели и получили точность 72.4% на десятикратной кросс-валидации.

6. Подготовили мини-приложение IrLumDB App для того, чтобы любой исследователь смог предсказать свойства для своих комплексов. Для предсказания достаточно SMILES лигандов.

Датасет на Zenodo | IrLumDB App

📕Journal of Materials Chemistry C (IF=5.7)
#dataset #application

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥17❤5👍5🎉2

1.03K views08:41

Will we ever be able to accurately predict solubility?

https://doi.org/10.1038/s41597-024-03105-6

Accurate prediction of thermodynamic solubility by machine learning remains a challenge.

We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets.

We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist.

Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources.

We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Will we ever be able to accurately predict solubility?

Scientific Data - Will we ever be able to accurately predict solubility?

👍7🔥5❤3

667 views11:05

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

https://doi.org/10.1038/s41467-025-57479-1

Совсем свежая работа от 5 марта:

In this paper, we report a crystal structure prediction (CSP) method with state of the art accuracy and efficiency, validated on a large and diverse dataset including 66 molecules with 137 experimentally known polymorphic forms. The method combines a novel systematic crystal packing search algorithm and the use of machine learning force fields in a hierarchical crystal energy ranking.

Our method not only reproduces all the experimentally known polymorphs, but also suggests new low energy polymorphs yet to be discovered by experiment that might pose potential risks to development of the currently known forms of these compounds.

In addition, we report the prediction results of a blinded study, results for Target XXXI from the seventh CSP blind test, and demonstrate how the method can be used to accelerate clinical formulation design and derisk downstream processing.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

Nature Communications - Crystal polymorphism plays an important role in pharmaceuticals, agrisciences and other industries. Here the authors present an efficient and accurate crystal structure...

🔥3❤2👍2

779 views08:11

Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

https://www.nature.com/articles/s41597-025-04848-6

In this work, we introduce a new open-source Raman dataset consisting of pure chemical compounds commonly employed in the development of APIs. By curating and publishing this dataset, we aim to provide the scientific community with access to high-quality, reusable data.

Containing 3,510 samples spanning 32 compounds, this data can be utilised for referencing and can potentially facilitate in the development of more accurate and generalisable calibration models when access to reference data is limited.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

Scientific Data - Open-source Raman spectra of chemical compounds for active pharmaceutical ingredient development

❤2👍1🔥1

642 views08:49

Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts

https://www.nature.com/articles/s43588-025-00783-z

Here we introduce NMRNet, a deep learning framework using the SE(3) Transformer for atomic environment modeling, following a pretraining and fine-tuning paradigm. To support the evaluation of nuclear magnetic resonance chemical shift prediction models, we have established a comprehensive benchmark based on previous research and databases, covering diverse chemical systems. Applying NMRNet to these benchmark datasets, we achieve competitive performance in both liquid-state and solid-state nuclear magnetic resonance datasets, demonstrating its robustness and practical utility in real-world scenarios.

📕 Nature Computational Science (IF=12.0)

#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Toward a unified benchmark and framework for deep learning-based prediction of nuclear magnetic resonance chemical shifts

Nature Computational Science - A deep learning framework (NMRNet) is developed to model atomic environments for predicting NMR chemical shifts. A benchmark dataset, nmrshiftdb2-2024, is also...

👍5❤3🔥3

720 views16:00

🎊 Спешим поделиться нашим новым и очень важным релизом

BigSolDB 2.0: a dataset of solubility values for organic compounds in organic solvents and water at various temperatures

https://doi.org/10.26434/chemrxiv-2025-nq0gr

Мы подготовили новую версию самой большой (из известных нам) базы по растворимости органических молекул в органических растворителях: BigSolDB 2.0. В ней почти в 2 раза больше данных, чем в первой версии и они чище.

В новой версии содержится:
— 103944 экспериментально измеренных значений растворимости
— 1448 уникальных соединений
— 213 растворителей
— данные из 1595 рецензируемых статей

На новой версии можно как обучать модели машинного обучения, так и напрямую смотреть растворимость конкретных соединений.

Для удобного просмотра мы подготовили сайт: https://bigsoldb.streamlit.app/, где можно искать молекулы как по молекулярной формуле, так по названию (Aspirin, Paracetamol и т.д.)

Скачать BigSolDB 2.0 можно по ссылке на Zenodo: https://doi.org/10.5281/zenodo.15094979

🔥14👍10❤9

806 viewsedited 15:39

🎊 Спешим поделиться нашим новым и очень важным релизом BigSolDB 2.0: a dataset of solubility values for organic compounds in organic solvents and water at various temperatures https://doi.org/10.26434/chemrxiv-2025-nq0gr Мы подготовили новую версию самой…

0:31

Media is too big

VIEW IN TELEGRAM

Небольшая демонстрация какие возможности просмотра данных по растворимости есть на сайте:

https://bigsoldb.streamlit.app/

👍7❤4🔥4

602 views17:47

SynCoTrain: a dual classifier PU-learning framework for synthesizability prediction

https://doi.org/10.1039/D4DD00394B

We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. SynCoTrain employs a co-training framework leveraging two complementary graph convolutional neural networks: SchNet and ALIGNN. By iteratively exchanging predictions between classifiers, SynCoTrain mitigates model bias and enhances generalizability.

Our approach uses Positive and Unlabeled (PU) learning to address the absence of explicit negative data, iteratively refining predictions through collaborative learning. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets.

By focusing on oxide crystals, a well-characterized material family with extensive experimental data, we establish SynCoTrain as a reliable tool for predicting synthesizability while balancing dataset variability and computational efficiency.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

SynCoTrain: a dual classifier PU-learning framework for synthesizability prediction

Material discovery is a cornerstone of modern science, driving advancements in diverse disciplines from biomedical technology to climate solutions. Predicting synthesizability, a critical factor in realizing novel materials, remains a complex challenge due…

🔥4❤3👍3

673 views10:07

Forwarded from Зоопарк из слоновой кости

#пост_по_регламенту

Ну что же, Зоопарк продолжает собирать тематические папки научных и околонаучных каналов.

Встречайте - сегодня это подборка химических каналов!

https://t.me/addlist/gJqD8Wjr2eE4Zjli

P.S. Напомним, что мы в процессе сборки других папок - биология, физика и много чего ещё (см. подробности тут)

🔥2❤1👍1

490 views18:58

Leveraging Prompt Engineering in Large Language Models for Accelerating Chemical Research🔥

https://pubs.acs.org/doi/full/10.1021/acscentsci.4c01935

In this Outlook, we delve into various prompt engineering techniques and illustrate relevant examples for extensive research from metal–organic frameworks and fast-charging batteries to autonomous experiments.

We also elucidate the current limitations of prompt engineering with LLMs such as incomplete or biased outcomes and constraints imposed by closed-source limitations.

Although LLM-assisted chemical research is still in its early stages, the application of prompt engineering will significantly enhance accuracy and reliability, thereby accelerating chemical research.

📕ACS Central Science (IF=13.1)

Please open Telegram to view this post

VIEW IN TELEGRAM

Leveraging Prompt Engineering in Large Language Models for Accelerating Chemical Research

Artificial intelligence (AI) using large language models (LLMs) such as GPTs has revolutionized various fields. Recently, LLMs have also made inroads in chemical research even for users without expertise in coding. However, applying LLMs directly may lead…

🔥4❤3👍3

747 views08:36

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years

https://pubs.acs.org/doi/abs/10.1021/acs.jcim.4c00747

In this review, we aim to distill insights from current research on employing transformer models for Molecular Property Prediction (MPP). We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pretraining data, optimal architecture selections, and promising pretraining objectives.

Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field’s understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.

🔥OA версия на ArXiv: https://arxiv.org/abs/2404.03969

📕Journal of Chemical Information and Modeling (IF=5.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years

Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints…

❤3👍2🔥2

715 views16:22

Understanding Conformation Importance in Data-Driven Property Prediction Models🔥

https://pubs.acs.org/doi/10.1021/acs.jcim.5c00018

This study investigates the influence of using multiple conformers in machine learning-based property prediction, comparing two- and three-dimensional descriptors using three independent data sets: a large-scale quantum mechanical property, a medium-scale melting point, and small-scale enantioselective chemical reaction data sets.

One unique aspect of this study is creating these carefully controlled data sets for models’ performance evaluation in conformational diversity and the target property’s dependence on conformation.

Our findings show that using all available conformers as simple data augmentation consistently achieves high prediction accuracy among aggregation approaches, followed by mean aggregation.

📕Journal of Chemical Information and Modeling (IF=5.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM