Chem ML/AI/Datasets
794 subscribers
35 photos
1 video
2 files
174 links
Daily articles and news from the field of machine learning in chemistry from the researchers of IGIC RAS @chemrussia

For contact: @levkrasnov @st613laboratory @StasBezzubov
Download Telegram
Chemprop: A Machine Learning Package for Chemical Property Prediction

https://doi.org/10.1021/acs.jcim.3c01250

The software package Chemprop implements the directed message-passing neural networks (D-MPNN) architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra.

Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features.

We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra.


🖥Github link, 🌟1.8k: https://github.com/chemprop/chemprop

📕Journal of Chemical Information and Modeling (IF=5.6)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥52👍2
Acquisition of absorption and fluorescence spectral data using chatbots

https://doi.org/10.1039/D4DD00255E

Гайд как быстро писать новые статьи в научных журналах:

1) Берем ChatGPT или любую другую LLM
2) Спрашиваем у него о свойствах молекулы X
3) Записываем в таблицу

Profit: получаем статью в журнале с IF=6.2
😁12🔥3👍2
ChEMBL 35 is out!

https://chembl.blogspot.com/2024/12/heres-nice-christmas-gift-chembl-35-is.html

Вышла новая версия базы CheMBL:

This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP (AI-driven Structure-enabled Antiviral Platform) project, a new NTD data set by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project.

This version of the database, prepared on 01/12/2024 contains:

2,496,335 compounds (of which 2,474,590 have mol files)
3,185,505 compound records (non-unique compounds)
21,123,501 activities
1,740,546 assays
16,003 targets
92,121 documents

#dataset
👍7🔥42
SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

https://doi.org/10.26434/chemrxiv-2024-c660p

Должно быть очень полезно для тех, кто занимает металлоорганикой:
We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with.


🖥Github link: https://github.com/jensengroup/xyz2mol_tm

#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍32🔥2
Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

https://doi.org/10.1021/acs.chemmater.4c02754

In this work, phosphorescent materials are represented as strings, molecular graphs, and point clouds, which are employed by language models, two-dimensional graph, and three-dimensional graph neural networks. In addition, more than 200 000 molecules with simulated properties highly relevant to experimental properties are used for pretraining the DL models.

Our work shows high performance in the prediction of five experimental properties that are importantly considered when commercializing OLED devices. This means that faster material discovery for OLEDs can be achieved through DL models that are trained with simulation information that is highly correlated with experimental properties.


📕Chemistry of Materials (IF=7.2)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍62🔥2
Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic Resonance-Based Structural Elucidation

https://pubs.acs.org/doi/10.1021/acs.jctc.4c01407

In this study, we present a comprehensive benchmark of carbon shielding anisotropies based on coupled cluster reference tensors taken from the NS372 benchmark data set.

Additionally, we investigate the representation of the DFT-predicted shielding tensors, such as the eigenvalues and eigenvectors. Moreover, we evaluated how various DFT methods influence the discrimination of possible relative configurations using recently published ΔΔRCSA data for a set of structurally diverse natural products.

Our findings demonstrate that accurate interpretation of RCSAs for configurational and conformational analysis is possible with semilocal DFT methods, which also reduce computational demands compared to hybrid functionals such as the commonly used B3LYP.


📕Journal of Chemical Theory and Computation (IF=5.7)
#benchmark
Please open Telegram to view this post
VIEW IN TELEGRAM
3👍3🔥2
A generative model for inorganic materials design

https://www.nature.com/articles/s41586-025-08628-5

Сегодня в журнале Nature вышла очень интересная работа.

Microsoft представил MatterGen — новую парадигму в дизайне материалов с использованием генеративного искусственного интеллекта. MatterGen позволяет ускорить процесс разработки материалов, автоматически генерируя и оценивая потенциальные структуры с заданными свойствами.

Модель может быть настроена на создание материалов с конкретными химическими составами, симметрией или физическими характеристиками, такими как магнитная плотность, ширина запрещённой зоны и механическая прочность, используя обучающий набор из более чем 608 000 стабильных соединений из известных баз данных материалов.

Экспериментальная проверка подтвердила успешный синтез материала TaCr2O6, в точности совпадающий с предсказаниями модели.

🖥Код доступен бесплатно на гитхабе: https://github.com/microsoft/mattergen
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥11👍75
Real-World Applications and Experiences of AI/ML Deployment for Drug Discovery

🔥https://doi.org/10.1021/acs.jmedchem.4c03044

Briefly summarized are our and others’ experiences with the AI/ML applications that currently have the greatest impact on our work.


У 📕Journal of Medicinal Chemistry вышел Editorial, посвященный методам ML/AI, которые используются для поиска лекарств.
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥54👍2
Hybrid nanophotonic-microfluidic sensor integrated with machine learning for operando state-of-charge monitoring in vanadium flow batteries

https://doi.org/10.1016/j.est.2025.115349

При нашем скромном участии вчера вышла работа, в которой представлен усовершенствованный метод измерения степени заряда (SoC) ванадиевых проточных батарей (VRFB) с использованием показателя преломления и машинного обучения.

Основной акцент сделан на использовании изменения показателя преломления (RI) электролитов для оценки концентрации ионов ванадия.

Разработанный сенсор основан на фотонных интегральных схемах (PIC) и микрофлюидных каналах, что обеспечивает высокую чувствительность. Система прошла тестирование на рабочих условиях батареи, показав устойчивую корреляцию между спектральными характеристиками и данными о заряде.

Используя экспериментальные данные, ML модель была обучена точно предсказывать степень заряда проточной ванадиевой батареи путем анализа спектральных характеристик.

🔗По этой ссылке статья будет доступна бесплатно в течение первых 50 дней: https://authors.elsevier.com/c/1kSYB,rUrFxfAl

📕Journal of Energy Storage (IF=8.9)
#application
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7🔥64
Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

https://pubs.acs.org/doi/10.1021/jacs.4c11085

Utilizing a chain of advanced large language models (LLMs), we developed a systematic approach to extract and organize MOF data into a structured format.

Our methodology successfully compiled information from more than 40,000 research articles, creating a comprehensive and ready-to-use data set. Specifically, data regarding MOF synthesis conditions and properties were extracted from both tables and text and then analyzed. Subsequently, we utilized the curated database to analyze the relationships between synthesis conditions, properties, and structure.


📕Journal of the American Chemical Society (IF=14.4)
#dataset #method
Please open Telegram to view this post
VIEW IN TELEGRAM
4🔥3👍2
Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

https://pubs.acs.org/doi/10.1021/jacs.4c11666

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.

With the reported workflow, we discovered two new classes of palladium complexes capable of achieving the synthesis of nonalternating polyketones with a lower CO content than those made by known palladium catalysts.

Our results show that we doubled the number of classes of palladium compounds that can catalyze the formation of this type of polymer. We envision that this methodology can be applied to accelerate catalyst discovery when selectivity is an important outcome.


📕Journal of the American Chemical Society (IF=14.4)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4🔥32
🔥В сборе больших открытых датасетов самая приятная часть — это когда они потом используются другими исследователями для прикладных целей, а не кладутся в стол.

Когда-то давно в 2022 году мы опубликовали на Zenodo BigSolDB — крупнейший датасет (известный нам) по растворимости, содержащий 54273 значений растворимости при температурах от 243.15 до 403.15K, в котором присутствуют 138 растворителей и 830 соединений.

📕А совсем недавно на глаза попалась статья от одной из крупнейших мировых фармацевтических корпораций – GlaxoSmithKline (GSK). Они использовли ML-модель для предсказания растворимости соединений на основе BigSolDB, а затем внедрили ее в свои лабораторные процессы.

Чем были полезны наши данные со стороны авторов:
🔹 Дополнением внутренней базы GSK, которая ограничена их исследовательскими соединениями.
🔹 Доступом к разным температурным режимам, что улучшило предсказания при высоких температурах.
🔹 Добавлением редких растворителей, которые раньше модель предсказывала с высокой погрешностью.

Такие моменты сильно повышают мотивацию и далее делать открытые датасеты для более полной систематизации экспериментальных данных по разным химическим областям.
Please open Telegram to view this post
VIEW IN TELEGRAM
👍14🔥86🦄1
🔥A review of large language models and autonomous agents in chemistry

https://doi.org/10.1039/D4SC03921A

This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning.

As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains.

This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry.


📕Chemical Science (IF=7.6)
#review
Please open Telegram to view this post
VIEW IN TELEGRAM
👍63🔥3
Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry

https://doi.org/10.1021/acs.jmedchem.4c02749

Herein, we review the recent advancements in AI applications for retrosynthesis prediction by summarizing related techniques and the landscape of current representative retrosynthesis models and propose feasible solutions to tackle existing problems and outline future directions in this field.


📕Journal of Medicinal Chemistry (IF = 6.8)
#review
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥32👍2
🔥Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

https://pubs.acs.org/doi/10.1021/jacs.4c15902

The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation.

To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target.

Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation.


📕Journal of the American Chemical Society (IF=14.4)
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
3👍2🔥2
🔥A simple similarity metric for comparing synthetic routes

https://doi.org/10.1039/D4DD00292J

AI-predicted routes are typically compared to experimental syntheses to check for an exact match among the top-ranked predictions (top-N accuracy).

This method is ideal for the evaluation of retrosynthetic algorithms on large datasets (>10^6 routes), but it cannot assess a degree of similarity between routes, which would be desirable for small datasets (<10^2 routes).

Here, we present a simple method to calculate a similarity score between any two synthetic routes to a given molecule.

The score is based on two concepts: which bonds are formed during the synthesis; and how the atoms of the final compound are grouped together throughout the synthesis. As a result, the similarity score overlaps well with chemists' intuition and provides a finer assessment of prediction accuracy.


📕Digital Discovery (IF=6.2)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍3🔥32
Balancing molecular information and empirical data in the prediction of physico-chemical properties

https://doi.org/10.1039/D4DD00154K

In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine-learning literature, which uses uncertainty estimates to trade off between the two approaches.

The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example.

The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.

📕Digital Discovery (IF=6.2)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍42🔥1
🎊Сегодня у нас наконец-то вышла статья:

Towards Accelerating the Discovery of Efficient Iridium(III) Emitters Using Novel Database and Machine Learning Based Only on Structural Formula


https://doi.org/10.1039/D5TC00305A

1. В этой статье мы собрали базу данных IrLumDB, в которой содержатся экспериментальные данные о 1287 бис-циклометалированных комлексах иридия (III) и их фотофизических свойствах (длина волны эмиссии (λmax), квантовый выход (PLQY) и время жизни).

2. На основе IrLumDB обучили XGBoost, LightGBM и Catboost предсказывать λmax и PLQY с MAE 18.26 нм и 0.13 на десятикратной кросс-валидации.

3. Протестировали работу обученных моделей на 33 синтезированных в нашей лаборатории комплексах, 12 из которых были получены для этой статьи. Комплексы были охарактеризованы с помощью ЯМР, РСА, масс-спектрометрии высокого разрешения, и частично РФА. 9 новых структур были депонированы в CCDC.

4. Сравнили на изученных нами соединениях точность предсказания длины волны эмиссии с помощью алгоритмов машинного обучения и с помощью DFT-расчетов; показали, что алгоритмы машинного обучения справляются с задачей лучше.

5. Так как нам важно искать новые комплексы с потенциально высокими квантовыми выходами, то мы разделили все комплексы на 3 класса: с низким (0-0.1), средним (0.1-0.5) и высоким PLQY (0.5-1), далее обучили классификационные модели и получили точность 72.4% на десятикратной кросс-валидации.

6. Подготовили мини-приложение IrLumDB App для того, чтобы любой исследователь смог предсказать свойства для своих комплексов. Для предсказания достаточно SMILES лигандов.

Датасет на Zenodo | IrLumDB App

📕Journal of Materials Chemistry C (IF=5.7)
#dataset #application
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥175👍5🎉2
Will we ever be able to accurately predict solubility?

https://doi.org/10.1038/s41597-024-03105-6

Accurate prediction of thermodynamic solubility by machine learning remains a challenge.

We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets.

We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist.

Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources.

We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.


📕Scientific Data (IF=5.9)
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7🔥53