Chem ML/AI/Datasets – Telegram

Chem ML/AI/Datasets

794 subscribers

35 photos

1 video

2 files

174 links

Daily articles and news from the field of machine learning in chemistry from the researchers of IGIC RAS @chemrussia

For contact: @levkrasnov @st613laboratory @StasBezzubov

Download Telegram

About

Blog

Apps

Platform

Chem ML/AI/Datasets

794 subscribers

Chem ML/AI/Datasets

Acquisition of absorption and fluorescence spectral data using chatbots

https://doi.org/10.1039/D4DD00255E

Гайд как быстро писать новые статьи в научных журналах:

1) Берем ChatGPT или любую другую LLM
2) Спрашиваем у него о свойствах молекулы X
3) Записываем в таблицу

✅Profit: получаем статью в журнале с IF=6.2

😁12🔥3👍2

865 views06:58

Chem ML/AI/Datasets

ChEMBL 35 is out!

https://chembl.blogspot.com/2024/12/heres-nice-christmas-gift-chembl-35-is.html

Вышла новая версия базы CheMBL:

This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP (AI-driven Structure-enabled Antiviral Platform) project, a new NTD data set by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project.

This version of the database, prepared on 01/12/2024 contains:

2,496,335 compounds (of which 2,474,590 have mol files)
3,185,505 compound records (non-unique compounds)
21,123,501 activities
1,740,546 assays
16,003 targets
92,121 documents

Here's a nice Christmas gift - ChEMBL 35 is out!

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35! This fresh...

👍7🔥4❤2

792 viewsedited 10:10

Chem ML/AI/Datasets

SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

https://doi.org/10.26434/chemrxiv-2024-c660p

Должно быть очень полезно для тех, кто занимает металлоорганикой:

We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with.

🖥Github link: https://github.com/jensengroup/xyz2mol_tm

#method

Please open Telegram to view this post

VIEW IN TELEGRAM

SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The…

👍3❤2🔥2

837 views14:30

Chem ML/AI/Datasets

Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

https://doi.org/10.1021/acs.chemmater.4c02754

In this work, phosphorescent materials are represented as strings, molecular graphs, and point clouds, which are employed by language models, two-dimensional graph, and three-dimensional graph neural networks. In addition, more than 200 000 molecules with simulated properties highly relevant to experimental properties are used for pretraining the DL models.

Our work shows high performance in the prediction of five experimental properties that are importantly considered when commercializing OLED devices. This means that faster material discovery for OLEDs can be achieved through DL models that are trained with simulation information that is highly correlated with experimental properties.

📕Chemistry of Materials (IF=7.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

Phosphorescent light-emitting materials play a central role in organic light-emitting diode (OLED) devices. Due to their synthesis difficulties, unsystematic trial-and-error synthesis is prohibitively challenging. For this reason, deep learning (DL), which…

👍6❤2🔥2

780 views16:46

Chem ML/AI/Datasets

Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic Resonance-Based Structural Elucidation

https://pubs.acs.org/doi/10.1021/acs.jctc.4c01407

In this study, we present a comprehensive benchmark of carbon shielding anisotropies based on coupled cluster reference tensors taken from the NS372 benchmark data set.

Additionally, we investigate the representation of the DFT-predicted shielding tensors, such as the eigenvalues and eigenvectors. Moreover, we evaluated how various DFT methods influence the discrimination of possible relative configurations using recently published ΔΔRCSA data for a set of structurally diverse natural products.

Our findings demonstrate that accurate interpretation of RCSAs for configurational and conformational analysis is possible with semilocal DFT methods, which also reduce computational demands compared to hybrid functionals such as the commonly used B3LYP.

📕Journal of Chemical Theory and Computation (IF=5.7)
#benchmark

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic…

Density functional theory (DFT) calculations have emerged as a powerful theoretical toolbox for interpreting and analyzing the experimental nuclear magnetic resonance (NMR) spectra of chemical compounds. While DFT has been extensively used and benchmarked…

❤3👍3🔥2

686 views14:56

Chem ML/AI/Datasets

A generative model for inorganic materials design

https://www.nature.com/articles/s41586-025-08628-5

Сегодня в журнале Nature вышла очень интересная работа.

Microsoft представил MatterGen — новую парадигму в дизайне материалов с использованием генеративного искусственного интеллекта. MatterGen позволяет ускорить процесс разработки материалов, автоматически генерируя и оценивая потенциальные структуры с заданными свойствами.

Модель может быть настроена на создание материалов с конкретными химическими составами, симметрией или физическими характеристиками, такими как магнитная плотность, ширина запрещённой зоны и механическая прочность, используя обучающий набор из более чем 608 000 стабильных соединений из известных баз данных материалов.

Экспериментальная проверка подтвердила успешный синтез материала TaCr2O6, в точности совпадающий с предсказаниями модели.

🖥Код доступен бесплатно на гитхабе: https://github.com/microsoft/mattergen

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥11👍7❤5

3.39K views17:34

Chem ML/AI/Datasets

Real-World Applications and Experiences of AI/ML Deployment for Drug Discovery

🔥

https://doi.org/10.1021/acs.jmedchem.4c03044

Briefly summarized are our and others’ experiences with the AI/ML applications that currently have the greatest impact on our work.

У 📕Journal of Medicinal Chemistry вышел Editorial, посвященный методам ML/AI, которые используются для поиска лекарств.

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥5❤4👍2

639 views07:56

Chem ML/AI/Datasets

Hybrid nanophotonic-microfluidic sensor integrated with machine learning for operando state-of-charge monitoring in vanadium flow batteries

https://doi.org/10.1016/j.est.2025.115349

При нашем скромном участии вчера вышла работа, в которой представлен усовершенствованный метод измерения степени заряда (SoC) ванадиевых проточных батарей (VRFB) с использованием показателя преломления и машинного обучения.

Основной акцент сделан на использовании изменения показателя преломления (RI) электролитов для оценки концентрации ионов ванадия.

Разработанный сенсор основан на фотонных интегральных схемах (PIC) и микрофлюидных каналах, что обеспечивает высокую чувствительность. Система прошла тестирование на рабочих условиях батареи, показав устойчивую корреляцию между спектральными характеристиками и данными о заряде.

Используя экспериментальные данные, ML модель была обучена точно предсказывать степень заряда проточной ванадиевой батареи путем анализа спектральных характеристик.

🔗По этой ссылке статья будет доступна бесплатно в течение первых 50 дней: https://authors.elsevier.com/c/1kSYB,rUrFxfAl

📕Journal of Energy Storage (IF=8.9)
#application

Please open Telegram to view this post

VIEW IN TELEGRAM

👍7🔥6❤4

1.8K views11:21

Chem ML/AI/Datasets

Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

https://pubs.acs.org/doi/10.1021/jacs.4c11085

Utilizing a chain of advanced large language models (LLMs), we developed a systematic approach to extract and organize MOF data into a structured format.

Our methodology successfully compiled information from more than 40,000 research articles, creating a comprehensive and ready-to-use data set. Specifically, data regarding MOF synthesis conditions and properties were extracted from both tables and text and then analyzed. Subsequently, we utilized the curated database to analyze the relationships between synthesis conditions, properties, and structure.

📕Journal of the American Chemical Society (IF=14.4)
#dataset #method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

This research focused on the efficient collection of experimental metal–organic framework (MOF) data from scientific literature to address the challenges of accessing hard-to-find data and improving the quality of information available for machine learning…

❤4🔥3👍2

845 views11:33

Chem ML/AI/Datasets

Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

https://pubs.acs.org/doi/10.1021/jacs.4c11666

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.

With the reported workflow, we discovered two new classes of palladium complexes capable of achieving the synthesis of nonalternating polyketones with a lower CO content than those made by known palladium catalysts.

Our results show that we doubled the number of classes of palladium compounds that can catalyze the formation of this type of polymer. We envision that this methodology can be applied to accelerate catalyst discovery when selectivity is an important outcome.

📕Journal of the American Chemical Society (IF=14.4)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.…

👍4🔥3❤2

729 views10:26

Chem ML/AI/Datasets

🔥В сборе больших открытых датасетов самая приятная часть — это когда они потом используются другими исследователями для прикладных целей, а не кладутся в стол.

Когда-то давно в 2022 году мы опубликовали на Zenodo BigSolDB — крупнейший датасет (известный нам) по растворимости, содержащий 54273 значений растворимости при температурах от 243.15 до 403.15K, в котором присутствуют 138 растворителей и 830 соединений.

📕А совсем недавно на глаза попалась статья от одной из крупнейших мировых фармацевтических корпораций – GlaxoSmithKline (GSK). Они использовли ML-модель для предсказания растворимости соединений на основе BigSolDB, а затем внедрили ее в свои лабораторные процессы.

Чем были полезны наши данные со стороны авторов:
🔹 Дополнением внутренней базы GSK, которая ограничена их исследовательскими соединениями.
🔹 Доступом к разным температурным режимам, что улучшило предсказания при высоких температурах.
🔹 Добавлением редких растворителей, которые раньше модель предсказывала с высокой погрешностью.

Такие моменты сильно повышают мотивацию и далее делать открытые датасеты для более полной систематизации экспериментальных данных по разным химическим областям.

Please open Telegram to view this post

VIEW IN TELEGRAM

👍14🔥8❤6🦄1

916 views12:14

Chem ML/AI/Datasets

🔥

A review of large language models and autonomous agents in chemistry

https://doi.org/10.1039/D4SC03921A

This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning.

As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains.

This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry.

📕Chemical Science (IF=7.6)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

A review of large language models and autonomous agents in chemistry

Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate…

👍6❤3🔥3

1.31K views10:40

Chem ML/AI/Datasets

Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry

https://doi.org/10.1021/acs.jmedchem.4c02749

Herein, we review the recent advancements in AI applications for retrosynthesis prediction by summarizing related techniques and the landscape of current representative retrosynthesis models and propose feasible solutions to tackle existing problems and outline future directions in this field.

📕Journal of Medicinal Chemistry (IF = 6.8)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Artificial Intelligence in Retrosynthesis Prediction and its Applications in Medicinal Chemistry

Retrosynthesis is a strategy to analyze the synthetic routes for target molecules in medicinal chemistry. However, traditional retrosynthesis predictions performed by chemists and rule-based expert systems struggle to adapt to the vast chemical space of real…

🔥3❤2👍2

747 views12:06

Chem ML/AI/Datasets

🔥

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

https://pubs.acs.org/doi/10.1021/jacs.4c15902

The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation.

To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target.

Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation.

📕Journal of the American Chemical Society (IF=14.4)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

ACS Publications

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates

The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity…

❤3👍2🔥2

729 views09:30

Chem ML/AI/Datasets

🔥

A simple similarity metric for comparing synthetic routes

https://doi.org/10.1039/D4DD00292J

AI-predicted routes are typically compared to experimental syntheses to check for an exact match among the top-ranked predictions (top-N accuracy).

This method is ideal for the evaluation of retrosynthetic algorithms on large datasets (>10^6 routes), but it cannot assess a degree of similarity between routes, which would be desirable for small datasets (<10^2 routes).

Here, we present a simple method to calculate a similarity score between any two synthetic routes to a given molecule.

The score is based on two concepts: which bonds are formed during the synthesis; and how the atoms of the final compound are grouped together throughout the synthesis. As a result, the similarity score overlaps well with chemists' intuition and provides a finer assessment of prediction accuracy.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

A simple similarity metric for comparing synthetic routes

Experimentally validated routes to synthetic compounds can be compared to each other by quantitative metrics (step count, yield, atom economy), or by qualitative assessments (strategy, novelty). AI-predicted routes are typically compared to experimental syntheses…

👍3🔥3❤2

722 views10:21

Chem ML/AI/Datasets

Balancing molecular information and empirical data in the prediction of physico-chemical properties

https://doi.org/10.1039/D4DD00154K

In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine-learning literature, which uses uncertainty estimates to trade off between the two approaches.

The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example.

The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.

📕Digital Discovery (IF=6.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Balancing molecular information and empirical data in the prediction of physico-chemical properties

Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab initio calculations, which are only feasible for very simple systems, over descriptor…

👍4❤2🔥1

673 viewsedited 13:31

Chem ML/AI/Datasets

🎊Сегодня у нас наконец-то вышла статья:

Towards Accelerating the Discovery of Efficient Iridium(III) Emitters Using Novel Database and Machine Learning Based Only on Structural Formula

https://doi.org/10.1039/D5TC00305A

1. В этой статье мы собрали базу данных IrLumDB, в которой содержатся экспериментальные данные о 1287 бис-циклометалированных комлексах иридия (III) и их фотофизических свойствах (длина волны эмиссии (λmax), квантовый выход (PLQY) и время жизни).

2. На основе IrLumDB обучили XGBoost, LightGBM и Catboost предсказывать λmax и PLQY с MAE 18.26 нм и 0.13 на десятикратной кросс-валидации.

3. Протестировали работу обученных моделей на 33 синтезированных в нашей лаборатории комплексах, 12 из которых были получены для этой статьи. Комплексы были охарактеризованы с помощью ЯМР, РСА, масс-спектрометрии высокого разрешения, и частично РФА. 9 новых структур были депонированы в CCDC.

4. Сравнили на изученных нами соединениях точность предсказания длины волны эмиссии с помощью алгоритмов машинного обучения и с помощью DFT-расчетов; показали, что алгоритмы машинного обучения справляются с задачей лучше.

5. Так как нам важно искать новые комплексы с потенциально высокими квантовыми выходами, то мы разделили все комплексы на 3 класса: с низким (0-0.1), средним (0.1-0.5) и высоким PLQY (0.5-1), далее обучили классификационные модели и получили точность 72.4% на десятикратной кросс-валидации.

6. Подготовили мини-приложение IrLumDB App для того, чтобы любой исследователь смог предсказать свойства для своих комплексов. Для предсказания достаточно SMILES лигандов.

Датасет на Zenodo | IrLumDB App

📕Journal of Materials Chemistry C (IF=5.7)
#dataset #application

Please open Telegram to view this post

VIEW IN TELEGRAM

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥17❤5👍5🎉2

1.03K views08:41

Chem ML/AI/Datasets

Will we ever be able to accurately predict solubility?

https://doi.org/10.1038/s41597-024-03105-6

Accurate prediction of thermodynamic solubility by machine learning remains a challenge.

We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets.

We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist.

Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources.

We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.

📕Scientific Data (IF=5.9)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

Will we ever be able to accurately predict solubility?

Scientific Data - Will we ever be able to accurately predict solubility?

👍7🔥5❤3

667 views11:05

Chem ML/AI/Datasets

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

https://doi.org/10.1038/s41467-025-57479-1

Совсем свежая работа от 5 марта:

In this paper, we report a crystal structure prediction (CSP) method with state of the art accuracy and efficiency, validated on a large and diverse dataset including 66 molecules with 137 experimentally known polymorphic forms. The method combines a novel systematic crystal packing search algorithm and the use of machine learning force fields in a hierarchical crystal energy ranking.

Our method not only reproduces all the experimentally known polymorphs, but also suggests new low energy polymorphs yet to be discovered by experiment that might pose potential risks to development of the currently known forms of these compounds.

In addition, we report the prediction results of a blinded study, results for Target XXXI from the seventh CSP blind test, and demonstrate how the method can be used to accelerate clinical formulation design and derisk downstream processing.

📕Nature Communications (IF=14.7)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

A robust crystal structure prediction method to support small molecule drug development with large scale validation and blind study

Nature Communications - Crystal polymorphism plays an important role in pharmaceuticals, agrisciences and other industries. Here the authors present an efficient and accurate crystal structure...

🔥3❤2👍2

779 views08:11