Chem ML/AI/Datasets
795 subscribers
35 photos
1 video
2 files
174 links
Daily articles and news from the field of machine learning in chemistry from the researchers of IGIC RAS @chemrussia

For contact: @levkrasnov @st613laboratory @StasBezzubov
Download Telegram
Systematic, computational discovery of multicomponent and one-pot reactions

https://www.nature.com/articles/s41467-024-54611-5

This work demonstrates that computers taught the essential knowledge of reaction mechanisms and rules of physical-organic chemistry can design – completely autonomously and in large numbers – mechanistically distinct multicomponent reactions (MCRs).

Moreover, when supplemented by models to approximate kinetic rates, the algorithm can predict reaction yields and identify reactions that have potential for organocatalysis. These predictions are validated by experiments spanning different modes of reactivity and diverse product scaffolds.


📕Nature Communications (IF=14.7)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
6👍3🔥3
On-demand reverse design of polymers with PolyTAO

https://www.nature.com/articles/s41524-024-01466-5

In this work, we curate an immense polymer dataset containing nearly one million polymeric structure-property pairs based on expert knowledge. Leveraging this dataset, we propose a Transformer-Assisted Oriented pretrained model for on-demand polymer generation (PolyTAO).

This model generates polymers with 99.27% chemical validity in top-1 generation mode (approximately 200k generated polymers), representing the highest reported success rate among polymer generative models, and this was achieved on the largest test set. Importantly, the average R2 between the properties of the generated polymers and their expected values across 15 predefined properties is 0.96, which underscores PolyTAO’s powerful on-demand polymer generation capabilities.


📕npj computational materials (IF=9.4)
#dataset #application
Please open Telegram to view this post
VIEW IN TELEGRAM
👍54🔥4
Large Language Models for Inorganic Synthesis Predictions

https://pubs.acs.org/doi/10.1021/jacs.4c05840

We evaluate the effectiveness of pretrained and fine-tuned large language models (LLMs) for predicting the synthesizability of inorganic compounds and the selection of precursors needed to perform inorganic synthesis. The predictions of fine-tuned LLMs are comparable to─and sometimes better than─recent bespoke machine learning models for these tasks but require only minimal user expertise, cost, and time to develop. Therefore, this strategy can serve both as an effective and strong baseline for future machine learning studies of various chemical applications and as a practical tool for experimental chemists.


🖥Github link: https://github.com/jschrier/SynthGPT/

📕Journal of the American Chemical Society (IF=14.4)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
4🔥4👍3
An open-access database and analysis tool for perovskite solar cells based on the FAIR data principles

https://www.nature.com/articles/s41560-021-00941-3

We collect data from over 42,400 photovoltaic devices with up to 100 parameters per device. We then develop open-source and accessible procedures to analyse the data, providing examples of insights that can be gleaned from the analysis of a large dataset.

The database, graphics and analysis tools are made available to the community and will continue to evolve as an open-source initiative. This approach of extensively capturing the progress of an entire field, including sorting, interactive exploration and graphical representation of the data, will be applicable to many fields in materials science, engineering and biosciences.


Database: http://www.perovskitedatabase.com/

📕 Nature Energy (IF=49.8)
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
5🔥4👍3
High-throughput computational stacking reveals emergent properties in natural van der Waals bilayers

https://www.nature.com/articles/s41467-024-45003-w

Here we employ a density functional theory (DFT) workflow to calculate interlayer binding energies of 8451 homobilayers created by stacking 1052 different monolayers in various configurations. Analysis of the stacking orders in 247 experimentally known van der Waals crystals is used to validate the workflow and determine the criteria for realisable bilayers.

For the 2586 most stable bilayer systems, we calculate a range of electronic, magnetic, and vibrational properties, and explore general trends and anomalies. We identify an abundance of bistable bilayers with stacking order-dependent magnetic or electrical polarisation states making them candidates for slidetronics applications.


van der Waals Bilayer Database (BiDB): https://cmr.fysik.dtu.dk/bidb/bidb.html

📕Nature Communications (IF=14.7)
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
4👍4🔥4
Advancing material property prediction: using physics-informed machine learning models for viscosity

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-024-00820-5

In this work, we curated a comprehensive dataset of over 4000 small organic molecules’ viscosities from scientific literature, publications, and online databases. This dataset enabled us to develop quantitative structure–property relationships (QSPR) consisting of descriptor-based and graph neural network models to predict temperature-dependent viscosities for a wide range of viscosities.

The QSPR models reveal that including MD descriptors improves the prediction of experimental viscosities, particularly at the small data set scale of fewer than a thousand data points.

Furthermore, feature importance tools reveal that intermolecular interactions captured by MD descriptors are most important for viscosity predictions.

Finally, the QSPR models can accurately capture the inverse relationship between viscosity and temperature for six battery-relevant solvents, some of which were not included in the original data set.


📕 Journal of Cheminformatics (IF=7.1)
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
3🔥3👍2
Practical guidelines for the use of gradient boosting for molecular property prediction

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00743-7

Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total.

Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets.


📕 Journal of Cheminformatics (IF=7.1)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥64👍3
Machine Learning Methods for Small Data Challenges in Molecular Science

https://pubs.acs.org/doi/10.1021/acs.chemrev.3c00189

Обзор на 600 ссылок, посвященный небольшим датасетам, а также различным ML-алгоритмам при работе с ними.

In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences.

We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation.

Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.


📕Chemical Reviews (IF=51.4)
#review
Please open Telegram to view this post
VIEW IN TELEGRAM
👍63🔥3❤‍🔥1
Deep-PK: deep learning for small molecule pharmacokinetic and toxicity prediction

https://doi.org/10.1093/nar/gkae254

Нашли на просторах интернета новый сервис, который позволяет предсказывать бесплатно 64 ADMET и 9 общих свойств молекул. По заявлению авторов делает это точнее, чем предыдущие известные модели.

🔥Ссылка на сервис: https://biosig.lab.uq.edu.au/deeppk/

📕Nucleic Acids Research (IF=16.6)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
5👍5🔥5
Chemprop: A Machine Learning Package for Chemical Property Prediction

https://doi.org/10.1021/acs.jcim.3c01250

The software package Chemprop implements the directed message-passing neural networks (D-MPNN) architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra.

Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features.

We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra.


🖥Github link, 🌟1.8k: https://github.com/chemprop/chemprop

📕Journal of Chemical Information and Modeling (IF=5.6)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥52👍2
Acquisition of absorption and fluorescence spectral data using chatbots

https://doi.org/10.1039/D4DD00255E

Гайд как быстро писать новые статьи в научных журналах:

1) Берем ChatGPT или любую другую LLM
2) Спрашиваем у него о свойствах молекулы X
3) Записываем в таблицу

Profit: получаем статью в журнале с IF=6.2
😁12🔥3👍2
ChEMBL 35 is out!

https://chembl.blogspot.com/2024/12/heres-nice-christmas-gift-chembl-35-is.html

Вышла новая версия базы CheMBL:

This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP (AI-driven Structure-enabled Antiviral Platform) project, a new NTD data set by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project.

This version of the database, prepared on 01/12/2024 contains:

2,496,335 compounds (of which 2,474,590 have mol files)
3,185,505 compound records (non-unique compounds)
21,123,501 activities
1,740,546 assays
16,003 targets
92,121 documents

#dataset
👍7🔥42
SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

https://doi.org/10.26434/chemrxiv-2024-c660p

Должно быть очень полезно для тех, кто занимает металлоорганикой:
We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with.


🖥Github link: https://github.com/jensengroup/xyz2mol_tm

#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍32🔥2
Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

https://doi.org/10.1021/acs.chemmater.4c02754

In this work, phosphorescent materials are represented as strings, molecular graphs, and point clouds, which are employed by language models, two-dimensional graph, and three-dimensional graph neural networks. In addition, more than 200 000 molecules with simulated properties highly relevant to experimental properties are used for pretraining the DL models.

Our work shows high performance in the prediction of five experimental properties that are importantly considered when commercializing OLED devices. This means that faster material discovery for OLEDs can be achieved through DL models that are trained with simulation information that is highly correlated with experimental properties.


📕Chemistry of Materials (IF=7.2)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍62🔥2
Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic Resonance-Based Structural Elucidation

https://pubs.acs.org/doi/10.1021/acs.jctc.4c01407

In this study, we present a comprehensive benchmark of carbon shielding anisotropies based on coupled cluster reference tensors taken from the NS372 benchmark data set.

Additionally, we investigate the representation of the DFT-predicted shielding tensors, such as the eigenvalues and eigenvectors. Moreover, we evaluated how various DFT methods influence the discrimination of possible relative configurations using recently published ΔΔRCSA data for a set of structurally diverse natural products.

Our findings demonstrate that accurate interpretation of RCSAs for configurational and conformational analysis is possible with semilocal DFT methods, which also reduce computational demands compared to hybrid functionals such as the commonly used B3LYP.


📕Journal of Chemical Theory and Computation (IF=5.7)
#benchmark
Please open Telegram to view this post
VIEW IN TELEGRAM
3👍3🔥2
A generative model for inorganic materials design

https://www.nature.com/articles/s41586-025-08628-5

Сегодня в журнале Nature вышла очень интересная работа.

Microsoft представил MatterGen — новую парадигму в дизайне материалов с использованием генеративного искусственного интеллекта. MatterGen позволяет ускорить процесс разработки материалов, автоматически генерируя и оценивая потенциальные структуры с заданными свойствами.

Модель может быть настроена на создание материалов с конкретными химическими составами, симметрией или физическими характеристиками, такими как магнитная плотность, ширина запрещённой зоны и механическая прочность, используя обучающий набор из более чем 608 000 стабильных соединений из известных баз данных материалов.

Экспериментальная проверка подтвердила успешный синтез материала TaCr2O6, в точности совпадающий с предсказаниями модели.

🖥Код доступен бесплатно на гитхабе: https://github.com/microsoft/mattergen
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥11👍75
Real-World Applications and Experiences of AI/ML Deployment for Drug Discovery

🔥https://doi.org/10.1021/acs.jmedchem.4c03044

Briefly summarized are our and others’ experiences with the AI/ML applications that currently have the greatest impact on our work.


У 📕Journal of Medicinal Chemistry вышел Editorial, посвященный методам ML/AI, которые используются для поиска лекарств.
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥54👍2
Hybrid nanophotonic-microfluidic sensor integrated with machine learning for operando state-of-charge monitoring in vanadium flow batteries

https://doi.org/10.1016/j.est.2025.115349

При нашем скромном участии вчера вышла работа, в которой представлен усовершенствованный метод измерения степени заряда (SoC) ванадиевых проточных батарей (VRFB) с использованием показателя преломления и машинного обучения.

Основной акцент сделан на использовании изменения показателя преломления (RI) электролитов для оценки концентрации ионов ванадия.

Разработанный сенсор основан на фотонных интегральных схемах (PIC) и микрофлюидных каналах, что обеспечивает высокую чувствительность. Система прошла тестирование на рабочих условиях батареи, показав устойчивую корреляцию между спектральными характеристиками и данными о заряде.

Используя экспериментальные данные, ML модель была обучена точно предсказывать степень заряда проточной ванадиевой батареи путем анализа спектральных характеристик.

🔗По этой ссылке статья будет доступна бесплатно в течение первых 50 дней: https://authors.elsevier.com/c/1kSYB,rUrFxfAl

📕Journal of Energy Storage (IF=8.9)
#application
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7🔥64
Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

https://pubs.acs.org/doi/10.1021/jacs.4c11085

Utilizing a chain of advanced large language models (LLMs), we developed a systematic approach to extract and organize MOF data into a structured format.

Our methodology successfully compiled information from more than 40,000 research articles, creating a comprehensive and ready-to-use data set. Specifically, data regarding MOF synthesis conditions and properties were extracted from both tables and text and then analyzed. Subsequently, we utilized the curated database to analyze the relationships between synthesis conditions, properties, and structure.


📕Journal of the American Chemical Society (IF=14.4)
#dataset #method
Please open Telegram to view this post
VIEW IN TELEGRAM
4🔥3👍2
Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

https://pubs.acs.org/doi/10.1021/jacs.4c11666

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.

With the reported workflow, we discovered two new classes of palladium complexes capable of achieving the synthesis of nonalternating polyketones with a lower CO content than those made by known palladium catalysts.

Our results show that we doubled the number of classes of palladium compounds that can catalyze the formation of this type of polymer. We envision that this methodology can be applied to accelerate catalyst discovery when selectivity is an important outcome.


📕Journal of the American Chemical Society (IF=14.4)
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4🔥32
🔥В сборе больших открытых датасетов самая приятная часть — это когда они потом используются другими исследователями для прикладных целей, а не кладутся в стол.

Когда-то давно в 2022 году мы опубликовали на Zenodo BigSolDB — крупнейший датасет (известный нам) по растворимости, содержащий 54273 значений растворимости при температурах от 243.15 до 403.15K, в котором присутствуют 138 растворителей и 830 соединений.

📕А совсем недавно на глаза попалась статья от одной из крупнейших мировых фармацевтических корпораций – GlaxoSmithKline (GSK). Они использовли ML-модель для предсказания растворимости соединений на основе BigSolDB, а затем внедрили ее в свои лабораторные процессы.

Чем были полезны наши данные со стороны авторов:
🔹 Дополнением внутренней базы GSK, которая ограничена их исследовательскими соединениями.
🔹 Доступом к разным температурным режимам, что улучшило предсказания при высоких температурах.
🔹 Добавлением редких растворителей, которые раньше модель предсказывала с высокой погрешностью.

Такие моменты сильно повышают мотивацию и далее делать открытые датасеты для более полной систематизации экспериментальных данных по разным химическим областям.
Please open Telegram to view this post
VIEW IN TELEGRAM
👍14🔥86🦄1