Chem ML/AI/Datasets

Systematic, computational discovery of multicomponent and one-pot reactions

Nature Communications - Multi component reactions (MCRs) can build complex scaffolds from multiple starting materials in just one step without purification of intermediates but until now MCRs have...

❤6👍3🔥3

462 views07:21

On-demand reverse design of polymers with PolyTAO

https://www.nature.com/articles/s41524-024-01466-5

In this work, we curate an immense polymer dataset containing nearly one million polymeric structure-property pairs based on expert knowledge. Leveraging this dataset, we propose a Transformer-Assisted Oriented pretrained model for on-demand polymer generation (PolyTAO).

This model generates polymers with 99.27% chemical validity in top-1 generation mode (approximately 200k generated polymers), representing the highest reported success rate among polymer generative models, and this was achieved on the largest test set. Importantly, the average R2 between the properties of the generated polymers and their expected values across 15 predefined properties is 0.96, which underscores PolyTAO’s powerful on-demand polymer generation capabilities.

📕npj computational materials (IF=9.4)
#dataset #application

Please open Telegram to view this post

VIEW IN TELEGRAM

On-demand reverse design of polymers with PolyTAO

npj Computational Materials - On-demand reverse design of polymers with PolyTAO

👍5❤4🔥4

487 views07:27

Large Language Models for Inorganic Synthesis Predictions

https://pubs.acs.org/doi/10.1021/jacs.4c05840

We evaluate the effectiveness of pretrained and fine-tuned large language models (LLMs) for predicting the synthesizability of inorganic compounds and the selection of precursors needed to perform inorganic synthesis. The predictions of fine-tuned LLMs are comparable to─and sometimes better than─recent bespoke machine learning models for these tasks but require only minimal user expertise, cost, and time to develop. Therefore, this strategy can serve both as an effective and strong baseline for future machine learning studies of various chemical applications and as a practical tool for experimental chemists.

🖥Github link: https://github.com/jschrier/SynthGPT/

📕Journal of the American Chemical Society (IF=14.4)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Large Language Models for Inorganic Synthesis Predictions

We evaluate the effectiveness of pretrained and fine-tuned large language models (LLMs) for predicting the synthesizability of inorganic compounds and the selection of precursors needed to perform inorganic synthesis. The predictions of fine-tuned LLMs are…

❤4🔥4👍3

785 viewsedited 07:21

An open-access database and analysis tool for perovskite solar cells based on the FAIR data principles

https://www.nature.com/articles/s41560-021-00941-3

We collect data from over 42,400 photovoltaic devices with up to 100 parameters per device. We then develop open-source and accessible procedures to analyse the data, providing examples of insights that can be gleaned from the analysis of a large dataset.

The database, graphics and analysis tools are made available to the community and will continue to evolve as an open-source initiative. This approach of extensively capturing the progress of an entire field, including sorting, interactive exploration and graphical representation of the data, will be applicable to many fields in materials science, engineering and biosciences.

Database: http://www.perovskitedatabase.com/

📕 Nature Energy (IF=49.8)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

An open-access database and analysis tool for perovskite solar cells based on the FAIR data principles

Nature Energy - Making large datasets findable, accessible, interoperable and reusable could accelerate technology development. Now, Jacobsson et al. present an approach to build an open-access...

❤5🔥4👍3

521 views17:06

High-throughput computational stacking reveals emergent properties in natural van der Waals bilayers

https://www.nature.com/articles/s41467-024-45003-w

Here we employ a density functional theory (DFT) workflow to calculate interlayer binding energies of 8451 homobilayers created by stacking 1052 different monolayers in various configurations. Analysis of the stacking orders in 247 experimentally known van der Waals crystals is used to validate the workflow and determine the criteria for realisable bilayers.

For the 2586 most stable bilayer systems, we calculate a range of electronic, magnetic, and vibrational properties, and explore general trends and anomalies. We identify an abundance of bistable bilayers with stacking order-dependent magnetic or electrical polarisation states making them candidates for slidetronics applications.

van der Waals Bilayer Database (BiDB): https://cmr.fysik.dtu.dk/bidb/bidb.html

📕Nature Communications (IF=14.7)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

High-throughput computational stacking reveals emergent properties in natural van der Waals bilayers

Nature Communications - 2D bilayers have recently attracted significant attention due to fundamental properties like interlayer excitons and interfacial ferroelectricity. Here, the authors report a...

❤4👍4🔥4

479 viewsedited 18:37

Advancing material property prediction: using physics-informed machine learning models for viscosity - Journal of Cheminformatics

Advancing material property prediction: using physics-informed machine learning models for viscosity

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-024-00820-5

In this work, we curated a comprehensive dataset of over 4000 small organic molecules’ viscosities from scientific literature, publications, and online databases. This dataset enabled us to develop quantitative structure–property relationships (QSPR) consisting of descriptor-based and graph neural network models to predict temperature-dependent viscosities for a wide range of viscosities.

The QSPR models reveal that including MD descriptors improves the prediction of experimental viscosities, particularly at the small data set scale of fewer than a thousand data points.

Furthermore, feature importance tools reveal that intermolecular interactions captured by MD descriptors are most important for viscosity predictions.

Finally, the QSPR models can accurately capture the inverse relationship between viscosity and temperature for six battery-relevant solvents, some of which were not included in the original data set.

📕 Journal of Cheminformatics (IF=7.1)
#dataset

Please open Telegram to view this post

VIEW IN TELEGRAM

BioMed Central

In materials science, accurately computing properties like viscosity, melting point, and glass transition temperatures solely through physics-based models is challenging. Data-driven machine learning (ML) also poses challenges in constructing ML models, especially…

❤3🔥3👍2

506 views10:38

Practical guidelines for the use of gradient boosting for molecular property prediction - Journal of Cheminformatics

Practical guidelines for the use of gradient boosting for molecular property prediction

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-023-00743-7

Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total.

Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets.

📕 Journal of Cheminformatics (IF=7.1)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

BioMed Central

Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention…

🔥6❤4👍3

556 views07:26

Machine Learning Methods for Small Data Challenges in Molecular Science

https://pubs.acs.org/doi/10.1021/acs.chemrev.3c00189

Обзор на 600 ссылок, посвященный небольшим датасетам, а также различным ML-алгоритмам при работе с ними.

In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences.

We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation.

Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.

📕Chemical Reviews (IF=51.4)
#review

Please open Telegram to view this post

VIEW IN TELEGRAM

Machine Learning Methods for Small Data Challenges in Molecular Science

Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade…

👍6❤3🔥3❤‍🔥1

606 views07:12

Deep-PK: deep learning for small molecule pharmacokinetic and toxicity prediction

Deep-PK: deep learning for small molecule pharmacokinetic and toxicity prediction

https://doi.org/10.1093/nar/gkae254

Нашли на просторах интернета новый сервис, который позволяет предсказывать бесплатно 64 ADMET и 9 общих свойств молекул. По заявлению авторов делает это точнее, чем предыдущие известные модели.

🔥Ссылка на сервис: https://biosig.lab.uq.edu.au/deeppk/

📕Nucleic Acids Research (IF=16.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

OUP Academic

Abstract. Evaluating pharmacokinetic properties of small molecules is considered a key feature in most drug development and high-throughput screening proce

❤5👍5🔥5

687 views08:40

Chemprop: A Machine Learning Package for Chemical Property Prediction

https://doi.org/10.1021/acs.jcim.3c01250

The software package Chemprop implements the directed message-passing neural networks (D-MPNN) architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra.

Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features.

We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra.

🖥Github link, 🌟1.8k: https://github.com/chemprop/chemprop

📕Journal of Chemical Information and Modeling (IF=5.6)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Chemprop: A Machine Learning Package for Chemical Property Prediction

Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by nonexperts. Among the current approaches, directed message…

🔥5❤2👍2

786 views16:24

Acquisition of absorption and fluorescence spectral data using chatbots

https://doi.org/10.1039/D4DD00255E

Гайд как быстро писать новые статьи в научных журналах:

1) Берем ChatGPT или любую другую LLM
2) Спрашиваем у него о свойствах молекулы X
3) Записываем в таблицу

✅Profit: получаем статью в журнале с IF=6.2

😁12🔥3👍2

865 views06:58

Here's a nice Christmas gift - ChEMBL 35 is out!

ChEMBL 35 is out!

https://chembl.blogspot.com/2024/12/heres-nice-christmas-gift-chembl-35-is.html

Вышла новая версия базы CheMBL:

This fresh release comes with a wealth of new data sets and some new data sources as well. Examples include a total of 14 datasets deposited by by the ASAP (AI-driven Structure-enabled Antiviral Platform) project, a new NTD data set by Aberystwyth University on anti-schistosome activity, nine new chemical probe data sets, and seven new data sets for the Chemogenomic library of the EUbOPEN project.

This version of the database, prepared on 01/12/2024 contains:

2,496,335 compounds (of which 2,474,590 have mol files)
3,185,505 compound records (non-unique compounds)
21,123,501 activities
1,740,546 assays
16,003 targets
92,121 documents

#dataset

Blogspot

Use your well-deserved Christmas holidays to spend time with your loved ones and explore the new release of ChEMBL 35! This fresh...

👍7🔥4❤2

792 viewsedited 10:10

SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

SMILES All Around: Structure to SMILES conversion for Transition Metal Complexes

https://doi.org/10.26434/chemrxiv-2024-c660p

Должно быть очень полезно для тех, кто занимает металлоорганикой:

We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with.

🖥Github link: https://github.com/jensengroup/xyz2mol_tm

#method

Please open Telegram to view this post

VIEW IN TELEGRAM

ChemRxiv

We present a method for creating RDKit parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The…

👍3❤2🔥2

837 views14:30

Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

https://doi.org/10.1021/acs.chemmater.4c02754

In this work, phosphorescent materials are represented as strings, molecular graphs, and point clouds, which are employed by language models, two-dimensional graph, and three-dimensional graph neural networks. In addition, more than 200 000 molecules with simulated properties highly relevant to experimental properties are used for pretraining the DL models.

Our work shows high performance in the prediction of five experimental properties that are importantly considered when commercializing OLED devices. This means that faster material discovery for OLEDs can be achieved through DL models that are trained with simulation information that is highly correlated with experimental properties.

📕Chemistry of Materials (IF=7.2)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Simulation-Assisted Deep Learning Techniques for Commercially Applicable OLED Phosphorescent Materials

Phosphorescent light-emitting materials play a central role in organic light-emitting diode (OLED) devices. Due to their synthesis difficulties, unsystematic trial-and-error synthesis is prohibitively challenging. For this reason, deep learning (DL), which…

👍6❤2🔥2

780 views16:46

Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic Resonance-Based Structural Elucidation

https://pubs.acs.org/doi/10.1021/acs.jctc.4c01407

In this study, we present a comprehensive benchmark of carbon shielding anisotropies based on coupled cluster reference tensors taken from the NS372 benchmark data set.

Additionally, we investigate the representation of the DFT-predicted shielding tensors, such as the eigenvalues and eigenvectors. Moreover, we evaluated how various DFT methods influence the discrimination of possible relative configurations using recently published ΔΔRCSA data for a set of structurally diverse natural products.

Our findings demonstrate that accurate interpretation of RCSAs for configurational and conformational analysis is possible with semilocal DFT methods, which also reduce computational demands compared to hybrid functionals such as the commonly used B3LYP.

📕Journal of Chemical Theory and Computation (IF=5.7)
#benchmark

Please open Telegram to view this post

VIEW IN TELEGRAM

Benchmark of Density Functional Theory in the Prediction of 13C Chemical Shielding Anisotropies for Anisotropic Nuclear Magnetic…

Density functional theory (DFT) calculations have emerged as a powerful theoretical toolbox for interpreting and analyzing the experimental nuclear magnetic resonance (NMR) spectra of chemical compounds. While DFT has been extensively used and benchmarked…

❤3👍3🔥2

686 views14:56

A generative model for inorganic materials design

https://www.nature.com/articles/s41586-025-08628-5

Сегодня в журнале Nature вышла очень интересная работа.

Microsoft представил MatterGen — новую парадигму в дизайне материалов с использованием генеративного искусственного интеллекта. MatterGen позволяет ускорить процесс разработки материалов, автоматически генерируя и оценивая потенциальные структуры с заданными свойствами.

Модель может быть настроена на создание материалов с конкретными химическими составами, симметрией или физическими характеристиками, такими как магнитная плотность, ширина запрещённой зоны и механическая прочность, используя обучающий набор из более чем 608 000 стабильных соединений из известных баз данных материалов.

Экспериментальная проверка подтвердила успешный синтез материала TaCr2O6, в точности совпадающий с предсказаниями модели.

🖥Код доступен бесплатно на гитхабе: https://github.com/microsoft/mattergen

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥11👍7❤5

3.39K views17:34

https://doi.org/10.1021/acs.jmedchem.4c03044

Real-World Applications and Experiences of AI/ML Deployment for Drug Discovery

🔥

Briefly summarized are our and others’ experiences with the AI/ML applications that currently have the greatest impact on our work.

У 📕Journal of Medicinal Chemistry вышел Editorial, посвященный методам ML/AI, которые используются для поиска лекарств.

Please open Telegram to view this post

VIEW IN TELEGRAM

🔥5❤4👍2

639 views07:56

Hybrid nanophotonic-microfluidic sensor integrated with machine learning for operando state-of-charge monitoring in vanadium flow batteries

https://doi.org/10.1016/j.est.2025.115349

При нашем скромном участии вчера вышла работа, в которой представлен усовершенствованный метод измерения степени заряда (SoC) ванадиевых проточных батарей (VRFB) с использованием показателя преломления и машинного обучения.

Основной акцент сделан на использовании изменения показателя преломления (RI) электролитов для оценки концентрации ионов ванадия.

Разработанный сенсор основан на фотонных интегральных схемах (PIC) и микрофлюидных каналах, что обеспечивает высокую чувствительность. Система прошла тестирование на рабочих условиях батареи, показав устойчивую корреляцию между спектральными характеристиками и данными о заряде.

Используя экспериментальные данные, ML модель была обучена точно предсказывать степень заряда проточной ванадиевой батареи путем анализа спектральных характеристик.

🔗По этой ссылке статья будет доступна бесплатно в течение первых 50 дней: https://authors.elsevier.com/c/1kSYB,rUrFxfAl

📕Journal of Energy Storage (IF=8.9)
#application

Please open Telegram to view this post

VIEW IN TELEGRAM

👍7🔥6❤4

1.8K views11:21

Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

https://pubs.acs.org/doi/10.1021/jacs.4c11085

Utilizing a chain of advanced large language models (LLMs), we developed a systematic approach to extract and organize MOF data into a structured format.

Our methodology successfully compiled information from more than 40,000 research articles, creating a comprehensive and ready-to-use data set. Specifically, data regarding MOF synthesis conditions and properties were extracted from both tables and text and then analyzed. Subsequently, we utilized the curated database to analyze the relationships between synthesis conditions, properties, and structure.

📕Journal of the American Chemical Society (IF=14.4)
#dataset #method

Please open Telegram to view this post

VIEW IN TELEGRAM

Harnessing Large Language Models to Collect and Analyze Metal–Organic Framework Property Data Set

This research focused on the efficient collection of experimental metal–organic framework (MOF) data from scientific literature to address the challenges of accessing hard-to-find data and improving the quality of information available for machine learning…

❤4🔥3👍2

845 views11:33

Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

https://pubs.acs.org/doi/10.1021/jacs.4c11666

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.

With the reported workflow, we discovered two new classes of palladium complexes capable of achieving the synthesis of nonalternating polyketones with a lower CO content than those made by known palladium catalysts.

Our results show that we doubled the number of classes of palladium compounds that can catalyze the formation of this type of polymer. We envision that this methodology can be applied to accelerate catalyst discovery when selectivity is an important outcome.

📕Journal of the American Chemical Society (IF=14.4)
#method

Please open Telegram to view this post

VIEW IN TELEGRAM

Using Classifiers To Predict Catalyst Design for Polyketone Microstructure

We applied a classifier method to predict palladium catalysts for the formation of nonalternating polyketones via the copolymerization of CO and ethylene; current examples are limited to using phosphine sulfonate and diphosphazane monoxide supporting ligands.…

👍4🔥3❤2

729 views10:26