Identifying Dynamic Metal–Ligand Coordination Modes with Ensemble Learning
https://pubs.acs.org/doi/10.1021/jacs.5c17169
👉🏻Web interface enabling no-code prediction of ligand coordination modes: https://molsimplify.mit.edu/pydentate.html
📕 Journal of the American Chemical Society (IF=15.6)
#method
https://pubs.acs.org/doi/10.1021/jacs.5c17169
In this work, we curate data sets of hemilabile and nonhemilabile ligands from experimentally characterized structures in the Cambridge Structural Database, analyze trends in observed coordination modes, and introduce four exhaustive and mutually exclusive types of hemilability.
Using these labeled data sets, we train graph neural networks to carry out classification of hemilabile ligands with high accuracy, precision, and recall and develop an ensemble algorithm that predicts primary and alternative chemically plausible coordination modes from SMILES strings in an end-to-end fashion. We demonstrate the utility of our algorithm by generating novel TMCs in predicted coordination modes and calculating the corresponding energy difference due to changes in coordination (i.e., ΔEc) with density functional theory.
Comparing our novel TMCs in multiple poses against an energetic criterion from experimentally observed TMCs confirms the plausibility of our alternative poses. We anticipate that our open-source workflows will accelerate organometallic discovery in experimental and virtual screening campaigns by proposing realistic metal–ligand coordination.
👉🏻Web interface enabling no-code prediction of ligand coordination modes: https://molsimplify.mit.edu/pydentate.html
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
ACS Publications
Identifying Dynamic Metal–Ligand Coordination Modes with Ensemble Learning
Knowledge of how a ligand coordinates a metal is essential for mechanistic and data-driven studies of transition metal complexes (TMCs), but most analyses assume a single binding interaction for a given metal–ligand pair. In catalysis, many ligands engage…
❤6👍4🔥4🤡1
2501.09223v2.pdf
2.6 MB
Foundations of Large Language Models
https://arxiv.org/abs/2501.09223
Сегодня хочется отойти от химии и поделиться свежей книгой по LLM на 250+ страниц:
https://arxiv.org/abs/2501.09223
Сегодня хочется отойти от химии и поделиться свежей книгой по LLM на 250+ страниц:
This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.
🔥10❤3👍3
oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning 🔥
https://arxiv.org/abs/2510.07731
#benchmark
https://arxiv.org/abs/2510.07731
We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings.
Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity.
We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model.
#benchmark
Please open Telegram to view this post
VIEW IN TELEGRAM
👍7🔥5❤4
Computer vision for high-throughput materials synthesis: a tutorial for experimentalists🔥
https://doi.org/10.1039/D5DD00384A
📕 Digital Discovery (IF=6.2)
#method
https://doi.org/10.1039/D5DD00384A
Here, we aim to fill that identified gap and present a structured tutorial for experimentalists to integrate computer vision into high-throughput materials research, providing a detailed roadmap from data collection to model validation.
Specifically, we describe the hardware and software stack required for deploying CV in materials characterization, including image acquisition, annotation strategies, model training, and performance evaluation.
As a case study, we demonstrate the implementation of a CV workflow within a high-throughput materials synthesis and characterization platform to investigate the crystallization of metal–organic frameworks (MOFs). By outlining key challenges and best practices, this tutorial aims to equip chemists and materials scientists with the necessary tools to harness CV for accelerating materials discovery.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
pubs.rsc.org
Computer vision for high-throughput materials synthesis: a tutorial for experimentalists
Advances in high-throughput instrumentation and laboratory automation are revolutionizing materials synthesis by enabling the rapid generation of large libraries of novel materials. However, efficient characterization of these synthetic libraries remains…
🔥3❤2👍2
Machine Learning for Green Solvents: Assessment, Selection and Substitution 🔥
https://doi.org/10.1002/advs.202516851
Download GreenSolventDB: https://github.com/Ramprasad-Group/green_solvents/tree/main
📕 Advanced Science (IF = 14.1)
#dataset
https://doi.org/10.1002/advs.202516851
A data-driven pipeline is presented for assessing the sustainability of solvents and identifying greener substitutes. Three models are trained and evaluated on the GlaxoSmithKline Solvent Sustainability Guide (GSK SSG) to predict “greenness” metrics: a traditional Gaussian Process Regression (GPR) model, a fine-tuned GPT model (FT GPT), and a GPT model using in-context learning (ICL). It is found that GPR slightly outperforms language-based GPT models and is used to evaluate 10,189 solvents, forming GreenSolventDB–the largest public database of green solvent metrics.
These predictions are combined with Hansen solubility parameter-based metrics to identify greener solvents with solubility behavior similar to hazardous solvents. This approach is validated through case studies on benzene and diethyl ether, with predicted alternatives aligning well with known greener substitutes.
Building on this success, novel alternatives are proposed for the hazardous solvents listed in the GSK SSG. This framework for quantifying solvent sustainability and identifying greener substitutes is expected to significantly accelerate the discovery and adoption of environmentally-friendly solvents.
Download GreenSolventDB: https://github.com/Ramprasad-Group/green_solvents/tree/main
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
👍4🔥3❤2
QSAR Prediction of BBB Permeability Based on Machine Learning upon PETBD: A Novel Data Set of PET Tracers
https://pubs.acs.org/doi/10.1021/acs.jmedchem.5c01791
Download PETBD: https://github.com/GDUT-Computer-Medical-Science-Team/PETBD-QSAR/tree/main/dataset_PETBD
📕 Journal of Medicinal Chemistry (IF = 6.8)
#dataset
https://pubs.acs.org/doi/10.1021/acs.jmedchem.5c01791
Assessing small-molecule blood–brain barrier permeability is laborious, yet critical in drug development. Quantitative prediction models are hindered by a lack of high-quality data set.
To address this, we curated PETBD, a novel data set of drug concentrations for 1056 positron emission tomography tracers across 14 organs at 60 min post injection, as well as in vivo metadata. We developed machine learning models to predict the brain-to-blood concentration ratio (log BB), and for the first time, drug concentration in the brain.
Extreme gradient boosting model reached the best performance in predicting Cbrain (R2 = 0.700) and also achieved state-of-the-art log BB prediction (R2 = 0.770). Feature importance analysis was employed to explain the contributions of physicochemical-based features. The model’s superior generalizability was validated against the B3DB benchmark and with unpublished PET tracers.
Download PETBD: https://github.com/GDUT-Computer-Medical-Science-Team/PETBD-QSAR/tree/main/dataset_PETBD
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
ACS Publications
QSAR Prediction of BBB Permeability Based on Machine Learning upon PETBD: A Novel Data Set of PET Tracers
Assessing small-molecule blood–brain barrier permeability is laborious, yet critical in drug development. Quantitative prediction models are hindered by a lack of high-quality data set. To address this, we curated PETBD, a novel data set of drug concentrations…
🔥4❤3👍3
Explainable artificial intelligence for molecular design in pharmaceutical research🔥
https://doi.org/10.1039/D5SC08461J
📕 Chemical Science (IF=7.5)
#review
https://doi.org/10.1039/D5SC08461J
In this Perspective, we examine current challenges and opportunities for explainable AI (XAI) in molecular design and evaluate the benefits of incorporating domain-specific knowledge into XAI approaches for model refinement, experimental design, and hypothesis testing. In this context, we also discuss the current limitations in evaluating results from chemical language models that are increasingly used in molecular design and drug discovery.
#review
Please open Telegram to view this post
VIEW IN TELEGRAM
pubs.rsc.org
Explainable artificial intelligence for molecular design in pharmaceutical research
The rise of artificial intelligence (AI) has taken machine learning (ML) in molecular design to a new level. As ML increasingly relies on complex deep learning frameworks, the inability to understand predictions of black-box models has become a topical issue.…
🔥4❤3👍3
Collective intelligence for AI-assisted chemical synthesis
https://www.nature.com/articles/s41586-026-10131-4
📕 Nature (IF = 48.5)
#method
https://www.nature.com/articles/s41586-026-10131-4
Here we introduce MOSAIC (Multiple Optimized Specialists for AI-assisted Chemical Prediction), a computational framework that enables chemists to harness the collective knowledge of millions of reaction protocols. MOSAIC is built upon the Llama-3.1-8B-instruct architecture, training 2,498 specialized chemical experts within Voronoi-clustered spaces.
This approach delivers reproducible and executable experimental protocols with confidence metrics for complex syntheses. With an overall 71% success rate, experimental validation demonstrates the realizations of over 35 novel compounds, spanning pharmaceuticals, materials, agrochemicals, and cosmetics. Notably, MOSAIC also enables the discovery of new reaction methodologies that are absent from the expert’s training, a cornerstone for advancing chemical synthesis.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
Nature
Collective intelligence for AI-assisted chemical synthesis
Nature - A tool based on the Llama-3.1-8B-Instruct architecture called MOSAIC (Multiple Optimized Specialists for AI-assisted Chemical Prediction) is described, allowing chemists to use the...
❤5👍5🔥5
Synthetic Applicability Domain (SynAD): Navigating Chemical Space for Reliable AI-Driven Reaction Prediction
https://doi.org/10.1002/anie.202523874
📕 Angewandte Chemie (IF=17.0)
#method
https://doi.org/10.1002/anie.202523874
Organic synthetic chemistry has undergone a paradigm shift driven by breakthroughs in artificial intelligence (AI). Data-driven methods help accelerate hypothesis evaluation and reduce experimental trial-and-error efforts. However, its practical utility is constrained by the out-of-distribution (OOD) issue, where predictions usually fail when extrapolating to unseen reactions with new catalysts, substrates, or conditions.
Here, we introduce SynAD (synthetic applicability domain), a machine learning framework for assessing the predictive capability of AI models trained with existing data. SynAD combines descriptors with model-adaptive distance metrics to automatically demarcate reliable and unreliable reactions. Validated on the Ullmann Ligand Dataset (ULD, >5000 reactions), SynAD a priori distinguishes predictable chemical space, resulting in a prediction accuracy of R2 = 0.90 (at 12.3% coverage) from a baseline of R2 = −0.21. This capacity to target reliable chemical space is consistently observed across 6 additional datasets. We also enable a SynAD score to quantify reaction class predictability, guiding experimental focus on OOD spaces. By defining model limits, SynAD provides a critical guardrail for chemists to trust AI, allocate resources strategically, and accelerate de novo discovery.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
Wiley Online Library
Synthetic Applicability Domain (SynAD): Navigating Chemical Space for Reliable AI‐Driven Reaction Prediction
SynAD (Synthetic Applicability Domain) is a machine learning framework that integrates descriptors with model-adaptive distance metrics. SynAD can automatically demarcate reliable and unreliable pred...
🔥5❤4👍3
Augmenting Large Language Models for Automated Discovery of F-Element Extractants
https://pubs.acs.org/doi/10.1021/jacs.5c19738
📕 Journal of the American Chemical Society (IF=15.6)
#method
https://pubs.acs.org/doi/10.1021/jacs.5c19738
Here, we present a quasi-autonomous AI-enabled workflow for the design and computational screening of selective extractant ligands. Molecular design is guided by SAFE-MolGen, a large language model-based agentic system that leverages curated extraction data to propose new ligands and preliminarily rank their performance using a supervised machine learning model trained on experimental data sets to consider the impact of realistic experimental conditions.
Promising human-approved ligands are then passed to a second automated pipeline that constructs three-dimensional metal–ligand complexes and performs quantum mechanical free energy calculations to directly assess the metal selectivity.
We demonstrate this approach for Am(III)/Eu(III) separations and report several newly designed ligands predicted to exhibit higher Am(III)/Eu(III) selectivity than the benchmark extractant CyMe4BTBP.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
ACS Publications
Augmenting Large Language Models for Automated Discovery of F-Element Extractants
Efficient separation of f-elements is a critical challenge for a wide range of emerging technologies. The chemical similarity among these elements makes the development of selective solvent extraction reagents both slow and difficult. Here, we present a quasi…
👍4🔥4❤3
Inverse Molecular Design for the Discovery of Organic Energy Transfer Photocatalysts: Bridging Global and Local Chemical Space Exploration
https://doi.org/10.1021/jacs.5c20087
📕 Journal of the American Chemical Society (IF=15.6)
#method
https://doi.org/10.1021/jacs.5c20087
The discovery of new organic photocatalysts (PCs) for energy transfer (EnT) catalysis remains a significant challenge, largely due to the vast and underexplored chemical space and the delicate balance of the photocatalytic properties. While transition-metal catalysts are effective, their high cost and environmental impact necessitate the development of metal-free alternatives.
In this work, we present a hybrid inverse molecular design strategy that combines global exploration with targeted local optimization to discover highly efficient organic PCs. Our approach leverages a generative model, guided by machine learning predictions and semiempirical simulations, to efficiently navigate chemical space and identify promising molecular scaffolds. We demonstrate the utility of this strategy by rediscovering known PCs and, more importantly, exploring uncharted structural regions, leading to the identification of novel candidates with favorable photophysical properties. A subsequent local exploration stage, using quantum mechanical calculations, allows refinement of the properties as well as control of the synthetic complexity.
The practical applicability of the approach is demonstrated by performing a local exploration of one of the identified scaffolds and successfully synthesizing four candidate PCs. We showcase their catalytic aptitude in three different EnT-mediated reactions, including a challenging aza-photocycloaddition, where one of our designed PCs achieved 90% yield, a performance comparable to a state-of-the-art iridium-based catalyst. This study highlights the power of a data-driven inverse design framework to bridge computational discovery and experimental validation, accelerating the identification of novel PCs and expanding the scope of EnT catalysis.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
ACS Publications
Inverse Molecular Design for the Discovery of Organic Energy Transfer Photocatalysts: Bridging Global and Local Chemical Space…
The discovery of new organic photocatalysts (PCs) for energy transfer (EnT) catalysis remains a significant challenge, largely due to the vast and underexplored chemical space and the delicate balance of the photocatalytic properties. While transition-metal…
🔥4❤3👍3
Chem ML/AI/Datasets
🎊Наш датасет по растворимости опубликован в Scientific Data (IF=6.9)📕 BigSolDB 2.0, dataset of solubility values for organic compounds in different solvents at various temperatures https://www.nature.com/articles/s41597-025-05559-8 После выхода препринта…
Мы выпустили BigSolDB v2.1, новое обновление нашего открытого набора данных по растворимости.
Что нового в v2.1:
➕ 8521 новых измерений растворимости
➕ 77 новых растворенных веществ
➕ 92 новых литературных источника
🛠 Исправлены несколько ошибок в ранее выпущенных данных (неправильные SMILES, отсутствующие CAS и т.д.)
Это обновление еще больше расширяет химическое пространство и улучшает качество данных.
Набор данных доступен на Zenodo: https://doi.org/10.5281/zenodo.18552681
Что нового в v2.1:
➕ 8521 новых измерений растворимости
➕ 77 новых растворенных веществ
➕ 92 новых литературных источника
🛠 Исправлены несколько ошибок в ранее выпущенных данных (неправильные SMILES, отсутствующие CAS и т.д.)
Это обновление еще больше расширяет химическое пространство и улучшает качество данных.
Набор данных доступен на Zenodo: https://doi.org/10.5281/zenodo.18552681
👍16🔥9❤7
Digitized dataset of aqueous dissociation constants🔥
https://chemrxiv.org/doi/full/10.26434/chemrxiv-2026-6khcw
Download dataset: https://doi.org/10.5281/zenodo.7236452
ChemRxiv
#dataset
https://chemrxiv.org/doi/full/10.26434/chemrxiv-2026-6khcw
In this work, we release the IUPAC Digitized pKa Dataset, a digital version of a critically-assessed collection of data compiled up to 1970. The dataset includes metadata such as temperature, measurement method, assessed reliability of data, and chemical identifiers such as SMILES and InChI strings.
The dataset spans 24,222 entries across 10,564 unique molecules, making it the largest FAIR open-source dataset publicly available for aqueous pKa data. Herein, we detail the data digitization and checking process, and assess the informational space spanned by the data.
Download dataset: https://doi.org/10.5281/zenodo.7236452
ChemRxiv
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
ChemRxiv
Digitized dataset of aqueous dissociation constants | ChemRxiv
The acid dissociation constant (pKa) quantifies the acidity of a compound, which is
crucial for applications including drug design, environmental fate studies, and chemical
synthesis. However, high-quality open-source digital pKa datasets are scarce, ...
crucial for applications including drug design, environmental fate studies, and chemical
synthesis. However, high-quality open-source digital pKa datasets are scarce, ...
❤3👍3🔥3
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry🔥
https://arxiv.org/abs/2512.01274
The dataset of the benchmark is available at this link: https://huggingface.co/datasets/ZehuaZhao/SUPERChem
#benchmark
https://arxiv.org/abs/2512.01274
We introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy.
Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence.
The dataset of the benchmark is available at this link: https://huggingface.co/datasets/ZehuaZhao/SUPERChem
#benchmark
Please open Telegram to view this post
VIEW IN TELEGRAM
arXiv.org
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with...
❤5🔥5👍4
Automated DFT-Machine Learning Integration Enables Data-Efficient and Generalizable Feasibility Predictions in Metallaphotoredox sp2–sp3 Cross-Coupling Reactions
https://doi.org/10.1021/acscatal.5c07857
📕 ACS Catalysis (IF=13.1)
#method
https://doi.org/10.1021/acscatal.5c07857
Nickel/photoredox catalysis in cross-coupling reactions offers mild operating conditions for efficient C–C bond formation, expanding synthetic access to pharmaceutically relevant molecules. However, routine implementation of such reactions remains constrained by the intricate reaction mechanism and limited availability of experimental data, which complicate optimization tasks and the development of predictive models. The integration of quantum-mechanical (QM) calculations with machine learning (ML) has proven to be effective for developing predictive models of complex reactions with sparse experimental data.
Here, we present a combined approach that integrates automated density functional theory calculations, ML, and parallel synthesis to develop quantum mechanics-machine learning (QM-ML) models for the nickel metallophotoredox cross-coupling reaction feasibility prediction. Random-Forest classification models are trained to predict the outcome of a given reaction using DFT-computed descriptors from automatically generated 3D structures of catalytic cycle intermediates. We demonstrate the broad applicability of this approach, applying it to a diverse data set encompassing four reaction subtypes, namely, bromide cross-electrophile couplings, chloride cross-electrophile couplings, deoxygenative couplings, and amino radical transfer (ART) couplings, augmented with additional experiments curated by a systematic cheminformatics method to broaden the alkyl halides scope. We show on a blind literature data set that such a QM-ML approach can successfully predict the feasibility of complex reactions from heterogeneous data sets with minimal data requirements and can generalize it to unseen reaction subtypes with a few-shot learning approach, affording a computational model for ART coupling. Together, these capabilities provide a data-efficient solution for rapidly predicting the outcome of cross-coupling reactions and facilitate the adoption of nickel photocatalysis in the MAKE stage of the DMTA cycle.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
ACS Publications
Automated DFT-Machine Learning Integration Enables Data-Efficient and Generalizable Feasibility Predictions in Metallaphotoredox…
Nickel/photoredox catalysis in cross-coupling reactions offers mild operating conditions for efficient C–C bond formation, expanding synthetic access to pharmaceutically relevant molecules. However, routine implementation of such reactions remains constrained…
❤4👍4🔥4
Hybrid Computational Strategy for Predicting Complex Ligand–Metal Architectures
https://doi.org/10.1002/anie.202524655
📕 Angewandte Chemie (IF=17.0)
#method
https://doi.org/10.1002/anie.202524655
Understanding how metals coordinate to organic ligands is a precondition for the rational design of metal complexes and catalysts. Whereas certain types of ligands are capable of just one easy-to-predict coordination modality, others may present tens and sometimes even hundreds of coordination options (mono-, bi-, or polydentate), and predicting the correct one may be a challenge even to seasoned chemists.
The current paper describes a “hybrid” computational approach in which a Machine Learning, ML, algorithm learns to predict complex coordination patterns using knowledge-based “rules” derived from the Cambridge Structural Database, CSD. This model is applicable to a broad scope of ligands (including hemilabile and haptic ones as well as those with denticity > 6) and different metals at different oxidation states. The algorithm's code is disclosed and can be readily deployed in RDKit via our RDMetallics python-wrapper. It is also deployed as a publicly accessible web portal for demonstration and use.
#method
Please open Telegram to view this post
VIEW IN TELEGRAM
Wiley Online Library
Hybrid Computational Strategy for Predicting Complex Ligand–Metal Architectures
An accurate “hybrid” model—combining elements of knowledge-based and Machine Learning approaches—is designed to predict metal-ligand coordination modes in a metal-aware fashion and for structurally c...
👍5❤4🔥3🤡1
Molecular LEGION: incalculably large coverage of chemical space around the NLRP3 target🔥
https://www.nature.com/articles/s41597-026-06850-y
📕 Scientific Data (IF=6.9)
#dataset
https://www.nature.com/articles/s41597-026-06850-y
Here, we present a unique dataset containing approximately 110 M molecular structures of potential NLRP3 inhibitors enabled by the LEGION (Latent Enumeration, Generation, Integration, Optimization, and Navigation) workflow, which integrates generative AI, AI-guided screening within the Chemistry42 platform and auxiliary cheminformatics tools to enable large-scale exploration of chemical space around specific drug targets.
Using the structural data of NLRP3 co-crystals, a clinically relevant target, LEGION combined ligand- and structure-based design strategies, in-house algorithms for 3D pharmacophore-aware scaffold extraction, and distinct library enumeration methods to identify over 34,000 unique scaffolds, which can be multiplied into a dataset of 123B molecular structures within the provided code.
The resulting dataset of unprecedented size proved effective for scaffold hopping, chemical space navigation, and supporting intellectual property applications by generating structurally diverse and synthetically accessible structures.
#dataset
Please open Telegram to view this post
VIEW IN TELEGRAM
Nature
Molecular LEGION: incalculably large coverage of chemical space around the NLRP3 target
Scientific Data - Molecular LEGION: incalculably large coverage of chemical space around the NLRP3 target
🔥8❤4👍4
Сегодня с удивлением обнаружил, что в Claude есть возможность подключить MCP ChEMBL и искать по их базе прямо в интерфейсе Claude.
Чтобы это сделать, достаточно зайти в Settings → Connectors и добавить нужный сервер.
После этого, Claude начинает видеть инструменты для поиска соединений, мишеней, биоактивности и механизмов действия.
https://claude.ai/settings/connectors
Чтобы это сделать, достаточно зайти в Settings → Connectors и добавить нужный сервер.
После этого, Claude начинает видеть инструменты для поиска соединений, мишеней, биоактивности и механизмов действия.
https://claude.ai/settings/connectors
👍10🔥10❤7
Chem ML/AI/Datasets
Сегодня с удивлением обнаружил, что в Claude есть возможность подключить MCP ChEMBL и искать по их базе прямо в интерфейсе Claude. Чтобы это сделать, достаточно зайти в Settings → Connectors и добавить нужный сервер. После этого, Claude начинает видеть…
Самое интересное, что можно подключить свою кастомную базу или датасет. Первым делом попробовал подключить BigSolDB v2.1.
Естественно сам код MCP за меня написал Claude :) Пару минут деплоя и все работает.
После этого Claude видит все инструменты и умеет искать по названию, SMILES, CAS-номеру, фильтровать по растворителю и строить графики прямо в чате, опираясь на реальные данные.
Подключить можно в Settings → Connectors, вставив URL: https://mcp-bigsoldb-production.up.railway.app/sse
Естественно сам код MCP за меня написал Claude :) Пару минут деплоя и все работает.
После этого Claude видит все инструменты и умеет искать по названию, SMILES, CAS-номеру, фильтровать по растворителю и строить графики прямо в чате, опираясь на реальные данные.
Подключить можно в Settings → Connectors, вставив URL: https://mcp-bigsoldb-production.up.railway.app/sse
🔥12👍8❤6
Chem ML/AI/Datasets
Самое интересное, что можно подключить свою кастомную базу или датасет. Первым делом попробовал подключить BigSolDB v2.1. Естественно сам код MCP за меня написал Claude :) Пару минут деплоя и все работает. После этого Claude видит все инструменты и умеет…
И сегодня еще у нас на ChemRxiv вышел бенчмарк по растворимости для LLM, который мы пилили последние 2 месяца:
Can LLMs Reason About Solubility? The SoluBench Benchmark for Pure and Mixed Solvent Systems: https://doi.org/10.26434/chemrxiv.15000632/v1
В нем содержится 4 задачи. В сумме 9806 вопросов и 20+ различных LLM:
Задача 1 — в каком из двух растворителей соединение растворяется лучше?
Задача 2 — выбрать наилучший растворитель из нескольких вариантов
Задача 3 — увеличится или уменьшится растворимость после добавление второго растворителя?
Задача 4 — какое из двух соединений лучше растворяется в данном растворителе?
Что получилось:
1) Frontier-модели справляются с задачами на растворители неплохо — Gemini 3 Flash набирает 90.6% / 66.2% / 86.4% на задачах 1–3
2) Из открытых моделей приятно удивили Qwen3.5 397B и GLM-5/4.7, да и Kimi K2.5 тоже неплох
3) Задача 4 оказалась по-настоящему сложной без reasoning
4) Крупные молекулы (>500 Da) стабильно роняют точность у всех моделей
5) DMSO и DMF — самые неудобные растворители: модели склонны считать, что они растворяют всё подряд
🤗 Датасет: huggingface.co/datasets/levakrasnov/SoluBench
💻 Код: github.com/levakrasnovs/SoluBench
В комментариях будет лежать полный текст, а также отдельно картинки.
Can LLMs Reason About Solubility? The SoluBench Benchmark for Pure and Mixed Solvent Systems: https://doi.org/10.26434/chemrxiv.15000632/v1
В нем содержится 4 задачи. В сумме 9806 вопросов и 20+ различных LLM:
Задача 1 — в каком из двух растворителей соединение растворяется лучше?
Задача 2 — выбрать наилучший растворитель из нескольких вариантов
Задача 3 — увеличится или уменьшится растворимость после добавление второго растворителя?
Задача 4 — какое из двух соединений лучше растворяется в данном растворителе?
Что получилось:
1) Frontier-модели справляются с задачами на растворители неплохо — Gemini 3 Flash набирает 90.6% / 66.2% / 86.4% на задачах 1–3
2) Из открытых моделей приятно удивили Qwen3.5 397B и GLM-5/4.7, да и Kimi K2.5 тоже неплох
3) Задача 4 оказалась по-настоящему сложной без reasoning
4) Крупные молекулы (>500 Da) стабильно роняют точность у всех моделей
5) DMSO и DMF — самые неудобные растворители: модели склонны считать, что они растворяют всё подряд
🤗 Датасет: huggingface.co/datasets/levakrasnov/SoluBench
💻 Код: github.com/levakrasnovs/SoluBench
В комментариях будет лежать полный текст, а также отдельно картинки.
🔥12👍7❤5🏆2