Publication:
Assessment of fine-tuned large language models for real-world chemistry and material science applications

dc.contributor.coauthorVan Herck, Joren
dc.contributor.coauthorGil, Maria Victoria
dc.contributor.coauthorJablonka, Kevin Maik
dc.contributor.coauthorAbrudan, Alex
dc.contributor.coauthorAnker, Andy S.
dc.contributor.coauthorAsgari, Mehrdad
dc.contributor.coauthorBlaiszik, Ben
dc.contributor.coauthorBuffo, Antonio
dc.contributor.coauthorChoudhury, Leander
dc.contributor.coauthorCorminboeuf, Clemence
dc.contributor.coauthorDaglar, Hilal
dc.contributor.coauthorElahi, Amir Mohammad
dc.contributor.coauthorFoster, Ian T.
dc.contributor.coauthorGarcia, Susana
dc.contributor.coauthorGarvin, Matthew
dc.contributor.coauthorGodin, Guillaume
dc.contributor.coauthorGood, Lydia L.
dc.contributor.coauthorGu, Jianan
dc.contributor.coauthorXiao Hu, Noemie
dc.contributor.coauthorJin, Xin
dc.contributor.coauthorJunkers, Tanja
dc.contributor.coauthorKeskin, Seda
dc.contributor.coauthorKnowles, Tuomas P. J.
dc.contributor.coauthorLaplaza, Ruben
dc.contributor.coauthorLessona, Michele
dc.contributor.coauthorMajumdar, Sauradeep
dc.contributor.coauthorMashhadimoslem, Hossein
dc.contributor.coauthorMcintosh, Ruaraidh D.
dc.contributor.coauthorMoosavi, Seyed Mohamad
dc.contributor.coauthorMourino, Beatriz
dc.contributor.coauthorNerli, Francesca
dc.contributor.coauthorPevida, Covadonga
dc.contributor.coauthorPoudineh, Neda
dc.contributor.coauthorRajabi-Kochi, Mahyar
dc.contributor.coauthorSaar, Kadi L.
dc.contributor.coauthorHooriabad Saboor, Fahimeh
dc.contributor.coauthorSagharichiha, Morteza
dc.contributor.coauthorSchmidt, K. J.
dc.contributor.coauthorShi, Jiale
dc.contributor.coauthorSimone, Elena
dc.contributor.coauthorSvatunek, Dennis
dc.contributor.coauthorTaddei, Marco
dc.contributor.coauthorTetko, Igor
dc.contributor.coauthorTolnai, Domonkos
dc.contributor.coauthorVahdatifar, Sahar
dc.contributor.coauthorWhitmer, Jonathan
dc.contributor.coauthorWieland, D. C. Florian
dc.contributor.coauthorWillumeit-Roemer, Regine
dc.contributor.coauthorZuttel, Andreas
dc.contributor.coauthorSmit, Berend
dc.contributor.departmentDepartment of Chemical and Biological Engineering
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.kuauthorKeskin, Seda
dc.contributor.kuauthorHarman, Hilal Dağlar
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.date.accessioned2025-03-06T20:57:45Z
dc.date.issued2024
dc.description.abstractThe current generation of large language models (LLMs) has limited chemical knowledge. Recently, it has been shown that these LLMs can learn and predict chemical properties through fine-tuning. Using natural language to train machine learning models opens doors to a wider chemical audience, as field-specific featurization techniques can be omitted. In this work, we explore the potential and limitations of this approach. We studied the performance of fine-tuning three open-source LLMs (GPT-J-6B, Llama-3.1-8B, and Mistral-7B) for a range of different chemical questions. We benchmark their performances against "traditional" machine learning models and find that, in most cases, the fine-tuning approach is superior for a simple classification problem. Depending on the size of the dataset and the type of questions, we also successfully address more sophisticated problems. The most important conclusions of this work are that, for all datasets considered, their conversion into an LLM fine-tuning training set is straightforward and that fine-tuning with even relatively small datasets leads to predictive models. These results suggest that the systematic use of LLMs to guide experiments and simulations will be a powerful technique in any research study, significantly reducing unnecessary experiments or computations.
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.indexedbyPubMed
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuEU
dc.description.sponsorshipThe research of J. V. H., and B. S. is supported by the Swiss Science Foundation through a Project Funding (214872) and Advanced Grant (216165). M. V. G. and C. P. gratefully acknowledge financial support from the Spanish Agencia Estatal de Investigacion (AEI) through Grants TED2021-131693B-I00 (M. V. G. and C. P.) and CNS2022-135474 (M. V. G.), funded by MICIU/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. M. V. G. acknowledges support from the Spanish National Research Council (CSIC) through Programme for internationalization i-LINK 2023 (Project ILINK23047). M. V. G. acknowledges the access granted by the Galician Supercomputing Center (CESGA) to the FinisTerrae III supercomputer, funded by the Spanish Ministry of Science and Innovation, the Galician Government, and the European Regional Development Fund (ERDF), and the access granted by the CSIC to the Drago supercomputer. Parts of the work of K. M. J. were supported by the Carl Zeiss Foundation. S. G., M. G., N. P., B. S., and J. V. H. are partly supported by the USorb-DAC Project through a grant from The Grantham Foundation for the Protection of the Environment to RMI's climate tech accelerator program, Third Derivative. The work of A. S. A. is supported by Novo Nordisk Foundation grant NNF23OC0081359. S. M. M. and M. R. K. work is partly supported by grant number DSI-CGY3R1P16 from the Data Sciences Institute at the University of Toronto. M. A. expresses gratitude to the European Research Council (ERC) for evaluating the project with the reference number 101106377 titled "CLARIFIER" and accepting it for funding under the HORIZON TMA MSCA Postdoctoral Fellowships - European Fellowships. Furthermore, M. A. acknowledges the funding by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee (EP/Y023447/1;organization reference:101106377). L. L. G., A. A., and T. P. J. K. gratefully acknowledge funding from the European Research Council under the European Union's Horizon 2020 research and innovation program through the ERC grant DiProPhys (agreement ID 101001615) (L. L. G., A. A., T. P. J. K.). The National Institutes of Health Oxford-Cambridge Scholars Program (L. L. G.), the Cambridge Trust's Cambridge International Scholarship (L. L. G.), the Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health (L. L. G.). The European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013;T. P. J. K.) and the Frances and Augustus Newman Foundation (T. P. J. K.). K. L. S. acknowledges funding from the Schmidt Science Fellowship in partnership with the Rhodes Trust and from St. John's College Research Fellowship programme. F. N. and M. T. thank the Italian MUR for provision of funding through the PRIN 2020 Project doMino (ref 2020P9KBKZ). The research of N. X. H. was supported by the NCCR MARVEL, a National Centre of Competence in Research, funded by the Swiss National Science Foundation (grant number 205602). B. M. acknowledges the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 945363. E. S. acknowledges the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 949229, CryForm).
dc.identifier.doi10.1039/d4sc04401k
dc.identifier.eissn2041-6539
dc.identifier.grantnoSwiss Science Foundation;Spanish Agencia Estatal de Investigacion (AEI) - MICIU/AEI;European Union NextGenerationEU/PRTR;Spanish National Research Council (CSIC) [ILINK23047];Spanish Ministry of Science and Innovation;Galician Government;European Regional Development Fund (ERDF);Carl Zeiss Foundation;USorb-DAC Project through Grantham Foundation for the Protection of the Environment;Novo Nordisk Foundation [NNF23OC0081359];Data Sciences Institute at the University of Toronto [DSI-CGY3R1P16];European Research Council (ERC) [101106377];UK Research and Innovation (UKRI) under the UK government's Horizon Europe [101106377, EP/Y023447/1];European Research Council under the European Union's Horizon 2020 research and innovation program through the ERC grant DiProPhys [101001615];National Institutes of Health Oxford-Cambridge Scholars Program;Cambridge Trust's Cambridge International Scholarship;Intramural Research Program of the National Institute of Diabetes and Digestive and Kidney Diseases at the National Institutes of Health;European Research Council under the European Union's Seventh Framework Programme (FP7/2007-2013);Frances and Augustus Newman Foundation;Schmidt Science Fellowship;Rhodes Trust;St. John's College Research Fellowship programme;Italian MUR [2020P9KBKZ];NCCR MARVEL, a National Centre of Competence in Research - Swiss National Science Foundation [205602];European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant [945363];European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme [949229];[214872];[216165];[TED2021-131693B-I00];[CNS2022-135474]
dc.identifier.issn2041-6520
dc.identifier.issue2
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-85212108442
dc.identifier.urihttps://doi.org/10.1039/d4sc04401k
dc.identifier.urihttps://hdl.handle.net/20.500.14288/27303
dc.identifier.volume16
dc.identifier.wos1373013200001
dc.keywordsLarge language models
dc.keywordsChemical properties
dc.keywordsFine-tuning
dc.keywordsMachine learning
dc.keywordsChemical questions
dc.keywordsGPT-J-6B
dc.keywordsLlama-3.1-8B
dc.keywordsMistral-7B
dc.keywordsTraditional machine learning
dc.keywordsClassification problem
dc.keywordsDataset size
dc.keywordsPredictive models
dc.keywordsResearch study
dc.keywordsExperiments
dc.keywordsSimulations
dc.language.isoeng
dc.publisherThe Royal Society of Chemistry
dc.relation.ispartofCHEMICAL SCIENCE
dc.subjectChemistry
dc.titleAssessment of fine-tuned large language models for real-world chemistry and material science applications
dc.typeJournal Article
dspace.entity.typePublication
local.publication.orgunit1College of Engineering
local.publication.orgunit1GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
local.publication.orgunit2Department of Chemical and Biological Engineering
local.publication.orgunit2Graduate School of Sciences and Engineering
relation.isOrgUnitOfPublicationc747a256-6e0c-4969-b1bf-3b9f2f674289
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication.latestForDiscoveryc747a256-6e0c-4969-b1bf-3b9f2f674289
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files