Publication:
PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset

dc.contributor.coauthorArda Uzunoglu
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentKUIS AI (Koç University & İş Bank Artificial Intelligence Center)
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.kuauthorSafa, Abdalfatah Rashid
dc.contributor.kuauthorŞahin, Gözde Gül
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstituteResearch Center
dc.date.accessioned2025-03-06T21:00:08Z
dc.date.issued2024
dc.description.abstractRecently, there has been growing interest within the community regarding whether large language models are capable of planning or executing plans. However, most prior studies use LLMs to generate high-level plans for simplified scenarios lacking linguistic complexity and domain diversity, limiting analysis of their planning abilities. These setups constrain evaluation methods (e.g., predefined action space), architectural choices (e.g., only generative models), and overlook the linguistic nuances essential for realistic analysis. To tackle this, we present PARADISE, an abductive reasoning task using QandA format on practical procedural text sourced from wikiHow. It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios. Despite advancements, all models fall short of human performance. Notably, our analysis uncovers intriguing insights, such as variations in model behavior with dropped keywords, struggles of BERT-family and GPT-4 with physical and abstract goals, and the proposed tasks offering valuable prior knowledge for other unseen procedural tasks.
dc.description.indexedbyScopus
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuTÜBİTAK
dc.description.sponsorshipThis work has been supported by the Scientific and Technological Research Council of T\u00FCrkiye (T\u00DCBITAK) as part of the project \u201CAutomatic Learning of Procedural Language from Natural Language Instructions for Intelligent Assistance\u201D with the number 121C132. We also gratefully acknowledge KUIS AI Lab for providing computational support. We thank our anonymous reviewers and the members of GGLab who helped us improve this paper. We especially thank Aysha Gurbanova, Sebnem Demirtas, and Mahmut Ibrahim Deniz for their contributions to evaluating human performance on warning and tip inference tasks.
dc.identifier.grantnoTürkiye Bilimsel ve Teknolojik Araştırma Kurumu, TÜBİTAK; 121C132
dc.identifier.isbn9798891760998
dc.identifier.issn0736-587X
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-85205316972
dc.identifier.urihttps://hdl.handle.net/20.500.14288/27851
dc.keywordsLanguage models
dc.keywordsImplicit planning
dc.keywordsProcedural warnings
dc.keywordsTips dataset
dc.keywordsMachine learning
dc.keywordsNatural language processing
dc.keywordsAI decision-making
dc.keywordsTask planning
dc.keywordsLarge language models
dc.keywordsAlgorithmic reasoning
dc.keywordsModel evaluation
dc.keywordsAI safety
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics (ACL)
dc.relation.ispartofProceedings of the Annual Meeting of the Association for Computational Linguistics
dc.subjectComputer science, information systems
dc.subjectComputer science, theory and methods
dc.titlePARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset
dc.typeConference Proceeding
dspace.entity.typePublication
local.contributor.kuauthorSafa, Abdalfatah Rashid
local.contributor.kuauthorŞahin, Gözde Gül
local.publication.orgunit1GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
local.publication.orgunit1College of Engineering
local.publication.orgunit1Research Center
local.publication.orgunit2Department of Computer Engineering
local.publication.orgunit2KUIS AI (Koç University & İş Bank Artificial Intelligence Center)
local.publication.orgunit2Graduate School of Sciences and Engineering
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublicationd437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
IR05719.pdf
Size:
5.11 MB
Format:
Adobe Portable Document Format