PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset

Publication:
PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset

dc.contributor.coauthor	Arda Uzunoglu
dc.contributor.department	Department of Computer Engineering
dc.contributor.department	KUIS AI (Koç University & İş Bank Artificial Intelligence Center)
dc.contributor.department	Graduate School of Sciences and Engineering
dc.contributor.kuauthor	Safa, Abdalfatah Rashid
dc.contributor.kuauthor	Şahin, Gözde Gül
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.contributor.schoolcollegeinstitute	GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstitute	Research Center
dc.date.accessioned	2025-03-06T21:00:08Z
dc.date.issued	2024
dc.description.abstract	Recently, there has been growing interest within the community regarding whether large language models are capable of planning or executing plans. However, most prior studies use LLMs to generate high-level plans for simplified scenarios lacking linguistic complexity and domain diversity, limiting analysis of their planning abilities. These setups constrain evaluation methods (e.g., predefined action space), architectural choices (e.g., only generative models), and overlook the linguistic nuances essential for realistic analysis. To tackle this, we present PARADISE, an abductive reasoning task using QandA format on practical procedural text sourced from wikiHow. It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios. Despite advancements, all models fall short of human performance. Notably, our analysis uncovers intriguing insights, such as variations in model behavior with dropped keywords, struggles of BERT-family and GPT-4 with physical and abstract goals, and the proposed tasks offering valuable prior knowledge for other unseen procedural tasks.
dc.description.indexedby	Scopus
dc.description.publisherscope	International
dc.description.sponsoredbyTubitakEu	TÜBİTAK
dc.description.sponsorship	This work has been supported by the Scientific and Technological Research Council of T\u00FCrkiye (T\u00DCBITAK) as part of the project \u201CAutomatic Learning of Procedural Language from Natural Language Instructions for Intelligent Assistance\u201D with the number 121C132. We also gratefully acknowledge KUIS AI Lab for providing computational support. We thank our anonymous reviewers and the members of GGLab who helped us improve this paper. We especially thank Aysha Gurbanova, Sebnem Demirtas, and Mahmut Ibrahim Deniz for their contributions to evaluating human performance on warning and tip inference tasks.
dc.identifier.grantno	Türkiye Bilimsel ve Teknolojik Araştırma Kurumu, TÜBİTAK; 121C132
dc.identifier.isbn	9798891760998
dc.identifier.issn	0736-587X
dc.identifier.quartile	N/A
dc.identifier.scopus	2-s2.0-85205316972
dc.identifier.uri	https://hdl.handle.net/20.500.14288/27851
dc.keywords	Language models
dc.keywords	Implicit planning
dc.keywords	Procedural warnings
dc.keywords	Tips dataset
dc.keywords	Machine learning
dc.keywords	Natural language processing
dc.keywords	AI decision-making
dc.keywords	Task planning
dc.keywords	Large language models
dc.keywords	Algorithmic reasoning
dc.keywords	Model evaluation
dc.keywords	AI safety
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics (ACL)
dc.relation.ispartof	Proceedings of the Annual Meeting of the Association for Computational Linguistics
dc.subject	Computer science, information systems
dc.subject	Computer science, theory and methods
dc.title	PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset
dc.type	Conference Proceeding
dspace.entity.type	Publication
local.contributor.kuauthor	Safa, Abdalfatah Rashid
local.contributor.kuauthor	Şahin, Gözde Gül
local.publication.orgunit1	GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
local.publication.orgunit1	College of Engineering
local.publication.orgunit1	Research Center
local.publication.orgunit2	Department of Computer Engineering
local.publication.orgunit2	KUIS AI (Koç University & İş Bank Artificial Intelligence Center)
local.publication.orgunit2	Graduate School of Sciences and Engineering
person.familyName	Safa
person.familyName	Şahin
person.givenName	Abdalfatah Rashid
person.givenName	Gözde Gül
relation.isOrgUnitOfPublication	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication	77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isOrgUnitOfPublication	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication.latestForDiscovery	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication	434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication	d437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication.latestForDiscovery	8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files

Original bundle

Now showing 1 - 1 of 1

Name:: IR05719.pdf
Size:: 5.11 MB
Format:: Adobe Portable Document Format

Download

Collections

Publications with Fulltext

Publication: PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset

Files

Original bundle

Collections

Publication:
PARADISE: Evaluating implicit planning skills of language models with procedural warnings and tips dataset