GECTurk: grammatical error correction and detection dataset for Turkish

Publication:
GECTurk: grammatical error correction and detection dataset for Turkish

dc.contributor.department	Department of Computer Engineering
dc.contributor.department	Graduate School of Sciences and Engineering
dc.contributor.facultymember	Yes
dc.contributor.kuauthor	Kara, Atakan
dc.contributor.kuauthor	Şahin, Gözde Gül
dc.contributor.kuauthor	Sofian, Farrin Marouf
dc.contributor.kuauthor	Yong-Xern Bond, Andrew
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.contributor.schoolcollegeinstitute	GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.date.accessioned	2024-12-29T09:36:00Z
dc.date.issued	2023
dc.description.abstract	Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
dc.description.indexedby	WOS
dc.description.indexedby	Scopus
dc.description.publisherscope	International
dc.description.sponsoredbyTubitakEu	TÜBİTAK
dc.description.sponsorship	This work has been supported by the Scientific and Technological Research Council of Turkiye (TUBITAK) as part of the project "Automatic Learning of Procedural Language from Natural Language Instructions for Intelligent Assistance" with the number 121C132. We also gratefully acknowledge KUIS AI Lab for providing computational support. We thank our anonymous reviewers and the members of GGLab who helped us improve this paper.
dc.description.studentonlypublication	No
dc.description.studentpublication	Yes
dc.identifier.isbn	979-8-89176-018-9
dc.identifier.quartile	N/A
dc.identifier.scopus	2-s2.0-85188536949
dc.identifier.uri	https://hdl.handle.net/20.500.14288/21887
dc.identifier.wos	1221037500024
dc.keywords	Computational linguistics
dc.keywords	Error correction
dc.keywords	Error detection
dc.keywords	Knowledge management
dc.keywords	Metadata
dc.keywords	Natural language processing systems
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics
dc.relation.ispartof	13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject	Linguistics
dc.title	GECTurk: grammatical error correction and detection dataset for Turkish
dc.type	Conference Proceeding
dspace.entity.type	Publication
local.contributor.kuauthor	Kara, Atakan
local.contributor.kuauthor	Sofian, Farrin Marouf
local.contributor.kuauthor	Yong-Xern Bond, Andrew
local.contributor.kuauthor	Şahin, Gözde Gül
relation.isOrgUnitOfPublication	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication	3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication.latestForDiscovery	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication	434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication.latestForDiscovery	8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files

Original bundle

Now showing 1 - 1 of 1

Name:: IR04599.pdf
Size:: 1.12 MB
Format:: Adobe Portable Document Format

Download

Collections

Publications with Fulltext

Publication: GECTurk: grammatical error correction and detection dataset for Turkish

Files

Original bundle

Collections

Publication:
GECTurk: grammatical error correction and detection dataset for Turkish