Publication:
GECTurk: grammatical error correction and detection dataset for Turkish

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorKara, Atakan
dc.contributor.kuauthorSofian, Farrin Marouf
dc.contributor.kuauthorYong-Xern Bond, Andrew
dc.contributor.kuauthorŞahin, Gözde Gül
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGraduate School of Sciences and Engineering
dc.date.accessioned2024-12-29T09:36:00Z
dc.date.issued2023
dc.description.abstractGrammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuTÜBİTAK
dc.description.sponsorsThis work has been supported by the Scientific and Technological Research Council of Turkiye (TUBITAK) as part of the project "Automatic Learning of Procedural Language from Natural Language Instructions for Intelligent Assistance" with the number 121C132. We also gratefully acknowledge KUIS AI Lab for providing computational support. We thank our anonymous reviewers and the members of GGLab who helped us improve this paper.
dc.identifier.isbn979-8-89176-018-9
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-85188536949
dc.identifier.urihttps://hdl.handle.net/20.500.14288/21887
dc.identifier.wos1221037500024
dc.keywordsComputational linguistics
dc.keywordsError correction
dc.keywordsError detection
dc.keywordsKnowledge management
dc.keywordsMetadata
dc.keywordsNatural language processing systems
dc.languageen
dc.publisherAssociation for Computational Linguistics
dc.source13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subjectLinguistics
dc.titleGECTurk: grammatical error correction and detection dataset for Turkish
dc.typeConference proceeding
dspace.entity.typePublication
local.contributor.kuauthorKara, Atakan
local.contributor.kuauthorSofian, Farrin Marouf
local.contributor.kuauthorYong-Xern Bond, Andrew
local.contributor.kuauthorŞahin, Gözde Gül
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae

Files