GECTurk: grammatical error correction and detection dataset for Turkish

Publication:
GECTurk: grammatical error correction and detection dataset for Turkish

Files

Primary IR04599.pdf (1.12 MB)

Departments

Organizational Unit

Department of Computer Engineering

Organizational Unit

Graduate School of Sciences and Engineering

School / College / Institute

Organizational Unit

College of Engineering

Organizational Unit

GRADUATE SCHOOL OF SCIENCES AND ENGINEERING

Upper Org Unit

KU-Authors

Kara, Atakan

Şahin, Gözde Gül

Sofian, Farrin Marouf

Yong-Xern Bond, Andrew

Date

2023

Type

Conference Proceeding

Abstract

Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.

Publisher

Association for Computational Linguistics

Subject

Computer science, Artificial intelligence, Linguistics

Source

13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023

URI

https://hdl.handle.net/20.500.14288/21887

Collections

Publications with Fulltext

Full item page

Publication: GECTurk: grammatical error correction and detection dataset for Turkish

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Editor & Affiliation

Compiler & Affiliation

Translator

Other Contributor

Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

Related Goal

14

Views

16

Downloads

Publication:
GECTurk: grammatical error correction and detection dataset for Turkish