Publication:
To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP

dc.contributor.departmentN/A
dc.contributor.kuauthorŞahin, Gözde Gül
dc.contributor.kuprofileFaculty Member
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.yokid366984
dc.date.accessioned2024-11-09T13:11:12Z
dc.date.issued2022
dc.description.abstractData-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).
dc.description.fulltextYES
dc.description.indexedbyScopus
dc.description.issue1
dc.description.openaccessYES
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.sponsorshipN/A
dc.description.versionPublisher version
dc.description.volume48
dc.formatpdf
dc.identifier.doi10.1162/COLI_a_00425
dc.identifier.embargoNO
dc.identifier.filenameinventorynoIR03691
dc.identifier.issn0891-2017
dc.identifier.linkhttps://doi.org/10.1162/COLI_a_00425
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-85128214382
dc.identifier.urihttps://hdl.handle.net/20.500.14288/2849
dc.keywordsComputational linguistics
dc.keywordsDeep neural networks
dc.keywordsNatural language processing systems
dc.keywordsSemantics
dc.languageEnglish
dc.publisherMIT Press
dc.relation.grantnoNA
dc.relation.urihttp://cdm21054.contentdm.oclc.org/cdm/ref/collection/IR/id/10549
dc.sourceComputational Linguistics
dc.subjectEngineering
dc.titleTo augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.authorid0000-0002-0332-1657
local.contributor.kuauthorŞahin, Gözde Gül

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
10549.pdf
Size:
716.66 KB
Format:
Adobe Portable Document Format