To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP

Publication:
To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP

dc.contributor.facultymember	Yes
dc.contributor.kuauthor	Şahin, Gözde Gül
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.date.accessioned	2024-11-09T13:11:12Z
dc.date.issued	2022
dc.description.abstract	Data-hungry deep neural networks have established themselves as the de facto standard for many NLP tasks, including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind their statistical counterparts in low-resource scenarios. One methodology to counterattack this problem is text augmentation, that is, generating new synthetic training data points from existing data. Although NLP has recently witnessed several new textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three categories of text augmentation methodologies that perform changes on the syntax (e.g., cropping sub-sentences), token (e.g., random word insertion), and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing, and semantic role labeling for a diverse set of language families using various models, including the architectures that rely on pretrained multilingual contextualized language models such as mBERT. Augmentation most significantly improves dependency parsing, followed by part-of-speech tagging and semantic role labeling. We find the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese. Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT, especially for dependency parsing. We identify the character-level methods as the most consistent performers, while synonym replacement and syntactic augmenters provide inconsistent improvements. Finally, we discuss that the results most heavily depend on the task, language pair (e.g., syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages), and model type (e.g., token-level augmentation provides significant improvements for BPE, while character-level ones give generally higher scores for char and mBERT based models).
dc.description.fulltext	YES
dc.description.indexedby	Scopus
dc.description.issue	1
dc.description.openaccess	YES
dc.description.publisherscope	International
dc.description.sponsoredbyTubitakEu	N/A
dc.description.sponsorship	N/A
dc.description.version	Publisher version
dc.description.volume	48
dc.identifier.doi	10.1162/COLI_a_00425
dc.identifier.embargo	NO
dc.identifier.filenameinventoryno	IR03691
dc.identifier.issn	0891-2017
dc.identifier.quartile	Q1
dc.identifier.scopus	2-s2.0-85128214382
dc.identifier.uri	https://doi.org/10.1162/COLI_a_00425
dc.keywords	Computational linguistics
dc.keywords	Deep neural networks
dc.keywords	Natural language processing systems
dc.keywords	Semantics
dc.language.iso	eng
dc.publisher	MIT Press
dc.relation.grantno	NA
dc.relation.ispartof	Computational Linguistics
dc.relation.uri	http://cdm21054.contentdm.oclc.org/cdm/ref/collection/IR/id/10549
dc.subject	Engineering
dc.title	To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP
dc.type	Journal Article
dspace.entity.type	Publication
local.contributor.kuauthor	Şahin, Gözde Gül
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery	8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 10549.pdf
Size:: 716.66 KB
Format:: Adobe Portable Document Format

Download

Collections

Publications with Fulltext

Publication: To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP

Files

Original bundle

Collections

Publication:
To augment or not to augment? a comparative study on text augmentation techniques for low-resource NLP