Publication:
Optimizing instance selection for statistical machine translation with feature decay algorithms

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.kuauthorYüret, Deniz
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.date.accessioned2024-11-10T00:06:41Z
dc.date.issued2015
dc.description.abstractWe introduce FDa5 for efficient parameterization, optimization, and implementation of feature decay algorithms (FDa), A class of instance selection algorithms that use feature decay. FDa increase the diversity of the selected training set by devaluing features (i.e., n-grams) that have already been included. FDa5 decides which instances to select based on three functions used for initializing and decaying feature values and scaling sentence scores controlled with five parameters. We present optimization techniques that allow FDa5 to adapt these functions to in-domain and out-of-domain translation tasks for different language pairs. in a transductive learning setting, selection of training instances relevant to the test set can improve the final translation quality. in machine translation experiments performed on the 2 million sentence English-German section of the Europarl corpus, we show that a subset of the training set selected by FDa5 can gain up to 3.22 BLEU points compared to a randomly selected subset of the same size, can gain up to 0.41 BLEU points compared to using all of the available training data using only 15% of it, and can reach within 0.5 BLEU points to the full training set result by using only 2.7% of the full training data. FDa5 peaks at around 8M words or 15% of the full training set. in an active learning setting, FDa5 minimizes the human effort by identifying the most informative sentences for translation and FDa gains up to 0.45 BLEU points using 3/5 of the available training data compared to using all of it and 1.12 BLEU points compared to random training set. in translation tasks involving English and Turkish, A morphologically rich language, FDa5 can gain up to 11.52 BLEU points compared to a randomly selected subset of the same size, can achieve the same BLEU score using as little as 4% of the data compared to random instance selection, and can exceed the full dataset result by 0.78 BLEU points. FDa5 is able to reduce the time to build a statistical machine translation system to about half with 1M words using only 3% of the space for the phrase table and 8% of the overall space when compared with a baseline system using all of the training data available yet still obtain only 0.58 BLEU points difference with the baseline system in out-of-domain translation.
dc.description.indexedbyWOS
dc.description.indexedbyScopus
dc.description.issue2
dc.description.openaccessNO
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.volume23
dc.identifier.doi10.1109/TaSLP.2014.2381882
dc.identifier.eissn2329-9304
dc.identifier.issn2329-9290
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-84921646370
dc.identifier.urihttps://doi.org/10.1109/TaSLP.2014.2381882
dc.identifier.urihttps://hdl.handle.net/20.500.14288/16655
dc.identifier.wos348210300010
dc.keywordsDomain adaptation
dc.keywordsinformation retrieval
dc.keywordsinstance selection
dc.keywordsMachine translation
dc.keywordsTransductive learning
dc.language.isoeng
dc.publisherIEEE-Inst Electrical Electronics Engineers Inc
dc.relation.ispartofIEEE-ACM Transactions on Audio Speech and Language Processing
dc.subjectAcoustics
dc.subjectElectrical electronics engineering
dc.titleOptimizing instance selection for statistical machine translation with feature decay algorithms
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.kuauthorBiçici, Ergun
local.contributor.kuauthorYüret, Deniz
local.publication.orgunit1GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
local.publication.orgunit1College of Engineering
local.publication.orgunit2Department of Computer Engineering
local.publication.orgunit2Graduate School of Sciences and Engineering
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files