Publication:
Smoothing a tera-word language model

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.kuauthorYüret, Deniz
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2024-11-09T23:46:40Z
dc.date.issued2008
dc.description.abstractFrequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.
dc.description.indexedbyScopus
dc.description.openaccessYES
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.sponsorshipAssociation for Computational Linguistics (ACL)
dc.description.sponsorshipNorth Am. Chapter of the Assoc. Comput. Linguistics
dc.identifier.isbn9781-9324-3204-6
dc.identifier.linkhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84859927912andpartnerID=40andmd5=628cb5eae44eb8de2f5cede0f65c89f6
dc.identifier.quartileN/A
dc.identifier.scopus2-s2.0-84859927912
dc.identifier.urihttps://hdl.handle.net/20.500.14288/13977
dc.keywordsData sets
dc.keywordsDirichlet prior
dc.keywordsFrequency counts
dc.keywordsLanguage model
dc.keywordsLanguage modeling
dc.keywordsLarge datasets
dc.keywordsLow frequency
dc.keywordsSmoothing algorithms
dc.keywordsSmoothing methods
dc.keywordsSoftware agents
dc.keywordsComputational linguistics
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics
dc.relation.ispartofACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
dc.subjectComputer engineering
dc.titleSmoothing a tera-word language model
dc.typeConference Proceeding
dspace.entity.typePublication
local.contributor.kuauthorYüret, Deniz
local.publication.orgunit1College of Engineering
local.publication.orgunit2Department of Computer Engineering
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files