Publication:
Smoothing a tera-word language model

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.facultymemberYes
dc.contributor.kuauthorYüret, Deniz
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.date.accessioned2024-11-09T23:46:40Z
dc.date.issued2008
dc.description.abstractFrequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.
dc.description.fulltextNo
dc.description.harvestedfromManual
dc.description.indexedbyScopus
dc.description.openaccessYES
dc.description.peerreviewstatusN/A
dc.description.publisherscopeInternational
dc.description.readpublishN/A
dc.description.sponsoredbyTubitakEuN/A
dc.description.sponsorshipAssociation for Computational Linguistics (ACL)
dc.description.sponsorshipNorth Am. Chapter of the Assoc. Comput. Linguistics
dc.description.versionN/A
dc.identifier.embargoN/A
dc.identifier.isbn9781-9324-3204-6
dc.identifier.linkhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84859927912andpartnerID=40andmd5=628cb5eae44eb8de2f5cede0f65c89f6
dc.identifier.quartileBakılacak
dc.identifier.scopus2-s2.0-84859927912
dc.identifier.urihttps://hdl.handle.net/20.500.14288/13977
dc.keywordsData sets
dc.keywordsDirichlet prior
dc.keywordsFrequency counts
dc.keywordsLanguage model
dc.keywordsLanguage modeling
dc.keywordsLarge datasets
dc.keywordsLow frequency
dc.keywordsSmoothing algorithms
dc.keywordsSmoothing methods
dc.keywordsSoftware agents
dc.keywordsComputational linguistics
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics
dc.relation.affiliationKoç University
dc.relation.collectionKoç University Institutional Repository
dc.relation.ispartofACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
dc.relation.openaccessN/A
dc.rightsN/A
dc.subjectComputer engineering
dc.titleSmoothing a tera-word language model
dc.typeConference Proceeding
dspace.entity.typePublication
local.contributor.kuauthorYüret, Deniz
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files