Smoothing a tera-word language model

Publication:
Smoothing a tera-word language model

dc.contributor.department	Department of Computer Engineering
dc.contributor.facultymember	Yes
dc.contributor.kuauthor	Yüret, Deniz
dc.contributor.schoolcollegeinstitute	College of Engineering
dc.date.accessioned	2024-11-09T23:46:40Z
dc.date.issued	2008
dc.description.abstract	Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.
dc.description.fulltext	No
dc.description.harvestedfrom	Manual
dc.description.indexedby	Scopus
dc.description.openaccess	YES
dc.description.peerreviewstatus	N/A
dc.description.publisherscope	International
dc.description.readpublish	N/A
dc.description.sponsoredbyTubitakEu	N/A
dc.description.sponsorship	Association for Computational Linguistics (ACL)
dc.description.sponsorship	North Am. Chapter of the Assoc. Comput. Linguistics
dc.description.studentonlypublication	No
dc.description.studentpublication	No
dc.description.version	N/A
dc.identifier.WoSQuartile	Bakılacak
dc.identifier.embargo	N/A
dc.identifier.isbn	9781-9324-3204-6
dc.identifier.link	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84859927912andpartnerID=40andmd5=628cb5eae44eb8de2f5cede0f65c89f6
dc.identifier.scopus	2-s2.0-84859927912
dc.identifier.uri	https://hdl.handle.net/20.500.14288/13977
dc.keywords	Data sets
dc.keywords	Dirichlet prior
dc.keywords	Frequency counts
dc.keywords	Language model
dc.keywords	Language modeling
dc.keywords	Large datasets
dc.keywords	Low frequency
dc.keywords	Smoothing algorithms
dc.keywords	Smoothing methods
dc.keywords	Software agents
dc.keywords	Computational linguistics
dc.language.iso	eng
dc.publisher	Association for Computational Linguistics
dc.relation.affiliation	Koç University
dc.relation.collection	Koç University Institutional Repository
dc.relation.ispartof	ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference
dc.relation.openaccess	N/A
dc.rights	N/A
dc.subject	Computer engineering
dc.title	Smoothing a tera-word language model
dc.type	Conference Proceeding
dspace.entity.type	Publication
local.contributor.kuauthor	Yüret, Deniz
relation.isOrgUnitOfPublication	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication.latestForDiscovery	89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication	8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication.latestForDiscovery	8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Collections

Publications without Fulltext

Publication: Smoothing a tera-word language model

Files

Collections

Publication:
Smoothing a tera-word language model