Publication: Smoothing a tera-word language model
dc.contributor.department | Department of Computer Engineering | |
dc.contributor.kuauthor | Yüret, Deniz | |
dc.contributor.schoolcollegeinstitute | College of Engineering | |
dc.date.accessioned | 2024-11-09T23:46:40Z | |
dc.date.issued | 2008 | |
dc.description.abstract | Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting. | |
dc.description.indexedby | Scopus | |
dc.description.openaccess | YES | |
dc.description.publisherscope | International | |
dc.description.sponsoredbyTubitakEu | N/A | |
dc.description.sponsorship | Association for Computational Linguistics (ACL) | |
dc.description.sponsorship | North Am. Chapter of the Assoc. Comput. Linguistics | |
dc.identifier.isbn | 9781-9324-3204-6 | |
dc.identifier.link | https://www.scopus.com/inward/record.uri?eid=2-s2.0-84859927912andpartnerID=40andmd5=628cb5eae44eb8de2f5cede0f65c89f6 | |
dc.identifier.quartile | N/A | |
dc.identifier.scopus | 2-s2.0-84859927912 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14288/13977 | |
dc.keywords | Data sets | |
dc.keywords | Dirichlet prior | |
dc.keywords | Frequency counts | |
dc.keywords | Language model | |
dc.keywords | Language modeling | |
dc.keywords | Large datasets | |
dc.keywords | Low frequency | |
dc.keywords | Smoothing algorithms | |
dc.keywords | Smoothing methods | |
dc.keywords | Software agents | |
dc.keywords | Computational linguistics | |
dc.language.iso | eng | |
dc.publisher | Association for Computational Linguistics | |
dc.relation.ispartof | ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference | |
dc.subject | Computer engineering | |
dc.title | Smoothing a tera-word language model | |
dc.type | Conference Proceeding | |
dspace.entity.type | Publication | |
local.contributor.kuauthor | Yüret, Deniz | |
local.publication.orgunit1 | College of Engineering | |
local.publication.orgunit2 | Department of Computer Engineering | |
relation.isOrgUnitOfPublication | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
relation.isOrgUnitOfPublication.latestForDiscovery | 89352e43-bf09-4ef4-82f6-6f9d0174ebae | |
relation.isParentOrgUnitOfPublication | 8e756b23-2d4a-4ce8-b1b3-62c794a8c164 | |
relation.isParentOrgUnitOfPublication.latestForDiscovery | 8e756b23-2d4a-4ce8-b1b3-62c794a8c164 |