Smoothing a tera-word language model

Publication:
Smoothing a tera-word language model

Departments

Organizational Unit

Department of Computer Engineering

School / College / Institute

Organizational Unit

College of Engineering

KU-Authors

Faculty Member, Yüret, Deniz

Publication Date

2008

Type

Conference Proceeding

Abstract

Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.

Publisher

Association for Computational Linguistics

Subject

Computer engineering

Source

ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

URI

https://hdl.handle.net/20.500.14288/13977

Link

https://www.scopus.com/inward/record.uri?eid=2-s2.0-84859927912andpartnerID=40andmd5=628cb5eae44eb8de2f5cede0f65c89f6

Publication:
Smoothing a tera-word language model

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

2

Views

0

Downloads

Publication: Smoothing a tera-word language model

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

2

Views

0

Downloads

Publication:
Smoothing a tera-word language model