Publication:
Smoothing a tera-word language model

Placeholder

Organizational Units

Program

KU-Authors

KU Authors

Co-Authors

Advisor

Publication Date

Language

English

Journal Title

Journal ISSN

Volume Title

Abstract

Frequency counts from very large corpora, such as the Web 1T dataset, have recently become available for language modeling. Omission of low frequency n-gram counts is a practical necessity for datasets of this size. Naive implementations of standard smoothing methods do not realize the full potential of such large datasets with missing counts. In this paper I present a new smoothing algorithm that combines the Dirichlet prior form of (Mackay and Peto, 1995) with the modified back-off estimates of (Kneser and Ney, 1995) that leads to a 31% perplexity reduction on the Brown corpus compared to a baseline implementation of Kneser-Ney discounting.

Source:

ACL-08: HLT - 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

Publisher:

Association for Computational Linguistics

Keywords:

Subject

Computer engineering

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyrights Note

0

Views

0

Downloads

View PlumX Details