Publication:
Unsupervised instance-based part of speech induction using probable substitutes

Placeholder

Organizational Units

Program

KU Authors

Co-Authors

Advisor

Publication Date

Language

English

Journal Title

Journal ISSN

Volume Title

Abstract

We develop an instance (token) based extension of the state of the art word (type) based part-ofspeech induction system introduced in (Yatbaz et al., 2012). Each word instance is represented by a feature vector that combines information from the target word and probable substitutes sampled from an n-gram model representing its context. Modeling ambiguity using an instance based model does not lead to significant gains in overall accuracy in part-of-speech tagging because most words in running text are used in their most frequent class (e.g. 93.69% in the Penn Treebank). However it is important to model ambiguity because most frequent words are ambiguous and not modeling them correctly may negatively affect upstream tasks. Our main contribution is to show that an instance based model can achieve significantly higher accuracy on ambiguous words at the cost of a slight degradation on unambiguous ones, maintaining a comparable overall accuracy. On the Penn Treebank, the overall many-to-one accuracy of the system is within 1% of the state-of-the-art (80%), while on highly ambiguous words it is up to 70% better. On multilingual experiments our results are significantly better than or comparable to the best published word or instance based systems on 15 out of 19 corpora in 15 languages. The vector representations for words used in our system are available for download for further experiments.

Source:

COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers

Publisher:

Association for Computational Linguistics (ACL)

Keywords:

Subject

Computer Science, Artificial intelligence

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyrights Note

1

Views

0

Downloads

View PlumX Details