Publication: Unsupervised instance-based part of speech induction using probable substitutes
Program
KU-Authors
KU Authors
Co-Authors
Advisor
Publication Date
Language
English
Journal Title
Journal ISSN
Volume Title
Abstract
We develop an instance (token) based extension of the state of the art word (type) based part-ofspeech induction system introduced in (Yatbaz et al., 2012). Each word instance is represented by a feature vector that combines information from the target word and probable substitutes sampled from an n-gram model representing its context. Modeling ambiguity using an instance based model does not lead to significant gains in overall accuracy in part-of-speech tagging because most words in running text are used in their most frequent class (e.g. 93.69% in the Penn Treebank). However it is important to model ambiguity because most frequent words are ambiguous and not modeling them correctly may negatively affect upstream tasks. Our main contribution is to show that an instance based model can achieve significantly higher accuracy on ambiguous words at the cost of a slight degradation on unambiguous ones, maintaining a comparable overall accuracy. On the Penn Treebank, the overall many-to-one accuracy of the system is within 1% of the state-of-the-art (80%), while on highly ambiguous words it is up to 70% better. On multilingual experiments our results are significantly better than or comparable to the best published word or instance based systems on 15 out of 19 corpora in 15 languages. The vector representations for words used in our system are available for download for further experiments.
Source:
COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers
Publisher:
Association for Computational Linguistics (ACL)
Keywords:
Subject
Computer Science, Artificial intelligence