Publication:
Unsupervised instance-based part of speech induction using probable substitutes

dc.contributor.departmentDepartment of Computer Engineering
dc.contributor.departmentGraduate School of Sciences and Engineering
dc.contributor.departmentKUIS AI (Koç University & İş Bank Artificial Intelligence Center)
dc.contributor.kuauthorYatbaz, Mehmet Ali
dc.contributor.kuauthorYüret, Deniz
dc.contributor.schoolcollegeinstituteCollege of Engineering
dc.contributor.schoolcollegeinstituteGRADUATE SCHOOL OF SCIENCES AND ENGINEERING
dc.contributor.schoolcollegeinstituteResearch Center
dc.date.accessioned2024-11-09T23:21:45Z
dc.date.issued2014
dc.description.abstractWe develop an instance (token) based extension of the state of the art word (type) based part-ofspeech induction system introduced in (Yatbaz et al., 2012). Each word instance is represented by a feature vector that combines information from the target word and probable substitutes sampled from an n-gram model representing its context. Modeling ambiguity using an instance based model does not lead to significant gains in overall accuracy in part-of-speech tagging because most words in running text are used in their most frequent class (e.g. 93.69% in the Penn Treebank). However it is important to model ambiguity because most frequent words are ambiguous and not modeling them correctly may negatively affect upstream tasks. Our main contribution is to show that an instance based model can achieve significantly higher accuracy on ambiguous words at the cost of a slight degradation on unambiguous ones, maintaining a comparable overall accuracy. On the Penn Treebank, the overall many-to-one accuracy of the system is within 1% of the state-of-the-art (80%), while on highly ambiguous words it is up to 70% better. On multilingual experiments our results are significantly better than or comparable to the best published word or instance based systems on 15 out of 19 corpora in 15 languages. The vector representations for words used in our system are available for download for further experiments.
dc.description.indexedbyScopus
dc.description.openaccessYES
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuN/A
dc.description.sponsorshipBaidu
dc.description.sponsorshipeBay
dc.description.sponsorshipGoogle
dc.description.sponsorshipMicrosoft
dc.description.sponsorshipSymantec
dc.identifier.isbn9781-9416-4326-6
dc.identifier.linkhttps://www.scopus.com/inward/record.uri?eid=2-s2.0-84959884159&partnerID=40&md5=56b6c16081b66947f8c5bdfe9dc3d7ca
dc.identifier.scopus2-s2.0-84959884159
dc.identifier.urihttps://hdl.handle.net/20.500.14288/10949
dc.keywordsForestry
dc.keywordsLinguistics
dc.keywordsFeature vectors
dc.keywordsInduction system
dc.keywordsN-gram modeling
dc.keywordsOverall accuracies
dc.keywordsPart of speech tagging
dc.keywordsPart-of-speech inductions
dc.keywordsState of the art
dc.keywordsVector representations
dc.keywordsComputational linguistics
dc.language.isoeng
dc.publisherAssociation for Computational Linguistics (ACL)
dc.relation.ispartofCOLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers
dc.subjectComputer Science
dc.subjectArtificial intelligence
dc.titleUnsupervised instance-based part of speech induction using probable substitutes
dc.typeConference Proceeding
dspace.entity.typePublication
local.contributor.kuauthorYatbaz, Mehmet Ali
local.contributor.kuauthorSert, Enis Rıfat
local.contributor.kuauthorYüret, Deniz
local.publication.orgunit1GRADUATE SCHOOL OF SCIENCES AND ENGINEERING
local.publication.orgunit1College of Engineering
local.publication.orgunit1Research Center
local.publication.orgunit2Department of Computer Engineering
local.publication.orgunit2KUIS AI (Koç University & İş Bank Artificial Intelligence Center)
local.publication.orgunit2Graduate School of Sciences and Engineering
relation.isOrgUnitOfPublication89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isOrgUnitOfPublication3fc31c89-e803-4eb1-af6b-6258bc42c3d8
relation.isOrgUnitOfPublication77d67233-829b-4c3a-a28f-bd97ab5c12c7
relation.isOrgUnitOfPublication.latestForDiscovery89352e43-bf09-4ef4-82f6-6f9d0174ebae
relation.isParentOrgUnitOfPublication8e756b23-2d4a-4ce8-b1b3-62c794a8c164
relation.isParentOrgUnitOfPublication434c9663-2b11-4e66-9399-c863e2ebae43
relation.isParentOrgUnitOfPublicationd437580f-9309-4ecb-864a-4af58309d287
relation.isParentOrgUnitOfPublication.latestForDiscovery8e756b23-2d4a-4ce8-b1b3-62c794a8c164

Files