Publication:
Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection

dc.contributor.departmentDepartment of Sociology
dc.contributor.kuauthorYörük, Erdem
dc.contributor.kuauthorHürriyetoğlu, Ali
dc.contributor.kuauthorDuruşan, Fırat
dc.contributor.kuauthorYoltar, Çağrı
dc.contributor.kuprofileFaculty Member
dc.contributor.kuprofileTeaching Faculty
dc.contributor.kuprofileResearcher
dc.contributor.otherDepartment of Sociology
dc.contributor.schoolcollegeinstituteCollege of Social Sciences and Humanities
dc.contributor.yokid28982
dc.contributor.yokidN/A
dc.contributor.yokidN/A
dc.contributor.yokidN/A
dc.date.accessioned2024-11-09T12:26:39Z
dc.date.issued2021
dc.description.abstractWhat is the most optimal way of creating a gold standard corpus for training a machine learning system that is designed for automatically collecting protest information in a cross-country context? We show that creating a gold standard corpus for training and testing machine learning models on the basis of randomly chosen news articles from news archives yields better performance than selecting news articles on the basis of keyword filtering, which is the most prevalent method currently used in automated event coding. We advance this new bottom-up approach to ensure generalizability and reliability in cross-country comparative protest event collection from international and local news in different countries, languages, sources and time periods, which entails a large variety of event types, actors, and targets. We present the results of comparing our random-sample approach with keyword filtering. We show that the machine learning algorithms, and particularly state-of-the-art deep learning tools, perform much better when they are trained with the gold standard corpus from a randomly selected set of news articles from China, India, and South Africa. Finally, we also present our approach to overcome the major ethical issues that are intrinsic to protest event coding.
dc.description.fulltextYES
dc.description.indexedbyWoS
dc.description.indexedbyScopus
dc.description.issue5
dc.description.openaccessYES
dc.description.publisherscopeInternational
dc.description.sponsoredbyTubitakEuEU
dc.description.sponsorshipEuropean Union (EU)
dc.description.sponsorshipHorizon 2020
dc.description.sponsorshipEuropean Research Council (ERC)
dc.description.sponsorshipStarting Grant
dc.description.sponsorshipEmerging Welfare
dc.description.versionAuthor's final manuscript
dc.description.volume66
dc.formatpdf
dc.identifier.doi10.1177/00027642211021630
dc.identifier.eissn1552-3381
dc.identifier.embargoNO
dc.identifier.filenameinventorynoIR02858
dc.identifier.issn0002-7642
dc.identifier.linkhttps://doi.org/10.1177/00027642211021630
dc.identifier.quartileQ1
dc.identifier.scopus2-s2.0-85107397581
dc.identifier.urihttps://hdl.handle.net/20.500.14288/1699
dc.identifier.wos663403100001
dc.keywordsNatural language processing
dc.keywordsMachine learning
dc.keywordsProtests
dc.keywordsContentious politics
dc.keywordsEvent data extraction
dc.keywordsLanguage resources
dc.languageEnglish
dc.publisherSage
dc.relation.grantno714868
dc.relation.urihttp://cdm21054.contentdm.oclc.org/cdm/ref/collection/IR/id/9422
dc.sourceAmerican Behavioral Scientist
dc.subjectPsychology
dc.subjectSocial science
dc.titleRandom sampling in corpus design: cross-context generalizability in automated multicountry protest event collection
dc.typeJournal Article
dspace.entity.typePublication
local.contributor.authorid0000-0002-4882-0812
local.contributor.authoridN/A
local.contributor.authoridN/A
local.contributor.authoridN/A
local.contributor.kuauthorYörük, Erdem
local.contributor.kuauthorHürriyetoğlu, Ali
local.contributor.kuauthorDuruşan, Fırat
local.contributor.kuauthorYoltar, Çağrı
relation.isOrgUnitOfPublication10f5be47-fab1-42a1-af66-1642ba4aff8e
relation.isOrgUnitOfPublication.latestForDiscovery10f5be47-fab1-42a1-af66-1642ba4aff8e

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
9422.pdf
Size:
398.37 KB
Format:
Adobe Portable Document Format