Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection

Publication:
Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection

Files

9422.pdf (398.37 KB)

Departments

Organizational Unit

Department of Sociology

School / College / Institute

Organizational Unit

College of Social Sciences and Humanities

KU-Authors

Publication Date

2021

Type

Journal Article

Embargo Status

NO

Abstract

What is the most optimal way of creating a gold standard corpus for training a machine learning system that is designed for automatically collecting protest information in a cross-country context? We show that creating a gold standard corpus for training and testing machine learning models on the basis of randomly chosen news articles from news archives yields better performance than selecting news articles on the basis of keyword filtering, which is the most prevalent method currently used in automated event coding. We advance this new bottom-up approach to ensure generalizability and reliability in cross-country comparative protest event collection from international and local news in different countries, languages, sources and time periods, which entails a large variety of event types, actors, and targets. We present the results of comparing our random-sample approach with keyword filtering. We show that the machine learning algorithms, and particularly state-of-the-art deep learning tools, perform much better when they are trained with the gold standard corpus from a randomly selected set of news articles from China, India, and South Africa. Finally, we also present our approach to overcome the major ethical issues that are intrinsic to protest event coding.

Publisher

Sage

Subject

Psychology, Social science

Source

American Behavioral Scientist

DOI

10.1177/00027642211021630

URI

https://doi.org/10.1177/00027642211021630

Collections

Publications with Fulltext

Full item page

0

Views

4

Downloads

View PlumX Details

Publication:
Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

0

Views

4

Downloads

Publication: Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection

Files

Departments

School / College / Institute

Program

KU-Authors

KU Authors

Co-Authors

Publication Date

Language

Type

Embargo Status

Journal Title

Journal ISSN

Volume Title

Alternative Title

Abstract

Source

Publisher

Subject

Citation

Has Part

Source

Book Series Title

Edition

DOI

URI

item.page.datauri

Link

Rights

Copyrights Note

Collections

Endorsement

Review

Supplemented By

Referenced By

0

Views

4

Downloads

Publication:
Random sampling in corpus design: cross-context generalizability in automated multicountry protest event collection