Establishing standards for human-annotated samples applied in supervised machine learning: Evidence from a Monte Carlo simulation


Automated content analyses have become a popular tool in communication science. While standard procedures for manual content analysis were established decades ago, it remains an open question whether these standards are sufficient for the use of human-annotated data to train supervised machine learning models. Scholars typically follow a two-stage procedure to obtain high prediction accuracy: manual content analysis followed by model training with human-annotated samples. We argue that a loss in prediction accuracy in supervised machine learning builds up over this two-stage procedure. In a Monte Carlo simulation, we tested (1) human coder errors (random, individual systematic, joint systematic) and (2) curation strategies for human-annotated datasets (one coder per document, majority rule, full agreement) as two sequential sources of accuracy loss of automated content analysis. Coder agreement prior to conducting manual content analysis remains an important quality criterion for automated content analyses. A Krippendorff’s alpha of at least 0.8 is desirable to achieve satisfying prediction results after machine learning. Systematic errors (individual and joint) must be avoided at all costs. The best training samples were obtained using one coder per document or the majority coding curation strategy. Ultimately, this paper can help researchers produce trustworthy predictions when combining manual coding and machine learning.

Studies in Communication and Media