According to researchers at the University of Southern California, male figures are more prevalent in the literature than female figures.
The research used artificial intelligence technologies to establish their findings as part of their novel study in the journal Data in Brief.
“With the advent of advanced tools in Artificial Intelligence (AI) and Natural Language Processing (NLP), there is an opportunity to use computational and digital tools to analyze corpora, such as copyright-expired literature in the pre-modern period (defined herein as books published approximately between 1800 and 1950) from the Project Gutenberg corpus,” the authors explained in their study.
“We present a dataset and materials that illustrate how modern processes in NLP can be used on the raw text of more than 3,000 literary texts in Project Gutenberg to (i) extract characters and pronouns from the text with high quality, (ii) disambiguate characters so that they are not overcounted, (iii) detect the gender of each character,” they further explained.
The data that included over 3,000 English-based books as part of the Gutenberg Project corpus was examined. The research uncovered gender bias, in which the authors revealed that females are four times less represented in the literature than their male counterparts.
“Our study shows us that the real world is complex but there are benefits to all different groups in our society participating in the cultural discourse,” emphasized one co-author of the study.
“When we do that, there tends to be a more realistic view of society.”