• Member Since 5th Mar, 2013
  • offline last seen 11 hours ago

Catalysts Cradle


I am a biomedical scientist who enjoys reading and writing about ponies in my spare time.

More Blog Posts5

May
9th
2016

Defining the Readers of each Genre · 3:53am May 9th, 2016

A couple of weeks ago, Bad Horse (part 1, part 2, and part 3) and bookplayer posted their analyses of the mystery and fantasy genres. One of the important points that bookplayer brought up is that genres are essentially defined by their readers. I will use this principle as the starting point for my analysis, but instead of drawing from a literary analysis of the books composing each genre, I will instead take a data-driven approach by analyzing the demographics of each genre's readership to try to draw some conclusions about which genres might be most similar to each other and why.

For my source data, I used data from a 2015 Harris Interactive survey in which they polled ~ 2000 people in the US. Respondents were asked whether they had read at least one book in each of twenty-one listed genres during the past year [1]. The data were split by age into four categories (18-35, 36-50, 51-69, 70+), by gender, and by highest attained educational level (high school or less, some college, college grad, or post graduate). The raw data are available in Table 2 from the link above (and if you want any of my data or analysis code, please let me know).

The first thing to note is that preference for genre is fairly similar across all of the demographic groups studied. For example, if I compare the preference of readers aged 18-35 versus age 51-69 (see the figure below), you can see that most of the data points lie near the y = x equality line (all of the data points would lie exactly on this line if reading tastes did not differ between the groups):

There are a few outliers, however, especially a few genres which are popular with younger readers but much less popular with older readers (e.g. fantasy and graphic novels). Overall, I can quantify how much the tastes of these two groups align by computing the correlation coefficient. This value can range from 1 (the tastes of the two demographic groups align perfectly) to zero (the tastes of the two demographic groups are completely unrelated), to –1 (one group liking a genre predicts that the other group will dislike that genre). For the age 18-35 vs age 51-69 data set, this correlation coefficient is 0.72. I can compute a correlation coefficient for each pairwise combination of the ten demographic groups and get the following matrix:

As you can see, the correlation coefficients are all positive and fairly high. The only two groups showing a relatively low correlation in their tastes (correlation coefficient = 0.35) are men and women:

Here the outlier genres preferred much more by women than men include romance and mystery whereas history seems to be one of the genres showing the greatest discrepancy in the other direction.

Next, to compare different genres to each other, I performed the same correlation analysis but looked at all pairs of genres, computing how the interest in the two genres varied among the different demographic groups. In addition, I clustered the data [2], placing genres that show highly correlated demographic patterns close to each other (indicated by the tree structure on the left). This procedure produces the following matrix of correlation coefficients (positive correlations are colored red, negative correlations are colored blue, and uncorrelated pairs are colored white):

The clustering data and correlation matrix indicate three main groups of genres (indicated by the boxes in the above figure) [3]. Cluster 1 contains most of the fiction genres mixed among a subset of non-fiction genres, cluster 2 contains predominantly non-fiction, and cluster 3 is composed of solely fiction.

To get a sense of why these genres cluster together, I plotted out the genres along with their preferences among the different demographic groups. In the figure below, red boxes indicate that the demographic group reads the genre more than the population average whereas blue boxes indicate that the demographic group reads the genre less than the population average [4]:

Staring at the matrix provides a few observations:
1) Cluster 1 appears to contain books read predominantly by younger readers (age 18-50) but not older readers (ages 51+). It can be further subdivided into two groups: those with predominantly female readership (cooking through true crime), and those very strongly preferred by younger readers (science fiction through health and wellness).
2) Cluster 2 does not show as strong patterns by age, but instead shows the strongest pattern by education level. Genres in this cluster are preferred by college grads and above and are less widely read among those with a high school education or below. Many also show some bias toward male readership rather than female readership.
3) Cluster 3 appears to consist of genres read predominantly by old people (Age 70+).

Intuitively, these categorizations make sense (and conform to many stereotypes we have :twilightoops:). For example, the non-fiction genres in cluster 1 (e.g. food, religion, self-help) mainly appear to be those conducive to pleasure reading versus the non-fiction genres in cluster 2 (e.g. history, current affairs, business), which are more educational in nature. Similarly, that classics/literature groups with mainly non-fiction rather than fiction genres also makes intuitive sense. These data confirm that those trying to write literary fiction are writing toward a fundamentally different audience than those writing “genre fiction.” These data also indicate that westerns and mysteries have a fundamentally different audience than other forms of genre fiction.

Do these data provide additional insight into the analyses done by Bad Horse and bookplayer? In his blog post, Bad Horse noted a connection between mysteries and westerns, a connection that can also be seen in this data set. That both are preferred by older readers (age 70+) but not age 36-50 readers suggests that some of the themes discussed by Bad Horse may appeal more to specific generations. Fantasy clusters most closely with science fiction, graphic novels and poetry, which are all characterized by appealing to younger readers (age 18-35) but not older readers (age 51+). While this could also reflect different generational tastes, these may reflect an appeal toward people in a specific period of their life rather than people who grew up during a specific era of history. That fantasy and sci-fi group with poetry is interesting and could reflect more of an interest in poetic language in fantasy versus other categories of genre fiction (such as romance, mystery or westerns). (Of course, just because these genres are all popular among younger readers does not mean that it is the same young readers reading each of these genres).

Finally, it’s important to note the limitations of the data set. The survey includes only Americans and it was conducted over the course of a week in 2015, so the data may reflect tastes in only one country and could be skewed by the types of books popular in 2014-15. Because the survey asked which categories of books each individual read, demographics who read many more books than other demographics will on average show higher numbers across the genres than other demographics [5]. Finally, the demographics themselves are not necessarily independent of each other. For example, the age 18-35 demographic is probably depleted of college grads and individuals with post graduate degrees relative to the other demographic groups (which does show up in the data). These interdependencies in the data probably skew the clustering to some extent. For example, whereas the biggest differences in tastes are between men and women, major clusters are not separated by gender preferences. Principal component analysis would likely help in disentangling the interdependencies between the demographic groups (which may be the subject of a part II for this blog post).


[1] The data include only respondents who read at least one book in the average year, so the numbers will seem higher than one might expect because they exclude those who don't read many books.
[2] I used a UPGMA algorithm using the correlation coefficient as a distance metric.
[3] Clusters 1 and 2 do seem to show some additional structure within them, so you could make the case for five clusters.
[4] Specifically, I am plotting the value log(p_x/p) where p_x is the proportion of readers from demographic group x and p is the average readership among the population surveyed.
[5] This fact can be seen in the first scatterplot; more genres fall below the equality line than above, likely reflecting that those aged 51-69 read more books than those aged 18-35.

Comments ( 9 )

I love the heat maps! This is a great way of looking at this data. I especially love that you didn't stop with just one correlation matrix or one heat map, because some things that look solid in one heat map get evaporated by another, like the apparent clustering in the first heat map of "fantasy" with "things men read", which turns out to be more an artifact of age group.

Some real surprises:

- Literature/Classics clusters with Business, not Poetry or other fiction--but this seems to be due to clustering with high education
- Poetry is read by the young, not by the old
- Women don't read many books on business. This has serious implications for government policies which assume that women are equally interested in starting businesses.
- The male - female correlation is the single lowest correlation of all, by far
- I may get lynched for pointing this out, but look at the correlations between level of education, and male / female.

3939075

I may get lynched for pointing this out, but look at the correlations between level of education, and male / female.

I found this observation surprising as well. If low-education women read more books than low-education men, this difference could skew the data to make female readership appear more correlated with low education readership. Do more men pursue advanced degrees in the US? That could also affect the results.

It's to bad that the pollsters didn't release any cross tabs, as they would probably be useful in interpreting some of these correlations.

3939107 Women are more likely to go to undergraduate and graduate school than men in the US, so no. But the sample of 2000 might have been biased, or just not big enough. Judging from their list of most-popular authors, I'd guess there weren't many people in their sample with graduate degrees. It could also be that the age distribution in the sample is different for men and women.

3939107 3941482
I'm going to toss out an untestable theory (at least, using this data) that would line up with my observations: women who are out of college don't have as much time to read as men do. I mean, if I recall, studies have shown that women tend to end up with more household and childcare duties than men do, so it would make sense that their husbands would have time to read more books.

(Of course, by that logic, busier women who do have time to read might not have time to take polls about reading.)

3941548
That's a good hypothesis. There was a 2008 poll on the number of books read by each demographic group (women read more overall), but unfortunately, they didn't collect data on educational attainment. Hard to draw any conclusions without more data.

The longer I stare at this, the more fascinated I am. I'm particularly surprised by some of the differences between genders: like, current affairs, really? And poetry is popular with young people (would have thought the opposite), but apparently only those with college education. It falls off after that. Did most of the lit majors just not pursue graduate education? Assuming they would be the most likely group to read poetry in college.
Very interesting stuff.

these categorizations make sense (and conform to many stereotypes we have :twilightoops:)

Almost to an unsettling degree.

The strength of these correlations is pretty unsettling. And since I don't think you explicitly stated it anywhere, preference is defined in the first & third graphs as the percentage of a group that has read at least one book of a given genre within the last year, right?

I love the heatmaps and correlation tables, by the way - what software did you use to plot those?

4117279
Yes, the axes on the first and third graphs show the percentage of a group that has read at least one book of the given genre in the past year (note that the data includes only those who have read at least one book in the past year).

I plotted the heat maps in MATLAB (adding the labels in powerpoint). The correlation table (Fig 2) is just an Excel table using conditional formatting to color the values.

Loved your best pony analysis post, by the way.

Never mind; answered in your previous comment.

Login or register to comment