• Member Since 5th Mar, 2013
  • offline last seen 4 hours ago

Catalysts Cradle

I am a graduate student in the sciences who enjoys reading and writing about ponies in my spare time.

More Blog Posts5


Defining the Readers of each Genre: Part II · 1:21am May 16th, 2016

In my previous post, I used data from a 2015 Harris survey to examine how reader demographics vary across different genres of books. From this data, I was able to group the genres into three rough clusters: Cluster 1 seems to consist of pleasure reading read predominantly by younger readers, cluster 2 consists of educational reading read predominantly by highly educated readers, and cluster 3 consists of books read more by older readers. In this post, I’ll use a different analytical technique called principal component analysis to sort through the data and see if it yields any different conclusions.

In my dataset, each genre is associated with ten different variables, describing the readership of that genre among all ten demographic groups (four age groups, two genders, and four levels of education). Plotting such data is difficult as humans aren’t used to visualizing a ten-dimensional space. Principal component analysis (PCA) seeks to simplify a multi-dimensional data set by defining a new set of coordinate axes to capture most of the differences between the data points using a smaller number of dimensions (here is a site with a nice, conceptual explanation of the process). So, are all ten demographic categories required to describe the data or can the data be summarized with a smaller number of variables? Let’s science the shit out of this dataset:

As the graph shows, applying PCA to the dataset [1] reveals that 2-3 dimensions are sufficient to capture most of the features of the data set. This result makes intuitive sense as the ten demographic categories reflect only three independent types of information (age, gender, and education). Each of the dimensions, called principal components (PCs), are made up of some combination of the original variables, and the breakdown of the PCs is shown below:

PC1 appears to reflect the taste of older males, PC2 appears to reflect the taste of older females, and PC3 appears to reflect the tastes of more educated readers. Plotting the data along the first two principal components gives the graph on the left:

The graph identifies a large cluster of genres near the middle along with a few outliers. These outliers (romance, chick-lit, poetry, graphic novels, and business) likely represent genres with niche audiences, whereas the genres that fall within the larger cluster are focused more toward a broader, general audience. Fantasy falls right on the border. The graph on the right shows a zoom in on the larger cluster of genres.

How does the PCA compare with the cluster analysis in the previous post? I’ve re-plotted the data along the first two principal components coloring cluster 1 blue, cluster 2 red, and cluster 3 green:

The clusters seem to separate fairly well along the principal components, though it indicates a few genres where the classification is a bit ambiguous (e.g. the genres along the blue/red border: science fiction, other non-ficiton, and classics/literature). Business also looks to be somewhat misclassified, though it clusters well with the other red points when you consider the PC3:

As the above graph shows, the clustering done previously fits well with the principal component analysis. The principal component analysis suggests that cluster 1 should probably be split into two sub-clusters based on PC2. Thus, food, religion, chick-lit, self-help, romance and true crime should be clustered together based on their high score in PC2 while science fiction, fantasy, graphic novels, and poetry should be classed separately based on their low score in PC2 (essentially reflecting the male/female readership divide in cluster 1). Health and wellness, while grouped with cluster 1B in the automated analysis, probably fits better with cluster 1A.

Well, this has been a fun exercise in taking a dataset and running a bunch of different analyses on it. I hope you guys have found it interesting as well.

[1] Because I’m concerned with how much the tastes of each demographic differ from average, I’m first normalizing the data by calculating a demographic score. For each genre, the demographic score for demographic x is given by the formula log(p_x/p) where p_x is the readership of the genre among demographic x, and p is the readership of the genere among the entire population.

Comments ( 0 )
Login or register to comment