• Member Since 26th Feb, 2012
  • offline last seen Dec 6th, 2019

Ponkadoodle


More Blog Posts2

Aug
13th
2016

Fimfiction Word Frequency Data · 4:45am Aug 13th, 2016

Here's a json file that maps any word to the frequency that it occurs across all fimfiction stories. It contains the top 155727 words from JockeTF's 2016/05/25 snapshot.

Why?
In trying to solve the codes on the covers of iisaw's books, I needed a way to determine how likely it was that any given decoded sentence candidate contained actual words, in order to rank them. I couldn't find a suitable, modern source of English word frequency data, and it's very possible that the code contains 'pony' words or names, so I threw this together.

I already had the counts of each word across all of fimfiction on disk as a byproduct of my earlier analysis, so I just converted them to frequencies by dividing by the total number of words on fimfiction (about 205,000,000) and discarded everything with frequency < 5e-8 (roughly 10 occurrences), as those tended to be nonsensical typos that just ate up space in the data & added noise. As a result though, the frequency counts only sum to 98.6%. Unfiltered counts can be found inside aggregated.json (data["associations"]["text"]) in the same repository.

Report Ponkadoodle · 289 views · #statistics
Comments ( 1 )

Hmm. I can use this to normalize word counts from pony stories when comparing them to non-pony stories with stylo.

Login or register to comment