• Member Since 11th Apr, 2012
  • offline last seen 6 hours ago

Bad Horse


Beneath the microscope, you contain galaxies.

More Blog Posts758

Sep
16th
2015

The Principal Dimensions of English · 2:10am Sep 16th, 2015

I want to talk about the principal dimensions of theories of art, but to do that, I must explain what I mean by "principal dimensions". Besides, you should learn this stuff anyway.

How to Recommend Stories if you're a computer

Suppose you want to decide, after reading the first thousand words of a story on fimfiction, whether to recommend that Bob read it. Also suppose you’re a computer, so you must summarize the story in numbers. You can list some numbers for each story: its number of words, whether it’s complete (1) or incomplete (0), whether it has the Sex tag, the Mature tag, the Romance tag, etc., how many grammatical errors it has per 1000 words, how good the style is on a scale from 0 to 10, and how often the words “kiss”, “snuggle”, “sesquipedalian”, and “bloody” appear.

You end up with tens of thousands of numbers about the story. What do you do with them?

First, you make up terminology. Let N be the number of numbers (say, 23724). Call each of the things being counted a “dimension”, and the whole set of N numbers a “point” in “N-dimensional space”. (Don’t worry that you can’t picture N-dimensional space. Nobody can.)

Next, you need to get similar sets of N numbers for a bunch of other stories. Then, you need to know which of those stories Bob likes. Then, you recommend Bob read the story if its values for those numbers are similar to those for the stories Bob likes.

But you might only know 100 stories that Bob likes. Hardly enough to determine exactly how he feels about the word “sesquipedalian”.

What you could do is look at the other stories, and group counted things together that seem to go together most of the time. So, you might notice stories that use the word “snuggle” a lot also use “kiss” more than the others, but use “wart” and "angst" less often. So you make a single new, fake dimension, which for the benefit of any humans reading this I will call Kissiness, like this:

Kissiness = 30*Romance - 10*Dark + 3*count(kiss) + 3*count(cuddle) + count(close) + count(smooth) … - count(angst) - 2*count(wart)

You make some other fake dimensions that group together words that seem more or less likely to be found together:

Sexiness = 30*Sex + 10*Mature - 300*Everyone + count(grope) + count(hard) + count(soft) + 3*count(erect) + ...

Violence = 10*Gore + 5*Mature + 3*count(bloody) + 5*count(battle) + count(hard) - count(soft) - count(fuzzy)

Second_Person = 10*count(you) + 5*count(your) - 10*count(I)

Superheroics = 5*count(power) + 2*count(mighty) + 3*count(evil) + count(cape) + 3*count(costume) + 2*count(mask)

Obscenity = count(@$!) + 5*count(@#$@#$) - count(fudge) - count(hay)

Bigwordiness = count(assiduous) + count(voracious) + count(punctilious) + ...

(I’ve listed several words each, but more realistically there would be hundreds making up each fake dimension.)

Then you can build categories using just the fake dimensions. You might notice that a bunch of stories have the Human and Incomplete tags, and high scores in Second_Person and Superheroics, and call them “Displaced.” Or you might notice a bunch of stories with the Romance, Gore, and OC tag have high Kissiness, Violence, and grammatical errors per 1000 words, and low Bigwordiness and style, and call them “Alicorn OC Tales”.

In fact, I think this is very similar to what your brain does for you.

Principle Component Analysis (PCA)

There’s a signal processing technique called Principle Component Analysis (PCA) which is one of the Deep Insights into Everything that philosophy students should study instead of Plato’s forms. It does all this for you automatically, optimally, for any category of things described by points in an N-dimensional space. It looks at a whole bunch of such things, then figures out the one single best fake dimension that gives the data the widest spread [0]. Then it removes that dimension from the data, and does the same thing again, figuring out the second-best summary dimension. Do that 10 times, and you get 10 summary dimensions. Compute the values along those 10 dimensions for all the points in your N-dimensional space, throw away the original points, and you’ll still have most of the information that was in the original N dimensions. [1]

Then, predict that the probability that Bob will like a story is the probability that he likes other stories near it in 10-space.

This is the technique that won over all other approaches in the $1,000,000 Netflix contest, which was probably the biggest experiment in predicting ratings ever. The key innovation in the contest was using a fast way to approximate PCA [2]--a way which, incidentally, can be done by neurons.

Plus, once you’ve got the things you’re dealing with down to 10 dimensions or so, you can use logic or computation on them. You can have complex rules like, “If each line of the story has a similar repeated pattern of stresses, it’s probably poetry” if your analysis has discovered dimensions corresponding to "trochee" and "spondee" [3]. (Which it might, if your training stories had a lot of poetry in them.)

A lot of what cultures do, and what your brain does, is basically PCA followed by categorization and then thinking with those categories. All this crazy high-dimensional stuff happens, and people try to come up with concepts to simplify and explain it. The intermediate-level concepts produced, like “pretty”, “harmonious”, and “cruel”, are not real things. They are fake summary dimensions, each a sum (or function) of lots of real dimensions, that capture a lot of the differences between real things.

Then people build more concepts out of that smaller number of intermediate concepts. Because there are fewer of them, they can use more-powerful ways to combine them, like logical rules or lambda functions, to say whether something is “just”, “virtuous”, “beautiful”, or “sublime”. [4]]

Finding Data Points in the Real World

If you don’t have the N-dimensional data for all your objects, don't worry. You don’t need it. If you can take any 2 objects and say how similar or different they are, or even just whether they’re similar or different, you can jump straight to the lower-dimensional space that PCA would produce. Call it M-dimensional space, M << N. Here’s how:

Compare a bunch of object pairs. Say for instance difference(kind, compassionate) = 1, difference(kind, hurtful) = 6. Then use the differences between them as distances in a low-dimensional space. Start each of the objects at a random point in M-dimensional space (a popular choice is to distribute them on the surface of an M-1 dimensional sphere around the origin), then repeatedly push pairs apart if they’re too close, and pull them nearer each other if they’re too far (keeping them on the surface of that sphere if you’re doing it that way), until most pairs are about as far apart in your M-dimensional space as the distance says they should be.

(How do you choose M? You make it up. Everything left over gets mashed together in the Mth dimension, so if you want 10 meaningful dimensions, set M = 11.)

The principal dimensions of English

In fact, we can just do this with the English language, using lists of synonyms and antonyms, and see what our M summary dimensions are. In fact, somebody already did. Given some reasonable assumptions and one particular thesaurus, the 10 most-important dimensions of the English language are, roughly:

1. Good/increase vs. bad/diminish
2. Easy vs. hard
3. Start/open/free vs. finish/close/constrain
4. Peripheral vs. central
5. Essential vs. extra
6. Pull vs. push (sort of)
7. Above vs. below
8. Old vs. young
9. Normal vs. strange
10. Specific vs. general

I call these the principal dimensions. If we were doing PCA, we’d call them the principal components. Same thing. [5]

By contrast, if you do the same thing for French with a French thesaurus, these are the first 3 dimensions:

1. Good/increase vs. bad/diminish
2. Easy vs. hard
3. Start/open vs. finish/close

Whoops! Did I say by contrast? They’re the same. Because the dimensions that fall out of this analysis aren’t accidents of language. Languages develop to express how humans think. And that’s how humans think, at least Western Europeans. [6, 7, 8]

...but you said this had something to do with art

Here’s how all this is relevant to art: I want to claim I’ve discovered the first principal dimension of theories of art. I’m going to show (hopefully) that the position of different cultures on this dimension predicts something important about what type of art they value. But you need to understand what I mean by their position on this dimension, and what I mean by a type of art.

A type of art is like a mental disease. You diagnose it by noting that it contains, say, any 5 out of a list of 12 symptoms. The art type, or disease type, is a category. Its “symptoms” are measurements on summary (principal) dimensions. The actual data for a culture are going to be things like the degree to which power and wealth are centralized, the level of external threats, the heterogeneity of social roles, and the education level. [9] The principal dimension I’m going to talk about is not a real thing-in-the-world, though it is real. It’s determined by a statistical correlation between actual things in the world.


[0] Technically, the largest variance.

[1] There are many ways of doing PCA, and many related dimension-reduction techniques like "non-linear PCA" and factor analysis. Backpropagation neural networks are doing non-orthogonal PCA, though this wasn't realized for many years after their invention.

[2] Except that they didn't technically do PCA because they didn't have the N-dimensional points. They assumed that each movie was described by an N-dimensional point, and that each user had an N-dimensional preference vector saying how much he liked high values on each dimension, and that their ratings were the dot products of these two vectors. Then they used singular value decomposition (SVD) to construct low-dimensional approximations to both kinds of vectors. So they ended up with the low-dimensional points without ever knowing the "real" original N-dimensional points. If anyone understands how to do this with PCA, please tell me.

[3] If all you want to do is recommend stories to Bob, it turns out it isn’t helpful for a computer program to construct the final genre categories. It’s already got the point in N-space for a story; saying which genre that point lies within just throws away information. Just do your PCA and predict whether Bob likes that point in N-space. (Reference: The Netflix contest winners and losers.) But if you want to use logic to reason about genres (say, what themes are common in which genres), then you’ll have to categorize them.

[4] Many of the supposed proofs that meaning cannot be compositional (compositional: a term can be defined without reference to the entire dictionary) stem from the fact that philosophers don't understand that first-order logic is strictly weaker than a Turing machine (lambda functions). "Logic" is a weak form of reason compared to computation.

[5] The fact that you can reconstruct these dimensions, and will get the same answer every time even with significant changes in the data, refutes the cornerstone of post-modern philosophy, which is that scientific theories, social structures, and especially language, are underdetermined by the world. That is, they claim that any one of an approximately infinite number of other ways of doing things, or categorizing things, or thinking about things, would work equally well, and the real world underlying the things we say can never be known. But in fact, casual experimentation proves that language is astronomically overdetermined. (The number of constraints we get from how linguistic terms relate to each other and to sense data is much larger than the number of degrees of freedom in the system.)

[6] Contrary to what Ferdinand de Saussure said, and post-modern philosophers after him assented to, thought came first, language, second. We can excuse him for making this mistake, because he was writing before Darwin's theories were well-known, except oops no he wasn't.

[7] Some “synonyms” are words that are opposite on one dimension, and the same on all the others, allowing people to invert a particular dimension. Examples: challenge / obstacle, abundant / gratuitous (differ on good/bad), tug / yank (on easy/hard, funny / peculiar (on normal/strange).

[8] If you feed the algorithm radically different data, you’ll come up with different dimensions after the first few dimensions, as I suppose they did for French in that study.

So what happens if two people had different life experiences, and their brains came up with different principal dimensions?

It turns out we have a word for this, an old word that predates the math needed to understand it this way: we say they have different paradigms. They classify things in the world using a different set of dimensions. When they think about things, they come up with different answers. When they talk to each other, they each think the other is stupid. This is why political debates rarely change anyone’s mind; the people on opposite sites literally cannot understand each other. Their brains automatically compress their opponents’ statements into dimensions in which the distinctions they’re making are lost.

This is, I think, the correct interpretation of Thomas Kuhn’s observation that scientists using different paradigms can’t seem to communicate with each other. It doesn’t mean that the choice of paradigm is arbitrary. Different paradigms are better at making distinctions in different data sets. Someone who’s grown up with one data set can’t easily switch to a different one; she would have to re-learn everything. But, given agreement on what the data to explain is, paradigms can be compared objectively.

[9] Yes, it turns out I’m a literary Marxist. Sorry.

Comments ( 55 )

If all this is just a computational model of how humans think, can't we get pretty much the same results by collating what stories humans 'like'? I.e.:

- Bob liked OC Alicorn Deathlord 4: The Gorening and Dark Stains on my Soul: The Black and Red Mane of Revenge
- Alice liked OC Alicorn Deathlord 4: The Gorening
- Alice will probably also like Dark Stains on my Soul: The Black and Red Mane of Revenge[1]

Which is, if I'm correct, pretty much how most of the algorithms work today?


[1] All these titles are copyright Cold in Gardez, 2015. Look for them soon!

3396173 That's how most algorithms on the web appear to work today, but it gives slightly poorer performance. Partly because it runs into data sparsity, partly because there are users who say they like everything, partly because it's more likely to just predict popular stuff.

But the main thing I want to get across that what I mean by a "principal dimension" isn't a real, directly-observable thing, but a combination of lots of things that often go together and other things that seldom go with them.

"for any category of things described by points in an N-dimensional space." - In other words, for any thing that actually exists, period. And probably most things that are only hypothetical as well.

If PCA is the result of a contest held by Netflix, I'm half surprised that the resulting algorithm was made public. As opposed to being hoarded covetously like it was the formula for Coca-Cola or the secret to extracting aluminum (that one's a fascinating story, look it up). Not that that would actually work for very long, since others would arrive at it independently, but I'd expect them to at least want to have a better ratings system than their competition for a few years.

In any case, there's no way philosophy majors would study PCA because it sounds too much like math. In that it uses an actual method to produces an actual end result that does something.

By the way, your constant resizing between paragraphs in this post is giving me flashbacks to Timecube. Maybe you should back away from things that bring you closer to the crazies on that dimension? Then again, the one paragraph that you rendered in smaller text (the one about language being overdetermined/underdetermined) didn't make sense until a few paragraphs later in the post, so at least there's a method to your madness.

This is the technique that won over all other approaches in the $1,000,000 Netflix contest, which was probably the biggest experiment in predicting ratings ever. The key innovation in the contest was using a fast way to approximate PCA [2]--a way which, incidentally, can be done by neurons.

"Well, we originally figured out how the human visual cortex worked when we were trying to determine which videos to recommend to users on Pornhub. It had a big sample size and we got many samples per user."

"So how did you figure out how Broces' area worked?"

"A random user wrote an algorithm for recommending stories on a pony fanfiction website."

"Why don't you start out with that story, and not with the porn?"

"Because the porn is less embarrassing."

Or, alternatively:

"A random user wrote an algorithm for recommending stories on a pony fanfiction website, but never generalized it."

"Wait, is that why all our custom-made entertainment features unicorns?"

"No, son, that's because we're perverts."[1]

[1] Most American families prefer pegasi, finding their flighty yet competitive nature much easier to associate with.

3396348 PCA is much older than Netflix. I think it was the first use of it in a large-scale recommendation system. I've heard they still don't use it in their online recommendation system. A Netflix employee said something (in Wired, I think) to the effect that it turns out that what users want isn't to predict which movies they'd rate highly, but he didn't say what it was they did want.

Isn't this similar to what Pandora does with music? They hired a bunch of musicians to categorize music among a bunch of musical dimensions then they probably applied something like PCA to classify the songs so they could make recommendations.

This is exciting and I want to see where you go with it! (And I don't have a whole lot of other comments to make at the moment, because this stuff is kind of my bread and butter, and the explanation here didn't really tell me anything methodologically that I didn't already know, though the language stuff was cool.)


3396348

In any case, there's no way philosophy majors would study PCA because it sounds too much like math. In that it uses an actual method to produces an actual end result that does something.

Actually, the Logic and Philosophy of Science program where I go is very much not math-phobic. I think there are only a handful of dedicated LPS programs around the US, but of the two I know, they freaking rock at this stuff. We had a guy come in and give a seminar about solving the Sleeping Beauty Problem last year, and it was epic. And basically the takeaway was that nobody arguing about it had figured out the math, and that they were dealing with "probability" measures that integrated to more than 1, which was causing all the trouble.


3396456
Oh God, Pandora. We had somebody come in from Pandora a couple years ago to give a stats seminar, and she didn't seem to have any clue how anything worked. I'd like to think they know what they're doing, but she didn't give me much confidence that they're working within telescope range of the cutting edge.

3396456 One would hope so. But the recommendations I've gotten from Pandora were clearly based on picking particular bands that were associated with the bands I was playing, not on musical content, and I wasn't able to get any hybrid recommendations. (If I played 70s rock and baroque music on one channel, for example, all I got was recommendations for baroque and recommendations for 70s rock.)

Actually... So bookplayer and I have been chatting about this blog a bit on Skype, and I'm not sure I get what you're saying about language as well as I thought I did. I can certainly see how some concepts can get sorted along those ten dimensions, but I'm getting hung up on the word "feather-duster". Is there just a whole gigantic chunk of things that hang out at the origin of the reduced space? What exactly is this decomposition of English supposed to be? It's easy for me to understand how adjectives slide into that framework. Are they the only concepts that are being considered important here?

3396600 Sorry if I gave the impression that it's just those 10 dimensions. I think sense data is reduced to a smaller number of dimensions--maybe it's 200, who knows? -- and then those dimensions are used to build another layer of concepts on top of them, and then those are used to build another layer on top of that. At least some of the later layers are built using different ways of combining things. And each sensory modality has its own specialized representations as well: vision starts with detection center/surround spots and edges at different orientations, and moves up to detection regions of motion, and might build pretty sophisticated representations before it rises near the level of language. Sound starts with frequencies. Smell is controversial.

Certain categories of information are handled non-linguistically. The knowledge of how to grab and swing a hammer, or what things you use a hammer on, is supposedly in a different part of the brain than knowledge about the word "hammer". This was believed from studies of old folks with focal brain lesions, and some has been confirmed by fMRI studies.

Ah, I love this area of study and how, algorithms aside, we have a very vague idea about the gritty detail of how a lot of the decisions made by software with a massive amount of data to mine are produced. But hey, it works...

Sadly my knowledge stops at the basics of image recognition and sampling, but maybe that may change in the coming years.

the mentions of Alicorn OCs made me try to actually look for some stories about those.
from casually browsing the search results, it seems that ironic Alicorn OCs are a huge infestation on the site. It took much more work to find some that weren't parodies. HRMMM :trixieshiftright:

I've always wondered what it would be like to study philosophy.

This is the technique that won over all other approaches in the $1,000,000 Netflix contest, which was probably the biggest experiment in predicting ratings ever.

I think you can claim that PCA is a good chunk of any modern machine learning technique since it's so general a term, but the winning technique of the Netflix competition used more than just PCA variants. KNN, for example. Also, I think you have the wrong entry since the one you linked wasn't the winning one. Here's the (last, 2009) winning entry (http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf).

Their brains automatically compress their opponents’ statements into dimensions in which the distinctions they’re making are lost.

I've actually noticed this as well. There's a guy in my office that I cannot have an extended conversation with because both of us so horribly misinterpret what the other is trying to say. It's actually faster for us to talk through a third guy that can understand both of us. It also happens between two of my friends that have strangely known each other for longer than I've known them. It happens frequently enough that I've learned to translate in realtime as one is is saying something the other will misinterpret.

[2] Except that they didn't technically do PCA because they didn't have the N-dimensional points. They assumed that each movie was described by an N-dimensional point, and that each user had an N-dimensional preference vector saying how much he liked high values on each dimension, and that their ratings were the dot products of these two vectors. Then they used singular value decomposition (SVD) to construct low-dimensional approximations to both kinds of vectors. So they ended up with the low-dimensional points without ever knowing the "real" original N-dimensional points. If anyone understands how to do this with PCA, please tell me.

Typically when machine learning people say things that they don't use, they're just stating the assumptions under which their algorithm is justifiable. To me, that summary sounds like "Ratings are linearly separable into a user preference matrix and a movie feature matrix. SVD can give us a factorization, but it's too large, so we compute an approximate SVD." The "real N-dimensional points" refer to vectors in the full SVD, which don't need to be computed if you just want the lower rank SVD.

Is this the referenced paper: http://aclweb.org/anthology/E/E06/E06-1013.pdf ? I'll be honest, I don't want to read it if I'm not missing anything.


3396530
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

Hey! I have this friend in Dnepropetrovsk who's working on this exact problem. Looks like you published first.

Some coincidence, huh? :moustache:

3396415 Well obviously they can't hide PCA. I meant hoarding whatever specific variant/set of assumptions/shortcut let them speed up the process enough for Netflix's purposes while sacrificing the least effectiveness.

Well, this is fascinating, and a lot more engaging than most philosophy I've dealt with. (Of course, that was metaphysics, which meant that some people were arguing that they themselves didn't exist.) I look forward to the DSM-style art analysis.

It seems like the interesting stuff like say for example "number of unique characters" or even more "number of main characters" would be unaccessible using this method without a high levels of abstraction that I don't see coming out of it without some "agent recognizing" heuristic that one might assume people have and that relies on more than just counting the subjects of a sentance.
It is possible that I'm wrong and that you can just predict these things using lower level concepts but it seems that if someone wanted to read a story where the protagonist dies in the end you wouldn't be able to find it with this method unless you generated a perfect node by accident.

PresentPerfect
Author Interviewer

You got math in my English. ;_;

3396173
You'd better not be joking. :V

3396546 is Netflix better at giving hybrid recommendations? (I don't use Netflix so I wouldn't know)

I'm going to compress my reaction to this post into as few words as I can. In fact, I'll do it in a glyph: :heart:

Oh, and your description of PCA was particularly cogent and elegant. I wish you wrote my textbooks. :twilightsmile:

I am on tenterhooks to see what you discovered. Computational literary analysis is an absolutely fascinating idea and I am most amused imagining how it may affect postmodernists. Spontaneous combustion, I shouldn't wonder.

3396772
Oh, I can't wait until he puts all of this into a book. I hear Metro-Goldwyn-Moskva has already got the rights, and there's some saucy casting rumors about who they'll get to play the neural network.

:pinkiehappy:

I want to claim I’ve discovered the first principal dimension of theories of art. I’m going to show (hopefully) that the position of different cultures on this dimension predicts something important about what type of art they value.

Bad Horse, why are you posting this on Fimfiction?? It sounds like you could potentially break new theoretical ground, or at the very least provide something very meaty to chew on. It belongs as an article in a magazine somewhere--something official. Not wasted here. You can come back and tell us after you've published your ideas in a more secure location. We'll wait.

Furthermore, Kissiness needs to be an official tag. It needs to be.

3397337 Naw, it's not well-supported, and kind of obvious, and I want to see what you guys think.

3396922 A Kohonen map is similar, but you use it to map feature vectors onto nodes, usually in a 2D array. The nodes stay in one place; the mappings change. It's not a way of discovering where objects are in a metric space, because you have to already be able to measure the distance between any two feature vectors to use it. It's a way of making an array of dumb nodes map themselves onto a metric space.

3396771

I think you can claim that PCA is a good chunk of any modern machine learning technique since it's so general a term, but the winning technique of the Netflix competition used more than just PCA variants. KNN, for example. Also, I think you have the wrong entry since the one you linked wasn't the winning one. Here's the (last, 2009) winning entry

The winning entry used a combination of many different variants of, I forget, maybe a dozen different approaches. But the PCA (actually SVD) approach was by far the best single approach, except for some guys at U Toronto who claimed to have gotten similar performance from a restricted Boltzman machine but which nobody could duplicate.

I linked to the Simon Funk blog post because he's the guy who figured it out, and instead of keeping it secret until the contest was over, he announced it on his blog. Everybody copied him. He didn't get any of the money.

To me, that summary sounds like "Ratings are linearly separable into a user preference matrix and a movie feature matrix. SVD can give us a factorization, but it's too large, so we compute an approximate SVD." The "real N-dimensional points" refer to vectors in the full SVD, which don't need to be computed if you just want the lower rank SVD.

Yes, exactly. My question is whether you can use PCA in this case when you don't have any of the movie or user feature vectors to start with. Everyone says PCA and SVD are equivalent, but I don't know how to use PCA in this case.

Is this the referenced paper: http://aclweb.org/anthology/E/E06/E06-1013.pdf ? I'll be honest, I don't want to read it if I'm not missing anything.

That is the referenced paper, but Simon Funk's explanation is much simpler. I recommend reading it, because it's really cool. You never compute the U, S, and V matrices normally derived in SVD; you just compute two matrices A and B such that AB = USV ~ your ratings matrix R. If A x B = R, then you would have 'done' the SVD. So you compute the error A x B - R. Then you differentiate that error equation and use it to hillclimb.

You don't get the eigenvalues, but you could probably figure them out from the constraint that U and V are both orthogonal matrices.

3397359

not well-supported

*looks through all the description of the math behind it*
Sure, not supported.

kind of obvious

Yes, in fact, twenty of us woke up with just this idea yesterday morning. We don't know what took you so long.

Need I say more?

This is why political debates rarely change anyone’s mind; the people on opposite sites literally cannot understand each other. Their brains automatically compress their opponents’ statements into dimensions in which the distinctions they’re making are lost.

Isn't figuring out the, ah, 'dimensions' that other people are operating on, and then framing your arguments to make the most sense within their views, exactly how one conducts a debate? Otherwise people are just quoting opinions at each other.

Okay this:

Sexiness = 30*Sex + 10*Mature - 300*Everyone + count(grope) + count(hard) + count(soft) + 3*count(erect) + ...

Violence = 10*Gore + 5*Mature + 3*count(bloody) + 5*count(battle) + count(hard) - count(soft) - count(fuzzy)

Second_Person = 10*count(you) + 5*count(your) - 10*count(I)

Where do all those coefficients come from?

They look like weighting factors but if so, how do you decide on how to weight?

And if not, what do they do?

3396863 Yes. A real recommendation system using this approach would train on user data. Then, if there's some feature that's important to a lot of people, like "the main character dies in the end", that would stick out in the data, and the system would learn a feature that had high values for movies where the main character dies in the end and for users who like those movies.

I wanted to get across what I mean when I call something a principle dimension, not explain how a good recommendation system should work.

3397540 No it was pretty clear what you were going for I was just interested in whether I was completely off base in my assumption and somebody had built an app with this that could count the amount of characters in a story who suffer nervous breakdowns.

Next time someone tells me that nothing deep occurs on my little pony fan sites, I am showing them this.

This is a topic I know really well.

I want to say things about it but I am tired. But: I will do that thing later.

3396772 Is this a joke, a pop-culture allusion, or reality? :applejackunsure:

3396976

Oh, and your description of PCA was particularly cogent and elegant. I wish you wrote my textbooks. :twilightsmile:

Except that you couldn't actually do PCA given my description.

I am on tenterhooks to see what you discovered.

Eh, I'm probably over-hyping it. If I found multiple dimensions and multiple affected properties, that would be cool. But there are too many confounding factors.

3397774

Except that you couldn't actually do PCA given my description.

Shush, and take your compliment. :twilightsmile:

And you are not overhyping it. I'm not expecting it to cure postmodernism or make Gertrude Stein retroactively make sense. I'm expecting it to be interesting and it can't fail to be that.

3397774

I...wh-what...you've never heard of this?

"...be that as it may, some of you may have had occasion to run into mathematicians and to wonder therefore how they got that way..."

3397694

If I go that right, what you said is basically "Kohonen Maps have a fixed topology", as the distance between nodes is determined entirely by the topology.

A Kohonen mapping preserves topology in mapping from the input space onto a neural network (when it's successful). It maps points in a metric space onto a fixed set of "neurons", which have a known, fixed distance or neighborhood predicate between any two of them. But you have to already know the metric space before you start. The process I described reconstructs a metric space, given pairs of distances, and there are no neurons.

3396415

A Netflix employee said something (in Wired, I think) to the effect that it turns out that what users want isn't to predict which movies they'd rate highly, but he didn't say what it was they did want.

I'll hazard a guess on this. To avoid linguistic awkwardness, I'm going to abstract users as "Bob". Bob doesn't want Netflix to predict what movies he'll rate highly. He might think that that's what he wants, but it isn't quite true. Bob wants Netflix to predict the movies that that Bob will predict he will rate highly (and ideally also will in fact rate highly). More concretely, if Bob usually watches movies about giant robots and alien invasions, and there's some common thread there that leads to a prediction that Bob will like Memoirs of a Geisha, Bob doesn't want to know that. He'll take one look at the recommendation for Memoirs of a Geisha and decide that the system is clearly broken, even if he would in fact like that movie. It's too far outside his comfort zone.

In short, users want comfortable recommendations.

The intermediate-level concepts produced, like “pretty”, “harmonious”, and “cruel”, are not real things. They are fake summary dimensions, each a sum (or function) of lots of real dimensions, that capture a lot of the differences between real things.

Something that should be obvious to anyone that read Hogfather:

“All right," said Susan. "I'm not stupid. You're saying humans need... fantasies to make life bearable."

REALLY? AS IF IT WAS SOME KIND OF PINK PILL? NO. HUMANS NEED FANTASY TO BE HUMAN. TO BE THE PLACE WHERE THE FALLING ANGEL MEETS THE RISING APE.

"Tooth fairies? Hogfathers? Little—"

YES. AS PRACTICE. YOU HAVE TO START OUT LEARNING TO BELIEVE THE LITTLE LIES.

"So we can believe the big ones?"

YES. JUSTICE. MERCY. DUTY. THAT SORT OF THING.

"They're not the same at all!"

YOU THINK SO? THEN TAKE THE UNIVERSE AND GRIND IT DOWN TO THE FINEST POWDER AND SIEVE IT THROUGH THE FINEST SIEVE AND THEN SHOW ME ONE ATOM OF JUSTICE, ONE MOLECULE OF MERCY. AND YET—Death waved a hand. AND YET YOU ACT AS IF THERE IS SOME IDEAL ORDER IN THE WORLD, AS IF THERE IS SOME...SOME RIGHTNESS IN THE UNIVERSE BY WHICH IT MAY BE JUDGED.

"Yes, but people have got to believe that, or what's the point—"

MY POINT EXACTLY.”

3397372

http://sifter.org/~simon/journal/20061211.html

That is pretty interesting.

My question is whether you can use PCA in this case when you don't have any of the movie or user feature vectors to start with.

Simon's algorithm rewritten:
One-hot boolean input layer A (user)
N-node linear hidden layer B (intermediate representation of user, N=40)
One-hot boolean input layer C (movie)
N-node linear hidden layer D (intermediate representation of movie, N=40)
1 node output layer = B dot D (rating)
Gradient descent with RMS error, learn one feature per hidden layer at a time

You have two linear PCAs followed by one multilinear PCA.

3398506 What does "one-hot" mean?
Simon's algorithm doesn't really do either PCA or SVD. SVD finds a set of matrices U, S, V such that USV = ratings matrix R. A low-rank approximation finds low-rank U, S, V. Simon's algorithm instead finds UQ, transpose(Q)xV, where Q x transpose(Q) = S.

You may be able to figure out what Q is from UQ and QV, since you know that U x transpose(U) = I, and the same for V.

My question was whether a similar gradient search approach could "use PCA". PCA is a different process, which finds the eigenvalues of X * transpose(X), where X is a matrix of feature vectors. We don't start with any feature vectors.

Comment posted by TheJediMasterEd deleted Sep 18th, 2015

Wow, haven't heard from you in a while, BH. You been okay?

3398791
"V" and "V-transposed" below use the convention in your post, not the one in Wikipedia.

I think S is just going to be a diagonal matrix where the ith diagonal element is magnitude(ith row of U) * magnitude(ith column of V). If you factor out S (i.e. divide out the magnitude of each row of U and each column of V), then Simon's algorithm computes SVD assuming that each row of U ends up being orthogonal and each column of V ends up being orthogonal.

(X) Simon's algorithm does have feature vectors. Each movie is described by a vector of user ratings, and each user is described by a vector of movie ratings.

I'm still not sure what your question is, but I think I'm down to two possibilities: (1) "Is it possible to rephrase Simon's algorithm as an eigenvector problem?" or (2) "Is it possible to compute the linear PCA of to-be-left-uncomputed data?"

I haven't spent enough time with SVD to know the answer for (1) with reasonable certainty, but I think you'll end up with this:
U = principal components of either movies or users using the feature vectors in (X).
V-transposed = principal components of the other using the feature vectors in (X).
(Edit: I confirmed that this works with a simple example using numpy.)


The answer to (2) is "Yes, though it's not useful in the case where the to-be-left-uncomputed features are a linearly transformed version of the original." You'd just be adding an extra linear transformation between the raw data and the linear PCA, which is itself a linear transformation. There's nothing the aggregate transformation can do that a single transformation can't. With nonlinear PCA or nonlinear transformations, PCA can be represented by an autoencoder. The to-be-left-uncomputed features would be represented by the hidden layers.

One-hot means there is a single "1" element, and the rest are "0".

3400546

"V" and "V-transposed" below use the convention in your post, not the one in Wikipedia.

The Wikipedia convention is the usual one. I forgot to say "transpose".

I think S is just going to be a diagonal matrix where the ith diagonal element is magnitude(ith row of U) * magnitude(ith column of V). If you factor out S (i.e. divide out the magnitude of each row of U and each column of V), then Simon's algorithm computes SVD assuming that each row of U ends up being orthogonal and each column of V ends up being orthogonal.

Let's see: Each row in U is orthogonal to each other row, and the same for each column. Supposedly. If we did SVD. Also, the magnitude of each row of U and V is 1. So you mean the diagonal element of S is the magnitude of what I called ith row of UE * ith column of EV (really EV*). Right. So you can easily recover the eigenvalues.

Though I don't think anybody ever checked to see that the particular matrix decomposition this comes up with is SVD. It seems like it ought to be. But, fun fact, if you don't train the features one-by-one, each one on the residual left after a prediction used all the previous ones, but instead train them all simultaneously on the same error, you get equally good results! I once said that was doing multiple linear regression rather than SVD, which I think is probably right.

(X) Simon's algorithm does have feature vectors. Each movie is described by a vector of user ratings, and each user is described by a vector of movie ratings.

You can use those as feature vectors, but would you get the results you wanted if you ran PCA on that ratings matrix? I... guess you would. It isn't as obvious what you're doing, though. And how would you do PCA on a sparse matrix?

When Simon says this isn't subject to local minima, it turns out he's wrong for some approaches. When you start training a feature, and you've initialized the features to have an average > 0 (which is a good thing, IIRC bcoz the ratings matrix is supposed to be invertible for this to work), there's a not-quite-symmetry between training the feature as meaning "has this feature" (if value is high) or "lacks this feature" (if value is high). The former gives better results.

Also, Simon Funk's real name is Brandynn Webb, which IMHO is an even cooler name than Simon Funk.

Now that it's far too late for anypony else to see what I'm writing, here's what I wanted to say.

I'm more familiar with this topic through Latent Semantic Analysis (LSA), which is a specific approach of the same concept. LSA is commonly used in natural language processing. The "components" in LSA are the derived by beginning with one dimension for each term, then reducing the dimensions significantly through SVD. The goal is to end up with X number of dimensions that still retain enough information to discriminate effectively between categories.

This process is amazingly accurate for many purposes. It's how the GRE assesses the written portion of the exams. There's a human rater in the process to be safe, but the reliability of LSA is as high as the inter-rater reliability between two expert human graders. This is true even though LSA for the GRE looks only at unigrams, so if you took a high-ranking essay and rearranged all the words, it would still be ranked just as highly. This sounds ridiculous, but it works: winning test takers who use certain words at certain rates on a particular topic are also the ones more likely to organize those words appropriately, so the order can be removed entirely, dimensions crunched, and still the rating is as accurate as can be.

The interesting thing about LSA (and PCA in general) is that the "dimensions" you end up with are difficult and sometimes impossible to label. The categories you want to identify are easy enough, because that's where you begin from: perhaps I want to categorize stories into "funny", "sad", and "weird" buckets. But the dimensions themselves may seem spurious: one might expect to see "pony" and "foal" in the same dimension, but something odd like "pony" and "styrofoam" might appear together even when the dimensions are hardly reduced, and there will not be any comforting non-statistical insight to explain how lumping those two terms together assists with the categorization process. Frequently with LSA you can't even identify a theme to some of the dimensions.

3407303

But, fun fact, if you don't train the features one-by-one, each one on the residual left after a prediction used all the previous ones, but instead train them all simultaneously on the same error, you get equally good results!

You'll get the same results minus the orthogonal vectors, I think, though you can get the vectors to be orthogonal by adding (W*W.T - diag(W*W.T))**2 to the error term. Without that, I don't think the results can be used for PCA.

You can use those as feature vectors, but would you get the results you wanted if you ran PCA on that ratings matrix?

SVD finds R = UV* (after multiplying in E), where the R element at the ith row, jth column is given by taking the dot product between the ith row of U and the jth column of V*. Brandynn Webb's computes a R = UV* where taking the dot product between the ith row of U and the jth column of V* yields the rating given by the ith user to the jth movie. If you construct the matrix of all those ratings, you get exactly the ratings matrix described in 3400546. V* in UEV* (factoring out E again) contains the principle components of that ratings matrix, and U in UEV* contains the principle components of the transpose of that ratings matrix. (Keep in mind that the movie ratings matrix is the transpose of the user ratings matrix).

And how would you do PCA on a sparse matrix?

Sparse PCA shouldn't be any different from non-sparse PCA, though the Netflix matrix is full of empty values, not zeros. It's because of situations like these that I prefer the neural network representation of statistical algorithms. Linear PCA on an incomplete matrix is this:
Input vector A, masked with M (M is 1 where A exists, 0 otherwise)
N-dimensional hidden vector B = W*A
Output vector C = W.T * B, masked with M
Gradient descent with RMS error, target C = A, train one hidden layer node at a time

All the empty values get properly ignored with this, as they do in Brandynn Webb's algorithm.

When you start training a feature, and you've initialized the features to have an average > 0, there's a not-quite-symmetry between training the feature as meaning "has this feature" (if value is high) or "lacks this feature" (if value is high). The former gives better results.

I'm not sure what you mean here. Can you explain this?

3408687

When you start training a feature, and you've initialized the features to have an average > 0, there's a not-quite-symmetry between training the feature as meaning "has this feature" (if value is high) or "lacks this feature" (if value is high). The former gives better results.

I'm not sure what you mean here. Can you explain this?

I'm working off some notes I made 6 years ago, so I might be completely wrong. This is my attempt to reconstruct what I think they meant:

You don't know ahead of time what the features will mean (and usually never find out). But sometimes they have evident meanings, like "fast-paced", maybe.

I usually initialized features to have positive mean values. I don't remember why, but I certainly experimented with mean zero values. I think zero mean might make the matrix hard to invert?

If a feature corresponds to something that relatively few movies have, it works better if the movies that have it have large values rather than small values.

I think that was what I meant. Anyway, there was definitely a problem under some cirumstances with local maxima, which I avoided by training first on movies and users with many ratings.

3407429 LSA is one of the techniques used by stylo.

LSA was patented in the 1990s, so I had applications then that I couldn't use it for, because you had to license it. Hopefully the patent has expired now. That was a stupid patent to grant. It would be like patenting the application of a t-test to medical testing.

3412255
Yeah, I'm pretty sure you can use LSA without a patent now.

That sounds so ridiculous. It's a widely-known algorithm. How can you patent that?

3413148 First, algorithms aren't patentable. However, due to a stupidly broad interpretation of an early case about computer software, a computer running an algorithm is considered a device, hence an invention, hence the combination of software implementing an algorithm, loaded onto a computer, taking input and producing output, is patentable. So while you can't patent PCA, if you phrase it in terms of a particular kind of input and output, you can call it a device and patent it.

Next, a patent can be rejected if it is "obvious to an average practitioner of the art". A law case established that something was obvious if and only if there was a “teaching, suggestion or motivation” for it. A later law case established that this meant that either a) someone had already done it, or b) someone had already suggested doing it.

So, for instance, I once processed an application for a patent on the use of linked lists in computer graphics. This is obviously obvious to a practitioner of the arts; a linked list is a basic data structure. It's like if, after one person invented the automobile, someone else patented the use of carbon steel in an automobile. Unfortunately I couldn't reject the application, because I couldn't find, in the 3 hours I had to read the 10-page application, search the literature, and write a response, anybody suggesting using a linked list in a computer graphics algorithm.

Login or register to comment