Finding relevant topics : text analysis with python
What do you talk about ? What are we interested in ?
If you were given an access to some network, how to identify what its member talks about ?
David Blei made a nice video at Google which I already posted on
It turns out that the method described, Latent Dirichlet Allocation can be extended in interesting directions, for sentiment analysis which enable automated news-trading for instance. Also LDA can be implemented using Gibbs sampling, which is another general method for finding latent variables. So I decided to give it a crack, and get started with some python as well.
As an example I generate documents that are actually images.
The topics are then groups of pixels. For instance the first topic ‘talks’ about pixel 1,2,3,4,5, topics 2 about pixels 11,12,13,14,15 :
From there I generate a set of documents. For each document, I draw a sample P from a dirichlet. P is a vector that represents a mixture of topics this document will speak about. Then for each word from this document, I pick from a topic from the mixture, then from the topic, I pick a term (aka, a pixel) from the words distribution of that topic.
If we run the LDA on such a generated set of documents, we find the following “topics”, with increasing likelihood for the data :
Likelihood : -119781, Iteration #0
Likelihood : -103233 iteration # 2
Likelihood : -89401 iteration # 4
Likelihood : -78168 iteration #8
Likelihood : -73634 iteration # 12
Likelihood : -71200 iteration # 17
Likelihood : -69670 iteration # 25
Likelihood : -66950 iteration # 35
Likelihood : -65199 iteration # 65
So we successfully recover the “topics”.
The crux of Gibs sampling is : we have N words, each corresponding to a hidden topic, and a model for their joint distribution. we want to have access to the distribution of topics given those words, or to take a sample from it. It turns out we can express easily the distribution of 1 topic given the rest (that is given the other topics and given the words, cf the docs below). Gibbs sampling means starting from a random point for the N topics, and sampling 1 topic at a time for the N topics, and doing so many times, gives you a sample from the joint distribution of the N topics.
The 2 detailed papers I found on the subject are Parameter estimation for text analysis and Distributed gibbs sampling of Latent topic model and the python code can be found at my github page http://github.com/nrolland/pyLDA Python has huge number of libraries implemented and is a big time saver.







Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Trackbacks
(Trackback URL)