Finding relevant topics : text analysis with python

What do you talk about ? What are we interested in ?
If you were given an access to some network, how to identify what its member talks about ?

David Blei made a nice video at Google which I already posted on

It turns out that the method described, Latent Dirichlet Allocation can be extended in interesting directions, for sentiment analysis which enable automated news-trading for instance. Also LDA can be implemented using Gibbs sampling, which is another general method for finding latent variables. So I decided to give it a crack, and get started with some python as well.

As an example I generate documents that are actually images.
The topics are then groups of pixels. For instance the first topic ‘talks’ about pixel 1,2,3,4,5, topics 2 about pixels 11,12,13,14,15   :

From there I generate a set of documents. For each document, I draw a sample P from a dirichlet. P is a vector that represents a mixture of topics this document will speak about. Then for each word from this document, I pick from a topic from the mixture, then from the topic, I pick a term (aka, a pixel) from the words distribution of that topic.

If we run the LDA on such a generated set of documents, we find the following “topics”, with increasing likelihood for the data :

Likelihood : -119781, Iteration #0

Likelihood : -103233 iteration # 2

Likelihood : -89401 iteration # 4

Likelihood : -78168 iteration #8

Likelihood : -73634 iteration # 12

Likelihood : -71200 iteration # 17

Likelihood : -69670 iteration # 25

Likelihood : -66950 iteration # 35

Likelihood : -65199 iteration # 65

So we successfully recover the “topics”.

The crux of Gibs sampling is : we have N words, each corresponding to a hidden topic, and a model for their joint distribution. we want to have access to the distribution of topics given those words, or to take a sample from it. It turns out we can express easily the distribution of 1 topic given the rest (that is given the other topics and given the words, cf the docs below). Gibbs sampling means starting from a random point for the N topics, and sampling 1 topic at a time for the N topics, and doing so many times, gives you a sample from the joint distribution of the N topics.

The 2 detailed papers I found on the subject are Parameter estimation for text analysis and Distributed gibbs sampling of Latent topic model and the python code can be found at my github page http://github.com/nrolland/pyLDA Python has huge number of  libraries implemented and is a big time saver.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

About the Author

admin

 

Trackbacks

(Trackback URL)

close Reblog this comment
blog comments powered by Disqus