Uncategorized

Weekly links for 10/08/29

joshuaclayton’s blueprint-css at master – GitHub
This is a CSS framework designed to cut down on your CSS development time. It gives you a solid foundation to build your own CSS on
SAX (Symbolic Aggregate approXimation)
Sax: Symbolic Aggregate approXimation — SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. In classic data mining tasks such as clustering, classification, index, etc., SAX is as good as well-known representations such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT), while requiring less storage space. In addition, the representation allows researchers to avail of the wealth of data structures and algorithms in bioinformatics or text mining, and also provides solutions to many challenges associated with current data mining tasks. One example is motif discovery, a problem which we recently defined for time series data. There is great potential for extending and applying the discrete representation on a wide class of data mining tasks. Source code has “non-commercial” license
jbooktrader – Project Hosting on Google Code
Introduction à Git pour les gens normaux | E-vidence
a very efficient introduction to git in french
Particle Filters
Particle Filters
author: Simon Godsill, University of Cambridge
Finding relevant topics : text analysis with python

Finding relevant topics : text analysis with python

What do you talk about ? What are we interested in ?
If you were given an access to some network, how to identify what its member talks about ?

David Blei made a nice video at Google which I already posted on

It turns out that the method described, Latent Dirichlet Allocation can be extended in interesting directions, for sentiment analysis which enable automated news-trading for instance. Also LDA can be implemented using Gibbs sampling, which is another general method for finding latent variables. So I decided to give it a crack, and get started with some python as well.

As an example I generate documents that are actually images.
The topics are then groups of pixels. For instance the first topic ‘talks’ about pixel 1,2,3,4,5, topics 2 about pixels 11,12,13,14,15   :

From there I generate a set of documents. For each document, I draw a sample P from a dirichlet. P is a vector that represents a mixture of topics this document will speak about. Then for each word from this document, I pick from a topic from the mixture, then from the topic, I pick a term (aka, a pixel) from the words distribution of that topic.

If we run the LDA on such a generated set of documents, we find the following “topics”, with increasing likelihood for the data :

Likelihood : -119781, Iteration #0

Likelihood : -103233 iteration # 2

Likelihood : -89401 iteration # 4

Likelihood : -78168 iteration #8

Likelihood : -73634 iteration # 12

Likelihood : -71200 iteration # 17

Likelihood : -69670 iteration # 25

Likelihood : -66950 iteration # 35

Likelihood : -65199 iteration # 65

So we successfully recover the “topics”.

The crux of Gibs sampling is : we have N words, each corresponding to a hidden topic, and a model for their joint distribution. we want to have access to the distribution of topics given those words, or to take a sample from it. It turns out we can express easily the distribution of 1 topic given the rest (that is given the other topics and given the words, cf the docs below). Gibbs sampling means starting from a random point for the N topics, and sampling 1 topic at a time for the N topics, and doing so many times, gives you a sample from the joint distribution of the N topics.

The 2 detailed papers I found on the subject are Parameter estimation for text analysis and Distributed gibbs sampling of Latent topic model and the python code can be found at my github page http://github.com/nrolland/pyLDA Python has huge number of  libraries implemented and is a big time saver.

Weekly links for 10/02/13

3. An Informal Introduction to Python — Python v2.6.4 documentation
La documentation de python.
Learning Python, Linux, Java, Ruby and more with Videos, Tutorials and Screencasts
Showmedo is about learning and (Free and) Open-source software (FOSS). We were inspired to start Showmedo by watching some very effective web video-tutorials/screencasts. These convinced us that web-videos can be a great way to quickly and efficiently acquire knowledge. It can even be fun, or at least painless. For some things there is no substitute to seeing it done.

notes

  • gnuradio : is a set of software routine to process signal. it enables you to sample the whole electromagnetic spectrum, and decode it to provide radio, GSM, DECT, GPS, functionality with a single piece of hardware. pretty awesome.
  • palantir a software company that provides useful software for datamining and flexible correlation studies.

The anonymous financial room

That would leave only 5.7 per cent of Volkswagen’s ordinary shares
available to be traded on the market. However, hedge funds and other
traders had between them short sold shares equivalent to 12.9 per cent
of the total, and in consequence were obliged to buy and return them.
They understandably panicked, and the resultant frantic efforts to buy
Volkswagen shares caused the price to quadruple

http://www.lrb.co.uk/nl/v30/n23/mack01_.html

This story highlights (among other things) the difficulty of creating a market consensus. Had hedge funds been able to gather and delegate the handling of this short to a common entity, who would then negotiate in their name, the problem would not have been so wide and painful.

However, when faced with such a case, there are strong incentives not to disclose anything to competitors, for many reasons. Which is why a platform that guarantee both authenticity of participant yet remains completely anonymous could be useful. With it, people with a shameful problem can at least discuss it without fear that the mere discussion will aggravate the situation.

I wonder what infrastructure can guarantee that kind of “anonymous authentication”…

no time to think

An interesting discussion by a Stanford professor

Star trader

Wall Street - Bull

Creative Commons License photo credit: David Paul Ohmer

August 15, 2008 WSJ  : The two-year-old hedge fund founded by former UBS AG star trader JW,  is down about 85% from its inception through July, according to a person familiar with the matter.

August 03, 2006: UBS star trader goes solo with launch of hedge fund. One of the City’s most aggressive traders will soon be snapping even more ferociously at the heels of management as the head of a $5bn (pounds 2.7bn) hedge fund. In his new role, Jon Wood will be set free from potential client conflict at UBS, the Swiss investment bank where he currently heads proprietary trading. Backers have already committed $3bn to his SRM global fund, which will be capped at $5bn and launched on 1 September. UBS, where Mr Wood has worked for the past 17 years, is investing $500m. Investors in SRM Global can choose from two fee structures – a 1 percent management fee to invest for a period of five years and 1.5 percent for three years. SRM Global will retain 25 percent of its profits.

Hey, good thing it was capped at USD 5bn !!

Proof network and knowledge repository

I have been looking a long time for a way to represent on the web a mathematical corpus. It would be tremendous if one could find on the web the different books written on a subject, in a form that enables verification of theorem and proof.

Let’s reflect on what a lesson is : Mostly a lesson is about learning one by one different entities, and the relations that those entities share. It is all about serializing a graph.

If we were to improve this process of learning, we should identify what represent the most effort and the least added value in the making of it, and I see 3 big bottleneck.

  • This process has to occur at a level the audience can understand :

The same lesson can take a day or 5 minutes depending on the knowledge of the audience. One of the big problem that clutters reuse of lessons on a large scale basins is that every time one wants to adapt the level, a whole rewrite is necessary. Having a formal representation of books would allow for seamless rewrite of proof down to the basic axiom if necessary.

  • Another variability is how does one reuse the theorem or presentation of another.

Dissemination now is informal, slow, and based on individual reading. Few ways exists to have clever solution to emerge. A repository of steps for proofs, which would allow reuse in other context, would enable indirect voting, and promote the best practices.

  • Finally, another problem is the sheer size of the domain which can only be tackled with by many specialists.

Sometimes the difficulty lies not in the domain itself but in the many ways there are to show one same effect. There is no place for such collaboration to take place in a fruitful way, with people ranging from high school students to phd’s to contribute and enrich each other. Wikipedia exists, but is completely out of scope : what if I was to see the 10 differents ways to prove an assertion, and how the theorems used apply in the specific case i am looking at ? wikipedia can’t handle this kind of data explosion and no onewill contribute this to the details I might need (and some other person won’t)

So, facing those issues, one might think that the computer science people in universites found a way ? No, you want to know why ? They are research ways not tho share proof and have a formal system for proof and representation. no. they are *waaaaayy* behond that : they are looking to *automate* the creation of proof. This really is for me the completely wrong way to go, and we should first concentrate on having a formal description system to tackle those 3 points I exposed. *Then*, with such a useful formal description, will we get ammo for automated proof, if we ever can solve it.

Fuse : fewer boundaries

FUSE enables you to mount a remote filesystem locally.
It is common knowledge that computers are done for lazy people to spend 2 hours to find a way to save 5 minutes. I therefore mount my dedibox to ~/dediboxfs and, magic, I can publish and access all the remote files as if they were just there.

For macs it’s a gift from google, to be found there for MacFuse , and there for SSHFS, other FS are available elsewhere (ftp, gmail, etc… )

Getting a list of your activerecord models

That’ll do

 

Dir.glob(“#{RAILS_ROOT}/app/models/**/*rb”).each{|m| Dependencies.require_or_load m }

Object.subclasses_of ActiveRecord::Base