17 Year-Old designs better micro-search for Twitter etc.

joel_b · 8/4/11

Cool article, implications for improved news mining of real-time twitter results, perhaps with Financial Engineering implications for algo trading.

http://apps.ysf-fsj.ca/virtualcwsf/projectdetails.php?id=2740&switchlanguage=en

Nicholas Schiefer
Apodora: Markov Chain-Inspired Microsearch
Abstract: A novel information retrieval algorithm called "Apodora" is introduced, using limiting powers of Markov chain-like matrices to determine models for the documents and making contextual statistical inferences about the semantics of words. The system is implemented and compared to the vector space model. Especially when the query is short, the novel algorithm gives results with approximately twice the precision and has interesting applications to microsearch.

Related Article with a good interview

http://www.theglobeandmail.com/news...-better-way-to-search-internet/article2118962

A lot of traditional algorithms for information retrieval tend to break down when you apply them to micro search. The reason for that is that most, nearly all existing algorithms make the independent assumption – that all words are completely independent from other words.

Obviously, that is false, but it’s been shown to work pretty well.

But that assumption breaks down quite badly with micro search. You do not have room to stuff your text full of synonyms and descriptions of everything you say so a search engine can find it.

For example, if you wanted to search tweets for the word “cat.” If a tweet contains the word “kitten,” that’s not going to be very helpful. It’s assuming cat and kitten are independent, even if they’re not.

D C · 8/4/11

joel_b :: thanks so much for sharing, I actually have a friend designing a system that uses an approach like you mention.

D C · 8/4/11

the article actually doesn't go into ANY detail as to how the algorithm is different except that it connects to similar search words...is there a published paper?

joel_b · 8/4/11

He already works for IBM, so I would imagine they've already struck a deal with him. You won't see a paper published until he's received a bunch of money for it, or never at all. Google won't publish their algo either.

It's written in Python obviously; I would guess that he is building a large database of sentences/paragraphs, likely off the internet, thesaurus or other sources and develops statistical realtionships betwen all the words. Princeton's WordNet could be used for that I would imagine as well http://wordnet.princeton.edu/

Maybe more detail will come out, but I would imagine that a Science Competition for Grade 11 students wouldn't require him to to fully describe the process, let alone really get it. Really blows my Volcano out of the water, hah!

17 Year-Old designs better micro-search for Twitter etc.

joel_b

D C

Grad Student

D C

Grad Student

joel_b