Googla ditt DNA

30 Jun 2010 | Kommentarer ()

Vem kommer vara först att skapa ett Google för genetisk information? arXiv blog refererar en artikel* där en kinesisk forskare från SOSO.com - det tredje största sökmotorsföretaget i Kina - visar upp en teknik som lånar metoden från hur man indexerar kinesisk text för sökning, och applicerar den på genetiska "ord".

In Chinese for example, the percentage of 1-gram words that appear only once is less than 50 per cent, the percentage of 2-gram words that appear only once is about 50 percent and the percentage of 3-gram words is less than 50 per cent. So 2-gram words are a good average.

Liang applies the same criteria to find the average length of words in the genomes of arabidopsis, aspergillus, the fruit fly and the mouse. And he finds that a good average word length is about 12 letters. So the best way to index genome data is to look for 12-grams, he says.

None of this needs any new technology to complete. Liang says that the open source search engine Lucene is the perfect forum in which to do the work and, impressively, has even used it to build a rudimentary bioinformatics search engine himself.

Du hittar koden här: DNA search engine.

* Wang Liang, "How to build a DNA search engine like Google?", arXiv:1006.4114v3

kommentar(er)

Jag heter Erik Stattin och det här är min blogg. Jag skriver om digital kultur, ungefär. Du får gärna tipsa mig om saker. Kontakta mig på erik.stattin@gmail.com. Jag är mymarkup på Twitter och Delicious.