Big data: you can hardly pick up a newspaper without reading about some new scientific or business acumen derived from mining some heretofore-untouched volumes of digital information. Well, I’m happy to say that genome sequence data—which certainly qualifies as big, both in volume and velocity—is joining the party, and in a most meaningful way. When combined with information from medical records, genome data can be mined for new insights into treating disease.
Towards this vision, I have been working with researchers at University of California San Diego (UCSD) and have invented the Genome Query Language (GQL), which features three operators that allow error-resilient manipulation of genome intervals. This, in turn, abstracts a variety of existing genomic software tasks, such as variant calling (determining whether a person has a different gene from the reference) and haplotyping (ascribing genomic variation as being inherited from the mother or the father). GQL is inspired by the classic database query language SQL and has similar operators; however, GQL introduces a major new operator: the fault-tolerant union of genomic intervals.
To understand how GQL could be used on the Windows Azure platform in the cloud, imagine that a biologist is working on the ApoE gene, which is responsible for forming lipoproteins in the body. Wondering how ApoE gene variations affect cardiovascular disease (CV), the biologist types in a query with the parameters “ApoE, CV” on a tablet computer, just as you might enter a search-engine query. The query is sent to the GQL implementation in the cloud, which returns the ApoE region of the genome in patients with cardiovascular disease. Since the ApoE gene is quite small, the data is processed quickly in the cloud and returned in seconds to the biologist’s tablet. The biologist can then use customized bioinformatics software to mine the data to identify variations.
We have implemented GQL on Windows Azure and used it to query genomic data expeditiously. We have shown, for example, how GQL can be used to query The Cancer Genome Atlas for large structural variations by using only 5 to 10 lines of high-level code. The code took approximately 60 seconds to execute on the Windows Azure application in the cloud when run on an input human genome file of 83 gigabytes. GQL can improve existing software as well by refactoring queries, significantly speeding up results. It could also be used to facilitate browsing by queries and not just location within the UCSC genome browser.
To make the GQL implementation provide interactive speeds, two optimizations were crucial: cached parsing and lazy joins. Combined, they sped up query processing by a factor of 100. I encourage interested readers to explore the details of our research—the GQL queries we used, the optimizations we implemented, and the experimental results we achieved—in the Microsoft Research Technical Report: Interactive Genomics: Rapidly Querying Genomes in the Cloud.
Welcome to the UCSC Genome Browser website. This site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides portals to the ENCODE and Neandertal projects.
We encourage you to explore these sequences with our tools. The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide. The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways. Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database. VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns. Genome Graphs allows you to upload and display genome-wide data sets.
I so have no idea what to do with this, but I still think it's cool as heck. There's got to be a way I can work this into a Zombie novel or CSI kind of show... :P