OLAP is a powerful tool that is widely used by the business community to analyze large financial data sets. Despite the big data sets found in modern biology, OLAP has not been widely adopted by the biology community. We want to change that. BIOLAP is open source OLAP for biology data. We take the open source tools provided by the JPivot project, and extend them to handle biology data types. In our first release, we focus on extensions to analyze large genomic sequence databases.
The basic idea is that data sets are viewed as cubes with hierarchies along each axis. To navigate the cube, we specify an aggregation function to say how to summarize information about groups of cells within the cube. This structure allows us to view large data sets in a highly summarized form, and then expand the data along each axis as needed to see finer grained features. More details can be found here.
In traditional OLAP, we are trying to summarize sets of numbers. In biology, the data we are working with is often non-numeric and the correct way to summarize that data may not be number. As a first example, suppose we are interested in summarizing a set of protein sequences. Usually these sequences are represented as strings of letters and there are several reasonable ways to summarize a group of sequences. We might want a numeric similarity score, we might want text output that shows the most common motifs shared by the sequences, we might even want a scatter plot or a tree diagram which shows how the sequences cluster together. This is just one example. Other common biology types are matrices for microarray data, graphs for chemical structures, or images from various kinds of instruments. For this reason, in order to handle biology data sets we have two main requirements:
Sounds hard? Not if you have the right tools!
As an example, we start with a set of related olfactory and rhodopsin proteins grouped using a cladogram.
For each of these proteins, we have annotations describing their function and taxonomy. We use these annotations as dimensions for a cube and define an aggregation function (see the whitepaper for details) that measures sequence similarity. Here is what this gives us in JPivot:
We start by looking at the most aggregated view of the data, with all proteins grouped together.
We can expand this by protein function dimension. Proteins within the same functional category have a higher similarity.
Expand by taxonomy dimension. Proteins within the same taxonomy class and functional category have an even higher similarity.
The similarity scores reflect what we would expect from the cladogram which was calculated via a full multiple sequence alignment. Drill through to see the underlying mammalian sequence data for any cell. In this case, we drill on the mammalian rhodopsin sequences.
As a real-world test, we next applied this tool to the iProClass database. This database contains over 2 million protein sequences, annotated with GO classification, PIR superfamilies, motifs, protein domains etc. All of these annotations are potentially useful as dimensions for a cube. We chose two obvious ones, the GO protein function hierarchy and the taxonomy / lineage information for the sequence. By restricting our focus to sequences containing these annotations we came up with a total of 44,441 sequences for our cube. In this case, the initial similarity calculation for these 44k sequences took over 10 minutes. Fortunately, the OLAP browser allows us to pre-calculate similarities and store them in a table so that the response time of the viewer is near instantaneous. Again, an OLAP browser allows us to easily navigate this large data set.
Expand one level deep by GO classification and top level Taxonomy classification:
Expand by GO Hierarchy:
From any cell we can drill down to view the underlying data for that group of sequences:
Get feedback from you! Contact us on our forum with suggestions on what kinds of aggregation functions might be useful, or suggestions for applications.
Add more aggregation functions. One such function would be MEME to show conserved motifs in a set of protein sequences.
Either as text,
or as a block diagram,
or as LOGOS output.
More biology examples. Extend the iProClass example further or find other sequence databases where we can pick off useful genomic annotations to use as dimensions for a cube.
JPivot. The best place to start is with the basic JPivot tool. Download this tool along with the mondrian example data. Follow the directions to create an ODBC data connection called MondrianFoodMart. Deploy the the jpivot.war file to your tomcat webapps server directory.
BIOLAP Source Code + Examples.
Huyn N., OLAP
for Clinical Trials, Proceedings of the 27th
VLDB Conference, Roma, Italy, 2001.
Abstract: On-line analysis of clinical study data has not benefited from recent advances in OLAP technologies.We examine the analysis requirements posed by the biotech domain that are not met by traditional OLAP. To accommodate these new requirements, we propose the concept of Scientific OLAP which applies more broadly to data analysis in controlled scientific experiments. We describe our experience implementing such a system for the support of biomarker discovery and we identify some key challenges that must be overcome before OLAP can be widely adopted in the biotech industry.
Vinga S., Almeida J., Alignment-free
sequence comparison-a review, Bioinformatics:2003 Mar
Abstract: Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed.
Kriegel H., Seidel T., Shape-based Similarity Search in Large Biomolecular Databases.
Abstract: As a new approach to explore large structural molecular databases, we suggest a shape-based similarity model for 3D objects including proteins, DNA structures, or medical images. On top of the model, we have developed efficient index-based algorithms that support mining in very large biological databases which are expected to rapidly grow in the context of todays and tomorrows genomic, post-genomic and proteomic research activities. Experiments on a large 3D protein database demonstrate both, a good classification accuracy of our model (above 90% like FSSP or CATH) and the good performance of our flexible query processor.
Le S., Ho T., Phan T., A
Novel Graph-Based Similarity Measure for 2D Chemical Structures,
Genome Informatics 15(2): 82-91 (2004).
Abstract: In this paper, we propose a graph-based method to measure the similarity between chemical compounds described by 2D form. Our main idea is to measure the similarity between two compounds based on edges, nodes, and connectivity of their common subgraphs. We applied the proposed similarity measure in combination with a clustering method to more than eleven thousand compounds in the chemical compound database KEGG/LIGAND and discovered that compound clusters with highly similar structure compounds that share common names, take part in the same pathways, and have the same requirement of enzymes in reactions. Furthermore, we discovered the surprising sameness between pathway modules identified by clusters of similar structure compounds and that identified by genomic contexts, namely, operon structures of enzyme genes.
system for motif discovery from a set of sequences.