"omics" data include genomics, transcriptomics, proteomics and metabolomics data etc. As we know, "omics" data are exploding at an extremely fast pace. So far, there are 387 published completed genomes, 608 eukaryotic ongoing genomes and 996 prokaryotic ongoing genomes. A portion of transcriptomics data is microarray (gene expression) data, which has increased substantially annually. Analyzing and managing these data has become a critical issue. Clustering micorarray data can cluster together of genes with similar behaviors and help to interpret gene functions. K-means is one of the most popular methods used in clustering microarray data due to its high computational performance. However, K-means might converge to a local optimum, and its result is subject to the initialization process, which randomly generates the initial clustering. Recently, we developed a new algorithm: markov chain correlation based clustering algorithm for clustering gene expression data. It performs much better than the existing K-means algorithm.

Classifying microarray data such as cancer microarray data to distinguish multiple classes corresponding to different subtypes of a specific kind of cancer, is important. It can be used for disease diagnosis and prognosis. We have developed a couple of new methods to find cancer marker genes by merging multiple microarray data from different platforms. We have also carefully compared various classification algorithms for classifying gene expression data. We found that Support Vector Machine performs best over other algorithms. Further we developed a user-friendly java GUI application allowing users to perform SVM training, classification and prediction. We demonstrated that our software can accurately classify phenotypes based on gene expression data.

In order to efficiently manage genomic data, I will discuss how we built the Tribolium genome database: BeetleBase, an online genome database for Tribolium castaneum, and the EST model database: ESTMD, an integrated web-based model for management, analysis and retrieval of EST biological information.