BCF-SIB Research

Method Development

The emergence of "omics" science presented challenges in data analysis, both in integrating multiple modes of assays (DNA, RNA, proteins, and everything else) and in integrating heterogeneous, independent datasets (such as multi-center cohorts in clinical studies). We identified the core problems are in the following area:

Availability of well-curated omics data, along with sample annotations such as clinical outcome
Adaptation of rigorous statistical methods with good properties to omics data, particularly for dealing with generalizability in discoveries made from omics data, such as in biomarker development, where their performance tend to decrease with expanding test cases
Implementation of effective and efficient computational methods, for both data management and intensive numerical computation

Expression Data Collection and Curation

Although omics data can viewed as a big matrix of concatenated data with multiple variables (and modes of assays) and samples (from many studies), tremendous effort is required to reformat and standardize mostly unstructured raw data (both private and public) before they can be considered "statistician friendly". We have developed capabilities to collect and curate microarray datasets, with emphasis in the following aspects:

identifying cohorts with non-redundant subjects
consideration of patient selection and characteristics
uniform representation of clinical variables
regular probe remapping according to up-to-date transcriptomic knowledge
quality control and renormalization of available expression measures