Method Development
The emergence of "omics" science presented challenges in data analysis, both in integrating multiple modes of assays (DNA, RNA, proteins, and everything else) and in integrating heterogeneous, independent datasets (such as multi-center cohorts in clinical studies). We identified the core problems are in the following area:
- Availability of well-curated omics data, along with sample annotations such as clinical outcome
- Adaptation of rigorous statistical methods with good properties to omics data, particularly for dealing with generalizability in discoveries made from omics data, such as in biomarker development, where their performance tend to decrease with expanding test cases
- Implementation of effective and efficient computational methods, for both data management and intensive numerical computation
Expression Data Collection and Curation
Although omics data can viewed as a big matrix of concatenated data with multiple variables (and modes of assays) and samples (from many studies), tremendous effort is required to reformat and standardize mostly unstructured raw data (both private and public) before they can be considered "statistician friendly". We have developed capabilities to collect and curate microarray datasets, with emphasis in the following aspects:
- identifying cohorts with non-redundant subjects
- consideration of patient selection and characteristics
- uniform representation of clinical variables
- regular probe remapping according to up-to-date transcriptomic knowledge
- quality control and renormalization of available expression measures