Preview only show first 10 pages with watermark. For full document please download

Transcript

  Review of Integrative Analysis Challengesin Systems Biology Lei Z HU , Amit B HATTACHARYYA , Edit K URALI , Amber A NDERSON , Alan M ENIUS , and Kwan L EE In the pharmaceutical industry, systems biology is mak-ing a paradigm shift in drug discovery from target-focused to a systems approach. It studies the molecular level changes in a biological system as an integrated andinteracting network of genes, proteins, and metabolitesinsteadoffocusingonindividualcomponents.Integrativeanalysis of molecular level data becomes very importantin unraveling the complex relationship and interactionsin the biological system. One way to conduct integrativeanalysis is to identify correlations among genes, proteins,and metabolites. There are various analytic challenges:data preprocessing from various omics platforms; thenumber of variables much larger than the number of sub- jects; wide differences in the number of variables per  platform leading to imbalance in the platform contribu-tions to the analysis; the issues of multicollinearity andmultiple testing; and difficulty in proper model validationto avoid selection bias. This article is a comprehensivereview of statistical methods that address analysis chal-lenges in integrative data. Key Words: Metabolic profiling; Partial least squares discriminantanalysis; Principal component analysis; Omics platform; Selec-tion bias; Variable selection. 1. Introduction Systems biology is the study of an organism, viewedas an integrated  and interacting network  of genes, pro-teins, and biochemical reactions which give rise to life.Instead of analyzing individual components or aspects of the organism, such as sugar metabolism or a cell nucleus,systems biologists study the whole biological system asan integrated and interacting network of genes, proteins,and biochemical reactions. These interactions are ulti-mately responsible for an organism’s form and functions.For example, the immune system is not the result of asingle mechanism or gene. Rather the interactions of nu-merous genes, proteins, mechanisms, and the organism’sexternal environment produce immune responses to fightinfections and diseases. The target-centered drug discov-ery approach has been used by the pharmaceutical indus-try for the past three decades. However, the productivityof pharmaceutical R&D is declining due to rising costsand increased failure rate. Recently, a paradigm shift istransforming the target-centered approach to a systemsapproach. An overview was published (van der Greef andMcBurney2005) on the systems pathology and systems pharmacology approach to rescue drug discovery usingsystems biology concepts applied to discovery and de-velopment of drugs for efficacy or safety on the wholeorganism. An understandable model of the whole biolog-ical system could be developed by studying the relation-ships and interactions between various parts of the sys-tem (e.g., metabolic pathways, organelles, cells, physio- c   American Statistical AssociationStatistics in Biopharmaceutical Research2011, Vol. 3, No. 4DOI: 10.1198/sbr.2010.09027 561  Statistics in Biopharmaceutical Research: Vol. 3, No. 4 logical systems, organisms, etc.). Identifying biomarkersand related biological pathways that change under vari-ous perturbations will result in a better understanding of drug efficacy and the predictive nature of animal models.A more complete understanding of the disease pathwaysand drug mechanisms of action in animals and humanscould improve clinical development strategies.Omics technologies—including genomics, pro-teomics, and metabonomics/metabolomics—is a rapidlygrowing area of biological study. The term “omics plat-forms” or “omics” refers to platforms that are capableof cataloging a myriad of biomolecular changes in anorganism undergoing experimental perturbation. Omics platforms’ ability to generate large quantities of mea-surements on various aspects of the organism has helpedto advance biology to areas previously inaccessible. Thedevelopment in omics has made it possible for the sys-tems biology approach to study drug efficacy and tox-icity effects more completely and generate insights onhow the human body operates under the influence of drug and/or disease. The collective interactions of genes, proteins, and other components in an organism are oftencharacterized together as an interactive network. A genelevel change may trigger a change in the protein level,and the protein level change may then result in metabo-lite changes. Even if the direct causal relationship wereabsent, gene, protein, or metabolite changes could still be highly correlated to each other. The systems biologyapproach can help identify these types of changes thatwould not be possible in traditional approaches. The un-derstanding of an organism will ultimately transform our understanding of diseases and how to treat them.In order to understand the interactive network insystems biology data, there are abundant publicationson data analysis of multiple platforms focusing on a platform-by-platform approach. Eriksson et al. (2004)discussed multivariate techniques in analyzing omics platforms as independent entities. However, to assessand understand a whole series of alterations in biologi-cal systems, all possible platforms need to be analyzedsimultaneously, which is termed as “integrative analy-sis.” Recent developments on the integration of multi- ple platforms was demonstrated from several publica-tions, such as a systems biology approach to connectgenes to metabolites in plants (Oksman-Caldentey, Inze,andOresic2004);statisticalexperimentaldesignandpar-tial least square regression analysis on NMR and clini-cal chemistry data (Antti et al.2004); and an extensivereview of the current state of integration of the omicsdata streams on both biological and data-driven integra-tion strategies (Thomas and Ganji2006). The integra-tions discussed in these publications are on multiple plat-form data collected on the same subjects but on differentvariables, which is the main focus of this review. It is dif-ferent from integrating or merging multiple platform datacollected from different subjects but measuring the sameor similar variables (Hwang et al.2005). 2. Methods 2.1 Technological Platforms The application of open, broad-based discovery tech-nologies makes it possible for the systems biology ap- proach to catalog a myriad of biomolecular changesin an organism undergoing experimental perturbations.For consistency, all molecular measurements from an-alytical platforms are defined as analytes. The analytesthat change due to disease or drug are called diseaseor drug markers, respectively. Omics platforms such astranscriptomics, metabonomics/metabolomics, and pro-teomics are capable of registering many thousands of an-alytes from the same sample. The following is a selectedlist of commonly used platforms:(a) Traditional blood chemistry (also called “non-omic”) analytes, such as Cholesterol, Glucose,Hemoglobin, Leptin, and so on, are from well-developed assays, and the properties of the ana-lytes are usually well studied and understood. Thenumber of analytes typically ranges from a few toseveral dozen.(b) Transcriptomics platform using AffymetrixGeneChip (Affymetrix, Santa Clara, CA) requireslabeling of complementary RNA derived frommRNA generated from extracted tissues, hybridiz-ing to oligonucleotides representative of humanor murine transcripts. The intensity of signal,as a result of binding to a nucleotide sequence,corresponds to the relative amount of mRNA inthe experimental sample. The dimension of theAffymetrix GeneChip data usually is 22,000 genesin animal and 54,000 genes in human.(c) Nuclear magnetic resonance (NMR) spectroscopicanalysis is used for metabolic profiling. In pro-ton nuclear magnetic resonance (NMR) spectro-scopic analysis, chemical entities in a sample aresubjected to a high-intensity, oscillating magneticfield which causes various nuclear resonances. Theamountofshiftinresonancesignalischaracteristicof molecules, and can be used to provide informa-tion on the number and type of chemical entities inamolecule.TheNMRdatacontainsapproximatelyseveral hundred metabolites. The metabolite datafrom the NMR platform are referred to as Metabo-nomics. 562   Review of Integrative Analysis Challenges in Systems Biology (d) Liquid chromatography-mass spectrometry (LC-MS) is a technology where the liquid chromatogra- phy separates analytes and the mass spectrometer measures their apparent mass (i.e., mass-to-chargeratio). The fragmentation pattern of the moleculecan be further analyzed to determine the molecu-lar composition of the analytes. LC-MS platformcan be used in metabolite profiling or lipid pro-filing generating several thousand metabolites or several thousand lipids. The metabolite data fromLC-MS platform are referred to as Metabolomics.(e) Gas chromatography with flame-ionization detec-tor (GC-FID) platform detects analytes by mea-suring an electrical current generated by electronsfrom burning carbon particles in the sample. It iscapable of profiling several hundred lipids. 2.2 IntegrativeAnalysisChallengesandPossibleSo-lutions The automated acquisition of large amounts of omicsdata creates exploratory and interpretative challenges inunraveling associations among non-omics, transcripts, proteins, metabolites, and lipids. The challenges include:data preprocessing, missing values, data integration, themultiple testing issue, selection bias and proper modelvalidation. Some challenges are individual platform spe-cific, such as data preprocessing and multiple testing.Some are unique challenges from data integration andintegrative analysis such as the missing data problemcaused by data that are not available on every omics plat-forms, platform dominance issues due to imbalance inthe number of variables across platforms, and selection bias problems in integrating multiple platforms on se-lected variables. In this section, the challenges as wellas possible solutions are discussed. 2.2.1 Data Preprocessing  A good preprocessing method is critical to producehigh-quality data for analysis. Raw data from the open platforms are not ready for analysis without proper pre- processing. Various sources of systematic variation and bias exist in raw platform data. The preprocessing stepsattempt to remove undesirable systematic variation fromthe data such that the samples or the analytes are compa-rable to each other. The statistical literature has abundantcoverage of preprocessing methods in the microarray platform over the past several years. Bolstad et al. (2003) proposednormalizationmethodstoreducethesystematicvariation in microarray experiments. Spectral data pre- processing involves much more complicated steps suchas peak alignment, data normalization, peak quantifica-tion, peak matching, and so on. Extensive details have been published on LC-MS spectral data preprocessingsuch as peak alignment, baseline correction, normaliza-tion, and combining replicates to improve the signal-to-noise ratio (Listgarten and Emili2005). All preprocess-ing methods have a common goal to differentiate the sig-nal from the different sources of systematic variation andto facilitate profile comparisons. Software packages and bioinformatics tools are being developed rapidly to meetthe data-preprocessing challenge. However, current toolshave much room for improvement; method developmentin preprocessing is still an area of active research. 2.2.2 Missing Values Missing values are common in most platforms. Thereare various reasons why values are missing: samples col-lected in one platform, but failed in another (this case isspecific to integration of multiple platforms); poor sam- ple quality; misalignment of peaks or chemical noise inspectra; assay failure or data that failed technical qualitycontrol; below limit of detection (LOD) values, and soon.All platforms have some extent of missing data prob-lems. Affymetrix MAS5 data tend to be noisy at low in-tensity. An intensity threshold can be derived from thedistribution of the control genes. If an intensity thresholdis applied before analyzing Affymetrix MAS5 data, themissing value problem will occur. Then the missing dataanalysis approaches discussed in the following would ap- ply; if an intensity threshold is applied after the analysis,standard analysis procedure is followed by an extra stepto filter out genes that are below the intensity threshold.There are different methods for handling missingdata which depend on the circumstances of how miss-ing data occur. If data are missing systematically (e.g.,missing due to below limit of detection for certain treat-ment groups), one approach is to test the association of the missing pattern with treatment or disease groups tocheck whether the observation of informative missing is by chance. For confirmed cases of informative missing,further investigation by the scientists is highly recom-mended.Figure1isanexampleofinformativemissingindata from LC/MS lipids. There are 9866 (25%) missingvalues in the data. The missing map showed that certainvariables are missing only in certain treatment groups,which is clearly violating the “missing at random” as-sumption.Statistical treatment of missing data is usually basedon the assumption of missing at random. Under this as-sumption, in analysis of variance (ANOVA), missing ex- perimentalsubjectsareomittedfromtheanalysis.Projec-tionmethodssuchasPCAorPLS-DA,theNIPALS(non-linear iterative partial least squares) algorithm (Wold,Sjostrom, and Eriksson2001), implemented in SIMCA(Umetrics Inc.), interpolate the missing point using aleast squares fit by iteratively substituting the missing 563  Statistics in Biopharmaceutical Research: Vol. 3, No. 4 Figure 1. An example of informative missing in LC/MS lipid data. Rows are samples and columns are peaks/analytes. Samples are sorted bytreatment. Black areas represent data, red areas represent missing values. There are 9866 (25%) missing values. The missing map shows thatcertain peaks/analytes are missing only in certain treatment groups, which is clearly violating the “missing at random” assumption. values with predictions from the model until convergenceis reached. Missing values are considered to be the ex-act fit of the model with zero residual. Other approaches based on the EM algorithm work better than the NI-PALS algorithm when there is a large percentage of miss-ing data (Grung and Manne1998; Nelson, Taylor, andMacGregor 1996). Among the three missing imputationmethods well-suited to missing at random cases in main-streamsoftware,multipleimputation,imputationbyclas-sification, and imputation by NIPALS algorithm, none of the methods appear to differ significantly from the othersregarding to quality of the results (Preda et al.2005).In any case, models or predictions are highly ques-tionable if there is much missing data. A missing valuethreshold is recommended to remove variables that haveover 50% missing data in both comparison groups inANOVA contrasts. SIMCA also provides a missing per-centagethresholdoption.Amissingvaluethresholdisof-ten used for both ANOVA and SIMCA analyses on lipidand polar metabolite data where excessive missing val-ues occur. For the above example of informative missingdata, categorical analysis is recommended to test whether the missing is associated with the treatment. 2.2.3 Methods for Analyzing High-Dimensionality Data In integrative cross-platform analysis, high dimen-sionality and multicollinearity among the analytes arecommon challenges to standard statistical inference. For example, ordinary least squares regression coefficientshave highly inflated variances and they become quite un-stable for multicollinear data or even nonunique for high-dimensional data.In high-dimensional data, most variability is likely toexist in a relatively lower dimensional space. This rela-tively lower dimensional space is the “latent dimension”of the srcinal high-dimensional data. Multivariate sta-tistical techniques such as principal component analysis(PCA) and partial least squares (PLS) reduce the high-dimensional data into fewer dimensions by preservingthe overall characteristics of the data and hence are ap- plied regularly in analyzing these types of data.For the past several decades, these techniques have been used in high-dimensional data such as spectroscopy 564   Review of Integrative Analysis Challenges in Systems Biology Figure 2. An example of the platform dominance problem. The plots are from principal component analysis (PCA). The subjects in black are fromthe normal group, the subjects in red are from the diseased group. The integrative analysis result is a mirror image of the gene platform analysisalone. The mirror image illustrates that the large number of genes masks the influence of the traditional blood chemistry measures (non-omics),which are known to be important. and chromatography (Eriksson et al.2002a; Eriksson etal.2002b; Eriksson et al.2001; Hellberg, Sjostrom, and Wold1986; Hellberg et al.1991). Projection methods are extensively used in quantitative structure activity rela-tionship analysis (QSAR). The suitability of multivariate projection methods for the analysis of genomics and pro-teomics data is well documented (Boulesteix and Strim-mer 2006), too. Partial least squares discriminant analy-sis, which is often used, attempts to find and span a latentspace for prediction of a response variable. The modelsearches for a set of latent vectors that performs a simul-taneous decomposition of the explanatory variables andthe response variables, with the constraint that these la-tent vectors explain as much as possible of the covariance between the explanatory variables and the response.Regularized or shrinkage methods are also highly preferable for high-dimensional data analysis. Shrink-age often improves prediction accuracy, trading off de-creased variance for a small increase in bias (Hastie, Tib-shirani, and Friedman2001). Shrinkage methods such asthe elastic net (Zou and Hastie2005) have the advantageof building a prediction model and selecting variables atthe same time. Regularized methods derive a predictionmodel and select variables at the same time. 2.2.4 Data Integration Cross-platform data integration is carried out bymerging all platform data after preprocessing, if any,and performing statistical analysis on the combined data.This approach is sensible when platforms are compara- ble in dimensions. When the platforms vary in their di-mensionalities, the larger platforms tend to dominate theintegrative analysis over the smaller platforms. The fol-lowing example, with an integrated data of 20 traditional blood chemistry (non-omics) analytes and 12,488 genes,illustrates this platform dominance challenge. The prin-cipal component analysis (Figure2) shows that the inte-grative analysis result is a mirror image of the gene plat-form analysis alone. The mirror image illustrates that thelarge number of genes masks the influence of the tradi-tional blood chemistry measures which are known to beimportant.A simple integration approach should work if the in-tegrated platforms are not very different in terms of size,for example, integration of NMR platform with GC/FIDor LC/MS platforms that have several hundred analytes. 565