Overview of methods for analyzing high-dimensional experimental data, including theory, methodologies, and applications
Analysis of Variance for High-Dimensional Data summarizes all the methods to analyze high-dimensional data that are obtained through applying an experimental design in the life, food, and chemical sciences, especially those developed in recent years.
Written by international experts who lead development in the field, Analysis of Variance for High-Dimensional Data includes information on:
Basic and established theories on linear models from a mathematical and statistical perspective Available methods and their mutual relationships, including coverage of ASCA, APCA, PC-ANOVA, ASCA+, LiMM-PCA and RM-ASCA+, and PERMANOVA, as well as various alternative methods and extensions Applications in metabolomics, microbiome, gene expression, proteomics, food science, sensory science, and chemistry Commercially available and open-source software for application of these methods
Analysis of Variance for High-Dimensional Data is an essential reference for practitioners involved in data analysis in the natural sciences, including professionals working in chemometrics, bioinformatics, data science, statistics, and machine learning. The book is valuable for developers of new methods in high dimensional data analysis.
About the Authors xi Foreword xii Preface xiii 1 Introduction 1 1.1 Types of Data 1 1.2 Statistical Design of Experiments 4 1.3 High-Dimensional Data 6 1.4 Examples 8 1.4.1 Metabolomics 8 1.4.2 Genomics 10 1.4.3 Microbiome 11 1.4.4 Proteomics 15 1.4.5 Food Science 16 1.4.6 Sensory Science 17 1.4.7 Chemistry 18 1.5 Complexities 20 1.5.1 Normalization 20 1.5.2 Different Measurement Scales 21 1.5.3 Different Distributions 21 1.5.4 Heteroscedastic Error 22 1.5.5 Comparability 23 1.5.6 Sparseness, Non-detects, and Missing Values 23 1.5.7 Unbalancedness 24 1.6 Direct Versus Indirect Methods 24 1.7 Some History 25 1.A Appendix 25 1.A.1 Types of Measurements 25 1.A.2 Notation and Terminology 25 1.A.3 Some Definitions 27 1.A.4 Abbreviations 28 2 Basic Theory and Concepts 31 2.1 Mathematical Background 31 2.1.1 Vector Spaces and Subspaces 31 2.1.2 Matrix Decompositions 33 2.1.3 Inverses and Generalized Inverses 34 2.1.4 Distances and Projections 35 2.1.4.1 Formal Description of Distances 35 2.1.4.2 Projections 38 2.1.5 Principal Component Analysis 42 2.2 Statistical Background 44 2.2.1 Estimation Methods 44 2.2.1.1 Least Squares 45 2.2.1.2 Maximum Likelihood 46 2.2.2 Regression Methods 47 2.2.2.1 Multiple Linear Regression: Full Rank Case 48 2.2.2.2 Multiple Linear Regression Using Dummy Variables 48 2.2.2.3 Multiple Linear Regression: Rank Deficient Case 48 2.2.2.4 Penalized Regression 49 2.2.2.5 Principal Component Regression 49 2.2.2.6 Partial Least Squares 50 2.2.2.7 Redundancy Analysis 51 2.2.3 Significance Tests 53 2.2.3.1 Classical Tests 53 2.2.3.2 Permutation Tests 54 2.2.3.3 Likelihood Ratio Tests 55 2.3 Association Measures 55 2.3.1 Pearson and Spearman Correlation Coefficients 55 2.3.2 Problems with Correlations 56 2.A Appendix 58 3 Linear Models 61 3.1 Introduction 61 3.2 Simple ANOVA Models 62 3.2.1 One-Way ANOVA 62 3.2.2 Two-Way ANOVA 65 3.2.2.1 Crossed Designs 65 3.2.2.2 Nested Designs 68 3.2.3 Unbalanced Designs 70 3.2.3.1 One-Way ANOVA 70 3.2.3.2 Two-Way ANOVA for Crossed Designs 71 3.2.3.3 Nested ANOVA 73 3.3 Regression Formulation, Estimability, and Contrasts 73 3.4 Coding Schemes 74 3.4.1 Codings for Balanced Designs 74 3.4.1.1 One-Way Layout 74 3.4.1.2 Two-Way Crossed Designs 79 3.4.1.3 Two-Way Nested Designs 85 3.4.2 Codings for Unbalanced Designs 85 3.5 Advanced Models 93 3.5.1 Variance Component Models 93 3.5.2 Linear Mixed Models 95 3.5.2.1 General Idea 95 3.5.2.2 Estimation of Model Parameters 95 3.5.2.3 Repeated Measures ANOVA 97 3.5.2.4 Cross-over Designs and Models 98 3.5.2.5 Longitudinal LMMs 98 3.6 Hasse Diagrams 105 3.6.1 Building a Hasse Diagram 105 3.7 Validation 107 3.7.1 Classical Tests Revisited 107 3.7.2 Expected Mean Squares from Hasse Diagrams 108 3.7.3 Permutation Tests 110 3.7.3.1 Exact Tests 111 3.7.3.2 Approximate Tests 113 3.8 Miscellaneous Models 114 3.8.1 Multivariate Analysis of Variance 114 3.8.1.1 Traditional Multivariate Analysis of Variance 114 3.8.1.2 Significance Testing in MANOVA 115 3.8.1.3 Regression Formulation of MANOVA 116 3.8.2 Multivariate LMMs 117 3.A Appendix 117 3.A.1 Proof 117 3.A.2 Relationships Between Codings 118 3.A.3 Practical Aspects of Codings 120 4 ASCA and Related Methods 125 4.1 ASCA 125 4.1.1 Basic Idea of ASCA 125 4.1.2 Properties of ASCA 129 4.1.3 Permutation Tests for ASCA 130 4.1.4 Back-Projection 131 4.1.5 Scaling in ASCA 132 4.1.6 Group-wise ASCA 139 4.1.7 Variable-Selection ASCA 140 4.1.8 REP-ASCA 140 4.1.9 ASCA as a Multivariate Multiple Regression Model 142 4.1.10 Geometry of ASCA 142 4.1.10.1 Geometry of ASCA in Row-Space 143 4.1.10.2 Geometry of ASCA in Column-Space 145 4.2 APCA 148 4.2.1 Basic Idea of APCA 148 4.2.2 Comparing APCA with ASCA 150 4.3 ASCA+ 151 4.3.1 Confidence Ellipsoids for ASCA 152 4.3.2 ASCA and ASCA+ as RDA Models 155 4.4 Principal Response Curves 156 4.5 SMART 160 4.6 ASCA, PRC, and SMART Compared 162 4.7 MSCA 167 4.A Appendix 171 4.A.1 Proof of Equation 4.20 171 4.A.2 Proof of Equation 4.6 172 5 Alternative Methods 173 5.1 General Introduction 173 5.2 PLSR-Based Methods 173 5.2.1 ANOVA-TP 173 5.2.2 ANOVA Multiblock Orthogonal Partial Least Squares (AMOPLS) 180 5.3 LMM-Based Methods 185 5.3.1 RM-ASCA+ with Qualitative Time Models 185 5.3.2 Validation of the RM-ASCA+ Model 186 5.3.2.1 Validation of RM-ASCA+ Models with Nonparametric Bootstrap 186 5.3.2.2 Validation of RM-ASCA+ Models with Permutation Testing 187 5.3.2.3 Visualization 188 5.3.2.4 RM-ASCA+ with Quantitative Time Models 189 5.3.3 LiMM-PCA 192 5.3.3.1 Validation 193 5.3.3.2 Visualization of Effects in LiMM-PCA 194 5.4 Miscellaneous Methods 196 5.4.1 PC-ANOVA 196 5.4.1.1 Basic Idea of PC-ANOVA 196 5.4.1.2 Comparing PC-ANOVA with ASCA 200 5.4.2 PARAFASCA 200 5.4.3 PE-ASCA 203 5.4.4 rMANOVA 209 5.4.5 Fifty–Fifty MANOVA 211 5.4.6 AComDim 212 5.4.7 General Effect Modeling (GEM) 218 6 Distance-based Methods 219 6.1 Introduction 219 6.1.1 Double Zeros 220 6.1.2 Horseshoe Effect 221 6.1.3 Compositionality 221 6.2 Methods 224 6.2.1 Principal Coordinate Analysis 224 6.2.2 PERMANOVA 228 6.2.2.1 PERMANOVA Calculated from the Gower Matrix 232 6.2.2.2 PERMANOVA of Non-Euclidean Dissimilarity Matrices 235 6.2.3 Effect Sizes in PERMANOVA 236 6.2.4 Permutations in PERMANOVA 237 6.2.5 Assumptions for PERMANOVA 237 6.3 ANOSIM 238 7 Reviews and Reflections 239 7.1 Reviews 239 7.1.1 Metabolomics 239 7.1.1.1 Plant Science 239 7.1.1.2 Microbiology and Biotechnology 241 7.1.1.3 Animal Science 242 7.1.1.4 Human Science 243 7.1.2 Microbiome 245 7.1.3 Genomics 246 7.1.4 Proteomics 246 7.1.5 Food Science 246 7.1.6 Sensory Science 248 7.1.7 Chemistry 248 7.2 Reflections 249 7.2.1 Summary of Reviews 249 7.2.2 Overview of Methods 250 7.2.3 Remaining Challenges 251 7.2.3.1 ASCA+ and Partial RDA 251 7.2.3.2 Permutations: Correlations and Unbalancedness 252 7.2.3.3 PERMANOVA and Effect Sizes 252 7.2.3.4 Back-Projection Approach 252 7.2.3.5 Inferential Statistics 252 7.2.3.6 Advanced HD-ANOVA Methods 252 8 Software 255 8.1 HD-ANOVA Software 255 8.2 R Package HDANOVA 255 8.3 Installing and Starting the Package 256 8.4 Data Handling 256 8.4.1 Read from File 256 8.4.2 Data Pre-processing 257 8.4.2.1 Re-coding Categorical data 258 8.4.3 Data Structures for Analysis Including Blocks 259 8.4.3.1 Create List of Blocks 259 8.4.3.2 Create data.frame of Blocks 260 8.5 Analysis of Variance (ANOVA) 262 8.5.1 Simulated Data 262 8.5.2 Fixed Effect Models 263 8.5.2.1 One-Way ANOVA 263 8.5.2.2 Two-Way Crossed Effects ANOVA 265 8.5.2.3 Types of Sums of Squares 265 8.5.2.4 Coding Schemes 266 8.5.2.5 Fixed Effect Nested ANOVA 267 8.5.3 Linear Mixed Models 268 8.5.3.1 Least Squares − mixlm 268 8.5.3.2 Restrictions 269 8.5.3.3 Repeated Measures 269 8.5.3.4 REML 271 8.5.4 Multivariate ANOVA (MANOVA) 272 8.6 Basic ASCA Family 274 8.6.1 Fit ASCA Model 275 8.6.1.1 Permutation Testing 275 8.6.1.2 Random Effects 276 8.6.1.3 Scores and Loadings 276 8.6.1.4 Data Ellipsoids and Confidence Ellipsoids 279 8.6.1.5 Combined Effects 280 8.6.1.6 Quantitative Effects 281 8.6.2 ANOVA-PCA (APCA) 282 8.6.3 PC-ANOVA 283 8.6.4 MSCA 285 8.6.5 LiMM-PCA 286 8.6.6 Repeated Measures ASCA 288 8.7 Alternative Methods 289 8.7.1 Principal Response Curves (PRC) 290 8.7.2 Permutation-Based MANOVA (PERMANOVA) 291 8.8 Software Packages 291 8.8.1 R Packages 291 8.8.2 MATLAB Toolboxes 293 8.8.3 Python 293 References 295 Index 319
Age K. Smilde is Emeritus-Professor of Biosystems Data Analysis at the Swammerdam Institute for Life Sciences at the University of Amsterdam. He also holds a part-time position at the Department of Plant and Environmental Sciences at the University of Copenhagen. Federico Marini is Professor of Analytical Chemistry at the Department of Chemistry of the University of Rome “La Sapienza”. Johan A. Westerhuis is Assistant Professor at the Swammerdam Institute for Life Sciences, University of Amsterdam, The Netherlands. Kristian H. Liland is Professor of Statistics at the Faculty of Science and Technology, Norwegian University of Life Sciences, Norway.