A thought-provoking look at statistical learning theory and its role in understanding human learning and inductive reasoning A joint endeavor from leading researchers in the fields of philosophy and electrical engineering, An Elementary Introduction to Statistical Learning Theory is a comprehensive and accessible primer on the rapidly evolving fields of statistical pattern recognition and statistical learning theory. Explaining these areas at a level and in a way that is not often found in other books on the topic, the authors present the basic theory behind contemporary machine learning and uniquely utilize its foundations as a framework for philosophical thinking about inductive inference.

Promoting the fundamental goal of statistical learning, knowing what is achievable and what is not, this book demonstrates the value of a systematic methodology when used along with the needed techniques for evaluating the performance of a learning system. First, an introduction to machine learning is presented that includes brief discussions of applications such as image recognition, speech recognition, medical diagnostics, and statistical arbitrage. To enhance accessibility, two chapters on relevant aspects of probability theory are provided. Subsequent chapters feature coverage of topics such as the pattern recognition problem, optimal Bayes decision rule, the nearest neighbor rule, kernel rules, neural networks, support vector machines, and boosting.

Appendices throughout the book explore the relationship between the discussed material and related topics from mathematics, philosophy, psychology, and statistics, drawing insightful connections between problems in these areas and statistical learning theory. All chapters conclude with a summary section, a set of practice questions, and a reference sections that supplies historical notes and additional resources for further study.

An Elementary Introduction to Statistical Learning Theory is an excellent book for courses on statistical learning theory, pattern recognition, and machine learning at the upper-undergraduate and graduate levels. It also serves as an introductory reference for researchers and practitioners in the fields of engineering, computer science, philosophy, and cognitive science that would like to further their knowledge of the topic.

Promoting the fundamental goal of statistical learning, knowing what is achievable and what is not, this book demonstrates the value of a systematic methodology when used along with the needed techniques for evaluating the performance of a learning system. First, an introduction to machine learning is presented that includes brief discussions of applications such as image recognition, speech recognition, medical diagnostics, and statistical arbitrage. To enhance accessibility, two chapters on relevant aspects of probability theory are provided. Subsequent chapters feature coverage of topics such as the pattern recognition problem, optimal Bayes decision rule, the nearest neighbor rule, kernel rules, neural networks, support vector machines, and boosting.

Appendices throughout the book explore the relationship between the discussed material and related topics from mathematics, philosophy, psychology, and statistics, drawing insightful connections between problems in these areas and statistical learning theory. All chapters conclude with a summary section, a set of practice questions, and a reference sections that supplies historical notes and additional resources for further study.

An Elementary Introduction to Statistical Learning Theory is an excellent book for courses on statistical learning theory, pattern recognition, and machine learning at the upper-undergraduate and graduate levels. It also serves as an introductory reference for researchers and practitioners in the fields of engineering, computer science, philosophy, and cognitive science that would like to further their knowledge of the topic.

Preface xiii 1 Introduction: Classification Learning Features and Applications 1 1.1 Scope 1 1.2 Why Machine Learning? 2 1.3 Some Applications 3 1.3.1 Image Recognition 3 1.3.2 Speech Recognition 3 1.3.3 Medical Diagnosis 4 1.3.4 Statistical Arbitrage 4 1.4 Measurements Features and Feature Vectors 4 1.5 The Need for Probability 5 1.6 Supervised Learning 5 1.7 Summary 6 1.8 Appendix: Induction 6 1.9 Questions 7 1.10 References 8 2 Probability 10 2.1 Probability of Some Basic Events 10 2.2 Probabilities of Compound Events 12 2.3 Conditional Probability 13 2.4 Drawing Without Replacement 14 2.5 A Classic Birthday Problem 15 2.6 Random Variables 15 2.7 Expected Value 16 2.8 Variance 17 2.9 Summary 19 2.10 Appendix: Interpretations of Probability 19 2.11 Questions 20 2.12 References 21 3 Probability Densities 23 3.1 An Example in Two Dimensions 23 3.2 Random Numbers in [01] 23 3.3 Density Functions 24 3.4 Probability Densities in Higher Dimensions 27 3.5 Joint and Conditional Densities 27 3.6 Expected Value and Variance 28 3.7 Laws of Large Numbers 29 3.8 Summary 30 3.9 Appendix: Measurability 30 3.10 Questions 32 3.11 References 32 4 The Pattern Recognition Problem 34 4.1 A Simple Example 34 4.2 Decision Rules 35 4.3 Success Criterion 37 4.4 The Best Classifier: Bayes Decision Rule 37 4.5 Continuous Features and Densities 38 4.6 Summary 39 4.7 Appendix: Uncountably Many 39 4.8 Questions 40 4.9 References 41 5 The Optimal Bayes Decision Rule 43 5.1 Bayes Theorem 43 5.2 Bayes Decision Rule 44 5.3 Optimality and Some Comments 45 5.4 An Example 47 5.5 Bayes Theorem and Decision Rule with Densities 48 5.6 Summary 49 5.7 Appendix: Defining Conditional Probability 50 5.8 Questions 50 5.9 References 53 6 Learning from Examples 55 6.1 Lack of Knowledge of Distributions 55 6.2 Training Data 56 6.3 Assumptions on the Training Data 57 6.4 A Brute Force Approach to Learning 59 6.5 Curse of Dimensionality Inductive Bias and No Free Lunch 60 6.6 Summary 61 6.7 Appendix: What Sort of Learning? 62 6.8 Questions 63 6.9 References 64 7 The Nearest Neighbor Rule 65 7.1 The Nearest Neighbor Rule 65 7.2 Performance of the Nearest Neighbor Rule 66 7.3 Intuition and Proof Sketch of Performance 67 7.4 Using more Neighbors 69 7.5 Summary 70 7.6 Appendix: When People use Nearest Neighbor Reasoning 70 7.6.1 Who Is a Bachelor? 70 7.6.2 Legal Reasoning 71 7.6.3 Moral Reasoning 71 7.7 Questions 72 7.8 References 73 8 Kernel Rules 74 8.1 Motivation 74 8.2 A Variation on Nearest Neighbor Rules 75 8.3 Kernel Rules 76 8.4 Universal Consistency of Kernel Rules 79 8.5 Potential Functions 80 8.6 More General Kernels 81 8.7 Summary 82 8.8 Appendix: Kernels Similarity and Features 82 8.9 Questions 83 8.10 References 84 9 Neural Networks: Perceptrons 86 9.1 Multilayer Feedforward Networks 86 9.2 Neural Networks for Learning and Classification 87 9.3 Perceptrons 89 9.3.1 Threshold 90 9.4 Learning Rule for Perceptrons 90 9.5 Representational Capabilities of Perceptrons 92 9.6 Summary 94 9.7 Appendix: Models of Mind 95 9.8 Questions 96 9.9 References 97 10 Multilayer Networks 99 10.1 Representation Capabilities of Multilayer Networks 99 10.2 Learning and Sigmoidal Outputs 101 10.3 Training Error and Weight Space 104 10.4 Error Minimization by Gradient Descent 105 10.5 Backpropagation 106 10.6 Derivation of Backpropagation Equations 109 10.6.1 Derivation for a Single Unit 110 10.6.2 Derivation for a Network 111 10.7 Summary 113 10.8 Appendix: Gradient Descent and Reasoning toward Reflective Equilibrium 113 10.9 Questions 114 10.10 References 115 11 PAC Learning 116 11.1 Class of Decision Rules 117 11.2 Best Rule from a Class 118 11.3 Probably Approximately Correct Criterion 119 11.4 PAC Learning 120 11.5 Summary 122 11.6 Appendix: Identifying Indiscernibles 122 11.7 Questions 123 11.8 References 123 12 VC Dimension 125 12.1 Approximation and Estimation Errors 125 12.2 Shattering 126 12.3 VC Dimension 127 12.4 Learning Result 128 12.5 Some Examples 129 12.6 Application to Neural Nets 132 12.7 Summary 133 12.8 Appendix: VC Dimension and Popper Dimension 133 12.9 Questions 134 12.10 References 135 13 Infinite VC Dimension 137 13.1 A Hierarchy of Classes and Modified PAC Criterion 138 13.2 Misfit Versus Complexity Trade-Off 138 13.3 Learning Results 139 13.4 Inductive Bias and Simplicity 140 13.5 Summary 141 13.6 Appendix: Uniform Convergence and Universal Consistency 141 13.7 Questions 142 13.8 References 143 14 The Function Estimation Problem 144 14.1 Estimation 144 14.2 Success Criterion 145 14.3 Best Estimator: Regression Function 146 14.4 Learning in Function Estimation 146 14.5 Summary 147 14.6 Appendix: Regression Toward the Mean 147 14.7 Questions 148 14.8 References 149 15 Learning Function Estimation 150 15.1 Review of the Function Estimation/Regression Problem 150 15.2 Nearest Neighbor Rules 151 15.3 Kernel Methods 151 15.4 Neural Network Learning 152 15.5 Estimation with a Fixed Class of Functions 153 15.6 Shattering Pseudo-Dimension and Learning 154 15.7 Conclusion 156 15.8 Appendix: Accuracy Precision Bias and Variance in Estimation 156 15.9 Questions 157 15.10 References 158 16 Simplicity 160 16.1 Simplicity in Science 160 16.1.1 Explicit Appeals to Simplicity 160 16.1.2 Is the World Simple? 161 16.1.3 Mistaken Appeals to Simplicity 161 16.1.4 Implicit Appeals to Simplicity 161 16.2 Ordering Hypotheses 162 16.2.1 Two Kinds of Simplicity Orderings 162 16.3 Two Examples 163 16.3.1 Curve Fitting 163 16.3.2 Enumerative Induction 164 16.4 Simplicity as Simplicity of Representation 165 16.4.1 Fix on a Particular System of Representation? 166 16.4.2 Are Fewer Parameters Simpler? 167 16.5 Pragmatic Theory of Simplicity 167 16.6 Simplicity and Global Indeterminacy 168 16.7 Summary 169 16.8 Appendix: Basic Science and Statistical Learning Theory 169 16.9 Questions 170 16.10 References 170 17 Support Vector Machines 172 17.1 Mapping the Feature Vectors 173 17.2 Maximizing the Margin 175 17.3 Optimization and Support Vectors 177 17.4 Implementation and Connection to Kernel Methods 179 17.5 Details of the Optimization Problem 180 17.5.1 Rewriting Separation Conditions 180 17.5.2 Equation for Margin 181 17.5.3 Slack Variables for Nonseparable Examples 181 17.5.4 Reformulation and Solution of Optimization 182 17.6 Summary 183 17.7 Appendix: Computation 184 17.8 Questions 185 17.9 References 186 18 Boosting 187 18.1 Weak Learning Rules 187 18.2 Combining Classifiers 188 18.3 Distribution on the Training Examples 189 18.4 The Adaboost Algorithm 190 18.5 Performance on Training Data 191 18.6 Generalization Performance 192 18.7 Summary 194 18.8 Appendix: Ensemble Methods 194 18.9 Questions 195 18.10 References 196 Bibliography 197 Author Index 203 Subject Index 207

SANJEEV KULKARNI, PhD, is Professor in the Department of Electrical Engineering at Princeton University, where he is also an affiliated faculty member in the Department of Operations Research and Financial Engineering and the Department of Philosophy. Dr. Kulkarni has published widely on statistical pattern recognition, nonparametric estimation, machine learning, information theory, and other areas. A Fellow of the IEEE, he was awarded Princeton University's President's Award for Distinguished Teaching in 2007. GILBERT HARMAN, PhD, is James S. McDonnell Distinguished University Professor in the Department of Philosophy at Princeton University. A Fellow of the Cognitive Science Society, he is the author of more than fifty published articles in his areas of research interest, which include ethics, statistical learning theory, psychology of reasoning, and logic.

<p> The main focus of the book is on the ideas behind basicprinciples of learning theory and I can strongly recommend the bookto anyone who wants to comprehend these ideas. (Mathematical Reviews, 1 January 2013) <p> It also serves as an introductory reference forresearchers and practitioners in the fields of engineering,computer science, philosophy, and cognitive science that would liketo further their knowledge of the topic. (Zentralblatt MATH, 2012) <p>