Production-targeted Spark guidance with real-world use cases Spark: Big Data Cluster Computing in Production goes beyond general Spark overviews to provide targeted guidance toward using lightning-fast big-data clustering in production. Written by an expert team well-known in the big data community, this book walks you through the challenges in moving from proof-of-concept or demo Spark applications to live Spark in production. Real use cases provide deep insight into common problems, limitations, challenges, and opportunities, while expert tips and tricks help you get the most out of Spark performance. Coverage includes Spark SQL, Tachyon, Kerberos, ML Lib, YARN, and Mesos, with clear, actionable guidance on resource scheduling, db connectors, streaming, security, and much more. Spark has become the tool of choice for many Big Data problems, with more active contributors than any other Apache Software project. General introductory books abound, but this book is the first to provide deep insight and real-world advice on using Spark in production. Specific guidance, expert tips, and invaluable foresight make this guide an incredibly useful resource for real production settings.
Review Spark hardware requirements and estimate cluster size Gain insight from real-world production use cases Tighten security, schedule resources, and fine-tune performance Overcome common problems encountered using Spark in production Spark works with other big data tools including MapReduce and Hadoop, and uses languages you already know like Java, Scala, Python, and R. Lightning speed makes Spark too good to pass up, but understanding limitations and challenges in advance goes a long way toward easing actual production implementation. Spark: Big Data Cluster Computing in Production tells you everything you need to know, with real-world production insight and expert guidance, tips, and tricks.
, Ema Orhian
, Kai Sasaki
, Brennon York
, Anikate Singh
John Wiley & Sons Inc
Country of Publication:
Professional and scholarly
Introduction xix Chapter 1 Finishing Your Spark Job 1 Installation of the Necessary Components 2 Native Installation Using a Spark Standalone Cluster 3 The History of Distributed Computing That Led to Spark 3 Enter the Cloud 4 Understanding Resource Management 5 Using Various Formats for Storage 8 Text Files 10 Sequence Files 11 Avro Files 11 Parquet Files 12 Making Sense of Monitoring and Instrumentation 13 Spark UI 13 Spark Standalone UI 15 Metrics REST API 16 Metrics System 16 External Monitoring Tools 16 Summary 17 Chapter 2 Cluster Management 19 Background 21 Spark Components 24 Driver 25 Workers and Executors 26 Configuration 27 Spark Standalone 30 Architecture 31 Single -Node Setup Scenario 31 Multi -Node Setup 32 YARN 33 Architecture 35 Dynamic Resource Allocation 37 Scenario 39 Mesos 40 Setup 41 Architecture 42 Dynamic Resource Allocation 44 Basic Setup Scenario 44 Comparison 46 Summary 50 Chapter 3 Performance Tuning 53 Spark Execution Model 54 Partitioning 56 Controlling Parallelism 56 Partitioners 58 Shuffling Data 59 Shuffling and Data Partitioning 61 Operators and Shuffl ing 63 Shuffling Is Not That Bad After All 67 Serialization 67 Kryo Registrators 69 Spark Cache 69 Spark SQL Cache 73 Memory Management 73 Garbage Collection 74 Shared Variables 75 Broadcast Variables 76 Accumulators 78 Data Locality 81 Summary 82 Chapter 4 Security 83 Architecture 84 Security Manager 84 Setup Configurations 85 ACL 86 Configuration 86 Job Submission 87 Web UI 88 Network Security 95 Encryption 96 Event logging 101 Kerberos 101 Apache Sentry 102 Summary 102 Chapter 5 Fault Tolerance or Job Execution 105 Lifecycle of a Spark Job 106 Spark Master 107 Spark Driver 109 Spark Worker 111 Job Lifecycle 112 Job Scheduling 112 Scheduling within an Application 113 Scheduling with External Utilities 120 Fault Tolerance 122 Internal and External Fault Tolerance 122 Service Level Agreements (SLAs) 123 Resilient Distributed Datasets (RDDs) 124 Batch versus Streaming 130 Testing Strategies 133 Recommended Confi gurations 139 Summary 142 Chapter 6 Beyond Spark 145 Data Warehousing 146 Spark SQL CLI 147 Thrift JDBC/ODBC Server 147 Hive on Spark 148 Machine Learning 150 DataFrame 150 MLlib and ML 153 Mahout on Spark 158 Hivemall on Spark 160 External Frameworks 161 Spark Package 161 XGBoost 163 spark -jobserver 164 Future Works 166 Integration with the Parameter Server 167 Deep Learning 175 Enterprise Usage 182 Collecting User Activity Log with Spark and Kafka 183 Real -Time Recommendation with Spark 184 Real -Time Categorization of Twitter Bots 186 Summary 186 Index 189
Ilya Ganelin is a data engineer working at Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex. Ema Orhian is a Big Data Engineer interested in scaling algorithms. She is the main committer on jaws-spark-sql-rest, a data warehouse explorer on top of Spark SQL. Kai Sasaki is a software engineer working in distributed computing and machine learning. He is a Spark contributor who develops mainly MLlib, ML libraries. Brennon York has been a core contributor to Apache Spark since 2014 including development on GraphX and the core build environment.