SlideShare a Scribd company logo
High Performance Spark Best Practices for
Scaling and Optimizing Apache Spark 1st Edition
Holden Karau download
https://textbookfull.com/product/high-performance-spark-best-
practices-for-scaling-and-optimizing-apache-spark-1st-edition-
holden-karau/
Download full version ebook from https://textbookfull.com
We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!
Stream Processing with Apache Spark Mastering
Structured Streaming and Spark Streaming 1st Edition
Gerard Maas
https://textbookfull.com/product/stream-processing-with-apache-
spark-mastering-structured-streaming-and-spark-streaming-1st-
edition-gerard-maas/
Introducing .NET for Apache Spark: Distributed
Processing for Massive Datasets 1st Edition Ed Elliott
https://textbookfull.com/product/introducing-net-for-apache-
spark-distributed-processing-for-massive-datasets-1st-edition-ed-
elliott/
Spark in Action - Second Edition: Covers Apache Spark 3
with Examples in Java, Python, and Scala Jean-Georges
Perrin
https://textbookfull.com/product/spark-in-action-second-edition-
covers-apache-spark-3-with-examples-in-java-python-and-scala-
jean-georges-perrin/
Graph Algorithms Practical Examples in Apache Spark and
Neo4j 1st Edition Mark Needham
https://textbookfull.com/product/graph-algorithms-practical-
examples-in-apache-spark-and-neo4j-1st-edition-mark-needham/
Apache Spark 2 x Cookbook Cloud ready recipes for
analytics and data science Rishi Yadav
https://textbookfull.com/product/apache-spark-2-x-cookbook-cloud-
ready-recipes-for-analytics-and-data-science-rishi-yadav/
Big Data SMACK A Guide to Apache Spark Mesos Akka
Cassandra and Kafka 1st Edition Raul Estrada
https://textbookfull.com/product/big-data-smack-a-guide-to-
apache-spark-mesos-akka-cassandra-and-kafka-1st-edition-raul-
estrada/
Beginning Apache Spark Using Azure Databricks:
Unleashing Large Cluster Analytics in the Cloud Robert
Ilijason
https://textbookfull.com/product/beginning-apache-spark-using-
azure-databricks-unleashing-large-cluster-analytics-in-the-cloud-
robert-ilijason/
Beginning Apache Spark Using Azure Databricks:
Unleashing Large Cluster Analytics in the Cloud 1st
Edition Robert Ilijason
https://textbookfull.com/product/beginning-apache-spark-using-
azure-databricks-unleashing-large-cluster-analytics-in-the-
cloud-1st-edition-robert-ilijason/
Spark GraphX in Action 1st Edition Michael Malak
https://textbookfull.com/product/spark-graphx-in-action-1st-
edition-michael-malak/
Holden Karau &
Rachel Warren
High Performance
Spark
BEST PRACTICES FOR SCALING
& OPTIMIZING APACHE SPARK
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau
Holden Karau and Rachel Warren
High Performance Spark
Best Practices for Scaling and
Optimizing Apache Spark
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing
978-1-491-94320-5
[LSI]
High Performance Spark
by Holden Karau and Rachel Warren
Copyright © 2017 Holden Karau, Rachel Warren. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Shannon Cutt Indexer: Ellen Troutman-Zaig
Production Editor: Kristen Brown Interior Designer: David Futato
Copyeditor: Kim Cofer Cover Designer: Karen Montgomery
Proofreader: James Fraleigh Illustrator: Rebecca Demarest
June 2017: First Edition
Revision History for the First Edition
2017-05-22: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. High Performance Spark, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction to High Performance Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is Spark and Why Performance Matters 1
What You Can Expect to Get from This Book 2
Spark Versions 3
Why Scala? 3
To Be a Spark Expert You Have to Learn a Little Scala Anyway 3
The Spark Scala API Is Easier to Use Than the Java API 4
Scala Is More Performant Than Python 4
Why Not Scala? 4
Learning Scala 5
Conclusion 6
2. How Spark Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
How Spark Fits into the Big Data Ecosystem 8
Spark Components 8
Spark Model of Parallel Computing: RDDs 10
Lazy Evaluation 11
In-Memory Persistence and Memory Management 13
Immutability and the RDD Interface 14
Types of RDDs 16
Functions on RDDs: Transformations Versus Actions 17
Wide Versus Narrow Dependencies 17
Spark Job Scheduling 19
Resource Allocation Across Applications 20
The Spark Application 20
The Anatomy of a Spark Job 22
iii
The DAG 22
Jobs 23
Stages 23
Tasks 24
Conclusion 26
3. DataFrames, Datasets, and Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Getting Started with the SparkSession (or HiveContext or SQLContext) 28
Spark SQL Dependencies 30
Managing Spark Dependencies 31
Avoiding Hive JARs 32
Basics of Schemas 33
DataFrame API 36
Transformations 36
Multi-DataFrame Transformations 48
Plain Old SQL Queries and Interacting with Hive Data 49
Data Representation in DataFrames and Datasets 49
Tungsten 50
Data Loading and Saving Functions 51
DataFrameWriter and DataFrameReader 51
Formats 52
Save Modes 61
Partitions (Discovery and Writing) 61
Datasets 62
Interoperability with RDDs, DataFrames, and Local Collections 62
Compile-Time Strong Typing 64
Easier Functional (RDD “like”) Transformations 64
Relational Transformations 64
Multi-Dataset Relational Transformations 65
Grouped Operations on Datasets 65
Extending with User-Defined Functions and Aggregate Functions (UDFs,
UDAFs) 66
Query Optimizer 69
Logical and Physical Plans 69
Code Generation 69
Large Query Plans and Iterative Algorithms 70
Debugging Spark SQL Queries 70
JDBC/ODBC Server 70
Conclusion 72
4. Joins (SQL and Core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Core Spark Joins 73
iv | Table of Contents
Choosing a Join Type 75
Choosing an Execution Plan 76
Spark SQL Joins 79
DataFrame Joins 79
Dataset Joins 83
Conclusion 84
5. Effective Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Narrow Versus Wide Transformations 86
Implications for Performance 88
Implications for Fault Tolerance 89
The Special Case of coalesce 89
What Type of RDD Does Your Transformation Return? 90
Minimizing Object Creation 92
Reusing Existing Objects 92
Using Smaller Data Structures 95
Iterator-to-Iterator Transformations with mapPartitions 98
What Is an Iterator-to-Iterator Transformation? 99
Space and Time Advantages 100
An Example 101
Set Operations 104
Reducing Setup Overhead 105
Shared Variables 106
Broadcast Variables 106
Accumulators 107
Reusing RDDs 112
Cases for Reuse 112
Deciding if Recompute Is Inexpensive Enough 115
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files 116
Alluxio (nee Tachyon) 120
LRU Caching 121
Noisy Cluster Considerations 122
Interaction with Accumulators 123
Conclusion 124
6. Working with Key/Value Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
The Goldilocks Example 127
Goldilocks Version 0: Iterative Solution 128
How to Use PairRDDFunctions and OrderedRDDFunctions 130
Actions on Key/Value Pairs 131
What’s So Dangerous About the groupByKey Function 132
Goldilocks Version 1: groupByKey Solution 132
Table of Contents | v
Choosing an Aggregation Operation 136
Dictionary of Aggregation Operations with Performance Considerations 136
Multiple RDD Operations 139
Co-Grouping 139
Partitioners and Key/Value Data 140
Using the Spark Partitioner Object 142
Hash Partitioning 142
Range Partitioning 142
Custom Partitioning 143
Preserving Partitioning Information Across Transformations 144
Leveraging Co-Located and Co-Partitioned RDDs 144
Dictionary of Mapping and Partitioning Functions PairRDDFunctions 146
Dictionary of OrderedRDDOperations 147
Sorting by Two Keys with SortByKey 149
Secondary Sort and repartitionAndSortWithinPartitions 149
Leveraging repartitionAndSortWithinPartitions for a Group by Key and
Sort Values Function 150
How Not to Sort by Two Orderings 153
Goldilocks Version 2: Secondary Sort 154
A Different Approach to Goldilocks 157
Goldilocks Version 3: Sort on Cell Values 162
Straggler Detection and Unbalanced Data 163
Back to Goldilocks (Again) 165
Goldilocks Version 4: Reduce to Distinct on Each Partition 165
Conclusion 171
7. Going Beyond Scala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Beyond Scala within the JVM 174
Beyond Scala, and Beyond the JVM 178
How PySpark Works 179
How SparkR Works 187
Spark.jl (Julia Spark) 189
How Eclair JS Works 190
Spark on the Common Language Runtime (CLR)—C# and Friends 191
Calling Other Languages from Spark 191
Using Pipe and Friends 191
JNI 193
Java Native Access (JNA) 196
Underneath Everything Is FORTRAN 196
Getting to the GPU 198
The Future 198
Conclusion 198
vi | Table of Contents
8. Testing and Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Unit Testing 201
General Spark Unit Testing 202
Mocking RDDs 206
Getting Test Data 208
Generating Large Datasets 208
Sampling 209
Property Checking with ScalaCheck 211
Computing RDD Difference 211
Integration Testing 214
Choosing Your Integration Testing Environment 214
Verifying Performance 215
Spark Counters for Verifying Performance 215
Projects for Verifying Performance 216
Job Validation 216
Conclusion 217
9. Spark MLlib and ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Choosing Between Spark MLlib and Spark ML 219
Working with MLlib 220
Getting Started with MLlib (Organization and Imports) 220
MLlib Feature Encoding and Data Preparation 221
Feature Scaling and Selection 226
MLlib Model Training 226
Predicting 227
Serving and Persistence 228
Model Evaluation 230
Working with Spark ML 231
Spark ML Organization and Imports 231
Pipeline Stages 232
Explain Params 233
Data Encoding 234
Data Cleaning 236
Spark ML Models 237
Putting It All Together in a Pipeline 238
Training a Pipeline 239
Accessing Individual Stages 239
Data Persistence and Spark ML 239
Extending Spark ML Pipelines with Your Own Algorithms 242
Model and Pipeline Persistence and Serving with Spark ML 250
General Serving Considerations 250
Conclusion 251
Table of Contents | vii
10. Spark Components and Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Stream Processing with Spark 255
Sources and Sinks 255
Batch Intervals 257
Data Checkpoint Intervals 258
Considerations for DStreams 259
Considerations for Structured Streaming 260
High Availability Mode (or Handling Driver Failure or Checkpointing) 268
GraphX 269
Using Community Packages and Libraries 269
Creating a Spark Package 271
Conclusion 272
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist. . . . . . . 273
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
viii | Table of Contents
Preface
We wrote this book for data engineers and data scientists who are looking to get the
most out of Spark. If you’ve been working with Spark and invested in Spark but your
experience so far has been mired by memory errors and mysterious, intermittent fail‐
ures, this book is for you. If you have been using Spark for some exploratory work or
experimenting with it on the side but have not felt confident enough to put it into
production, this book may help. If you are enthusiastic about Spark but have not seen
the performance improvements from it that you expected, we hope this book can
help. This book is intended for those who have some working knowledge of Spark,
and may be difficult to understand for those with little or no experience with Spark or
distributed computing. For recommendations of more introductory literature see
“Supporting Books and Materials” on page x.
We expect this text will be most useful to those who care about optimizing repeated
queries in production, rather than to those who are doing primarily exploratory
work. While writing highly performant queries is perhaps more important to the data
engineer, writing those queries with Spark, in contrast to other frameworks, requires
a good knowledge of the data, usually more intuitive to the data scientist. Thus it may
be more useful to a data engineer who may be less experienced with thinking criti‐
cally about the statistical nature, distribution, and layout of data when considering
performance. We hope that this book will help data engineers think more critically
about their data as they put pipelines into production. We want to help our readers
ask questions such as “How is my data distributed?”, “Is it skewed?”, “What is the
range of values in a column?”, and “How do we expect a given value to group?” and
then apply the answers to those questions to the logic of their Spark queries.
However, even for data scientists using Spark mostly for exploratory purposes, this
book should cultivate some important intuition about writing performant Spark
queries, so that as the scale of the exploratory analysis inevitably grows, you may have
a better shot of getting something to run the first time. We hope to guide data scien‐
tists, even those who are already comfortable thinking about data in a distributed
way, to think critically about how their programs are evaluated, empowering them to
Preface | ix
1 Though we may be biased.
2 Although it’s important to note that some of the practices suggested in this book are not common practice in
Spark code.
explore their data more fully, more quickly, and to communicate effectively with any‐
one helping them put their algorithms into production.
Regardless of your job title, it is likely that the amount of data with which you are
working is growing quickly. Your original solutions may need to be scaled, and your
old techniques for solving new problems may need to be updated. We hope this book
will help you leverage Apache Spark to tackle new problems more easily and old
problems more efficiently.
First Edition Notes
You are reading the first edition of High Performance Spark, and for that, we thank
you! If you find errors, mistakes, or have ideas for ways to improve this book, please
reach out to us at high-performance-spark@googlegroups.com. If you wish to be
included in a “thanks” section in future editions of the book, please include your pre‐
ferred display name.
Supporting Books and Materials
For data scientists and developers new to Spark, Learning Spark by Karau, Konwin‐
ski, Wendell, and Zaharia is an excellent introduction,1
and Advanced Analytics with
Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills is a great book for
interested data scientists. For individuals more interested in streaming, the upcoming
Learning Spark Streaming by François Garillot may also be of use once it is available.
Beyond books, there is also a collection of intro-level Spark training material avail‐
able. For individuals who prefer video, Paco Nathan has an excellent introduction
video series on O’Reilly. Commercially, Databricks as well as Cloudera and other
Hadoop/Spark vendors offer Spark training. Previous recordings of Spark camps, as
well as many other great resources, have been posted on the Apache Spark documen‐
tation page.
If you don’t have experience with Scala, we do our best to convince you to pick up
Scala in Chapter 1, and if you are interested in learning, Programming Scala, 2nd Edi‐
tion, by Dean Wampler and Alex Payne is a good introduction.2
x | Preface
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Examples prefixed with “Evil” depend heavily on Apache Spark
internals, and will likely break in future minor releases of Apache
Spark. You’ve been warned—but we totally understand you aren’t
going to pay much attention to that because neither would we.
Preface | xi
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download from
the High Performance Spark GitHub repository and some of the testing code is avail‐
able at the “Spark Testing Base” GitHub repository and the Spark Validator repo.
Structured Streaming machine learning examples, which are generally in the “evil”
category discussed under “Conventions Used in This Book” on page xi, are available
at https://github.com/holdenk/spark-structured-streaming-ml.
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not
need to contact us for permission unless you’re reproducing a significant portion of
the code. For example, writing a program that uses several chunks of code from this
book does not require permission. Selling or distributing a CD-ROM of examples
from O’Reilly books does require permission. Answering a question by citing this
book and quoting example code does not require permission. The code is also avail‐
able under an Apache 2 License. Incorporating a significant amount of example code
from this book into your product’s documentation may require permission.
We appreciate, but do not require, attribution. An attribution usually includes the
title, author, publisher, and ISBN. For example: “High Performance Spark by Holden
Karau and Rachel Warren (O’Reilly). Copyright 2017 Holden Karau, Rachel Warren,
978-1-491-94320-5.”
If you feel your use of code examples falls outside fair use or the permission given
above, feel free to contact us at permissions@oreilly.com.
O’Reilly Safari
Safari (formerly Safari Books Online) is a membership-based
training and reference platform for enterprise, government,
educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interac‐
tive tutorials, and curated playlists from over 250 publishers, including O’Reilly
Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐
fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
xii | Preface
How to Contact the Authors
For feedback, email us at high-performance-spark@googlegroups.com. For random
ramblings, occasionally about Spark, follow us on twitter:
Holden: http://twitter.com/holdenkarau
Rachel: https://twitter.com/warre_n_peace
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
To comment or ask technical questions about this book, send email to bookques‐
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
The authors would like to acknowledge everyone who has helped with comments and
suggestions on early drafts of our work. Special thanks to Anya Bida, Jakob Odersky,
and Katharine Kearnan for reviewing early drafts and diagrams. We’d like to thank
Mahmoud Hanafy for reviewing and improving the sample code as well as early
drafts. We’d also like to thank Michael Armbrust for reviewing and providing feed‐
back on early drafts of the SQL chapter. Justin Pihony has been one of the most active
early readers, suggesting fixes in every respect (language, formatting, etc.).
Thanks to all of the readers of our O’Reilly early release who have provided feedback
on various errata, including Kanak Kshetri and Rubén Berenguel.
Preface | xiii
Finally, thank you to our respective employers for being understanding as we’ve
worked on this book. Especially Lawrence Spracklen who insisted we mention him
here :p.
xiv | Preface
1 From http://spark.apache.org/.
CHAPTER 1
Introduction to High Performance Spark
This chapter provides an overview of what we hope you will be able to learn from this
book and does its best to convince you to learn Scala. Feel free to skip ahead to Chap‐
ter 2 if you already know what you’re looking for and use Scala (or have your heart
set on another language).
What Is Spark and Why Performance Matters
Apache Spark is a high-performance, general-purpose distributed computing system
that has become the most active Apache open source project, with more than 1,000
active contributors.1
Spark enables us to process large quantities of data, beyond what
can fit on a single machine, with a high-level, relatively easy-to-use API. Spark’s
design and interface are unique, and it is one of the fastest systems of its kind.
Uniquely, Spark allows us to write the logic of data transformations and machine
learning algorithms in a way that is parallelizable, but relatively system agnostic. So it
is often possible to write computations that are fast for distributed storage systems of
varying kind and size.
However, despite its many advantages and the excitement around Spark, the simplest
implementation of many common data science routines in Spark can be much slower
and much less robust than the best version. Since the computations we are concerned
with may involve data at a very large scale, the time and resources that gains from
tuning code for performance are enormous. Performance does not just mean run
faster; often at this scale it means getting something to run at all. It is possible to con‐
struct a Spark query that fails on gigabytes of data but, when refactored and adjusted
with an eye toward the structure of the data and the requirements of the cluster,
1
succeeds on the same system with terabytes of data. In the authors’ experience writ‐
ing production Spark code, we have seen the same tasks, run on the same clusters,
run 100× faster using some of the optimizations discussed in this book. In terms of
data processing, time is money, and we hope this book pays for itself through a
reduction in data infrastructure costs and developer hours.
Not all of these techniques are applicable to every use case. Especially because Spark
is highly configurable and is exposed at a higher level than other computational
frameworks of comparable power, we can reap tremendous benefits just by becoming
more attuned to the shape and structure of our data. Some techniques can work well
on certain data sizes or even certain key distributions, but not all. The simplest exam‐
ple of this can be how for many problems, using groupByKey in Spark can very easily
cause the dreaded out-of-memory exceptions, but for data with few duplicates this
operation can be just as quick as the alternatives that we will present. Learning to
understand your particular use case and system and how Spark will interact with it is
a must to solve the most complex data science problems with Spark.
What You Can Expect to Get from This Book
Our hope is that this book will help you take your Spark queries and make them
faster, able to handle larger data sizes, and use fewer resources. This book covers a
broad range of tools and scenarios. You will likely pick up some techniques that
might not apply to the problems you are working with, but that might apply to a
problem in the future and may help shape your understanding of Spark more gener‐
ally. The chapters in this book are written with enough context to allow the book to
be used as a reference; however, the structure of this book is intentional and reading
the sections in order should give you not only a few scattered tips, but a comprehen‐
sive understanding of Apache Spark and how to make it sing.
It’s equally important to point out what you will likely not get from this book. This
book is not intended to be an introduction to Spark or Scala; several other books and
video series are available to get you started. The authors may be a little biased in this
regard, but we think Learning Spark by Karau, Konwinski, Wendell, and Zaharia as
well as Paco Nathan’s introduction video series are excellent options for Spark begin‐
ners. While this book is focused on performance, it is not an operations book, so top‐
ics like setting up a cluster and multitenancy are not covered. We are assuming that
you already have a way to use Spark in your system, so we won’t provide much assis‐
tance in making higher-level architecture decisions. There are future books in the
works, by other authors, on the topic of Spark operations that may be done by the
time you are reading this one. If operations are your show, or if there isn’t anyone
responsible for operations in your organization, we hope those books can help you.
2 | Chapter 1: Introduction to High Performance Spark
2 MiMa is the Migration Manager for Scala and tries to catch binary incompatibilities between releases.
Spark Versions
Spark follows semantic versioning with the standard [MAJOR].[MINOR].[MAINTE‐
NANCE] with API stability for public nonexperimental nondeveloper APIs within
minor and maintenance releases. Many of these experimental components are some
of the more exciting from a performance standpoint, including Datasets—Spark
SQL’s new structured, strongly-typed, data abstraction. Spark also tries for binary
API compatibility between releases, using MiMa2
; so if you are using the stable API
you generally should not need to recompile to run a job against a new version of
Spark unless the major version has changed.
This book was created using the Spark 2.0.1 APIs, but much of the
code will work in earlier versions of Spark as well. In places where
this is not the case we have attempted to call that out.
Why Scala?
In this book, we will focus on Spark’s Scala API and assume a working knowledge of
Scala. Part of this decision is simply in the interest of time and space; we trust readers
wanting to use Spark in another language will be able to translate the concepts used
in this book without presenting the examples in Java and Python. More importantly,
it is the belief of the authors that “serious” performant Spark development is most
easily achieved in Scala.
To be clear, these reasons are very specific to using Spark with Scala; there are many
more general arguments for (and against) Scala’s applications in other contexts.
To Be a Spark Expert You Have to Learn a Little Scala Anyway
Although Python and Java are more commonly used languages, learning Scala is a
worthwhile investment for anyone interested in delving deep into Spark develop‐
ment. Spark’s documentation can be uneven. However, the readability of the code‐
base is world-class. Perhaps more than with other frameworks, the advantages of
cultivating a sophisticated understanding of the Spark codebase is integral to the
advanced Spark user. Because Spark is written in Scala, it will be difficult to interact
with the Spark source code without the ability, at least, to read Scala code. Further‐
more, the methods in the Resilient Distributed Datasets (RDD) class closely mimic
those in the Scala collections API. RDD functions, such as map, filter, flatMap,
Spark Versions | 3
3 Although, as we explore in this book, the performance implications and evaluation semantics are quite
different.
4 Of course, in performance, every rule has its exception. mapPartitions in Spark 1.6 and earlier in Java suffers
some severe performance restrictions that we discuss in “Iterator-to-Iterator Transformations with mapParti‐
tions” on page 98.
reduce, and fold, have nearly identical specifications to their Scala equivalents.3
Fun‐
damentally Spark is a functional framework, relying heavily on concepts like immut‐
ability and lambda definition, so using the Spark API may be more intuitive with
some knowledge of functional programming.
The Spark Scala API Is Easier to Use Than the Java API
Once you have learned Scala, you will quickly find that writing Spark in Scala is less
painful than writing Spark in Java. First, writing Spark in Scala is significantly more
concise than writing Spark in Java since Spark relies heavily on inline function defini‐
tions and lambda expressions, which are much more naturally supported in Scala
(especially before Java 8). Second, the Spark shell can be a powerful tool for debug‐
ging and development, and is only available in languages with existing REPLs (Scala,
Python, and R).
Scala Is More Performant Than Python
It can be attractive to write Spark in Python, since it is easy to learn, quick to write,
interpreted, and includes a very rich set of data science toolkits. However, Spark code
written in Python is often slower than equivalent code written in the JVM, since Scala
is statically typed, and the cost of JVM communication (from Python to Scala) can be
very high. Last, Spark features are generally written in Scala first and then translated
into Python, so to use cutting-edge Spark functionality, you will need to be in the
JVM; Python support for MLlib and Spark Streaming are particularly behind.
Why Not Scala?
There are several good reasons to develop with Spark in other languages. One of the
more important constant reasons is developer/team preference. Existing code, both
internal and in libraries, can also be a strong reason to use a different language.
Python is one of the most supported languages today. While writing Java code can be
clunky and sometimes lag slightly in terms of API, there is very little performance
cost to writing in another JVM language (at most some object conversions).4
4 | Chapter 1: Introduction to High Performance Spark
While all of the examples in this book are presented in Scala for the
final release, we will port many of the examples from Scala to Java
and Python where the differences in implementation could be
important. These will be available (over time) at our GitHub. If you
find yourself wanting a specific example ported, please either email
us or create an issue on the GitHub repo.
Spark SQL does much to minimize the performance difference when using a non-
JVM language. Chapter 7 looks at options to work effectively in Spark with languages
outside of the JVM, including Spark’s supported languages of Python and R. This
section also offers guidance on how to use Fortran, C, and GPU-specific code to reap
additional performance improvements. Even if we are developing most of our Spark
application in Scala, we shouldn’t feel tied to doing everything in Scala, because spe‐
cialized libraries in other languages can be well worth the overhead of going outside
the JVM.
Learning Scala
If after all of this we’ve convinced you to use Scala, there are several excellent options
for learning Scala. Spark 1.6 is built against Scala 2.10 and cross-compiled against
Scala 2.11, and Spark 2.0 is built against Scala 2.11 and possibly cross-compiled
against Scala 2.10 and may add 2.12 in the future. Depending on how much we’ve
convinced you to learn Scala, and what your resources are, there are a number of dif‐
ferent options ranging from books to massive open online courses (MOOCs) to pro‐
fessional training.
For books, Programming Scala, 2nd Edition, by Dean Wampler and Alex Payne can
be great, although much of the actor system references are not relevant while working
in Spark. The Scala language website also maintains a list of Scala books.
In addition to books focused on Spark, there are online courses for learning Scala.
Functional Programming Principles in Scala, taught by Martin Ordersky, its creator, is
on Coursera as well as Introduction to Functional Programming on edX. A number
of different companies also offer video-based Scala courses, none of which the
authors have personally experienced or recommend.
For those who prefer a more interactive approach, professional training is offered by
a number of different companies, including Lightbend (formerly Typesafe). While we
have not directly experienced Typesafe training, it receives positive reviews and is
known especially to help bring a team or group of individuals up to speed with Scala
for the purposes of working with Spark.
Why Scala? | 5
Conclusion
Although you will likely be able to get the most out of Spark performance if you have
an understanding of Scala, working in Spark does not require a knowledge of Scala.
For those whose problems are better suited to other languages or tools, techniques for
working with other languages will be covered in Chapter 7. This book is aimed at
individuals who already have a grasp of the basics of Spark, and we thank you for
choosing High Performance Spark to deepen your knowledge of Spark. The next
chapter will introduce some of Spark’s general design and evaluation paradigms that
are important to understanding how to efficiently utilize Spark.
6 | Chapter 1: Introduction to High Performance Spark
1 MapReduce is a programmatic paradigm that defines programs in terms of map procedures that filter and
sort data onto the nodes of a distributed system, and reduce procedures that aggregate the data on the mapper
nodes. Implementations of MapReduce have been written in many languages, but the term usually refers to a
popular implementation called Hadoop MapReduce, packaged with the distributed filesystem, Apache
Hadoop Distributed File System.
2 DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top of
the Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a dis‐
tributed dataset, and then exposes functions to transform data as methods defined on that dataset object.
DryadLINQ is lazily evaluated and its scheduler is similar to Spark’s. However, DryadLINQ doesn’t use in-
memory storage. For more information see the DryadLINQ documentation.
3 See the original Spark Paper and other Spark papers.
CHAPTER 2
How Spark Works
This chapter introduces the overall design of Spark as well as its place in the big data
ecosystem. Spark is often considered an alternative to Apache MapReduce, since
Spark can also be used for distributed data processing with Hadoop.1
As we will dis‐
cuss in this chapter, Spark’s design principles are quite different from those of Map‐
Reduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem with
Apache Hadoop—although it often is. Spark has inherited parts of its API, design,
and supported formats from other existing computational frameworks, particularly
DryadLINQ.2
However, Spark’s internals, especially how it handles failures, differ
from many traditional systems. Spark’s ability to leverage lazy evaluation within
memory computations makes it particularly unique. Spark’s creators believe it to be
the first high-level programming language for fast, distributed data processing.3
To get the most out of Spark, it is important to understand some of the principles
used to design Spark and, at a cursory level, how Spark programs are executed. In this
chapter, we will provide a broad overview of Spark’s model of parallel computing and
a thorough explanation of the Spark scheduler and execution engine. We will refer to
7
the concepts in this chapter throughout the text. Further, we hope this explanation
will provide you with a more precise understanding of some of the terms you’ve
heard tossed around by other Spark users and encounter in the Spark documenta‐
tion.
How Spark Fits into the Big Data Ecosystem
Apache Spark is an open source framework that provides methods to process data in
parallel that are generalizable; the same high-level Spark functions can be used to per‐
form disparate data processing tasks on data of different sizes and structures. On its
own, Spark is not a data storage solution; it performs computations on Spark JVMs
(Java Virtual Machines) that last only for the duration of a Spark application. Spark
can be run locally on a single machine with a single JVM (called local mode). More
often, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassan‐
dra, or S3) and a cluster manager—the storage system to house the data processed
with Spark, and the cluster manager to orchestrate the distribution of Spark applica‐
tions across the cluster. Spark currently supports three kinds of cluster managers:
Standalone Cluster Manager, Apache Mesos, and Hadoop YARN (see Figure 2-1).
The Standalone Cluster Manager is included in Spark, but using the Standalone man‐
ager requires installing Spark on each node of the cluster.
Figure 2-1. A diagram of the data processing ecosystem including Spark
Spark Components
Spark provides a high-level query language to process data. Spark Core, the main
data processing framework in the Spark ecosystem, has APIs in Scala, Java, Python,
and R. Spark is built around a data abstraction called Resilient Distributed Datasets
(RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed
collections. RDDs have a number of predefined “coarse-grained” transformations
(functions that are applied to the entire dataset), such as map, join, and reduce to
8 | Chapter 2: How Spark Works
4 GraphX is not actively developed at this point, and will likely be replaced with GraphFrames or similar.
5 Datasets and DataFrames are unified in Spark 2.0. Datasets are DataFrames of “Row” objects that can be
accessed by field number.
6 See the MLlib documentation.
manipulate the distributed datasets, as well as I/O functionality to read and write data
between the distributed storage system and the Spark JVMs.
While Spark also supports R, at present the RDD interface is not
available in that language. We will cover tips for using Java,
Python, R, and other languages in detail in Chapter 7.
In addition to Spark Core, the Spark ecosystem includes a number of other first-party
components, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, and
GraphX,4
which provide more specific data processing functionality. Some of these
components have the same generic performance considerations as the Core; MLlib,
for example, is written almost entirely on the Spark API. However, some of them
have unique considerations. Spark SQL, for example, has a different query optimizer
than Spark Core.
Spark SQL is a component that can be used in tandem with Spark Core and has APIs
in Scala, Java, Python, and R, and basic SQL queries. Spark SQL defines an interface
for a semi-structured data type, called DataFrames, and as of Spark 1.6, a semi-
structured, typed version of RDDs called called Datasets.5
Spark SQL is a very
important component for Spark performance, and much of what can be accom‐
plished with Spark Core can be done by leveraging Spark SQL. We will cover Spark
SQL in detail in Chapter 3 and compare the performance of joins in Spark SQL and
Spark Core in Chapter 4.
Spark has two machine learning packages: ML and MLlib. MLlib is a package of
machine learning and statistics algorithms written with Spark. Spark ML is still in the
early stages, and has only existed since Spark 1.2. Spark ML provides a higher-level
API than MLlib with the goal of allowing users to more easily create practical
machine learning pipelines. Spark MLlib is primarily built on top of RDDs and uses
functions from Spark Core, while ML is built on top of Spark SQL DataFrames.6
Eventually the Spark community plans to move over to ML and deprecate MLlib.
Spark ML and MLlib both have additional performance considerations from Spark
Core and Spark SQL—we cover some of these in Chapter 9.
Spark Streaming uses the scheduling of the Spark Core for streaming analytics on
minibatches of data. Spark Streaming has a number of unique considerations, such as
How Spark Fits into the Big Data Ecosystem | 9
the window sizes used for batches. We offer some tips for using Spark Streaming in
“Stream Processing with Spark” on page 255.
GraphX is a graph processing framework built on top of Spark with an API for graph
computations. GraphX is one of the least mature components of Spark, so we don’t
cover it in much detail. In future versions of Spark, typed graph functionality will be
introduced on top of the Dataset API. We will provide a cursory glance at GraphX in
“GraphX” on page 269.
This book will focus on optimizing programs written with the Spark Core and Spark
SQL. However, since MLlib and the other frameworks are written using the Spark
API, this book will provide the tools you need to leverage those frameworks more
efficiently. Maybe by the time you’re done, you will be ready to start contributing
your own functions to MLlib and ML!
In addition to these first-party components, the community has written a number of
libraries that provide additional functionality, such as for testing or parsing CSVs,
and offer tools to connect it to different data sources. Many libraries are listed at
http://spark-packages.org/, and can be dynamically included at runtime with spark-
submit or the spark-shell and added as build dependencies to your maven or sbt
project. We first use Spark packages to add support for CSV data in “Additional for‐
mats” on page 59 and then in more detail in “Using Community Packages and Libra‐
ries” on page 269.
Spark Model of Parallel Computing: RDDs
Spark allows users to write a program for the driver (or master node) on a cluster
computing system that can perform operations on data in parallel. Spark represents
large datasets as RDDs—immutable, distributed collections of objects—which are
stored in the executors (or slave nodes). The objects that comprise RDDs are called
partitions and may be (but do not need to be) computed on different nodes of a dis‐
tributed system. The Spark cluster manager handles starting and distributing the
Spark executors across a distributed system according to the configuration parame‐
ters set by the Spark application. The Spark execution engine itself distributes data
across the executors for a computation. (See Figure 2-4.)
Rather than evaluating each transformation as soon as specified by the driver pro‐
gram, Spark evaluates RDDs lazily, computing RDD transformations only when the
final RDD data needs to be computed (often by writing out to storage or collecting an
aggregate to the driver). Spark can keep an RDD loaded in-memory on the executor
nodes throughout the life of a Spark application for faster access in repeated compu‐
tations. As they are implemented in Spark, RDDs are immutable, so transforming an
RDD returns a new RDD rather than the existing one. As we will explore in this
10 | Chapter 2: How Spark Works
chapter, this paradigm of lazy evaluation, in-memory storage, and immutability
allows Spark to be easy-to-use, fault-tolerant, scalable, and efficient.
Lazy Evaluation
Many other systems for in-memory storage are based on “fine-grained” updates to
mutable objects, i.e., calls to a particular cell in a table by storing intermediate results.
In contrast, evaluation of RDDs is completely lazy. Spark does not begin computing
the partitions until an action is called. An action is a Spark operation that returns
something other than an RDD, triggering evaluation of partitions and possibly
returning some output to a non-Spark system (outside of the Spark executors); for
example, bringing data back to the driver (with operations like count or collect) or
writing data to an external storage storage system (such as copyToHadoop). Actions
trigger the scheduler, which builds a directed acyclic graph (called the DAG), based
on the dependencies between RDD transformations. In other words, Spark evaluates
an action by working backward to define the series of steps it has to take to produce
each object in the final distributed dataset (each partition). Then, using this series of
steps, called the execution plan, the scheduler computes the missing partitions for
each stage until it computes the result.
Not all transformations are 100% lazy. sortByKey needs to evaluate
the RDD to determine the range of data, so it involves both a trans‐
formation and an action.
Performance and usability advantages of lazy evaluation
Lazy evaluation allows Spark to combine operations that don’t require communica‐
tion with the driver (called transformations with one-to-one dependencies) to avoid
doing multiple passes through the data. For example, suppose a Spark program calls a
map and a filter function on the same RDD. Spark can send the instructions for
both the map and the filter to each executor. Then Spark can perform both the map
and filter on each partition, which requires accessing the records only once, rather
than sending two sets of instructions and accessing each partition twice. This theoret‐
ically reduces the computational complexity by half.
Spark’s lazy evaluation paradigm is not only more efficient, it is also easier to imple‐
ment the same logic in Spark than in a different framework—like MapReduce—that
requires the developer to do the work to consolidate her mapping operations. Spark’s
clever lazy evaluation strategy lets us be lazy and express the same logic in far fewer
lines of code: we can chain together operations with narrow dependencies and let the
Spark evaluation engine do the work of consolidating them.
Spark Model of Parallel Computing: RDDs | 11
Consider the classic word count example that, given a dataset of documents, parses
the text into words and then computes the count for each word. The Apache docs
provide a word count example, which even in its simplest form comprises roughly
fifty lines of code (excluding import statements) in Java. A comparable Spark imple‐
mentation is roughly fifteen lines of code in Java and five in Scala, available on the
Apache website. The example excludes the steps to read in the data mapping docu‐
ments to words and counting the words. We have reproduced it in Example 2-1.
Example 2-1. Simple Scala word count example
def simpleWordCount(rdd: RDD[String]): RDD[(String, Int)] = {
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
}
A further benefit of the Spark implementation of word count is that it is easier to
modify and improve. Suppose that we now want to modify this function to filter out
some “stop words” and punctuation from each document before computing the word
count. In MapReduce, this would require adding the filter logic to the mapper to
avoid doing a second pass through the data. An implementation of this routine for
MapReduce can be found here: https://github.com/kite-sdk/kite/wiki/WordCount-
Version-Three. In contrast, we can modify the preceding Spark routine by simply
putting a filter step before the map step that creates the key/value pairs.
Example 2-2 shows how Spark’s lazy evaluation will consolidate the map and filter
steps for us.
Example 2-2. Word count example with stop words filtered
def withStopWordsFiltered(rdd : RDD[String], illegalTokens : Array[Char],
stopWords : Set[String]): RDD[(String, Int)] = {
val separators = illegalTokens ++ Array[Char](' ')
val tokens: RDD[String] = rdd.flatMap(_.split(separators).
map(_.trim.toLowerCase))
val words = tokens.filter(token =>
!stopWords.contains(token) && (token.length > 0) )
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
}
Lazy evaluation and fault tolerance
Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurate
results in the event of a host machine or network failure. Spark’s unique method of
fault tolerance is achieved because each partition of the data contains the dependency
12 | Chapter 2: How Spark Works
information needed to recalculate the partition. Most distributed computing para‐
digms that allow users to work with mutable objects provide fault tolerance by log‐
ging updates or duplicating data across machines.
In contrast, Spark does not need to maintain a log of updates to each RDD or log the
actual intermediary steps, since the RDD itself contains all the dependency informa‐
tion needed to replicate each of its partitions. Thus, if a partition is lost, the RDD has
enough information about its lineage to recompute it, and that computation can be
parallelized to make recovery faster.
Lazy evaluation and debugging
Lazy evaluation has important consequences for debugging since it means that a
Spark program will fail only at the point of action. For example, suppose that you
were using the word count example, and afterwards were collecting the results to the
driver. If the value you passed in for the stop words was null (maybe because it was
the result of a Java program), the code would of course fail with a null pointer excep‐
tion in the contains check. However, this failure would not appear until the program
evaluated the collect step. Even the stack trace will show the failure as first occurring
at the collect step, suggesting that the failure came from the collect statement. For this
reason it is probably most efficient to develop in an environment that gives you
access to complete debugging information.
Because of lazy evaluation, stack traces from failed Spark jobs
(especially when embedded in larger systems) will often appear to
fail consistently at the point of the action, even if the problem in
the logic occurs in a transformation much earlier in the program.
In-Memory Persistence and Memory Management
Spark’s performance advantage over MapReduce is greatest in use cases involving
repeated computations. Much of this performance increase is due to Spark’s use of
in-memory persistence. Rather than writing to disk between each pass through the
data, Spark has the option of keeping the data on the executors loaded into memory.
That way, the data on each partition is available in-memory each time it needs to be
accessed.
Spark offers three options for memory management: in-memory as deserialized data,
in-memory as serialized data, and on disk. Each has different space and time advan‐
tages:
In memory as deserialized Java objects
The most intuitive way to store objects in RDDs is as the original deserialized
Java objects that are defined by the driver program. This form of in-memory
Spark Model of Parallel Computing: RDDs | 13
Another Random Scribd Document
with Unrelated Content
et fourniront tout ce qu'il
faudra pour l'église ou
chapelle, s'il leur plaist. Nous
allons le P. Buteux et moy,
comme j'ay desjà dit,
demeurer aux Trois-Rivières
expressement pour assister
nos françois, car nous n'irions
pas sans cela; cependant nous
portons des meubles pour la
sacristie, et habits pour nous,
et, ce que je trouve plus
étrange, nos propres vivres
que nous leur donnerons: car
nous mangerons avec eux,
faute de logis où nous
puissions nous retirer. Nous
faisons cela volontiers, car
j'apprend que ces messieurs
nous aiment fort, et nous
assistent tant qu'ils peuvent,
selon l'estat de leurs affaires;
aussy faisons nous, et ferons
nous tout ce que nous
pourrons en leur
considération: car outre que
nous portons aux Trois
Rivières jusques à de la cire et
de la chandelle, nous avons
envoyé aux Hurons trois ou
quatre personnes plus que
nous n'eussions fait, n'estoit
leurs affaires que j'ay
recommandées à nos
hommes. Il est vray qu'ils ont
donné quelque chose pour ce
and will furnish everything for
the church or chapel that they
see fit. We are going, Father
Buteux and I, as I have said,
to live at Three Rivers
expressly to assist our
countrymen, for we would not
go, were it not for that;
however, we are going to take
furniture for the sacristy, and
clothes for ourselves, and,
what seems to me stranger
still, our own food, which we
shall give to them; for we shall
eat with them, for lack of a
dwelling where we might be
by ourselves. We do this
willingly, for I learn that these
gentlemen are very much
attached to us, and assist us
as much as they can,
according to the condition of
their affairs; also we do, and
will do, all that we can for
their sakes; for, besides
carrying with us to Three
Rivers everything, even to the
wax and the candles, we have
sent to the Hurons three or
four more persons than we
should have done, were it not
for their affairs which I have
entrusted to our men. It is
true, that they have given
something for this object,
according to what Father
subject, à ce que m'a dit le
Pere Lallemant. Je ne desire
pas les importuner; mais je
sçay leur aise qu'ils sçachent
que nous les servirons de bon
cœur, et que nous esperons
qu'ils donneront ce qu'il faut
pour l'entretien de [nos] Pères
aux nouvelles habitations, et
qu'ils monteront leur
chappelle, comme ils ont fait
cette année celle [152] de
Kébec;[XXIX.] et 82 qu'ils
donneront aussy des gages et
des vivres aux hommes que
nous tiendrons en leur
considération; et pour leurs
affaires soit dans les Hurons,
soit ailleurs, nous tenons ces
hommes avec nous, afin qu'ils
ne se débauchent avec les
Sauvages et ne donnent
mauvais exemple, comme ont
fait autrefois ceux qui y
estoient. Voila pour le
temporel de cette mission; si
je me souviens d'autre chose,
je l'escriray en un autre
endroit.
Lallemant has told me. I do
not wish to importune them;
but I am aware that they are
glad to know that we will
serve them willingly, and that
we shall expect them to give
what is necessary for the
maintenance of [our] Fathers
in the new settlements; and
that they will furnish their
chapel, as they have done this
year this one [152] at Kébec;
[XXX.] and that they will give
also wages and food to the
men whom we shall keep for
their sakes; and on their
account, either among the
Hurons, or elsewhere, we keep
these men with us, in order
that they may not become
debauched with the Savages
and show a bad example, as
those did who were here
formerly. This is all there is to
be said for the temporal
interests of this mission; if I
remember anything else, I
shall write it in another place.
Venons au spirituel. Let us come to the spiritual.
Premièrement nous esperons
une grande moisson avec le
First, we shall hope to have in
time a great harvest among
temps dans les Hurons, plus
grande et plus prochaine si on
y peut envoyer beaucoup
d'ouvriers pour passer dans les
nations voisines, le tout soubs
la conduite et l'ordonnance du
Supérieur qui sera aux Hurons.
Ces peuples sont sédentaires
et en grand nombre; j'espère
que le P. Buteux sçaura dans
un an autant du langage
montagnais qui j'en sçay, pour
l'enseigner aux autres, et ainsy
j'iray où on voudra. Ce n'est
pas que j'attende rien de moy;
je tacheray de servir pour le
moins de compagnon. Ces
peuples, où nous sommes,
sont errans et en fort petit
nombre; il sera difficile de les
convertir, [153] si on ne les
arreste; j'en ay apporté les
moyens dans la Relation.
the Hurons,—greater and
nearer, if we can send there
many laborers to pass into the
neighboring tribes, all to be
under the leadership and
command of the Superior who
will be among the Hurons.
These people are sedentary
and very populous; I hope that
Father Buteux will know in one
year as much of the
montagnais language as I
know of it, in order to teach it
to the others, and thus I shall
go wherever I shall be wanted.
It is not that I expect anything
of myself, but I shall try to
serve at least as a companion.
These people, where we are,
are wandering, and very few in
number; it will be difficult to
convert them, [153] if we
cannot make them stationary;
I have discussed the means
for doing this, in my Relation.
Pour le Seminaire, hélas!
pourroit-on bien avoir un fond
pour cela? Dans les bastimens
dont j'ay parlé, nous
désignons un petit lieu pour le
commencer, attendant qu'on
fasse exprès un corps de logis
pour ce subject. Si nous
estions bastis, j'espérerois que
As to the Seminary, alas! if we
could only have a fund for this
purpose! In the structures of
which I have spoken, we
marked out a little place for
the beginning of one, waiting
until some special houses be
erected expressly for this
purpose. If we had any built, I
dans deux ans le P. Brebeuf
nous envoiroit des enfants
hurons; on les pourroit
instruire icy avec toute liberté,
estans éloignés de leur parens.
O le grand coup pour la gloire
de Dieu, si cela se faisoit!
would hope that in two years
Father Brebeuf would send us
some huron children; they
could be instructed here with
all freedom, being separated
from their parents. Oh, what a
great stroke for the glory of
God, if that were done!
Quant aux enfants des
Sauvages de ce pais-cy, il y 84
aura plus de peine à les
retenir; je n'y voy point d'autre
moyen que celuy que touche
V. R. d'envoyer un enfant tous
les ans en France: ayant esté
là deux ans, il y reviendra
sçachant la langue; estant
desjà accoustumé à nos
façons de faire, il ne nous
quittera point et retiendra ses
petits compatriotes. Notre
petit Fortuné, qu'on a renvoyé
pour estre malade, et que
nous ne pouvons rendre à ses
parens, car il n'en a point, est
tout autre qu'il n'estoit, encor
qu'il n'ait demeuré que fort
peu en France; tant s'en faut
qu'il courre après les
Sauvages, il les fuit, et se rend
fort obéissant. En vérité il
m'estonne: car il s'encouroit
incontinent aux cabanes de
ces barbares sitost qu'on lui
As to the children of the
Savages in this country, there
will be more trouble in keeping
them; I see no other way than
that which Your Reverence
suggests, of sending a child
every year to France. Having
been there two years, he will
return with a knowledge of the
language, and having already
become accustomed to our
ways, he will not leave us and
will retain his little
countrymen. Our little Fortuné,
who has been sent back
because he was sick, and who
can not return to his parents,
for he has none, is quite
different from what he was,
although he has lived only a
little while in France; so far
from mingling with the
Savages, he runs away from
them, and is becoming very
obedient. In truth he
astonishes me, for he used to
begin to run to the cabins of
the barbarians as soon as we
said a word to him; he could
not [154] suffer any one to
command him, whoever he
might be; now he is prompt in
whatever he does. This year I
wished to send a little girl,
who was given me by the
family, that lives here, and
perhaps also a little boy,
according to Your Reverence's
wish. But M. de Champlain
told me that M. de Lauson had
recommended him not to let
any Savage go over, small or
great. I begged him last year
to allow this to be done; I
have an idea that Father
Lallemant has some share in
this advice and in this
conclusion. Here are the
reasons why they think that it
is not expedient for them to go
over: 1st. The example of the
two who have gone over and
who have been ruined. I
answer that Louys[XXXII.] the
Huron was taken and
corrupted by the English; and
yet he has here performed the
duties of a Christian,
confessing and taking
communion last year at his
arrival, and at his departure
from Kebec; he is now a
prisoner of the Hiroquois. As
to Pierre the montagnais,
[XXXIV.] taken [155] into France
by the Récolet Fathers, when
he returned here, he fled from
the Savages; he was
compelled to return among
them, in order to learn the
language, which he had
forgotten; he did not wish to
go, even saying: "They are
forcing me; but, if I once go
there, they will not get me
back as they wish." At that
time the English came upon
the scene, and they have
spoiled him; I may add that I
have not seen a savage so
savage and so barbarous as he
is.
L'autre raison du P. Lallemant
est que ces enfans cousteront
à nourrir et entretenir en
France, et la mission est
pauvre. S'ils sont en un
collége, on demandera
pension; s'ils sont ailleurs, cela
retardera les aumônes que
feroient les personnes qui les
nourriront. Je répond que les
collèges ne prendront point de
pension, et quand il en
faudroit, je trouve la chose si
importante pour la gloire de
Dieu, qu'il la faudroít donner.
Le P. Lallemant commence à
gouster mes raisons; car je
l'assure qu'on ne peut retenir
les petits Sauvages, s'ils ne
sont dépaïsés ou s'ils n'ont
quelques camarades qui les
aident à demeurer volontiers.
Father Lallemant's other
reason is that it will cost
something to maintain these
children in France, and the
mission is poor. If they are in a
college, their board will have
to be paid; if they are
elsewhere, that will diminish
the alms which would be given
by the persons who support
them. I answer that the
colleges will not take anything
for board; and, if it were
necessary to pay this, I find
the affair so important for the
glory of God, that it ought to
be given. Father Lallemant
begins to appreciate my
reasons, for I assured him that
we could not retain the little
Savages, if they be not
removed from their native
Nous en avons eû deux: en
l'absence des sauvages, ils
obéissoient tellement
quellement; les sauvages
estoient-ils cabanés près de
nous, nos enfants n'estoient
plus à nous, nous n'osions leur
rien dire.
country, or if they have not
some companions who help
them to remain of their own
free will. We have had two of
these: in the absence of the
savages they obeyed tolerably
well, but when the savages
were encamped near us, our
children no longer belonged to
us, we dared say nothing.
Si nous pouvons avoir
quelques enfants cette [156]
88 année, je feray mon
possible pour les faire passer,
du moins deux garçons, et
cette petite fille, qui trouvera
trois maisons pour une. On
m'en demande en plusieurs
endroits. Si M. Duplessis
m'écoute, au nom de Dieu,
soit. Quant le P. Lallemant
aura expérimenté la difficulté
qu'il y a de retenir ces enfants
libertins, il parlera plus haut
que moy.
If we can have some children
this [156] year I shall do all I
can to have them go over, at
least two boys and this little
girl, who will find three homes
for one. Several places have
asked me for them. If M.
Duplessis listens to me, in the
name of God, so let it be.
When Father Lallemant shall
have found out the difficulty
there is in keeping these wild
children, he will speak more
peremptorily than I do.
V. R. voit, par tout ce qui a
esté dit, le bien que l'on peut
espérer pour la gloire de Dieu
de toutes ces contrées, et
combien il est important, non-
seulement de ne rien divertir
ailleurs de ce qui est donné
Your Reverence sees, through
all that has been said, the
benefits to be expected for the
glory of God from all of these
countries, and how important
it is, not only not to divert to
some other places what is
pour la mission de Kebec, mais
encore de trouver quelque
chose pour faire subsister du
moins une maison qui serve de
retraite aux Nostres, qui serve
de séminaire pour des enfants
et pour les Nostres qui
apprendront un jour les
langues, car il y a quantité de
peuples différens tous en
langage.
Voici encore.....
(Le reste manque au
manuscrit.)
given for the mission at Kebec,
but still more to find
something for the
maintenance at least of a
house which may serve as a
retreat for Our Associates, as a
seminary for children, and for
Our Brothers who will one day
learn the languages, for there
are a great many tribes
differing altogether in their
language.
Still further ...
(The rest of this manuscript is
lacking.)
NOTES:
[I.] Jean de
Lauson, intendant
de la compagnie
des Cent-Associés,
et qui fut plus tard
gouverneur de la
Nouvelle-France.
[III.] Jean de
Brébeuf, d'une
famille noble de
Normandie, l'un
des premiers
missionnaires
jésuites venus en
Canada en 1625,
et qui fut martyrisé
au pays des
FOOTNOTES:
[II.] Jean de
Lauson,2 intendant
of the company of
the Hundred
Associates, who
was later governor
of New France—
[Carayon.]
[VII.] Jean de
Brébeuf, of a noble
family of
Normandy, one of
the first jesuit
missionaries, came
to Canada in 1625,
and was martyred
in the country of
Hurons en 1649
par les Iroquois.
[IV.] Antoine
Daniel, natif de
Dieppe, arrivé
l'année précédente
1633, et martyrisé
par les Iroquois,
en 1649.
[V.] Ambroise
Davost, arrivé
l'année
précédente, en
même temps que
le P. Daniel.
[VI.] Le P.
Ennemond Masse,
le même qui avait
évangélisé les
sauvages de
l'Acadie, dès
l'année 1611, avec
le P. Biard. Il vint
en Canada en
1633 et mourut en
la résidence de
Saint-Joseph de
Sillery, en 1646, à
l'âge de 72 ans.
[XI.] Anne De
Nouë, natif de
Champagne, venu
au Canada en
1626 et martyr de
son zèle en 1646.
On le trouva gelé
sur le Saint-
Laurent.
the Hurons, in
1649, by the
Iroquois.—
[Carayon.]
[VIII.] Antoine
Daniel, a native of
Dieppe, arrived the
preceding year,
1633, and was
martyred by the
Iroquois in 1649.—
[Carayon.]
[IX.] Ambroise
Davost arrived the
preceding year, at
the same time as
Father Daniel.3—
[Carayon.]
[X.] Father
Ennemond Masse,
the same one who
had evangelized
the savages of
Acadia in the year
1611 with Father
Biard. He came to
Canada in 1633
and died at the
residence of Saint-
Joseph de Sillery,
in 1646, at the age
of 72 years.—
[Carayon.]
[XIV.] Anne De
Nouë, native of
Champagne, came
to Canada in the
year 1626 and was
a martyr to his
[XII.] Il vint au
Canada en même
temps que le P.
Lejeune, en 1632.
[XIII.] Le Frère
Jean Liégeois, qui
périt victime de la
haine des Iroquois,
près de Sillery, en
1655.
[XVII.] Duplessis-
Bochart, général
de la flotte,
comme on
l'appelait alors, qui
fut plus tard
nommé
gouverneur des
Trois-Rivières, et
qui fut tué par les
Iroquois, le 19
août 1652.
[XIX.] Le P. Jacques
Buteux, natif
d'Abbeville, en
Picardie, qui fut
tué par les
Iroquois, le 10 de
mai 1652.
[XX.] Le P. Charles
Lalemant, l'un des
trois premiers
missionnaires
jésuites venus à
Québec, en 1625.
[XXIII.] Le P. Benier
était confesseur de
la princesse X ***.
zeal in 1646. He
was found frozen
upon the Saint
Lawrence.—
[Carayon.]
[XV.] He came to
Canada the same
time as Lejeune,
1632.—[Carayon.]
[XVI.] Brother Jean
Liégeois, who
perished as a
victim of Iroquois
hatred, near
Sillery, in 1655.—
[Carayon.]
[XVIII.] Duplessis-
Bochart, general of
the fleet, as he
was then called;
who was later
made governor of
Three Rivers and
killed by the
Iroquois on the
19th of August,
1652.—[Carayon.]
[XXI.] Father
Jacques Buteux,5
a native of
Abbeville, in
Picardie, who was
killed by the
Iroquois on the
10th of May, 1652.
—[Carayon.]
[XXII.] Father
Charles Lalemant,
[XXV.] Notre-Dame
des Anges, près de
Québec.
[XXVI.] La pointe
aux Lièvres, à
l'entrée de la
rivière Saint-
Charles.
[XXIX.] «L'an 1634,
Messieurs de la
Compagnie ont
envoyé pour cent
escus de meubles
et ornements entre
autres l'image de
saint Joseph en
bosse qui est sur
l'autel.» Catalogue
des bienfaiteurs de
Notre-Dame de
Recouvrance
(Archives du
Séminaire de
Québec).
[XXXI.] Louis
Amantacha,
surnommé de
Sainte-Foy, qui
avait été baptisé
en France.
[XXXIII.] Ou Pierre-
Antoine
Patetchoanen,
«qui depuis cinq
ans (1620-5) avoit
été envoyé en
France par nos
religieux de Kébec;
lequel après avoir
one of the first
three jesuit
missionaries, came
to Quebec in 1625.
—[Carayon.]
[XXIV.] Father
Benier was
confessor of the
princess X ***.—
[Carayon.]
[XXVII.] Notre
Dame des Anges,7
near Quebec—
[Carayon.]
[XXVIII.] La pointe
aux Lièvres, at
mouth of river
Saint Charles.—
[Carayon.]
[XXX.] "In the year
1634 the
Gentlemen of the
Society sent one
hundred ecus'
worth of furniture
and ornaments,
among others the
figure of saint
Joseph in relief,
which is over the
altar." Catalogue of
the benefactors of
Notre-Dame de
Recouvrance.
(Archives of the
Seminary at
Québec.)—
[Carayon.]
été bien instruit et
endoctriné aux
choses de la foy,
fut baptizé et
nommé par
deffunt M. le
Prince de
Guiménée, son
parrain, Pierre
Antoine, qu'il
entretint aux
études jusques
après sa mort, que
l'enfant fut congru
en la langue latine,
et si bon françois,
qu'estant de retour
à Kébec, nos
religieux furent
contraints le
renvoyer pour
quelque temps
entre ses parens,
afin de reprendre
les idées de sa
langue maternelle,
qu'il avoit presque
oublié.» (F.
Sagard.)
[XXXII.] Louis
Amantacha,
surnamed Sainte-
Foy, who was
baptized in France.
—[Carayon.]
[XXXIV.] Pierre-
Antoine
Patetchoanen,
"who, five years
ago, (1620-5) was
sent into France by
our religious of
Kébec; after
having been
taught and
instructed in the
doctrines of the
faith, he was
baptized and
named by the
deceased M. le
Prince de
Guiménée, his
godfather, Pierre
Antoine, who
maintained him at
his studies up to
the time of his
death, until the
child became so
well versed in the
latin language, and
so good a
frenchman, that
having returned to
Kébec, our
religious were
obliged to send
him back for a
little while to his
parents, so that he
might regain the
ideas of his native
tongue, which he
had almost
forgotten."12 (F.
Sagard.)—
[Carayon.]
XXIII
Le Jeune's Relation, 1634
Paris: SEBASTIEN CRAMOISY, 1635
Source: Title-page and text reprinted from the copy of the first issue,
in Lenox Library. Table des Chapitres, from the second issue, at
Lenox.
Chaps. i.-ix., only, are given in the present volume; the concluding
portion will appear in Volume VII.
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau
RELATION
DE CE QVI S'EST PASSÉ
EN LA
NOVVELLE FRANCE,
EN L'ANNÉE 1634.
Enuoyée au
R. PERE PROVINCIAL
de la Compangnie de I e s v s
en la Prouince de France.
Par le P. Paul le Ieune de la
mesme Compagnie,
Superieur de la residence de
Kebec.
A PARIS
Chez S e b a s t i e n C r a m o i s y,
Imprimeur
ordinaire du Roy, ruë S.
Iacques, aux Cicognes.
M DC. XXXV.
AVEC PRIVILEGE DV ROY.
RELATION
OF WHAT OCCURRED
IN
NEW FRANCE,
IN THE YEAR 1634.
Sent to the
Reverend Father Provincial
of the Society of Jesus in the
Province of France.
By Father Paul le Jeune, of the
same Society,
Superior of the Residence of
Kebec.
PARIS,
Sebastien Cramoisy, Printer in
ordinary to the King.
Ruë St. Jacques, at the Sign of
the Storks.
M DC. XXXV.
BY ROYAL LICENSE.
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau
P
96
[iii] Extraict du
Priuilege du Roy.
AR Grace & Priuilege du
Roy, il est permis à
Sebastien Cramoisy,
Imprimeur o[r]dinaire du Roy,
marchand Libraire Iuré en
l'Vniuersité de Paris,
d'imprimer ou faire imprimer
vn liure intitulé, Relation de ce
qui s'est passé en la Nouuelle
France en l'année mil six cens
trente-quatre, Enuoyée au
Reuerend Pere Barthelemy
Iaquinot, Prouincial de la
Compagnie de Iesvs en la
Prouince de France, Par le P.
Paul le Ieune de la mesme
Compagnie, Superieur de la
Residence de Kebec: &
cependant le temps & espace
de neuf années consecutiues.
Auec defenses à tous Libraires
& Imprimeurs d'imprimer ou
faire imprimer ledit liure, sous
pretexte de desguisement, ou
changement qu'ils y pourroient
B
[iii] Extract from the
Royal License.
Y the Grace and License
of the King, permission is
granted to Sebastien Cramoisy.
Printer in ordinary to the King,
Bookseller under Oath in the
University of Paris, to print or
to have printed a book
entitled, Relation de ce qui
s'est passé en la Nouvelle
France en l'année mil six cens
trente-quatre, Envoyée au
Reverend Pere Barthelemy
Jaquinot, Provincial de la
Compagnie de Jesus en la
Province de France, Par le P.
Paul le Jeune de la mesme
Compagnie, Superieur de la
Residence de Kebec: and this
during the time and space of
nine consecutive years.
Prohibiting all Booksellers and
Printers to print or to have
printed the said book, under
pretext of any disguise or
change which they may make
therein, under penalty of
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
textbookfull.com

More Related Content

Similar to High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau (20)

PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Spark and scala course content | Spark and scala course online training
Selfpaced
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
PPTX
Big Data Transformations Powered By Spark
Knoldus Inc.
 
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
Dive into spark2
Gal Marder
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
Spark introduction and architecture
Sohil Jain
 
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
PDF
Advanced analytics with Spark First Edition Laserson
pherryolayne
 
PPTX
Big Data training
vishal192091
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Spark and scala course content | Spark and scala course online training
Selfpaced
 
Spark real world use cases and optimizations
Gal Marder
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
Big Data Transformations Powered By Spark
Knoldus Inc.
 
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Dive into spark2
Gal Marder
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Sohil Jain
 
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
yafora8192
 
Advanced analytics with Spark First Edition Laserson
pherryolayne
 
Big Data training
vishal192091
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
HDPCD Spark using Python (pyspark)
Durga Gadiraju
 

Recently uploaded (20)

PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PDF
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
PPTX
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
PDF
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PDF
Introduction presentation of the patentbutler tool
MIPLM
 
PPTX
Controller Request and Response in Odoo18
Celine George
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
PDF
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
PPTX
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PDF
Council of Chalcedon Re-Examined
Smiling Lungs
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
Biological Bilingual Glossary Hindi and English Medium
World of Wisdom
 
Introduction to Biochemistry & Cellular Foundations.pptx
marvinnbustamante1
 
Women's Health: Essential Tips for Every Stage.pdf
Iftikhar Ahmed
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Reconstruct, Restore, Reimagine: New Perspectives on Stoke Newington’s Histor...
History of Stoke Newington
 
epi editorial commitee meeting presentation
MIPLM
 
Mahidol_Change_Agent_Note_2025-06-27-29_MUSEF
Tassanee Lerksuthirat
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
Introduction presentation of the patentbutler tool
MIPLM
 
Controller Request and Response in Odoo18
Celine George
 
Horarios de distribución de agua en julio
pegazohn1978
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
HUMAN RESOURCE MANAGEMENT: RECRUITMENT, SELECTION, PLACEMENT, DEPLOYMENT, TRA...
PRADEEP ABOTHU
 
QNL June Edition hosted by Pragya the official Quiz Club of the University of...
Pragya - UEM Kolkata Quiz Club
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
Council of Chalcedon Re-Examined
Smiling Lungs
 
Ad

High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau

  • 1. High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau download https://textbookfull.com/product/high-performance-spark-best- practices-for-scaling-and-optimizing-apache-spark-1st-edition- holden-karau/ Download full version ebook from https://textbookfull.com
  • 2. We believe these products will be a great fit for you. Click the link to download now, or visit textbookfull.com to discover even more! Stream Processing with Apache Spark Mastering Structured Streaming and Spark Streaming 1st Edition Gerard Maas https://textbookfull.com/product/stream-processing-with-apache- spark-mastering-structured-streaming-and-spark-streaming-1st- edition-gerard-maas/ Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets 1st Edition Ed Elliott https://textbookfull.com/product/introducing-net-for-apache- spark-distributed-processing-for-massive-datasets-1st-edition-ed- elliott/ Spark in Action - Second Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala Jean-Georges Perrin https://textbookfull.com/product/spark-in-action-second-edition- covers-apache-spark-3-with-examples-in-java-python-and-scala- jean-georges-perrin/ Graph Algorithms Practical Examples in Apache Spark and Neo4j 1st Edition Mark Needham https://textbookfull.com/product/graph-algorithms-practical- examples-in-apache-spark-and-neo4j-1st-edition-mark-needham/
  • 3. Apache Spark 2 x Cookbook Cloud ready recipes for analytics and data science Rishi Yadav https://textbookfull.com/product/apache-spark-2-x-cookbook-cloud- ready-recipes-for-analytics-and-data-science-rishi-yadav/ Big Data SMACK A Guide to Apache Spark Mesos Akka Cassandra and Kafka 1st Edition Raul Estrada https://textbookfull.com/product/big-data-smack-a-guide-to- apache-spark-mesos-akka-cassandra-and-kafka-1st-edition-raul- estrada/ Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud Robert Ilijason https://textbookfull.com/product/beginning-apache-spark-using- azure-databricks-unleashing-large-cluster-analytics-in-the-cloud- robert-ilijason/ Beginning Apache Spark Using Azure Databricks: Unleashing Large Cluster Analytics in the Cloud 1st Edition Robert Ilijason https://textbookfull.com/product/beginning-apache-spark-using- azure-databricks-unleashing-large-cluster-analytics-in-the- cloud-1st-edition-robert-ilijason/ Spark GraphX in Action 1st Edition Michael Malak https://textbookfull.com/product/spark-graphx-in-action-1st- edition-michael-malak/
  • 4. Holden Karau & Rachel Warren High Performance Spark BEST PRACTICES FOR SCALING & OPTIMIZING APACHE SPARK
  • 6. Holden Karau and Rachel Warren High Performance Spark Best Practices for Scaling and Optimizing Apache Spark Boston Farnham Sebastopol Tokyo Beijing Boston Farnham Sebastopol Tokyo Beijing
  • 7. 978-1-491-94320-5 [LSI] High Performance Spark by Holden Karau and Rachel Warren Copyright © 2017 Holden Karau, Rachel Warren. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editor: Shannon Cutt Indexer: Ellen Troutman-Zaig Production Editor: Kristen Brown Interior Designer: David Futato Copyeditor: Kim Cofer Cover Designer: Karen Montgomery Proofreader: James Fraleigh Illustrator: Rebecca Demarest June 2017: First Edition Revision History for the First Edition 2017-05-22: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. High Performance Spark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
  • 8. Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction to High Performance Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Spark and Why Performance Matters 1 What You Can Expect to Get from This Book 2 Spark Versions 3 Why Scala? 3 To Be a Spark Expert You Have to Learn a Little Scala Anyway 3 The Spark Scala API Is Easier to Use Than the Java API 4 Scala Is More Performant Than Python 4 Why Not Scala? 4 Learning Scala 5 Conclusion 6 2. How Spark Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 How Spark Fits into the Big Data Ecosystem 8 Spark Components 8 Spark Model of Parallel Computing: RDDs 10 Lazy Evaluation 11 In-Memory Persistence and Memory Management 13 Immutability and the RDD Interface 14 Types of RDDs 16 Functions on RDDs: Transformations Versus Actions 17 Wide Versus Narrow Dependencies 17 Spark Job Scheduling 19 Resource Allocation Across Applications 20 The Spark Application 20 The Anatomy of a Spark Job 22 iii
  • 9. The DAG 22 Jobs 23 Stages 23 Tasks 24 Conclusion 26 3. DataFrames, Datasets, and Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Getting Started with the SparkSession (or HiveContext or SQLContext) 28 Spark SQL Dependencies 30 Managing Spark Dependencies 31 Avoiding Hive JARs 32 Basics of Schemas 33 DataFrame API 36 Transformations 36 Multi-DataFrame Transformations 48 Plain Old SQL Queries and Interacting with Hive Data 49 Data Representation in DataFrames and Datasets 49 Tungsten 50 Data Loading and Saving Functions 51 DataFrameWriter and DataFrameReader 51 Formats 52 Save Modes 61 Partitions (Discovery and Writing) 61 Datasets 62 Interoperability with RDDs, DataFrames, and Local Collections 62 Compile-Time Strong Typing 64 Easier Functional (RDD “like”) Transformations 64 Relational Transformations 64 Multi-Dataset Relational Transformations 65 Grouped Operations on Datasets 65 Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs) 66 Query Optimizer 69 Logical and Physical Plans 69 Code Generation 69 Large Query Plans and Iterative Algorithms 70 Debugging Spark SQL Queries 70 JDBC/ODBC Server 70 Conclusion 72 4. Joins (SQL and Core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Core Spark Joins 73 iv | Table of Contents
  • 10. Choosing a Join Type 75 Choosing an Execution Plan 76 Spark SQL Joins 79 DataFrame Joins 79 Dataset Joins 83 Conclusion 84 5. Effective Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Narrow Versus Wide Transformations 86 Implications for Performance 88 Implications for Fault Tolerance 89 The Special Case of coalesce 89 What Type of RDD Does Your Transformation Return? 90 Minimizing Object Creation 92 Reusing Existing Objects 92 Using Smaller Data Structures 95 Iterator-to-Iterator Transformations with mapPartitions 98 What Is an Iterator-to-Iterator Transformation? 99 Space and Time Advantages 100 An Example 101 Set Operations 104 Reducing Setup Overhead 105 Shared Variables 106 Broadcast Variables 106 Accumulators 107 Reusing RDDs 112 Cases for Reuse 112 Deciding if Recompute Is Inexpensive Enough 115 Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files 116 Alluxio (nee Tachyon) 120 LRU Caching 121 Noisy Cluster Considerations 122 Interaction with Accumulators 123 Conclusion 124 6. Working with Key/Value Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 The Goldilocks Example 127 Goldilocks Version 0: Iterative Solution 128 How to Use PairRDDFunctions and OrderedRDDFunctions 130 Actions on Key/Value Pairs 131 What’s So Dangerous About the groupByKey Function 132 Goldilocks Version 1: groupByKey Solution 132 Table of Contents | v
  • 11. Choosing an Aggregation Operation 136 Dictionary of Aggregation Operations with Performance Considerations 136 Multiple RDD Operations 139 Co-Grouping 139 Partitioners and Key/Value Data 140 Using the Spark Partitioner Object 142 Hash Partitioning 142 Range Partitioning 142 Custom Partitioning 143 Preserving Partitioning Information Across Transformations 144 Leveraging Co-Located and Co-Partitioned RDDs 144 Dictionary of Mapping and Partitioning Functions PairRDDFunctions 146 Dictionary of OrderedRDDOperations 147 Sorting by Two Keys with SortByKey 149 Secondary Sort and repartitionAndSortWithinPartitions 149 Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function 150 How Not to Sort by Two Orderings 153 Goldilocks Version 2: Secondary Sort 154 A Different Approach to Goldilocks 157 Goldilocks Version 3: Sort on Cell Values 162 Straggler Detection and Unbalanced Data 163 Back to Goldilocks (Again) 165 Goldilocks Version 4: Reduce to Distinct on Each Partition 165 Conclusion 171 7. Going Beyond Scala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Beyond Scala within the JVM 174 Beyond Scala, and Beyond the JVM 178 How PySpark Works 179 How SparkR Works 187 Spark.jl (Julia Spark) 189 How Eclair JS Works 190 Spark on the Common Language Runtime (CLR)—C# and Friends 191 Calling Other Languages from Spark 191 Using Pipe and Friends 191 JNI 193 Java Native Access (JNA) 196 Underneath Everything Is FORTRAN 196 Getting to the GPU 198 The Future 198 Conclusion 198 vi | Table of Contents
  • 12. 8. Testing and Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Unit Testing 201 General Spark Unit Testing 202 Mocking RDDs 206 Getting Test Data 208 Generating Large Datasets 208 Sampling 209 Property Checking with ScalaCheck 211 Computing RDD Difference 211 Integration Testing 214 Choosing Your Integration Testing Environment 214 Verifying Performance 215 Spark Counters for Verifying Performance 215 Projects for Verifying Performance 216 Job Validation 216 Conclusion 217 9. Spark MLlib and ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Choosing Between Spark MLlib and Spark ML 219 Working with MLlib 220 Getting Started with MLlib (Organization and Imports) 220 MLlib Feature Encoding and Data Preparation 221 Feature Scaling and Selection 226 MLlib Model Training 226 Predicting 227 Serving and Persistence 228 Model Evaluation 230 Working with Spark ML 231 Spark ML Organization and Imports 231 Pipeline Stages 232 Explain Params 233 Data Encoding 234 Data Cleaning 236 Spark ML Models 237 Putting It All Together in a Pipeline 238 Training a Pipeline 239 Accessing Individual Stages 239 Data Persistence and Spark ML 239 Extending Spark ML Pipelines with Your Own Algorithms 242 Model and Pipeline Persistence and Serving with Spark ML 250 General Serving Considerations 250 Conclusion 251 Table of Contents | vii
  • 13. 10. Spark Components and Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Stream Processing with Spark 255 Sources and Sinks 255 Batch Intervals 257 Data Checkpoint Intervals 258 Considerations for DStreams 259 Considerations for Structured Streaming 260 High Availability Mode (or Handling Driver Failure or Checkpointing) 268 GraphX 269 Using Community Packages and Libraries 269 Creating a Spark Package 271 Conclusion 272 A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist. . . . . . . 273 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 viii | Table of Contents
  • 14. Preface We wrote this book for data engineers and data scientists who are looking to get the most out of Spark. If you’ve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent fail‐ ures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but have not felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but have not seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark, and may be difficult to understand for those with little or no experience with Spark or distributed computing. For recommendations of more introductory literature see “Supporting Books and Materials” on page x. We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are doing primarily exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, usually more intuitive to the data scientist. Thus it may be more useful to a data engineer who may be less experienced with thinking criti‐ cally about the statistical nature, distribution, and layout of data when considering performance. We hope that this book will help data engineers think more critically about their data as they put pipelines into production. We want to help our readers ask questions such as “How is my data distributed?”, “Is it skewed?”, “What is the range of values in a column?”, and “How do we expect a given value to group?” and then apply the answers to those questions to the logic of their Spark queries. However, even for data scientists using Spark mostly for exploratory purposes, this book should cultivate some important intuition about writing performant Spark queries, so that as the scale of the exploratory analysis inevitably grows, you may have a better shot of getting something to run the first time. We hope to guide data scien‐ tists, even those who are already comfortable thinking about data in a distributed way, to think critically about how their programs are evaluated, empowering them to Preface | ix
  • 15. 1 Though we may be biased. 2 Although it’s important to note that some of the practices suggested in this book are not common practice in Spark code. explore their data more fully, more quickly, and to communicate effectively with any‐ one helping them put their algorithms into production. Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. We hope this book will help you leverage Apache Spark to tackle new problems more easily and old problems more efficiently. First Edition Notes You are reading the first edition of High Performance Spark, and for that, we thank you! If you find errors, mistakes, or have ideas for ways to improve this book, please reach out to us at [email protected]. If you wish to be included in a “thanks” section in future editions of the book, please include your pre‐ ferred display name. Supporting Books and Materials For data scientists and developers new to Spark, Learning Spark by Karau, Konwin‐ ski, Wendell, and Zaharia is an excellent introduction,1 and Advanced Analytics with Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills is a great book for interested data scientists. For individuals more interested in streaming, the upcoming Learning Spark Streaming by François Garillot may also be of use once it is available. Beyond books, there is also a collection of intro-level Spark training material avail‐ able. For individuals who prefer video, Paco Nathan has an excellent introduction video series on O’Reilly. Commercially, Databricks as well as Cloudera and other Hadoop/Spark vendors offer Spark training. Previous recordings of Spark camps, as well as many other great resources, have been posted on the Apache Spark documen‐ tation page. If you don’t have experience with Scala, we do our best to convince you to pick up Scala in Chapter 1, and if you are interested in learning, Programming Scala, 2nd Edi‐ tion, by Dean Wampler and Alex Payne is a good introduction.2 x | Preface
  • 16. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Examples prefixed with “Evil” depend heavily on Apache Spark internals, and will likely break in future minor releases of Apache Spark. You’ve been warned—but we totally understand you aren’t going to pay much attention to that because neither would we. Preface | xi
  • 17. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download from the High Performance Spark GitHub repository and some of the testing code is avail‐ able at the “Spark Testing Base” GitHub repository and the Spark Validator repo. Structured Streaming machine learning examples, which are generally in the “evil” category discussed under “Conventions Used in This Book” on page xi, are available at https://github.com/holdenk/spark-structured-streaming-ml. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. The code is also avail‐ able under an Apache 2 License. Incorporating a significant amount of example code from this book into your product’s documentation may require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “High Performance Spark by Holden Karau and Rachel Warren (O’Reilly). Copyright 2017 Holden Karau, Rachel Warren, 978-1-491-94320-5.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected]. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐ fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit http://oreilly.com/safari. xii | Preface
  • 18. How to Contact the Authors For feedback, email us at [email protected]. For random ramblings, occasionally about Spark, follow us on twitter: Holden: http://twitter.com/holdenkarau Rachel: https://twitter.com/warre_n_peace How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) To comment or ask technical questions about this book, send email to bookques‐ [email protected]. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments The authors would like to acknowledge everyone who has helped with comments and suggestions on early drafts of our work. Special thanks to Anya Bida, Jakob Odersky, and Katharine Kearnan for reviewing early drafts and diagrams. We’d like to thank Mahmoud Hanafy for reviewing and improving the sample code as well as early drafts. We’d also like to thank Michael Armbrust for reviewing and providing feed‐ back on early drafts of the SQL chapter. Justin Pihony has been one of the most active early readers, suggesting fixes in every respect (language, formatting, etc.). Thanks to all of the readers of our O’Reilly early release who have provided feedback on various errata, including Kanak Kshetri and Rubén Berenguel. Preface | xiii
  • 19. Finally, thank you to our respective employers for being understanding as we’ve worked on this book. Especially Lawrence Spracklen who insisted we mention him here :p. xiv | Preface
  • 20. 1 From http://spark.apache.org/. CHAPTER 1 Introduction to High Performance Spark This chapter provides an overview of what we hope you will be able to learn from this book and does its best to convince you to learn Scala. Feel free to skip ahead to Chap‐ ter 2 if you already know what you’re looking for and use Scala (or have your heart set on another language). What Is Spark and Why Performance Matters Apache Spark is a high-performance, general-purpose distributed computing system that has become the most active Apache open source project, with more than 1,000 active contributors.1 Spark enables us to process large quantities of data, beyond what can fit on a single machine, with a high-level, relatively easy-to-use API. Spark’s design and interface are unique, and it is one of the fastest systems of its kind. Uniquely, Spark allows us to write the logic of data transformations and machine learning algorithms in a way that is parallelizable, but relatively system agnostic. So it is often possible to write computations that are fast for distributed storage systems of varying kind and size. However, despite its many advantages and the excitement around Spark, the simplest implementation of many common data science routines in Spark can be much slower and much less robust than the best version. Since the computations we are concerned with may involve data at a very large scale, the time and resources that gains from tuning code for performance are enormous. Performance does not just mean run faster; often at this scale it means getting something to run at all. It is possible to con‐ struct a Spark query that fails on gigabytes of data but, when refactored and adjusted with an eye toward the structure of the data and the requirements of the cluster, 1
  • 21. succeeds on the same system with terabytes of data. In the authors’ experience writ‐ ing production Spark code, we have seen the same tasks, run on the same clusters, run 100× faster using some of the optimizations discussed in this book. In terms of data processing, time is money, and we hope this book pays for itself through a reduction in data infrastructure costs and developer hours. Not all of these techniques are applicable to every use case. Especially because Spark is highly configurable and is exposed at a higher level than other computational frameworks of comparable power, we can reap tremendous benefits just by becoming more attuned to the shape and structure of our data. Some techniques can work well on certain data sizes or even certain key distributions, but not all. The simplest exam‐ ple of this can be how for many problems, using groupByKey in Spark can very easily cause the dreaded out-of-memory exceptions, but for data with few duplicates this operation can be just as quick as the alternatives that we will present. Learning to understand your particular use case and system and how Spark will interact with it is a must to solve the most complex data science problems with Spark. What You Can Expect to Get from This Book Our hope is that this book will help you take your Spark queries and make them faster, able to handle larger data sizes, and use fewer resources. This book covers a broad range of tools and scenarios. You will likely pick up some techniques that might not apply to the problems you are working with, but that might apply to a problem in the future and may help shape your understanding of Spark more gener‐ ally. The chapters in this book are written with enough context to allow the book to be used as a reference; however, the structure of this book is intentional and reading the sections in order should give you not only a few scattered tips, but a comprehen‐ sive understanding of Apache Spark and how to make it sing. It’s equally important to point out what you will likely not get from this book. This book is not intended to be an introduction to Spark or Scala; several other books and video series are available to get you started. The authors may be a little biased in this regard, but we think Learning Spark by Karau, Konwinski, Wendell, and Zaharia as well as Paco Nathan’s introduction video series are excellent options for Spark begin‐ ners. While this book is focused on performance, it is not an operations book, so top‐ ics like setting up a cluster and multitenancy are not covered. We are assuming that you already have a way to use Spark in your system, so we won’t provide much assis‐ tance in making higher-level architecture decisions. There are future books in the works, by other authors, on the topic of Spark operations that may be done by the time you are reading this one. If operations are your show, or if there isn’t anyone responsible for operations in your organization, we hope those books can help you. 2 | Chapter 1: Introduction to High Performance Spark
  • 22. 2 MiMa is the Migration Manager for Scala and tries to catch binary incompatibilities between releases. Spark Versions Spark follows semantic versioning with the standard [MAJOR].[MINOR].[MAINTE‐ NANCE] with API stability for public nonexperimental nondeveloper APIs within minor and maintenance releases. Many of these experimental components are some of the more exciting from a performance standpoint, including Datasets—Spark SQL’s new structured, strongly-typed, data abstraction. Spark also tries for binary API compatibility between releases, using MiMa2 ; so if you are using the stable API you generally should not need to recompile to run a job against a new version of Spark unless the major version has changed. This book was created using the Spark 2.0.1 APIs, but much of the code will work in earlier versions of Spark as well. In places where this is not the case we have attempted to call that out. Why Scala? In this book, we will focus on Spark’s Scala API and assume a working knowledge of Scala. Part of this decision is simply in the interest of time and space; we trust readers wanting to use Spark in another language will be able to translate the concepts used in this book without presenting the examples in Java and Python. More importantly, it is the belief of the authors that “serious” performant Spark development is most easily achieved in Scala. To be clear, these reasons are very specific to using Spark with Scala; there are many more general arguments for (and against) Scala’s applications in other contexts. To Be a Spark Expert You Have to Learn a Little Scala Anyway Although Python and Java are more commonly used languages, learning Scala is a worthwhile investment for anyone interested in delving deep into Spark develop‐ ment. Spark’s documentation can be uneven. However, the readability of the code‐ base is world-class. Perhaps more than with other frameworks, the advantages of cultivating a sophisticated understanding of the Spark codebase is integral to the advanced Spark user. Because Spark is written in Scala, it will be difficult to interact with the Spark source code without the ability, at least, to read Scala code. Further‐ more, the methods in the Resilient Distributed Datasets (RDD) class closely mimic those in the Scala collections API. RDD functions, such as map, filter, flatMap, Spark Versions | 3
  • 23. 3 Although, as we explore in this book, the performance implications and evaluation semantics are quite different. 4 Of course, in performance, every rule has its exception. mapPartitions in Spark 1.6 and earlier in Java suffers some severe performance restrictions that we discuss in “Iterator-to-Iterator Transformations with mapParti‐ tions” on page 98. reduce, and fold, have nearly identical specifications to their Scala equivalents.3 Fun‐ damentally Spark is a functional framework, relying heavily on concepts like immut‐ ability and lambda definition, so using the Spark API may be more intuitive with some knowledge of functional programming. The Spark Scala API Is Easier to Use Than the Java API Once you have learned Scala, you will quickly find that writing Spark in Scala is less painful than writing Spark in Java. First, writing Spark in Scala is significantly more concise than writing Spark in Java since Spark relies heavily on inline function defini‐ tions and lambda expressions, which are much more naturally supported in Scala (especially before Java 8). Second, the Spark shell can be a powerful tool for debug‐ ging and development, and is only available in languages with existing REPLs (Scala, Python, and R). Scala Is More Performant Than Python It can be attractive to write Spark in Python, since it is easy to learn, quick to write, interpreted, and includes a very rich set of data science toolkits. However, Spark code written in Python is often slower than equivalent code written in the JVM, since Scala is statically typed, and the cost of JVM communication (from Python to Scala) can be very high. Last, Spark features are generally written in Scala first and then translated into Python, so to use cutting-edge Spark functionality, you will need to be in the JVM; Python support for MLlib and Spark Streaming are particularly behind. Why Not Scala? There are several good reasons to develop with Spark in other languages. One of the more important constant reasons is developer/team preference. Existing code, both internal and in libraries, can also be a strong reason to use a different language. Python is one of the most supported languages today. While writing Java code can be clunky and sometimes lag slightly in terms of API, there is very little performance cost to writing in another JVM language (at most some object conversions).4 4 | Chapter 1: Introduction to High Performance Spark
  • 24. While all of the examples in this book are presented in Scala for the final release, we will port many of the examples from Scala to Java and Python where the differences in implementation could be important. These will be available (over time) at our GitHub. If you find yourself wanting a specific example ported, please either email us or create an issue on the GitHub repo. Spark SQL does much to minimize the performance difference when using a non- JVM language. Chapter 7 looks at options to work effectively in Spark with languages outside of the JVM, including Spark’s supported languages of Python and R. This section also offers guidance on how to use Fortran, C, and GPU-specific code to reap additional performance improvements. Even if we are developing most of our Spark application in Scala, we shouldn’t feel tied to doing everything in Scala, because spe‐ cialized libraries in other languages can be well worth the overhead of going outside the JVM. Learning Scala If after all of this we’ve convinced you to use Scala, there are several excellent options for learning Scala. Spark 1.6 is built against Scala 2.10 and cross-compiled against Scala 2.11, and Spark 2.0 is built against Scala 2.11 and possibly cross-compiled against Scala 2.10 and may add 2.12 in the future. Depending on how much we’ve convinced you to learn Scala, and what your resources are, there are a number of dif‐ ferent options ranging from books to massive open online courses (MOOCs) to pro‐ fessional training. For books, Programming Scala, 2nd Edition, by Dean Wampler and Alex Payne can be great, although much of the actor system references are not relevant while working in Spark. The Scala language website also maintains a list of Scala books. In addition to books focused on Spark, there are online courses for learning Scala. Functional Programming Principles in Scala, taught by Martin Ordersky, its creator, is on Coursera as well as Introduction to Functional Programming on edX. A number of different companies also offer video-based Scala courses, none of which the authors have personally experienced or recommend. For those who prefer a more interactive approach, professional training is offered by a number of different companies, including Lightbend (formerly Typesafe). While we have not directly experienced Typesafe training, it receives positive reviews and is known especially to help bring a team or group of individuals up to speed with Scala for the purposes of working with Spark. Why Scala? | 5
  • 25. Conclusion Although you will likely be able to get the most out of Spark performance if you have an understanding of Scala, working in Spark does not require a knowledge of Scala. For those whose problems are better suited to other languages or tools, techniques for working with other languages will be covered in Chapter 7. This book is aimed at individuals who already have a grasp of the basics of Spark, and we thank you for choosing High Performance Spark to deepen your knowledge of Spark. The next chapter will introduce some of Spark’s general design and evaluation paradigms that are important to understanding how to efficiently utilize Spark. 6 | Chapter 1: Introduction to High Performance Spark
  • 26. 1 MapReduce is a programmatic paradigm that defines programs in terms of map procedures that filter and sort data onto the nodes of a distributed system, and reduce procedures that aggregate the data on the mapper nodes. Implementations of MapReduce have been written in many languages, but the term usually refers to a popular implementation called Hadoop MapReduce, packaged with the distributed filesystem, Apache Hadoop Distributed File System. 2 DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top of the Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a dis‐ tributed dataset, and then exposes functions to transform data as methods defined on that dataset object. DryadLINQ is lazily evaluated and its scheduler is similar to Spark’s. However, DryadLINQ doesn’t use in- memory storage. For more information see the DryadLINQ documentation. 3 See the original Spark Paper and other Spark papers. CHAPTER 2 How Spark Works This chapter introduces the overall design of Spark as well as its place in the big data ecosystem. Spark is often considered an alternative to Apache MapReduce, since Spark can also be used for distributed data processing with Hadoop.1 As we will dis‐ cuss in this chapter, Spark’s design principles are quite different from those of Map‐ Reduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem with Apache Hadoop—although it often is. Spark has inherited parts of its API, design, and supported formats from other existing computational frameworks, particularly DryadLINQ.2 However, Spark’s internals, especially how it handles failures, differ from many traditional systems. Spark’s ability to leverage lazy evaluation within memory computations makes it particularly unique. Spark’s creators believe it to be the first high-level programming language for fast, distributed data processing.3 To get the most out of Spark, it is important to understand some of the principles used to design Spark and, at a cursory level, how Spark programs are executed. In this chapter, we will provide a broad overview of Spark’s model of parallel computing and a thorough explanation of the Spark scheduler and execution engine. We will refer to 7
  • 27. the concepts in this chapter throughout the text. Further, we hope this explanation will provide you with a more precise understanding of some of the terms you’ve heard tossed around by other Spark users and encounter in the Spark documenta‐ tion. How Spark Fits into the Big Data Ecosystem Apache Spark is an open source framework that provides methods to process data in parallel that are generalizable; the same high-level Spark functions can be used to per‐ form disparate data processing tasks on data of different sizes and structures. On its own, Spark is not a data storage solution; it performs computations on Spark JVMs (Java Virtual Machines) that last only for the duration of a Spark application. Spark can be run locally on a single machine with a single JVM (called local mode). More often, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassan‐ dra, or S3) and a cluster manager—the storage system to house the data processed with Spark, and the cluster manager to orchestrate the distribution of Spark applica‐ tions across the cluster. Spark currently supports three kinds of cluster managers: Standalone Cluster Manager, Apache Mesos, and Hadoop YARN (see Figure 2-1). The Standalone Cluster Manager is included in Spark, but using the Standalone man‐ ager requires installing Spark on each node of the cluster. Figure 2-1. A diagram of the data processing ecosystem including Spark Spark Components Spark provides a high-level query language to process data. Spark Core, the main data processing framework in the Spark ecosystem, has APIs in Scala, Java, Python, and R. Spark is built around a data abstraction called Resilient Distributed Datasets (RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed collections. RDDs have a number of predefined “coarse-grained” transformations (functions that are applied to the entire dataset), such as map, join, and reduce to 8 | Chapter 2: How Spark Works
  • 28. 4 GraphX is not actively developed at this point, and will likely be replaced with GraphFrames or similar. 5 Datasets and DataFrames are unified in Spark 2.0. Datasets are DataFrames of “Row” objects that can be accessed by field number. 6 See the MLlib documentation. manipulate the distributed datasets, as well as I/O functionality to read and write data between the distributed storage system and the Spark JVMs. While Spark also supports R, at present the RDD interface is not available in that language. We will cover tips for using Java, Python, R, and other languages in detail in Chapter 7. In addition to Spark Core, the Spark ecosystem includes a number of other first-party components, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, and GraphX,4 which provide more specific data processing functionality. Some of these components have the same generic performance considerations as the Core; MLlib, for example, is written almost entirely on the Spark API. However, some of them have unique considerations. Spark SQL, for example, has a different query optimizer than Spark Core. Spark SQL is a component that can be used in tandem with Spark Core and has APIs in Scala, Java, Python, and R, and basic SQL queries. Spark SQL defines an interface for a semi-structured data type, called DataFrames, and as of Spark 1.6, a semi- structured, typed version of RDDs called called Datasets.5 Spark SQL is a very important component for Spark performance, and much of what can be accom‐ plished with Spark Core can be done by leveraging Spark SQL. We will cover Spark SQL in detail in Chapter 3 and compare the performance of joins in Spark SQL and Spark Core in Chapter 4. Spark has two machine learning packages: ML and MLlib. MLlib is a package of machine learning and statistics algorithms written with Spark. Spark ML is still in the early stages, and has only existed since Spark 1.2. Spark ML provides a higher-level API than MLlib with the goal of allowing users to more easily create practical machine learning pipelines. Spark MLlib is primarily built on top of RDDs and uses functions from Spark Core, while ML is built on top of Spark SQL DataFrames.6 Eventually the Spark community plans to move over to ML and deprecate MLlib. Spark ML and MLlib both have additional performance considerations from Spark Core and Spark SQL—we cover some of these in Chapter 9. Spark Streaming uses the scheduling of the Spark Core for streaming analytics on minibatches of data. Spark Streaming has a number of unique considerations, such as How Spark Fits into the Big Data Ecosystem | 9
  • 29. the window sizes used for batches. We offer some tips for using Spark Streaming in “Stream Processing with Spark” on page 255. GraphX is a graph processing framework built on top of Spark with an API for graph computations. GraphX is one of the least mature components of Spark, so we don’t cover it in much detail. In future versions of Spark, typed graph functionality will be introduced on top of the Dataset API. We will provide a cursory glance at GraphX in “GraphX” on page 269. This book will focus on optimizing programs written with the Spark Core and Spark SQL. However, since MLlib and the other frameworks are written using the Spark API, this book will provide the tools you need to leverage those frameworks more efficiently. Maybe by the time you’re done, you will be ready to start contributing your own functions to MLlib and ML! In addition to these first-party components, the community has written a number of libraries that provide additional functionality, such as for testing or parsing CSVs, and offer tools to connect it to different data sources. Many libraries are listed at http://spark-packages.org/, and can be dynamically included at runtime with spark- submit or the spark-shell and added as build dependencies to your maven or sbt project. We first use Spark packages to add support for CSV data in “Additional for‐ mats” on page 59 and then in more detail in “Using Community Packages and Libra‐ ries” on page 269. Spark Model of Parallel Computing: RDDs Spark allows users to write a program for the driver (or master node) on a cluster computing system that can perform operations on data in parallel. Spark represents large datasets as RDDs—immutable, distributed collections of objects—which are stored in the executors (or slave nodes). The objects that comprise RDDs are called partitions and may be (but do not need to be) computed on different nodes of a dis‐ tributed system. The Spark cluster manager handles starting and distributing the Spark executors across a distributed system according to the configuration parame‐ ters set by the Spark application. The Spark execution engine itself distributes data across the executors for a computation. (See Figure 2-4.) Rather than evaluating each transformation as soon as specified by the driver pro‐ gram, Spark evaluates RDDs lazily, computing RDD transformations only when the final RDD data needs to be computed (often by writing out to storage or collecting an aggregate to the driver). Spark can keep an RDD loaded in-memory on the executor nodes throughout the life of a Spark application for faster access in repeated compu‐ tations. As they are implemented in Spark, RDDs are immutable, so transforming an RDD returns a new RDD rather than the existing one. As we will explore in this 10 | Chapter 2: How Spark Works
  • 30. chapter, this paradigm of lazy evaluation, in-memory storage, and immutability allows Spark to be easy-to-use, fault-tolerant, scalable, and efficient. Lazy Evaluation Many other systems for in-memory storage are based on “fine-grained” updates to mutable objects, i.e., calls to a particular cell in a table by storing intermediate results. In contrast, evaluation of RDDs is completely lazy. Spark does not begin computing the partitions until an action is called. An action is a Spark operation that returns something other than an RDD, triggering evaluation of partitions and possibly returning some output to a non-Spark system (outside of the Spark executors); for example, bringing data back to the driver (with operations like count or collect) or writing data to an external storage storage system (such as copyToHadoop). Actions trigger the scheduler, which builds a directed acyclic graph (called the DAG), based on the dependencies between RDD transformations. In other words, Spark evaluates an action by working backward to define the series of steps it has to take to produce each object in the final distributed dataset (each partition). Then, using this series of steps, called the execution plan, the scheduler computes the missing partitions for each stage until it computes the result. Not all transformations are 100% lazy. sortByKey needs to evaluate the RDD to determine the range of data, so it involves both a trans‐ formation and an action. Performance and usability advantages of lazy evaluation Lazy evaluation allows Spark to combine operations that don’t require communica‐ tion with the driver (called transformations with one-to-one dependencies) to avoid doing multiple passes through the data. For example, suppose a Spark program calls a map and a filter function on the same RDD. Spark can send the instructions for both the map and the filter to each executor. Then Spark can perform both the map and filter on each partition, which requires accessing the records only once, rather than sending two sets of instructions and accessing each partition twice. This theoret‐ ically reduces the computational complexity by half. Spark’s lazy evaluation paradigm is not only more efficient, it is also easier to imple‐ ment the same logic in Spark than in a different framework—like MapReduce—that requires the developer to do the work to consolidate her mapping operations. Spark’s clever lazy evaluation strategy lets us be lazy and express the same logic in far fewer lines of code: we can chain together operations with narrow dependencies and let the Spark evaluation engine do the work of consolidating them. Spark Model of Parallel Computing: RDDs | 11
  • 31. Consider the classic word count example that, given a dataset of documents, parses the text into words and then computes the count for each word. The Apache docs provide a word count example, which even in its simplest form comprises roughly fifty lines of code (excluding import statements) in Java. A comparable Spark imple‐ mentation is roughly fifteen lines of code in Java and five in Scala, available on the Apache website. The example excludes the steps to read in the data mapping docu‐ ments to words and counting the words. We have reproduced it in Example 2-1. Example 2-1. Simple Scala word count example def simpleWordCount(rdd: RDD[String]): RDD[(String, Int)] = { val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts } A further benefit of the Spark implementation of word count is that it is easier to modify and improve. Suppose that we now want to modify this function to filter out some “stop words” and punctuation from each document before computing the word count. In MapReduce, this would require adding the filter logic to the mapper to avoid doing a second pass through the data. An implementation of this routine for MapReduce can be found here: https://github.com/kite-sdk/kite/wiki/WordCount- Version-Three. In contrast, we can modify the preceding Spark routine by simply putting a filter step before the map step that creates the key/value pairs. Example 2-2 shows how Spark’s lazy evaluation will consolidate the map and filter steps for us. Example 2-2. Word count example with stop words filtered def withStopWordsFiltered(rdd : RDD[String], illegalTokens : Array[Char], stopWords : Set[String]): RDD[(String, Int)] = { val separators = illegalTokens ++ Array[Char](' ') val tokens: RDD[String] = rdd.flatMap(_.split(separators). map(_.trim.toLowerCase)) val words = tokens.filter(token => !stopWords.contains(token) && (token.length > 0) ) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts } Lazy evaluation and fault tolerance Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurate results in the event of a host machine or network failure. Spark’s unique method of fault tolerance is achieved because each partition of the data contains the dependency 12 | Chapter 2: How Spark Works
  • 32. information needed to recalculate the partition. Most distributed computing para‐ digms that allow users to work with mutable objects provide fault tolerance by log‐ ging updates or duplicating data across machines. In contrast, Spark does not need to maintain a log of updates to each RDD or log the actual intermediary steps, since the RDD itself contains all the dependency informa‐ tion needed to replicate each of its partitions. Thus, if a partition is lost, the RDD has enough information about its lineage to recompute it, and that computation can be parallelized to make recovery faster. Lazy evaluation and debugging Lazy evaluation has important consequences for debugging since it means that a Spark program will fail only at the point of action. For example, suppose that you were using the word count example, and afterwards were collecting the results to the driver. If the value you passed in for the stop words was null (maybe because it was the result of a Java program), the code would of course fail with a null pointer excep‐ tion in the contains check. However, this failure would not appear until the program evaluated the collect step. Even the stack trace will show the failure as first occurring at the collect step, suggesting that the failure came from the collect statement. For this reason it is probably most efficient to develop in an environment that gives you access to complete debugging information. Because of lazy evaluation, stack traces from failed Spark jobs (especially when embedded in larger systems) will often appear to fail consistently at the point of the action, even if the problem in the logic occurs in a transformation much earlier in the program. In-Memory Persistence and Memory Management Spark’s performance advantage over MapReduce is greatest in use cases involving repeated computations. Much of this performance increase is due to Spark’s use of in-memory persistence. Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, the data on each partition is available in-memory each time it needs to be accessed. Spark offers three options for memory management: in-memory as deserialized data, in-memory as serialized data, and on disk. Each has different space and time advan‐ tages: In memory as deserialized Java objects The most intuitive way to store objects in RDDs is as the original deserialized Java objects that are defined by the driver program. This form of in-memory Spark Model of Parallel Computing: RDDs | 13
  • 33. Another Random Scribd Document with Unrelated Content
  • 34. et fourniront tout ce qu'il faudra pour l'église ou chapelle, s'il leur plaist. Nous allons le P. Buteux et moy, comme j'ay desjà dit, demeurer aux Trois-Rivières expressement pour assister nos françois, car nous n'irions pas sans cela; cependant nous portons des meubles pour la sacristie, et habits pour nous, et, ce que je trouve plus étrange, nos propres vivres que nous leur donnerons: car nous mangerons avec eux, faute de logis où nous puissions nous retirer. Nous faisons cela volontiers, car j'apprend que ces messieurs nous aiment fort, et nous assistent tant qu'ils peuvent, selon l'estat de leurs affaires; aussy faisons nous, et ferons nous tout ce que nous pourrons en leur considération: car outre que nous portons aux Trois Rivières jusques à de la cire et de la chandelle, nous avons envoyé aux Hurons trois ou quatre personnes plus que nous n'eussions fait, n'estoit leurs affaires que j'ay recommandées à nos hommes. Il est vray qu'ils ont donné quelque chose pour ce and will furnish everything for the church or chapel that they see fit. We are going, Father Buteux and I, as I have said, to live at Three Rivers expressly to assist our countrymen, for we would not go, were it not for that; however, we are going to take furniture for the sacristy, and clothes for ourselves, and, what seems to me stranger still, our own food, which we shall give to them; for we shall eat with them, for lack of a dwelling where we might be by ourselves. We do this willingly, for I learn that these gentlemen are very much attached to us, and assist us as much as they can, according to the condition of their affairs; also we do, and will do, all that we can for their sakes; for, besides carrying with us to Three Rivers everything, even to the wax and the candles, we have sent to the Hurons three or four more persons than we should have done, were it not for their affairs which I have entrusted to our men. It is true, that they have given something for this object, according to what Father
  • 35. subject, à ce que m'a dit le Pere Lallemant. Je ne desire pas les importuner; mais je sçay leur aise qu'ils sçachent que nous les servirons de bon cœur, et que nous esperons qu'ils donneront ce qu'il faut pour l'entretien de [nos] Pères aux nouvelles habitations, et qu'ils monteront leur chappelle, comme ils ont fait cette année celle [152] de Kébec;[XXIX.] et 82 qu'ils donneront aussy des gages et des vivres aux hommes que nous tiendrons en leur considération; et pour leurs affaires soit dans les Hurons, soit ailleurs, nous tenons ces hommes avec nous, afin qu'ils ne se débauchent avec les Sauvages et ne donnent mauvais exemple, comme ont fait autrefois ceux qui y estoient. Voila pour le temporel de cette mission; si je me souviens d'autre chose, je l'escriray en un autre endroit. Lallemant has told me. I do not wish to importune them; but I am aware that they are glad to know that we will serve them willingly, and that we shall expect them to give what is necessary for the maintenance of [our] Fathers in the new settlements; and that they will furnish their chapel, as they have done this year this one [152] at Kébec; [XXX.] and that they will give also wages and food to the men whom we shall keep for their sakes; and on their account, either among the Hurons, or elsewhere, we keep these men with us, in order that they may not become debauched with the Savages and show a bad example, as those did who were here formerly. This is all there is to be said for the temporal interests of this mission; if I remember anything else, I shall write it in another place. Venons au spirituel. Let us come to the spiritual. Premièrement nous esperons une grande moisson avec le First, we shall hope to have in time a great harvest among
  • 36. temps dans les Hurons, plus grande et plus prochaine si on y peut envoyer beaucoup d'ouvriers pour passer dans les nations voisines, le tout soubs la conduite et l'ordonnance du Supérieur qui sera aux Hurons. Ces peuples sont sédentaires et en grand nombre; j'espère que le P. Buteux sçaura dans un an autant du langage montagnais qui j'en sçay, pour l'enseigner aux autres, et ainsy j'iray où on voudra. Ce n'est pas que j'attende rien de moy; je tacheray de servir pour le moins de compagnon. Ces peuples, où nous sommes, sont errans et en fort petit nombre; il sera difficile de les convertir, [153] si on ne les arreste; j'en ay apporté les moyens dans la Relation. the Hurons,—greater and nearer, if we can send there many laborers to pass into the neighboring tribes, all to be under the leadership and command of the Superior who will be among the Hurons. These people are sedentary and very populous; I hope that Father Buteux will know in one year as much of the montagnais language as I know of it, in order to teach it to the others, and thus I shall go wherever I shall be wanted. It is not that I expect anything of myself, but I shall try to serve at least as a companion. These people, where we are, are wandering, and very few in number; it will be difficult to convert them, [153] if we cannot make them stationary; I have discussed the means for doing this, in my Relation. Pour le Seminaire, hélas! pourroit-on bien avoir un fond pour cela? Dans les bastimens dont j'ay parlé, nous désignons un petit lieu pour le commencer, attendant qu'on fasse exprès un corps de logis pour ce subject. Si nous estions bastis, j'espérerois que As to the Seminary, alas! if we could only have a fund for this purpose! In the structures of which I have spoken, we marked out a little place for the beginning of one, waiting until some special houses be erected expressly for this purpose. If we had any built, I
  • 37. dans deux ans le P. Brebeuf nous envoiroit des enfants hurons; on les pourroit instruire icy avec toute liberté, estans éloignés de leur parens. O le grand coup pour la gloire de Dieu, si cela se faisoit! would hope that in two years Father Brebeuf would send us some huron children; they could be instructed here with all freedom, being separated from their parents. Oh, what a great stroke for the glory of God, if that were done! Quant aux enfants des Sauvages de ce pais-cy, il y 84 aura plus de peine à les retenir; je n'y voy point d'autre moyen que celuy que touche V. R. d'envoyer un enfant tous les ans en France: ayant esté là deux ans, il y reviendra sçachant la langue; estant desjà accoustumé à nos façons de faire, il ne nous quittera point et retiendra ses petits compatriotes. Notre petit Fortuné, qu'on a renvoyé pour estre malade, et que nous ne pouvons rendre à ses parens, car il n'en a point, est tout autre qu'il n'estoit, encor qu'il n'ait demeuré que fort peu en France; tant s'en faut qu'il courre après les Sauvages, il les fuit, et se rend fort obéissant. En vérité il m'estonne: car il s'encouroit incontinent aux cabanes de ces barbares sitost qu'on lui
  • 38. As to the children of the Savages in this country, there will be more trouble in keeping them; I see no other way than that which Your Reverence suggests, of sending a child every year to France. Having been there two years, he will return with a knowledge of the language, and having already become accustomed to our ways, he will not leave us and will retain his little countrymen. Our little Fortuné, who has been sent back because he was sick, and who can not return to his parents, for he has none, is quite different from what he was, although he has lived only a little while in France; so far from mingling with the Savages, he runs away from them, and is becoming very obedient. In truth he astonishes me, for he used to begin to run to the cabins of the barbarians as soon as we said a word to him; he could not [154] suffer any one to command him, whoever he might be; now he is prompt in whatever he does. This year I wished to send a little girl, who was given me by the
  • 39. family, that lives here, and perhaps also a little boy, according to Your Reverence's wish. But M. de Champlain told me that M. de Lauson had recommended him not to let any Savage go over, small or great. I begged him last year to allow this to be done; I have an idea that Father Lallemant has some share in this advice and in this conclusion. Here are the reasons why they think that it is not expedient for them to go over: 1st. The example of the two who have gone over and who have been ruined. I answer that Louys[XXXII.] the Huron was taken and corrupted by the English; and yet he has here performed the duties of a Christian, confessing and taking communion last year at his arrival, and at his departure from Kebec; he is now a prisoner of the Hiroquois. As to Pierre the montagnais, [XXXIV.] taken [155] into France by the Récolet Fathers, when he returned here, he fled from the Savages; he was compelled to return among them, in order to learn the
  • 40. language, which he had forgotten; he did not wish to go, even saying: "They are forcing me; but, if I once go there, they will not get me back as they wish." At that time the English came upon the scene, and they have spoiled him; I may add that I have not seen a savage so savage and so barbarous as he is. L'autre raison du P. Lallemant est que ces enfans cousteront à nourrir et entretenir en France, et la mission est pauvre. S'ils sont en un collége, on demandera pension; s'ils sont ailleurs, cela retardera les aumônes que feroient les personnes qui les nourriront. Je répond que les collèges ne prendront point de pension, et quand il en faudroit, je trouve la chose si importante pour la gloire de Dieu, qu'il la faudroít donner. Le P. Lallemant commence à gouster mes raisons; car je l'assure qu'on ne peut retenir les petits Sauvages, s'ils ne sont dépaïsés ou s'ils n'ont quelques camarades qui les aident à demeurer volontiers. Father Lallemant's other reason is that it will cost something to maintain these children in France, and the mission is poor. If they are in a college, their board will have to be paid; if they are elsewhere, that will diminish the alms which would be given by the persons who support them. I answer that the colleges will not take anything for board; and, if it were necessary to pay this, I find the affair so important for the glory of God, that it ought to be given. Father Lallemant begins to appreciate my reasons, for I assured him that we could not retain the little Savages, if they be not removed from their native
  • 41. Nous en avons eû deux: en l'absence des sauvages, ils obéissoient tellement quellement; les sauvages estoient-ils cabanés près de nous, nos enfants n'estoient plus à nous, nous n'osions leur rien dire. country, or if they have not some companions who help them to remain of their own free will. We have had two of these: in the absence of the savages they obeyed tolerably well, but when the savages were encamped near us, our children no longer belonged to us, we dared say nothing. Si nous pouvons avoir quelques enfants cette [156] 88 année, je feray mon possible pour les faire passer, du moins deux garçons, et cette petite fille, qui trouvera trois maisons pour une. On m'en demande en plusieurs endroits. Si M. Duplessis m'écoute, au nom de Dieu, soit. Quant le P. Lallemant aura expérimenté la difficulté qu'il y a de retenir ces enfants libertins, il parlera plus haut que moy. If we can have some children this [156] year I shall do all I can to have them go over, at least two boys and this little girl, who will find three homes for one. Several places have asked me for them. If M. Duplessis listens to me, in the name of God, so let it be. When Father Lallemant shall have found out the difficulty there is in keeping these wild children, he will speak more peremptorily than I do. V. R. voit, par tout ce qui a esté dit, le bien que l'on peut espérer pour la gloire de Dieu de toutes ces contrées, et combien il est important, non- seulement de ne rien divertir ailleurs de ce qui est donné Your Reverence sees, through all that has been said, the benefits to be expected for the glory of God from all of these countries, and how important it is, not only not to divert to some other places what is
  • 42. pour la mission de Kebec, mais encore de trouver quelque chose pour faire subsister du moins une maison qui serve de retraite aux Nostres, qui serve de séminaire pour des enfants et pour les Nostres qui apprendront un jour les langues, car il y a quantité de peuples différens tous en langage. Voici encore..... (Le reste manque au manuscrit.) given for the mission at Kebec, but still more to find something for the maintenance at least of a house which may serve as a retreat for Our Associates, as a seminary for children, and for Our Brothers who will one day learn the languages, for there are a great many tribes differing altogether in their language. Still further ... (The rest of this manuscript is lacking.) NOTES: [I.] Jean de Lauson, intendant de la compagnie des Cent-Associés, et qui fut plus tard gouverneur de la Nouvelle-France. [III.] Jean de Brébeuf, d'une famille noble de Normandie, l'un des premiers missionnaires jésuites venus en Canada en 1625, et qui fut martyrisé au pays des FOOTNOTES: [II.] Jean de Lauson,2 intendant of the company of the Hundred Associates, who was later governor of New France— [Carayon.] [VII.] Jean de Brébeuf, of a noble family of Normandy, one of the first jesuit missionaries, came to Canada in 1625, and was martyred in the country of
  • 43. Hurons en 1649 par les Iroquois. [IV.] Antoine Daniel, natif de Dieppe, arrivé l'année précédente 1633, et martyrisé par les Iroquois, en 1649. [V.] Ambroise Davost, arrivé l'année précédente, en même temps que le P. Daniel. [VI.] Le P. Ennemond Masse, le même qui avait évangélisé les sauvages de l'Acadie, dès l'année 1611, avec le P. Biard. Il vint en Canada en 1633 et mourut en la résidence de Saint-Joseph de Sillery, en 1646, à l'âge de 72 ans. [XI.] Anne De Nouë, natif de Champagne, venu au Canada en 1626 et martyr de son zèle en 1646. On le trouva gelé sur le Saint- Laurent. the Hurons, in 1649, by the Iroquois.— [Carayon.] [VIII.] Antoine Daniel, a native of Dieppe, arrived the preceding year, 1633, and was martyred by the Iroquois in 1649.— [Carayon.] [IX.] Ambroise Davost arrived the preceding year, at the same time as Father Daniel.3— [Carayon.] [X.] Father Ennemond Masse, the same one who had evangelized the savages of Acadia in the year 1611 with Father Biard. He came to Canada in 1633 and died at the residence of Saint- Joseph de Sillery, in 1646, at the age of 72 years.— [Carayon.] [XIV.] Anne De Nouë, native of Champagne, came to Canada in the year 1626 and was a martyr to his
  • 44. [XII.] Il vint au Canada en même temps que le P. Lejeune, en 1632. [XIII.] Le Frère Jean Liégeois, qui périt victime de la haine des Iroquois, près de Sillery, en 1655. [XVII.] Duplessis- Bochart, général de la flotte, comme on l'appelait alors, qui fut plus tard nommé gouverneur des Trois-Rivières, et qui fut tué par les Iroquois, le 19 août 1652. [XIX.] Le P. Jacques Buteux, natif d'Abbeville, en Picardie, qui fut tué par les Iroquois, le 10 de mai 1652. [XX.] Le P. Charles Lalemant, l'un des trois premiers missionnaires jésuites venus à Québec, en 1625. [XXIII.] Le P. Benier était confesseur de la princesse X ***. zeal in 1646. He was found frozen upon the Saint Lawrence.— [Carayon.] [XV.] He came to Canada the same time as Lejeune, 1632.—[Carayon.] [XVI.] Brother Jean Liégeois, who perished as a victim of Iroquois hatred, near Sillery, in 1655.— [Carayon.] [XVIII.] Duplessis- Bochart, general of the fleet, as he was then called; who was later made governor of Three Rivers and killed by the Iroquois on the 19th of August, 1652.—[Carayon.] [XXI.] Father Jacques Buteux,5 a native of Abbeville, in Picardie, who was killed by the Iroquois on the 10th of May, 1652. —[Carayon.] [XXII.] Father Charles Lalemant,
  • 45. [XXV.] Notre-Dame des Anges, près de Québec. [XXVI.] La pointe aux Lièvres, à l'entrée de la rivière Saint- Charles. [XXIX.] «L'an 1634, Messieurs de la Compagnie ont envoyé pour cent escus de meubles et ornements entre autres l'image de saint Joseph en bosse qui est sur l'autel.» Catalogue des bienfaiteurs de Notre-Dame de Recouvrance (Archives du Séminaire de Québec). [XXXI.] Louis Amantacha, surnommé de Sainte-Foy, qui avait été baptisé en France. [XXXIII.] Ou Pierre- Antoine Patetchoanen, «qui depuis cinq ans (1620-5) avoit été envoyé en France par nos religieux de Kébec; lequel après avoir one of the first three jesuit missionaries, came to Quebec in 1625. —[Carayon.] [XXIV.] Father Benier was confessor of the princess X ***.— [Carayon.] [XXVII.] Notre Dame des Anges,7 near Quebec— [Carayon.] [XXVIII.] La pointe aux Lièvres, at mouth of river Saint Charles.— [Carayon.] [XXX.] "In the year 1634 the Gentlemen of the Society sent one hundred ecus' worth of furniture and ornaments, among others the figure of saint Joseph in relief, which is over the altar." Catalogue of the benefactors of Notre-Dame de Recouvrance. (Archives of the Seminary at Québec.)— [Carayon.]
  • 46. été bien instruit et endoctriné aux choses de la foy, fut baptizé et nommé par deffunt M. le Prince de Guiménée, son parrain, Pierre Antoine, qu'il entretint aux études jusques après sa mort, que l'enfant fut congru en la langue latine, et si bon françois, qu'estant de retour à Kébec, nos religieux furent contraints le renvoyer pour quelque temps entre ses parens, afin de reprendre les idées de sa langue maternelle, qu'il avoit presque oublié.» (F. Sagard.) [XXXII.] Louis Amantacha, surnamed Sainte- Foy, who was baptized in France. —[Carayon.] [XXXIV.] Pierre- Antoine Patetchoanen, "who, five years ago, (1620-5) was sent into France by our religious of Kébec; after having been taught and instructed in the doctrines of the faith, he was baptized and named by the deceased M. le Prince de Guiménée, his godfather, Pierre Antoine, who maintained him at his studies up to the time of his death, until the child became so well versed in the latin language, and so good a frenchman, that having returned to Kébec, our religious were obliged to send him back for a little while to his parents, so that he
  • 47. might regain the ideas of his native tongue, which he had almost forgotten."12 (F. Sagard.)— [Carayon.]
  • 48. XXIII Le Jeune's Relation, 1634 Paris: SEBASTIEN CRAMOISY, 1635 Source: Title-page and text reprinted from the copy of the first issue, in Lenox Library. Table des Chapitres, from the second issue, at Lenox. Chaps. i.-ix., only, are given in the present volume; the concluding portion will appear in Volume VII.
  • 51. RELATION DE CE QVI S'EST PASSÉ EN LA NOVVELLE FRANCE, EN L'ANNÉE 1634. Enuoyée au R. PERE PROVINCIAL de la Compangnie de I e s v s en la Prouince de France. Par le P. Paul le Ieune de la mesme Compagnie, Superieur de la residence de Kebec. A PARIS Chez S e b a s t i e n C r a m o i s y, Imprimeur ordinaire du Roy, ruë S. Iacques, aux Cicognes. M DC. XXXV. AVEC PRIVILEGE DV ROY. RELATION OF WHAT OCCURRED IN NEW FRANCE, IN THE YEAR 1634. Sent to the Reverend Father Provincial of the Society of Jesus in the Province of France. By Father Paul le Jeune, of the same Society, Superior of the Residence of Kebec. PARIS, Sebastien Cramoisy, Printer in ordinary to the King. Ruë St. Jacques, at the Sign of the Storks. M DC. XXXV. BY ROYAL LICENSE.
  • 53. P 96 [iii] Extraict du Priuilege du Roy. AR Grace & Priuilege du Roy, il est permis à Sebastien Cramoisy, Imprimeur o[r]dinaire du Roy, marchand Libraire Iuré en l'Vniuersité de Paris, d'imprimer ou faire imprimer vn liure intitulé, Relation de ce qui s'est passé en la Nouuelle France en l'année mil six cens trente-quatre, Enuoyée au Reuerend Pere Barthelemy Iaquinot, Prouincial de la Compagnie de Iesvs en la Prouince de France, Par le P. Paul le Ieune de la mesme Compagnie, Superieur de la Residence de Kebec: & cependant le temps & espace de neuf années consecutiues. Auec defenses à tous Libraires & Imprimeurs d'imprimer ou faire imprimer ledit liure, sous pretexte de desguisement, ou changement qu'ils y pourroient B [iii] Extract from the Royal License. Y the Grace and License of the King, permission is granted to Sebastien Cramoisy. Printer in ordinary to the King, Bookseller under Oath in the University of Paris, to print or to have printed a book entitled, Relation de ce qui s'est passé en la Nouvelle France en l'année mil six cens trente-quatre, Envoyée au Reverend Pere Barthelemy Jaquinot, Provincial de la Compagnie de Jesus en la Province de France, Par le P. Paul le Jeune de la mesme Compagnie, Superieur de la Residence de Kebec: and this during the time and space of nine consecutive years. Prohibiting all Booksellers and Printers to print or to have printed the said book, under pretext of any disguise or change which they may make therein, under penalty of
  • 54. Welcome to our website – the ideal destination for book lovers and knowledge seekers. With a mission to inspire endlessly, we offer a vast collection of books, ranging from classic literary works to specialized publications, self-development books, and children's literature. Each book is a new journey of discovery, expanding knowledge and enriching the soul of the reade Our website is not just a platform for buying books, but a bridge connecting readers to the timeless values of culture and wisdom. With an elegant, user-friendly interface and an intelligent search system, we are committed to providing a quick and convenient shopping experience. Additionally, our special promotions and home delivery services ensure that you save time and fully enjoy the joy of reading. Let us accompany you on the journey of exploring knowledge and personal growth! textbookfull.com