High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau

High Performance Spark Best Practices for
Scaling and Optimizing Apache Spark 1st Edition
Holden Karau download
https://textbookfull.com/product/high-performance-spark-best-
practices-for-scaling-and-optimizing-apache-spark-1st-edition-
holden-karau/
Download full version ebook from https://textbookfull.com

We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!
Stream Processing with Apache Spark Mastering
Structured Streaming and Spark Streaming 1st Edition
Gerard Maas
https://textbookfull.com/product/stream-processing-with-apache-
spark-mastering-structured-streaming-and-spark-streaming-1st-
edition-gerard-maas/
Introducing .NET for Apache Spark: Distributed
Processing for Massive Datasets 1st Edition Ed Elliott
https://textbookfull.com/product/introducing-net-for-apache-
spark-distributed-processing-for-massive-datasets-1st-edition-ed-
elliott/
Spark in Action - Second Edition: Covers Apache Spark 3
with Examples in Java, Python, and Scala Jean-Georges
Perrin
https://textbookfull.com/product/spark-in-action-second-edition-
covers-apache-spark-3-with-examples-in-java-python-and-scala-
jean-georges-perrin/
Graph Algorithms Practical Examples in Apache Spark and
Neo4j 1st Edition Mark Needham
https://textbookfull.com/product/graph-algorithms-practical-
examples-in-apache-spark-and-neo4j-1st-edition-mark-needham/

Apache Spark 2 x Cookbook Cloud ready recipes for
analytics and data science Rishi Yadav
https://textbookfull.com/product/apache-spark-2-x-cookbook-cloud-
ready-recipes-for-analytics-and-data-science-rishi-yadav/
Big Data SMACK A Guide to Apache Spark Mesos Akka
Cassandra and Kafka 1st Edition Raul Estrada
https://textbookfull.com/product/big-data-smack-a-guide-to-
apache-spark-mesos-akka-cassandra-and-kafka-1st-edition-raul-
estrada/
Beginning Apache Spark Using Azure Databricks:
Unleashing Large Cluster Analytics in the Cloud Robert
Ilijason
https://textbookfull.com/product/beginning-apache-spark-using-
azure-databricks-unleashing-large-cluster-analytics-in-the-cloud-
robert-ilijason/
Beginning Apache Spark Using Azure Databricks:
Unleashing Large Cluster Analytics in the Cloud 1st
Edition Robert Ilijason
https://textbookfull.com/product/beginning-apache-spark-using-
azure-databricks-unleashing-large-cluster-analytics-in-the-
cloud-1st-edition-robert-ilijason/
Spark GraphX in Action 1st Edition Michael Malak
https://textbookfull.com/product/spark-graphx-in-action-1st-
edition-michael-malak/

Holden Karau &
Rachel Warren
High Performance
Spark
BEST PRACTICES FOR SCALING
& OPTIMIZING APACHE SPARK

Holden Karau and Rachel Warren
High Performance Spark
Best Practices for Scaling and
Optimizing Apache Spark
Boston Farnham Sebastopol Tokyo
Beijing Boston Farnham Sebastopol Tokyo
Beijing

978-1-491-94320-5
[LSI]
High Performance Spark
by Holden Karau and Rachel Warren
Copyright © 2017 Holden Karau, Rachel Warren. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐
tutional sales department: 800-998-9938 or corporate@oreilly.com.
Editor: Shannon Cutt Indexer: Ellen Troutman-Zaig
Production Editor: Kristen Brown Interior Designer: David Futato
Copyeditor: Kim Cofer Cover Designer: Karen Montgomery
Proofreader: James Fraleigh Illustrator: Rebecca Demarest
June 2017: First Edition
Revision History for the First Edition
2017-05-22: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. High Performance Spark, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.

Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction to High Performance Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is Spark and Why Performance Matters 1
What You Can Expect to Get from This Book 2
Spark Versions 3
Why Scala? 3
To Be a Spark Expert You Have to Learn a Little Scala Anyway 3
The Spark Scala API Is Easier to Use Than the Java API 4
Scala Is More Performant Than Python 4
Why Not Scala? 4
Learning Scala 5
Conclusion 6
2. How Spark Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
How Spark Fits into the Big Data Ecosystem 8
Spark Components 8
Spark Model of Parallel Computing: RDDs 10
Lazy Evaluation 11
In-Memory Persistence and Memory Management 13
Immutability and the RDD Interface 14
Types of RDDs 16
Functions on RDDs: Transformations Versus Actions 17
Wide Versus Narrow Dependencies 17
Spark Job Scheduling 19
Resource Allocation Across Applications 20
The Spark Application 20
The Anatomy of a Spark Job 22
iii

The DAG 22
Jobs 23
Stages 23
Tasks 24
Conclusion 26
3. DataFrames, Datasets, and Spark SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Getting Started with the SparkSession (or HiveContext or SQLContext) 28
Spark SQL Dependencies 30
Managing Spark Dependencies 31
Avoiding Hive JARs 32
Basics of Schemas 33
DataFrame API 36
Transformations 36
Multi-DataFrame Transformations 48
Plain Old SQL Queries and Interacting with Hive Data 49
Data Representation in DataFrames and Datasets 49
Tungsten 50
Data Loading and Saving Functions 51
DataFrameWriter and DataFrameReader 51
Formats 52
Save Modes 61
Partitions (Discovery and Writing) 61
Datasets 62
Interoperability with RDDs, DataFrames, and Local Collections 62
Compile-Time Strong Typing 64
Easier Functional (RDD “like”) Transformations 64
Relational Transformations 64
Multi-Dataset Relational Transformations 65
Grouped Operations on Datasets 65
Extending with User-Defined Functions and Aggregate Functions (UDFs,
UDAFs) 66
Query Optimizer 69
Logical and Physical Plans 69
Code Generation 69
Large Query Plans and Iterative Algorithms 70
Debugging Spark SQL Queries 70
JDBC/ODBC Server 70
Conclusion 72
4. Joins (SQL and Core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Core Spark Joins 73
iv | Table of Contents

Choosing a Join Type 75
Choosing an Execution Plan 76
Spark SQL Joins 79
DataFrame Joins 79
Dataset Joins 83
Conclusion 84
5. Effective Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Narrow Versus Wide Transformations 86
Implications for Performance 88
Implications for Fault Tolerance 89
The Special Case of coalesce 89
What Type of RDD Does Your Transformation Return? 90
Minimizing Object Creation 92
Reusing Existing Objects 92
Using Smaller Data Structures 95
Iterator-to-Iterator Transformations with mapPartitions 98
What Is an Iterator-to-Iterator Transformation? 99
Space and Time Advantages 100
An Example 101
Set Operations 104
Reducing Setup Overhead 105
Shared Variables 106
Broadcast Variables 106
Accumulators 107
Reusing RDDs 112
Cases for Reuse 112
Deciding if Recompute Is Inexpensive Enough 115
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files 116
Alluxio (nee Tachyon) 120
LRU Caching 121
Noisy Cluster Considerations 122
Interaction with Accumulators 123
Conclusion 124
6. Working with Key/Value Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
The Goldilocks Example 127
Goldilocks Version 0: Iterative Solution 128
How to Use PairRDDFunctions and OrderedRDDFunctions 130
Actions on Key/Value Pairs 131
What’s So Dangerous About the groupByKey Function 132
Goldilocks Version 1: groupByKey Solution 132
Table of Contents | v

Choosing an Aggregation Operation 136
Dictionary of Aggregation Operations with Performance Considerations 136
Multiple RDD Operations 139
Co-Grouping 139
Partitioners and Key/Value Data 140
Using the Spark Partitioner Object 142
Hash Partitioning 142
Range Partitioning 142
Custom Partitioning 143
Preserving Partitioning Information Across Transformations 144
Leveraging Co-Located and Co-Partitioned RDDs 144
Dictionary of Mapping and Partitioning Functions PairRDDFunctions 146
Dictionary of OrderedRDDOperations 147
Sorting by Two Keys with SortByKey 149
Secondary Sort and repartitionAndSortWithinPartitions 149
Leveraging repartitionAndSortWithinPartitions for a Group by Key and
Sort Values Function 150
How Not to Sort by Two Orderings 153
Goldilocks Version 2: Secondary Sort 154
A Different Approach to Goldilocks 157
Goldilocks Version 3: Sort on Cell Values 162
Straggler Detection and Unbalanced Data 163
Back to Goldilocks (Again) 165
Goldilocks Version 4: Reduce to Distinct on Each Partition 165
Conclusion 171
7. Going Beyond Scala. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Beyond Scala within the JVM 174
Beyond Scala, and Beyond the JVM 178
How PySpark Works 179
How SparkR Works 187
Spark.jl (Julia Spark) 189
How Eclair JS Works 190
Spark on the Common Language Runtime (CLR)—C# and Friends 191
Calling Other Languages from Spark 191
Using Pipe and Friends 191
JNI 193
Java Native Access (JNA) 196
Underneath Everything Is FORTRAN 196
Getting to the GPU 198
The Future 198
Conclusion 198
vi | Table of Contents

8. Testing and Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Unit Testing 201
General Spark Unit Testing 202
Mocking RDDs 206
Getting Test Data 208
Generating Large Datasets 208
Sampling 209
Property Checking with ScalaCheck 211
Computing RDD Difference 211
Integration Testing 214
Choosing Your Integration Testing Environment 214
Verifying Performance 215
Spark Counters for Verifying Performance 215
Projects for Verifying Performance 216
Job Validation 216
Conclusion 217
9. Spark MLlib and ML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Choosing Between Spark MLlib and Spark ML 219
Working with MLlib 220
Getting Started with MLlib (Organization and Imports) 220
MLlib Feature Encoding and Data Preparation 221
Feature Scaling and Selection 226
MLlib Model Training 226
Predicting 227
Serving and Persistence 228
Model Evaluation 230
Working with Spark ML 231
Spark ML Organization and Imports 231
Pipeline Stages 232
Explain Params 233
Data Encoding 234
Data Cleaning 236
Spark ML Models 237
Putting It All Together in a Pipeline 238
Training a Pipeline 239
Accessing Individual Stages 239
Data Persistence and Spark ML 239
Extending Spark ML Pipelines with Your Own Algorithms 242
Model and Pipeline Persistence and Serving with Spark ML 250
General Serving Considerations 250
Conclusion 251
Table of Contents | vii

10. Spark Components and Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Stream Processing with Spark 255
Sources and Sinks 255
Batch Intervals 257
Data Checkpoint Intervals 258
Considerations for DStreams 259
Considerations for Structured Streaming 260
High Availability Mode (or Handling Driver Failure or Checkpointing) 268
GraphX 269
Using Community Packages and Libraries 269
Creating a Spark Package 271
Conclusion 272
A. Tuning, Debugging, and Other Things Developers Like to Pretend Don’t Exist. . . . . . . 273
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
viii | Table of Contents

Preface
We wrote this book for data engineers and data scientists who are looking to get the
most out of Spark. If you’ve been working with Spark and invested in Spark but your
experience so far has been mired by memory errors and mysterious, intermittent fail‐
ures, this book is for you. If you have been using Spark for some exploratory work or
experimenting with it on the side but have not felt confident enough to put it into
production, this book may help. If you are enthusiastic about Spark but have not seen
the performance improvements from it that you expected, we hope this book can
help. This book is intended for those who have some working knowledge of Spark,
and may be difficult to understand for those with little or no experience with Spark or
distributed computing. For recommendations of more introductory literature see
“Supporting Books and Materials” on page x.
We expect this text will be most useful to those who care about optimizing repeated
queries in production, rather than to those who are doing primarily exploratory
work. While writing highly performant queries is perhaps more important to the data
engineer, writing those queries with Spark, in contrast to other frameworks, requires
a good knowledge of the data, usually more intuitive to the data scientist. Thus it may
be more useful to a data engineer who may be less experienced with thinking criti‐
cally about the statistical nature, distribution, and layout of data when considering
performance. We hope that this book will help data engineers think more critically
about their data as they put pipelines into production. We want to help our readers
ask questions such as “How is my data distributed?”, “Is it skewed?”, “What is the
range of values in a column?”, and “How do we expect a given value to group?” and
then apply the answers to those questions to the logic of their Spark queries.
However, even for data scientists using Spark mostly for exploratory purposes, this
book should cultivate some important intuition about writing performant Spark
queries, so that as the scale of the exploratory analysis inevitably grows, you may have
a better shot of getting something to run the first time. We hope to guide data scien‐
tists, even those who are already comfortable thinking about data in a distributed
way, to think critically about how their programs are evaluated, empowering them to
Preface | ix

1 Though we may be biased.
2 Although it’s important to note that some of the practices suggested in this book are not common practice in
Spark code.
explore their data more fully, more quickly, and to communicate effectively with any‐
one helping them put their algorithms into production.
Regardless of your job title, it is likely that the amount of data with which you are
working is growing quickly. Your original solutions may need to be scaled, and your
old techniques for solving new problems may need to be updated. We hope this book
will help you leverage Apache Spark to tackle new problems more easily and old
problems more efficiently.
First Edition Notes
You are reading the first edition of High Performance Spark, and for that, we thank
you! If you find errors, mistakes, or have ideas for ways to improve this book, please
reach out to us at high-performance-spark@googlegroups.com. If you wish to be
included in a “thanks” section in future editions of the book, please include your pre‐
ferred display name.
Supporting Books and Materials
For data scientists and developers new to Spark, Learning Spark by Karau, Konwin‐
ski, Wendell, and Zaharia is an excellent introduction,1
and Advanced Analytics with
Spark by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills is a great book for
interested data scientists. For individuals more interested in streaming, the upcoming
Learning Spark Streaming by François Garillot may also be of use once it is available.
Beyond books, there is also a collection of intro-level Spark training material avail‐
able. For individuals who prefer video, Paco Nathan has an excellent introduction
video series on O’Reilly. Commercially, Databricks as well as Cloudera and other
Hadoop/Spark vendors offer Spark training. Previous recordings of Spark camps, as
well as many other great resources, have been posted on the Apache Spark documen‐
tation page.
If you don’t have experience with Scala, we do our best to convince you to pick up
Scala in Chapter 1, and if you are interested in learning, Programming Scala, 2nd Edi‐
tion, by Dean Wampler and Alex Payne is a good introduction.2
x | Preface

Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program ele‐
ments such as variable or function names, databases, data types, environment
variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Examples prefixed with “Evil” depend heavily on Apache Spark
internals, and will likely break in future minor releases of Apache
Spark. You’ve been warned—but we totally understand you aren’t
going to pay much attention to that because neither would we.
Preface | xi

Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download from
the High Performance Spark GitHub repository and some of the testing code is avail‐
able at the “Spark Testing Base” GitHub repository and the Spark Validator repo.
Structured Streaming machine learning examples, which are generally in the “evil”
category discussed under “Conventions Used in This Book” on page xi, are available
at https://github.com/holdenk/spark-structured-streaming-ml.
This book is here to help you get your job done. In general, if example code is offered
with this book, you may use it in your programs and documentation. You do not
need to contact us for permission unless you’re reproducing a significant portion of
the code. For example, writing a program that uses several chunks of code from this
book does not require permission. Selling or distributing a CD-ROM of examples
from O’Reilly books does require permission. Answering a question by citing this
book and quoting example code does not require permission. The code is also avail‐
able under an Apache 2 License. Incorporating a significant amount of example code
from this book into your product’s documentation may require permission.
We appreciate, but do not require, attribution. An attribution usually includes the
title, author, publisher, and ISBN. For example: “High Performance Spark by Holden
Karau and Rachel Warren (O’Reilly). Copyright 2017 Holden Karau, Rachel Warren,
978-1-491-94320-5.”
If you feel your use of code examples falls outside fair use or the permission given
above, feel free to contact us at permissions@oreilly.com.
O’Reilly Safari
Safari (formerly Safari Books Online) is a membership-based
training and reference platform for enterprise, government,
educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interac‐
tive tutorials, and curated playlists from over 250 publishers, including O’Reilly
Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐
fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
xii | Preface

How to Contact the Authors
For feedback, email us at high-performance-spark@googlegroups.com. For random
ramblings, occasionally about Spark, follow us on twitter:
Holden: http://twitter.com/holdenkarau
Rachel: https://twitter.com/warre_n_peace
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
To comment or ask technical questions about this book, send email to bookques‐
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
The authors would like to acknowledge everyone who has helped with comments and
suggestions on early drafts of our work. Special thanks to Anya Bida, Jakob Odersky,
and Katharine Kearnan for reviewing early drafts and diagrams. We’d like to thank
Mahmoud Hanafy for reviewing and improving the sample code as well as early
drafts. We’d also like to thank Michael Armbrust for reviewing and providing feed‐
back on early drafts of the SQL chapter. Justin Pihony has been one of the most active
early readers, suggesting fixes in every respect (language, formatting, etc.).
Thanks to all of the readers of our O’Reilly early release who have provided feedback
on various errata, including Kanak Kshetri and Rubén Berenguel.
Preface | xiii

Finally, thank you to our respective employers for being understanding as we’ve
worked on this book. Especially Lawrence Spracklen who insisted we mention him
here :p.
xiv | Preface

1 From http://spark.apache.org/.
CHAPTER 1
Introduction to High Performance Spark
This chapter provides an overview of what we hope you will be able to learn from this
book and does its best to convince you to learn Scala. Feel free to skip ahead to Chap‐
ter 2 if you already know what you’re looking for and use Scala (or have your heart
set on another language).
What Is Spark and Why Performance Matters
Apache Spark is a high-performance, general-purpose distributed computing system
that has become the most active Apache open source project, with more than 1,000
active contributors.1
Spark enables us to process large quantities of data, beyond what
can fit on a single machine, with a high-level, relatively easy-to-use API. Spark’s
design and interface are unique, and it is one of the fastest systems of its kind.
Uniquely, Spark allows us to write the logic of data transformations and machine
learning algorithms in a way that is parallelizable, but relatively system agnostic. So it
is often possible to write computations that are fast for distributed storage systems of
varying kind and size.
However, despite its many advantages and the excitement around Spark, the simplest
implementation of many common data science routines in Spark can be much slower
and much less robust than the best version. Since the computations we are concerned
with may involve data at a very large scale, the time and resources that gains from
tuning code for performance are enormous. Performance does not just mean run
faster; often at this scale it means getting something to run at all. It is possible to con‐
struct a Spark query that fails on gigabytes of data but, when refactored and adjusted
with an eye toward the structure of the data and the requirements of the cluster,
1

succeeds on the same system with terabytes of data. In the authors’ experience writ‐
ing production Spark code, we have seen the same tasks, run on the same clusters,
run 100× faster using some of the optimizations discussed in this book. In terms of
data processing, time is money, and we hope this book pays for itself through a
reduction in data infrastructure costs and developer hours.
Not all of these techniques are applicable to every use case. Especially because Spark
is highly configurable and is exposed at a higher level than other computational
frameworks of comparable power, we can reap tremendous benefits just by becoming
more attuned to the shape and structure of our data. Some techniques can work well
on certain data sizes or even certain key distributions, but not all. The simplest exam‐
ple of this can be how for many problems, using groupByKey in Spark can very easily
cause the dreaded out-of-memory exceptions, but for data with few duplicates this
operation can be just as quick as the alternatives that we will present. Learning to
understand your particular use case and system and how Spark will interact with it is
a must to solve the most complex data science problems with Spark.
What You Can Expect to Get from This Book
Our hope is that this book will help you take your Spark queries and make them
faster, able to handle larger data sizes, and use fewer resources. This book covers a
broad range of tools and scenarios. You will likely pick up some techniques that
might not apply to the problems you are working with, but that might apply to a
problem in the future and may help shape your understanding of Spark more gener‐
ally. The chapters in this book are written with enough context to allow the book to
be used as a reference; however, the structure of this book is intentional and reading
the sections in order should give you not only a few scattered tips, but a comprehen‐
sive understanding of Apache Spark and how to make it sing.
It’s equally important to point out what you will likely not get from this book. This
book is not intended to be an introduction to Spark or Scala; several other books and
video series are available to get you started. The authors may be a little biased in this
regard, but we think Learning Spark by Karau, Konwinski, Wendell, and Zaharia as
well as Paco Nathan’s introduction video series are excellent options for Spark begin‐
ners. While this book is focused on performance, it is not an operations book, so top‐
ics like setting up a cluster and multitenancy are not covered. We are assuming that
you already have a way to use Spark in your system, so we won’t provide much assis‐
tance in making higher-level architecture decisions. There are future books in the
works, by other authors, on the topic of Spark operations that may be done by the
time you are reading this one. If operations are your show, or if there isn’t anyone
responsible for operations in your organization, we hope those books can help you.
2 | Chapter 1: Introduction to High Performance Spark

2 MiMa is the Migration Manager for Scala and tries to catch binary incompatibilities between releases.
Spark Versions
Spark follows semantic versioning with the standard [MAJOR].[MINOR].[MAINTE‐
NANCE] with API stability for public nonexperimental nondeveloper APIs within
minor and maintenance releases. Many of these experimental components are some
of the more exciting from a performance standpoint, including Datasets—Spark
SQL’s new structured, strongly-typed, data abstraction. Spark also tries for binary
API compatibility between releases, using MiMa2
; so if you are using the stable API
you generally should not need to recompile to run a job against a new version of
Spark unless the major version has changed.
This book was created using the Spark 2.0.1 APIs, but much of the
code will work in earlier versions of Spark as well. In places where
this is not the case we have attempted to call that out.
Why Scala?
In this book, we will focus on Spark’s Scala API and assume a working knowledge of
Scala. Part of this decision is simply in the interest of time and space; we trust readers
wanting to use Spark in another language will be able to translate the concepts used
in this book without presenting the examples in Java and Python. More importantly,
it is the belief of the authors that “serious” performant Spark development is most
easily achieved in Scala.
To be clear, these reasons are very specific to using Spark with Scala; there are many
more general arguments for (and against) Scala’s applications in other contexts.
To Be a Spark Expert You Have to Learn a Little Scala Anyway
Although Python and Java are more commonly used languages, learning Scala is a
worthwhile investment for anyone interested in delving deep into Spark develop‐
ment. Spark’s documentation can be uneven. However, the readability of the code‐
base is world-class. Perhaps more than with other frameworks, the advantages of
cultivating a sophisticated understanding of the Spark codebase is integral to the
advanced Spark user. Because Spark is written in Scala, it will be difficult to interact
with the Spark source code without the ability, at least, to read Scala code. Further‐
more, the methods in the Resilient Distributed Datasets (RDD) class closely mimic
those in the Scala collections API. RDD functions, such as map, filter, flatMap,
Spark Versions | 3

3 Although, as we explore in this book, the performance implications and evaluation semantics are quite
different.
4 Of course, in performance, every rule has its exception. mapPartitions in Spark 1.6 and earlier in Java suffers
some severe performance restrictions that we discuss in “Iterator-to-Iterator Transformations with mapParti‐
tions” on page 98.
reduce, and fold, have nearly identical specifications to their Scala equivalents.3
Fun‐
damentally Spark is a functional framework, relying heavily on concepts like immut‐
ability and lambda definition, so using the Spark API may be more intuitive with
some knowledge of functional programming.
The Spark Scala API Is Easier to Use Than the Java API
Once you have learned Scala, you will quickly find that writing Spark in Scala is less
painful than writing Spark in Java. First, writing Spark in Scala is significantly more
concise than writing Spark in Java since Spark relies heavily on inline function defini‐
tions and lambda expressions, which are much more naturally supported in Scala
(especially before Java 8). Second, the Spark shell can be a powerful tool for debug‐
ging and development, and is only available in languages with existing REPLs (Scala,
Python, and R).
Scala Is More Performant Than Python
It can be attractive to write Spark in Python, since it is easy to learn, quick to write,
interpreted, and includes a very rich set of data science toolkits. However, Spark code
written in Python is often slower than equivalent code written in the JVM, since Scala
is statically typed, and the cost of JVM communication (from Python to Scala) can be
very high. Last, Spark features are generally written in Scala first and then translated
into Python, so to use cutting-edge Spark functionality, you will need to be in the
JVM; Python support for MLlib and Spark Streaming are particularly behind.
Why Not Scala?
There are several good reasons to develop with Spark in other languages. One of the
more important constant reasons is developer/team preference. Existing code, both
internal and in libraries, can also be a strong reason to use a different language.
Python is one of the most supported languages today. While writing Java code can be
clunky and sometimes lag slightly in terms of API, there is very little performance
cost to writing in another JVM language (at most some object conversions).4

While all of the examples in this book are presented in Scala for the
final release, we will port many of the examples from Scala to Java
and Python where the differences in implementation could be
important. These will be available (over time) at our GitHub. If you
find yourself wanting a specific example ported, please either email
us or create an issue on the GitHub repo.
Spark SQL does much to minimize the performance difference when using a non-
JVM language. Chapter 7 looks at options to work effectively in Spark with languages
outside of the JVM, including Spark’s supported languages of Python and R. This
section also offers guidance on how to use Fortran, C, and GPU-specific code to reap
additional performance improvements. Even if we are developing most of our Spark
application in Scala, we shouldn’t feel tied to doing everything in Scala, because spe‐
cialized libraries in other languages can be well worth the overhead of going outside
the JVM.
Learning Scala
If after all of this we’ve convinced you to use Scala, there are several excellent options
for learning Scala. Spark 1.6 is built against Scala 2.10 and cross-compiled against
Scala 2.11, and Spark 2.0 is built against Scala 2.11 and possibly cross-compiled
against Scala 2.10 and may add 2.12 in the future. Depending on how much we’ve
convinced you to learn Scala, and what your resources are, there are a number of dif‐
ferent options ranging from books to massive open online courses (MOOCs) to pro‐
fessional training.
For books, Programming Scala, 2nd Edition, by Dean Wampler and Alex Payne can
be great, although much of the actor system references are not relevant while working
in Spark. The Scala language website also maintains a list of Scala books.
In addition to books focused on Spark, there are online courses for learning Scala.
Functional Programming Principles in Scala, taught by Martin Ordersky, its creator, is
on Coursera as well as Introduction to Functional Programming on edX. A number
of different companies also offer video-based Scala courses, none of which the
authors have personally experienced or recommend.
For those who prefer a more interactive approach, professional training is offered by
a number of different companies, including Lightbend (formerly Typesafe). While we
have not directly experienced Typesafe training, it receives positive reviews and is
known especially to help bring a team or group of individuals up to speed with Scala
for the purposes of working with Spark.
Why Scala? | 5

Conclusion
Although you will likely be able to get the most out of Spark performance if you have
an understanding of Scala, working in Spark does not require a knowledge of Scala.
For those whose problems are better suited to other languages or tools, techniques for
working with other languages will be covered in Chapter 7. This book is aimed at
individuals who already have a grasp of the basics of Spark, and we thank you for
choosing High Performance Spark to deepen your knowledge of Spark. The next
chapter will introduce some of Spark’s general design and evaluation paradigms that
are important to understanding how to efficiently utilize Spark.

1 MapReduce is a programmatic paradigm that defines programs in terms of map procedures that filter and
sort data onto the nodes of a distributed system, and reduce procedures that aggregate the data on the mapper
nodes. Implementations of MapReduce have been written in many languages, but the term usually refers to a
popular implementation called Hadoop MapReduce, packaged with the distributed filesystem, Apache
Hadoop Distributed File System.
2 DryadLINQ is a Microsoft research project that puts the .NET Language Integrated Query (LINQ) on top of
the Dryad distributed execution engine. Like Spark, the DryadLINQ API defines an object representing a dis‐
tributed dataset, and then exposes functions to transform data as methods defined on that dataset object.
DryadLINQ is lazily evaluated and its scheduler is similar to Spark’s. However, DryadLINQ doesn’t use in-
memory storage. For more information see the DryadLINQ documentation.
3 See the original Spark Paper and other Spark papers.
CHAPTER 2
How Spark Works
This chapter introduces the overall design of Spark as well as its place in the big data
ecosystem. Spark is often considered an alternative to Apache MapReduce, since
Spark can also be used for distributed data processing with Hadoop.1
As we will dis‐
cuss in this chapter, Spark’s design principles are quite different from those of Map‐
Reduce. Unlike Hadoop MapReduce, Spark does not need to be run in tandem with
Apache Hadoop—although it often is. Spark has inherited parts of its API, design,
and supported formats from other existing computational frameworks, particularly
DryadLINQ.2
However, Spark’s internals, especially how it handles failures, differ
from many traditional systems. Spark’s ability to leverage lazy evaluation within
memory computations makes it particularly unique. Spark’s creators believe it to be
the first high-level programming language for fast, distributed data processing.3
To get the most out of Spark, it is important to understand some of the principles
used to design Spark and, at a cursory level, how Spark programs are executed. In this
chapter, we will provide a broad overview of Spark’s model of parallel computing and
a thorough explanation of the Spark scheduler and execution engine. We will refer to
7

the concepts in this chapter throughout the text. Further, we hope this explanation
will provide you with a more precise understanding of some of the terms you’ve
heard tossed around by other Spark users and encounter in the Spark documenta‐
tion.
How Spark Fits into the Big Data Ecosystem
Apache Spark is an open source framework that provides methods to process data in
parallel that are generalizable; the same high-level Spark functions can be used to per‐
form disparate data processing tasks on data of different sizes and structures. On its
own, Spark is not a data storage solution; it performs computations on Spark JVMs
(Java Virtual Machines) that last only for the duration of a Spark application. Spark
can be run locally on a single machine with a single JVM (called local mode). More
often, Spark is used in tandem with a distributed storage system (e.g., HDFS, Cassan‐
dra, or S3) and a cluster manager—the storage system to house the data processed
with Spark, and the cluster manager to orchestrate the distribution of Spark applica‐
tions across the cluster. Spark currently supports three kinds of cluster managers:
Standalone Cluster Manager, Apache Mesos, and Hadoop YARN (see Figure 2-1).
The Standalone Cluster Manager is included in Spark, but using the Standalone man‐
ager requires installing Spark on each node of the cluster.
Figure 2-1. A diagram of the data processing ecosystem including Spark
Spark Components
Spark provides a high-level query language to process data. Spark Core, the main
data processing framework in the Spark ecosystem, has APIs in Scala, Java, Python,
and R. Spark is built around a data abstraction called Resilient Distributed Datasets
(RDDs). RDDs are a representation of lazily evaluated, statically typed, distributed
collections. RDDs have a number of predefined “coarse-grained” transformations
(functions that are applied to the entire dataset), such as map, join, and reduce to
8 | Chapter 2: How Spark Works

4 GraphX is not actively developed at this point, and will likely be replaced with GraphFrames or similar.
5 Datasets and DataFrames are unified in Spark 2.0. Datasets are DataFrames of “Row” objects that can be
accessed by field number.
6 See the MLlib documentation.
manipulate the distributed datasets, as well as I/O functionality to read and write data
between the distributed storage system and the Spark JVMs.
While Spark also supports R, at present the RDD interface is not
available in that language. We will cover tips for using Java,
Python, R, and other languages in detail in Chapter 7.
In addition to Spark Core, the Spark ecosystem includes a number of other first-party
components, including Spark SQL, Spark MLlib, Spark ML, Spark Streaming, and
GraphX,4
which provide more specific data processing functionality. Some of these
components have the same generic performance considerations as the Core; MLlib,
for example, is written almost entirely on the Spark API. However, some of them
have unique considerations. Spark SQL, for example, has a different query optimizer
than Spark Core.
Spark SQL is a component that can be used in tandem with Spark Core and has APIs
in Scala, Java, Python, and R, and basic SQL queries. Spark SQL defines an interface
for a semi-structured data type, called DataFrames, and as of Spark 1.6, a semi-
structured, typed version of RDDs called called Datasets.5
Spark SQL is a very
important component for Spark performance, and much of what can be accom‐
plished with Spark Core can be done by leveraging Spark SQL. We will cover Spark
SQL in detail in Chapter 3 and compare the performance of joins in Spark SQL and
Spark Core in Chapter 4.
Spark has two machine learning packages: ML and MLlib. MLlib is a package of
machine learning and statistics algorithms written with Spark. Spark ML is still in the
early stages, and has only existed since Spark 1.2. Spark ML provides a higher-level
API than MLlib with the goal of allowing users to more easily create practical
machine learning pipelines. Spark MLlib is primarily built on top of RDDs and uses
functions from Spark Core, while ML is built on top of Spark SQL DataFrames.6
Eventually the Spark community plans to move over to ML and deprecate MLlib.
Spark ML and MLlib both have additional performance considerations from Spark
Core and Spark SQL—we cover some of these in Chapter 9.
Spark Streaming uses the scheduling of the Spark Core for streaming analytics on
minibatches of data. Spark Streaming has a number of unique considerations, such as
How Spark Fits into the Big Data Ecosystem | 9

the window sizes used for batches. We offer some tips for using Spark Streaming in
“Stream Processing with Spark” on page 255.
GraphX is a graph processing framework built on top of Spark with an API for graph
computations. GraphX is one of the least mature components of Spark, so we don’t
cover it in much detail. In future versions of Spark, typed graph functionality will be
introduced on top of the Dataset API. We will provide a cursory glance at GraphX in
“GraphX” on page 269.
This book will focus on optimizing programs written with the Spark Core and Spark
SQL. However, since MLlib and the other frameworks are written using the Spark
API, this book will provide the tools you need to leverage those frameworks more
efficiently. Maybe by the time you’re done, you will be ready to start contributing
your own functions to MLlib and ML!
In addition to these first-party components, the community has written a number of
libraries that provide additional functionality, such as for testing or parsing CSVs,
and offer tools to connect it to different data sources. Many libraries are listed at
http://spark-packages.org/, and can be dynamically included at runtime with spark-
submit or the spark-shell and added as build dependencies to your maven or sbt
project. We first use Spark packages to add support for CSV data in “Additional for‐
mats” on page 59 and then in more detail in “Using Community Packages and Libra‐
ries” on page 269.
Spark Model of Parallel Computing: RDDs
Spark allows users to write a program for the driver (or master node) on a cluster
computing system that can perform operations on data in parallel. Spark represents
large datasets as RDDs—immutable, distributed collections of objects—which are
stored in the executors (or slave nodes). The objects that comprise RDDs are called
partitions and may be (but do not need to be) computed on different nodes of a dis‐
tributed system. The Spark cluster manager handles starting and distributing the
Spark executors across a distributed system according to the configuration parame‐
ters set by the Spark application. The Spark execution engine itself distributes data
across the executors for a computation. (See Figure 2-4.)
Rather than evaluating each transformation as soon as specified by the driver pro‐
gram, Spark evaluates RDDs lazily, computing RDD transformations only when the
final RDD data needs to be computed (often by writing out to storage or collecting an
aggregate to the driver). Spark can keep an RDD loaded in-memory on the executor
nodes throughout the life of a Spark application for faster access in repeated compu‐
tations. As they are implemented in Spark, RDDs are immutable, so transforming an
RDD returns a new RDD rather than the existing one. As we will explore in this

chapter, this paradigm of lazy evaluation, in-memory storage, and immutability
allows Spark to be easy-to-use, fault-tolerant, scalable, and efficient.
Lazy Evaluation
Many other systems for in-memory storage are based on “fine-grained” updates to
mutable objects, i.e., calls to a particular cell in a table by storing intermediate results.
In contrast, evaluation of RDDs is completely lazy. Spark does not begin computing
the partitions until an action is called. An action is a Spark operation that returns
something other than an RDD, triggering evaluation of partitions and possibly
returning some output to a non-Spark system (outside of the Spark executors); for
example, bringing data back to the driver (with operations like count or collect) or
writing data to an external storage storage system (such as copyToHadoop). Actions
trigger the scheduler, which builds a directed acyclic graph (called the DAG), based
on the dependencies between RDD transformations. In other words, Spark evaluates
an action by working backward to define the series of steps it has to take to produce
each object in the final distributed dataset (each partition). Then, using this series of
steps, called the execution plan, the scheduler computes the missing partitions for
each stage until it computes the result.
Not all transformations are 100% lazy. sortByKey needs to evaluate
the RDD to determine the range of data, so it involves both a trans‐
formation and an action.
Performance and usability advantages of lazy evaluation
Lazy evaluation allows Spark to combine operations that don’t require communica‐
tion with the driver (called transformations with one-to-one dependencies) to avoid
doing multiple passes through the data. For example, suppose a Spark program calls a
map and a filter function on the same RDD. Spark can send the instructions for
both the map and the filter to each executor. Then Spark can perform both the map
and filter on each partition, which requires accessing the records only once, rather
than sending two sets of instructions and accessing each partition twice. This theoret‐
ically reduces the computational complexity by half.
Spark’s lazy evaluation paradigm is not only more efficient, it is also easier to imple‐
ment the same logic in Spark than in a different framework—like MapReduce—that
requires the developer to do the work to consolidate her mapping operations. Spark’s
clever lazy evaluation strategy lets us be lazy and express the same logic in far fewer
lines of code: we can chain together operations with narrow dependencies and let the
Spark evaluation engine do the work of consolidating them.
Spark Model of Parallel Computing: RDDs | 11

Consider the classic word count example that, given a dataset of documents, parses
the text into words and then computes the count for each word. The Apache docs
provide a word count example, which even in its simplest form comprises roughly
fifty lines of code (excluding import statements) in Java. A comparable Spark imple‐
mentation is roughly fifteen lines of code in Java and five in Scala, available on the
Apache website. The example excludes the steps to read in the data mapping docu‐
ments to words and counting the words. We have reproduced it in Example 2-1.
Example 2-1. Simple Scala word count example
def simpleWordCount(rdd: RDD[String]): RDD[(String, Int)] = {
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
}
A further benefit of the Spark implementation of word count is that it is easier to
modify and improve. Suppose that we now want to modify this function to filter out
some “stop words” and punctuation from each document before computing the word
count. In MapReduce, this would require adding the filter logic to the mapper to
avoid doing a second pass through the data. An implementation of this routine for
MapReduce can be found here: https://github.com/kite-sdk/kite/wiki/WordCount-
Version-Three. In contrast, we can modify the preceding Spark routine by simply
putting a filter step before the map step that creates the key/value pairs.
Example 2-2 shows how Spark’s lazy evaluation will consolidate the map and filter
steps for us.
Example 2-2. Word count example with stop words filtered
def withStopWordsFiltered(rdd : RDD[String], illegalTokens : Array[Char],
stopWords : Set[String]): RDD[(String, Int)] = {
val separators = illegalTokens ++ Array[Char](' ')
val tokens: RDD[String] = rdd.flatMap(_.split(separators).
map(_.trim.toLowerCase))
val words = tokens.filter(token =>
!stopWords.contains(token) && (token.length > 0) )
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
}
Lazy evaluation and fault tolerance
Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurate
results in the event of a host machine or network failure. Spark’s unique method of
fault tolerance is achieved because each partition of the data contains the dependency

information needed to recalculate the partition. Most distributed computing para‐
digms that allow users to work with mutable objects provide fault tolerance by log‐
ging updates or duplicating data across machines.
In contrast, Spark does not need to maintain a log of updates to each RDD or log the
actual intermediary steps, since the RDD itself contains all the dependency informa‐
tion needed to replicate each of its partitions. Thus, if a partition is lost, the RDD has
enough information about its lineage to recompute it, and that computation can be
parallelized to make recovery faster.
Lazy evaluation and debugging
Lazy evaluation has important consequences for debugging since it means that a
Spark program will fail only at the point of action. For example, suppose that you
were using the word count example, and afterwards were collecting the results to the
driver. If the value you passed in for the stop words was null (maybe because it was
the result of a Java program), the code would of course fail with a null pointer excep‐
tion in the contains check. However, this failure would not appear until the program
evaluated the collect step. Even the stack trace will show the failure as first occurring
at the collect step, suggesting that the failure came from the collect statement. For this
reason it is probably most efficient to develop in an environment that gives you
access to complete debugging information.
Because of lazy evaluation, stack traces from failed Spark jobs
(especially when embedded in larger systems) will often appear to
fail consistently at the point of the action, even if the problem in
the logic occurs in a transformation much earlier in the program.
In-Memory Persistence and Memory Management
Spark’s performance advantage over MapReduce is greatest in use cases involving
repeated computations. Much of this performance increase is due to Spark’s use of
in-memory persistence. Rather than writing to disk between each pass through the
data, Spark has the option of keeping the data on the executors loaded into memory.
That way, the data on each partition is available in-memory each time it needs to be
accessed.
Spark offers three options for memory management: in-memory as deserialized data,
in-memory as serialized data, and on disk. Each has different space and time advan‐
tages:
In memory as deserialized Java objects
The most intuitive way to store objects in RDDs is as the original deserialized
Java objects that are defined by the driver program. This form of in-memory
Spark Model of Parallel Computing: RDDs | 13

Another Random Scribd Document
with Unrelated Content

et fourniront tout ce qu'il
faudra pour l'église ou
chapelle, s'il leur plaist. Nous
allons le P. Buteux et moy,
comme j'ay desjà dit,
demeurer aux Trois-Rivières
expressement pour assister
nos françois, car nous n'irions
pas sans cela; cependant nous
portons des meubles pour la
sacristie, et habits pour nous,
et, ce que je trouve plus
étrange, nos propres vivres
que nous leur donnerons: car
nous mangerons avec eux,
faute de logis où nous
puissions nous retirer. Nous
faisons cela volontiers, car
j'apprend que ces messieurs
nous aiment fort, et nous
assistent tant qu'ils peuvent,
selon l'estat de leurs affaires;
aussy faisons nous, et ferons
nous tout ce que nous
pourrons en leur
considération: car outre que
nous portons aux Trois
Rivières jusques à de la cire et
de la chandelle, nous avons
envoyé aux Hurons trois ou
quatre personnes plus que
nous n'eussions fait, n'estoit
leurs affaires que j'ay
recommandées à nos
hommes. Il est vray qu'ils ont
donné quelque chose pour ce
and will furnish everything for
the church or chapel that they
see fit. We are going, Father
Buteux and I, as I have said,
to live at Three Rivers
expressly to assist our
countrymen, for we would not
go, were it not for that;
however, we are going to take
furniture for the sacristy, and
clothes for ourselves, and,
what seems to me stranger
still, our own food, which we
shall give to them; for we shall
eat with them, for lack of a
dwelling where we might be
by ourselves. We do this
willingly, for I learn that these
gentlemen are very much
attached to us, and assist us
as much as they can,
according to the condition of
their affairs; also we do, and
will do, all that we can for
their sakes; for, besides
carrying with us to Three
Rivers everything, even to the
wax and the candles, we have
sent to the Hurons three or
four more persons than we
should have done, were it not
for their affairs which I have
entrusted to our men. It is
true, that they have given
something for this object,
according to what Father

subject, à ce que m'a dit le
Pere Lallemant. Je ne desire
pas les importuner; mais je
sçay leur aise qu'ils sçachent
que nous les servirons de bon
cœur, et que nous esperons
qu'ils donneront ce qu'il faut
pour l'entretien de [nos] Pères
aux nouvelles habitations, et
qu'ils monteront leur
chappelle, comme ils ont fait
cette année celle [152] de
Kébec;[XXIX.] et 82 qu'ils
donneront aussy des gages et
des vivres aux hommes que
nous tiendrons en leur
considération; et pour leurs
affaires soit dans les Hurons,
soit ailleurs, nous tenons ces
hommes avec nous, afin qu'ils
ne se débauchent avec les
Sauvages et ne donnent
mauvais exemple, comme ont
fait autrefois ceux qui y
estoient. Voila pour le
temporel de cette mission; si
je me souviens d'autre chose,
je l'escriray en un autre
endroit.
Lallemant has told me. I do
not wish to importune them;
but I am aware that they are
glad to know that we will
serve them willingly, and that
we shall expect them to give
what is necessary for the
maintenance of [our] Fathers
in the new settlements; and
that they will furnish their
chapel, as they have done this
year this one [152] at Kébec;
[XXX.] and that they will give
also wages and food to the
men whom we shall keep for
their sakes; and on their
account, either among the
Hurons, or elsewhere, we keep
these men with us, in order
that they may not become
debauched with the Savages
and show a bad example, as
those did who were here
formerly. This is all there is to
be said for the temporal
interests of this mission; if I
remember anything else, I
shall write it in another place.
Venons au spirituel. Let us come to the spiritual.
Premièrement nous esperons
une grande moisson avec le
First, we shall hope to have in
time a great harvest among

temps dans les Hurons, plus
grande et plus prochaine si on
y peut envoyer beaucoup
d'ouvriers pour passer dans les
nations voisines, le tout soubs
la conduite et l'ordonnance du
Supérieur qui sera aux Hurons.
Ces peuples sont sédentaires
et en grand nombre; j'espère
que le P. Buteux sçaura dans
un an autant du langage
montagnais qui j'en sçay, pour
l'enseigner aux autres, et ainsy
j'iray où on voudra. Ce n'est
pas que j'attende rien de moy;
je tacheray de servir pour le
moins de compagnon. Ces
peuples, où nous sommes,
sont errans et en fort petit
nombre; il sera difficile de les
convertir, [153] si on ne les
arreste; j'en ay apporté les
moyens dans la Relation.
the Hurons,—greater and
nearer, if we can send there
many laborers to pass into the
neighboring tribes, all to be
under the leadership and
command of the Superior who
will be among the Hurons.
These people are sedentary
and very populous; I hope that
Father Buteux will know in one
year as much of the
montagnais language as I
know of it, in order to teach it
to the others, and thus I shall
go wherever I shall be wanted.
It is not that I expect anything
of myself, but I shall try to
serve at least as a companion.
These people, where we are,
are wandering, and very few in
number; it will be difficult to
convert them, [153] if we
cannot make them stationary;
I have discussed the means
for doing this, in my Relation.
Pour le Seminaire, hélas!
pourroit-on bien avoir un fond
pour cela? Dans les bastimens
dont j'ay parlé, nous
désignons un petit lieu pour le
commencer, attendant qu'on
fasse exprès un corps de logis
pour ce subject. Si nous
estions bastis, j'espérerois que
As to the Seminary, alas! if we
could only have a fund for this
purpose! In the structures of
which I have spoken, we
marked out a little place for
the beginning of one, waiting
until some special houses be
erected expressly for this
purpose. If we had any built, I

dans deux ans le P. Brebeuf
nous envoiroit des enfants
hurons; on les pourroit
instruire icy avec toute liberté,
estans éloignés de leur parens.
O le grand coup pour la gloire
de Dieu, si cela se faisoit!
would hope that in two years
Father Brebeuf would send us
some huron children; they
could be instructed here with
all freedom, being separated
from their parents. Oh, what a
great stroke for the glory of
God, if that were done!
Quant aux enfants des
Sauvages de ce pais-cy, il y 84
aura plus de peine à les
retenir; je n'y voy point d'autre
moyen que celuy que touche
V. R. d'envoyer un enfant tous
les ans en France: ayant esté
là deux ans, il y reviendra
sçachant la langue; estant
desjà accoustumé à nos
façons de faire, il ne nous
quittera point et retiendra ses
petits compatriotes. Notre
petit Fortuné, qu'on a renvoyé
pour estre malade, et que
nous ne pouvons rendre à ses
parens, car il n'en a point, est
tout autre qu'il n'estoit, encor
qu'il n'ait demeuré que fort
peu en France; tant s'en faut
qu'il courre après les
Sauvages, il les fuit, et se rend
fort obéissant. En vérité il
m'estonne: car il s'encouroit
incontinent aux cabanes de
ces barbares sitost qu'on lui

As to the children of the
Savages in this country, there
will be more trouble in keeping
them; I see no other way than
that which Your Reverence
suggests, of sending a child
every year to France. Having
been there two years, he will
return with a knowledge of the
language, and having already
become accustomed to our
ways, he will not leave us and
will retain his little
countrymen. Our little Fortuné,
who has been sent back
because he was sick, and who
can not return to his parents,
for he has none, is quite
different from what he was,
although he has lived only a
little while in France; so far
from mingling with the
Savages, he runs away from
them, and is becoming very
obedient. In truth he
astonishes me, for he used to
begin to run to the cabins of
the barbarians as soon as we
said a word to him; he could
not [154] suffer any one to
command him, whoever he
might be; now he is prompt in
whatever he does. This year I
wished to send a little girl,
who was given me by the

family, that lives here, and
perhaps also a little boy,
according to Your Reverence's
wish. But M. de Champlain
told me that M. de Lauson had
recommended him not to let
any Savage go over, small or
great. I begged him last year
to allow this to be done; I
have an idea that Father
Lallemant has some share in
this advice and in this
conclusion. Here are the
reasons why they think that it
is not expedient for them to go
over: 1st. The example of the
two who have gone over and
who have been ruined. I
answer that Louys[XXXII.] the
Huron was taken and
corrupted by the English; and
yet he has here performed the
duties of a Christian,
confessing and taking
communion last year at his
arrival, and at his departure
from Kebec; he is now a
prisoner of the Hiroquois. As
to Pierre the montagnais,
[XXXIV.] taken [155] into France
by the Récolet Fathers, when
he returned here, he fled from
the Savages; he was
compelled to return among
them, in order to learn the

language, which he had
forgotten; he did not wish to
go, even saying: "They are
forcing me; but, if I once go
there, they will not get me
back as they wish." At that
time the English came upon
the scene, and they have
spoiled him; I may add that I
have not seen a savage so
savage and so barbarous as he
is.
L'autre raison du P. Lallemant
est que ces enfans cousteront
à nourrir et entretenir en
France, et la mission est
pauvre. S'ils sont en un
collége, on demandera
pension; s'ils sont ailleurs, cela
retardera les aumônes que
feroient les personnes qui les
nourriront. Je répond que les
collèges ne prendront point de
pension, et quand il en
faudroit, je trouve la chose si
importante pour la gloire de
Dieu, qu'il la faudroít donner.
Le P. Lallemant commence à
gouster mes raisons; car je
l'assure qu'on ne peut retenir
les petits Sauvages, s'ils ne
sont dépaïsés ou s'ils n'ont
quelques camarades qui les
aident à demeurer volontiers.
Father Lallemant's other
reason is that it will cost
something to maintain these
children in France, and the
mission is poor. If they are in a
college, their board will have
to be paid; if they are
elsewhere, that will diminish
the alms which would be given
by the persons who support
them. I answer that the
colleges will not take anything
for board; and, if it were
necessary to pay this, I find
the affair so important for the
glory of God, that it ought to
be given. Father Lallemant
begins to appreciate my
reasons, for I assured him that
we could not retain the little
Savages, if they be not
removed from their native

Nous en avons eû deux: en
l'absence des sauvages, ils
obéissoient tellement
quellement; les sauvages
estoient-ils cabanés près de
nous, nos enfants n'estoient
plus à nous, nous n'osions leur
rien dire.
country, or if they have not
some companions who help
them to remain of their own
free will. We have had two of
these: in the absence of the
savages they obeyed tolerably
well, but when the savages
were encamped near us, our
children no longer belonged to
us, we dared say nothing.
Si nous pouvons avoir
quelques enfants cette [156]
88 année, je feray mon
possible pour les faire passer,
du moins deux garçons, et
cette petite fille, qui trouvera
trois maisons pour une. On
m'en demande en plusieurs
endroits. Si M. Duplessis
m'écoute, au nom de Dieu,
soit. Quant le P. Lallemant
aura expérimenté la difficulté
qu'il y a de retenir ces enfants
libertins, il parlera plus haut
que moy.
If we can have some children
this [156] year I shall do all I
can to have them go over, at
least two boys and this little
girl, who will find three homes
for one. Several places have
asked me for them. If M.
Duplessis listens to me, in the
name of God, so let it be.
When Father Lallemant shall
have found out the difficulty
there is in keeping these wild
children, he will speak more
peremptorily than I do.
V. R. voit, par tout ce qui a
esté dit, le bien que l'on peut
espérer pour la gloire de Dieu
de toutes ces contrées, et
combien il est important, non-
seulement de ne rien divertir
ailleurs de ce qui est donné
Your Reverence sees, through
all that has been said, the
benefits to be expected for the
glory of God from all of these
countries, and how important
it is, not only not to divert to
some other places what is

pour la mission de Kebec, mais
encore de trouver quelque
chose pour faire subsister du
moins une maison qui serve de
retraite aux Nostres, qui serve
de séminaire pour des enfants
et pour les Nostres qui
apprendront un jour les
langues, car il y a quantité de
peuples différens tous en
langage.
Voici encore.....
(Le reste manque au
manuscrit.)
given for the mission at Kebec,
but still more to find
something for the
maintenance at least of a
house which may serve as a
retreat for Our Associates, as a
seminary for children, and for
Our Brothers who will one day
learn the languages, for there
are a great many tribes
differing altogether in their
language.
Still further ...
(The rest of this manuscript is
lacking.)
NOTES:
[I.] Jean de
Lauson, intendant
de la compagnie
des Cent-Associés,
et qui fut plus tard
gouverneur de la
Nouvelle-France.
[III.] Jean de
Brébeuf, d'une
famille noble de
Normandie, l'un
des premiers
missionnaires
jésuites venus en
Canada en 1625,
et qui fut martyrisé
au pays des
FOOTNOTES:
[II.] Jean de
Lauson,2 intendant
of the company of
the Hundred
Associates, who
was later governor
of New France—
[Carayon.]
[VII.] Jean de
Brébeuf, of a noble
family of
Normandy, one of
the first jesuit
missionaries, came
to Canada in 1625,
and was martyred
in the country of

Hurons en 1649
par les Iroquois.
[IV.] Antoine
Daniel, natif de
Dieppe, arrivé
l'année précédente
1633, et martyrisé
par les Iroquois,
en 1649.
[V.] Ambroise
Davost, arrivé
l'année
précédente, en
même temps que
le P. Daniel.
[VI.] Le P.
Ennemond Masse,
le même qui avait
évangélisé les
sauvages de
l'Acadie, dès
l'année 1611, avec
le P. Biard. Il vint
en Canada en
1633 et mourut en
la résidence de
Saint-Joseph de
Sillery, en 1646, à
l'âge de 72 ans.
[XI.] Anne De
Nouë, natif de
Champagne, venu
au Canada en
1626 et martyr de
son zèle en 1646.
On le trouva gelé
sur le Saint-
Laurent.
the Hurons, in
1649, by the
Iroquois.—
[Carayon.]
[VIII.] Antoine
Daniel, a native of
Dieppe, arrived the
preceding year,
1633, and was
martyred by the
Iroquois in 1649.—
[Carayon.]
[IX.] Ambroise
Davost arrived the
preceding year, at
the same time as
Father Daniel.3—
[Carayon.]
[X.] Father
Ennemond Masse,
the same one who
had evangelized
the savages of
Acadia in the year
1611 with Father
Biard. He came to
Canada in 1633
and died at the
residence of Saint-
Joseph de Sillery,
in 1646, at the age
of 72 years.—
[Carayon.]
[XIV.] Anne De
Nouë, native of
Champagne, came
to Canada in the
year 1626 and was
a martyr to his

[XII.] Il vint au
Canada en même
temps que le P.
Lejeune, en 1632.
[XIII.] Le Frère
Jean Liégeois, qui
périt victime de la
haine des Iroquois,
près de Sillery, en
1655.
[XVII.] Duplessis-
Bochart, général
de la flotte,
comme on
l'appelait alors, qui
fut plus tard
nommé
gouverneur des
Trois-Rivières, et
qui fut tué par les
Iroquois, le 19
août 1652.
[XIX.] Le P. Jacques
Buteux, natif
d'Abbeville, en
Picardie, qui fut
tué par les
Iroquois, le 10 de
mai 1652.
[XX.] Le P. Charles
Lalemant, l'un des
trois premiers
missionnaires
jésuites venus à
Québec, en 1625.
[XXIII.] Le P. Benier
était confesseur de
la princesse X ***.
zeal in 1646. He
was found frozen
upon the Saint
Lawrence.—
[Carayon.]
[XV.] He came to
Canada the same
time as Lejeune,
1632.—[Carayon.]
[XVI.] Brother Jean
Liégeois, who
perished as a
victim of Iroquois
hatred, near
Sillery, in 1655.—
[Carayon.]
[XVIII.] Duplessis-
Bochart, general of
the fleet, as he
was then called;
who was later
made governor of
Three Rivers and
killed by the
Iroquois on the
19th of August,
1652.—[Carayon.]
[XXI.] Father
Jacques Buteux,5
a native of
Abbeville, in
Picardie, who was
killed by the
Iroquois on the
10th of May, 1652.
—[Carayon.]
[XXII.] Father
Charles Lalemant,

[XXV.] Notre-Dame
des Anges, près de
Québec.
[XXVI.] La pointe
aux Lièvres, à
l'entrée de la
rivière Saint-
Charles.
[XXIX.] «L'an 1634,
Messieurs de la
Compagnie ont
envoyé pour cent
escus de meubles
et ornements entre
autres l'image de
saint Joseph en
bosse qui est sur
l'autel.» Catalogue
des bienfaiteurs de
Notre-Dame de
Recouvrance
(Archives du
Séminaire de
Québec).
[XXXI.] Louis
Amantacha,
surnommé de
Sainte-Foy, qui
avait été baptisé
en France.
[XXXIII.] Ou Pierre-
Antoine
Patetchoanen,
«qui depuis cinq
ans (1620-5) avoit
été envoyé en
France par nos
religieux de Kébec;
lequel après avoir
one of the first
three jesuit
missionaries, came
to Quebec in 1625.
—[Carayon.]
[XXIV.] Father
Benier was
confessor of the
princess X ***.—
[Carayon.]
[XXVII.] Notre
Dame des Anges,7
near Quebec—
[Carayon.]
[XXVIII.] La pointe
aux Lièvres, at
mouth of river
Saint Charles.—
[Carayon.]
[XXX.] "In the year
1634 the
Gentlemen of the
Society sent one
hundred ecus'
worth of furniture
and ornaments,
among others the
figure of saint
Joseph in relief,
which is over the
altar." Catalogue of
the benefactors of
Notre-Dame de
Recouvrance.
(Archives of the
Seminary at
Québec.)—
[Carayon.]

été bien instruit et
endoctriné aux
choses de la foy,
fut baptizé et
nommé par
deffunt M. le
Prince de
Guiménée, son
parrain, Pierre
Antoine, qu'il
entretint aux
études jusques
après sa mort, que
l'enfant fut congru
en la langue latine,
et si bon françois,
qu'estant de retour
à Kébec, nos
religieux furent
contraints le
renvoyer pour
quelque temps
entre ses parens,
afin de reprendre
les idées de sa
langue maternelle,
qu'il avoit presque
oublié.» (F.
Sagard.)
[XXXII.] Louis
Amantacha,
surnamed Sainte-
Foy, who was
baptized in France.
—[Carayon.]
[XXXIV.] Pierre-
Antoine
Patetchoanen,
"who, five years
ago, (1620-5) was
sent into France by
our religious of
Kébec; after
having been
taught and
instructed in the
doctrines of the
faith, he was
baptized and
named by the
deceased M. le
Prince de
Guiménée, his
godfather, Pierre
Antoine, who
maintained him at
his studies up to
the time of his
death, until the
child became so
well versed in the
latin language, and
so good a
frenchman, that
having returned to
Kébec, our
religious were
obliged to send
him back for a
little while to his
parents, so that he

might regain the
ideas of his native
tongue, which he
had almost
forgotten."12 (F.
Sagard.)—
[Carayon.]

XXIII
Le Jeune's Relation, 1634
Paris: SEBASTIEN CRAMOISY, 1635
Source: Title-page and text reprinted from the copy of the first issue,
in Lenox Library. Table des Chapitres, from the second issue, at
Lenox.
Chaps. i.-ix., only, are given in the present volume; the concluding
portion will appear in Volume VII.

RELATION
DE CE QVI S'EST PASSÉ
EN LA
NOVVELLE FRANCE,
EN L'ANNÉE 1634.
Enuoyée au
R. PERE PROVINCIAL
de la Compangnie de I e s v s
en la Prouince de France.
Par le P. Paul le Ieune de la
mesme Compagnie,
Superieur de la residence de
Kebec.
A PARIS
Chez S e b a s t i e n C r a m o i s y,
Imprimeur
ordinaire du Roy, ruë S.
Iacques, aux Cicognes.
M DC. XXXV.
AVEC PRIVILEGE DV ROY.
RELATION
OF WHAT OCCURRED
IN
NEW FRANCE,
IN THE YEAR 1634.
Sent to the
Reverend Father Provincial
of the Society of Jesus in the
Province of France.
By Father Paul le Jeune, of the
same Society,
Superior of the Residence of
Kebec.
PARIS,
Sebastien Cramoisy, Printer in
ordinary to the King.
Ruë St. Jacques, at the Sign of
the Storks.
M DC. XXXV.
BY ROYAL LICENSE.

P
96
[iii] Extraict du
Priuilege du Roy.
AR Grace & Priuilege du
Roy, il est permis à
Sebastien Cramoisy,
Imprimeur o[r]dinaire du Roy,
marchand Libraire Iuré en
l'Vniuersité de Paris,
d'imprimer ou faire imprimer
vn liure intitulé, Relation de ce
qui s'est passé en la Nouuelle
France en l'année mil six cens
trente-quatre, Enuoyée au
Reuerend Pere Barthelemy
Iaquinot, Prouincial de la
Compagnie de Iesvs en la
Prouince de France, Par le P.
Paul le Ieune de la mesme
Compagnie, Superieur de la
Residence de Kebec: &
cependant le temps & espace
de neuf années consecutiues.
Auec defenses à tous Libraires
& Imprimeurs d'imprimer ou
faire imprimer ledit liure, sous
pretexte de desguisement, ou
changement qu'ils y pourroient
B
[iii] Extract from the
Royal License.
Y the Grace and License
of the King, permission is
granted to Sebastien Cramoisy.
Printer in ordinary to the King,
Bookseller under Oath in the
University of Paris, to print or
to have printed a book
entitled, Relation de ce qui
s'est passé en la Nouvelle
France en l'année mil six cens
trente-quatre, Envoyée au
Reverend Pere Barthelemy
Jaquinot, Provincial de la
Compagnie de Jesus en la
Province de France, Par le P.
Paul le Jeune de la mesme
Compagnie, Superieur de la
Residence de Kebec: and this
during the time and space of
nine consecutive years.
Prohibiting all Booksellers and
Printers to print or to have
printed the said book, under
pretext of any disguise or
change which they may make
therein, under penalty of

Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
textbookfull.com

High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau

More Related Content

Similar to High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau (20)

Recently uploaded (20)

High Performance Spark Best Practices for Scaling and Optimizing Apache Spark 1st Edition Holden Karau