A Programmer's Introduction To Mathematics - Jeremy Kun
A Programmer's Introduction To Mathematics - Jeremy Kun
Jeremy Kun
All rights reserved. This book or any portion thereof may not be reproduced or used
in any manner whatsoever without the express written permission of the publisher
except for the use of brief quotations in a book review.
All images used in this book are either the author’s original works or in the
public domain. In particular, the only non-original images are in the chapter on
group theory, specifically the textures from Owen Jones’s design masterpiece, The
Grammar of the Ornament (1856), M.C. Escher’s Circle Limit IV (1960), and two
diagrams in the public domain, sourced from Wikipedia.
pimbook.org
To my wife, Erin.
My unbounded, uncountable thanks goes out to the many people who read drafts at
various stages of roughness and gave feedback, including (in alphabetical order by
first name), Aaron Shifman, Adam Lelkes, Alex Walchli, Ali Fathalian, Arun Koshy,
Ben Fish, Craig Stuntz, Devin Ivy, Erin Kelly, Fred Ross, Ian Sharkey, Jasper
Slusallek, Jean-Gabriel Young, João Rico, John Granata, Julian Leonardo Cuevas
Rozo, Kevin Finn, Landon Kavlie, Louis Maddox, Matthijs Hollemans, Olivia Simpson,
Pablo González de Aledo, Paige Bai-ley, Patrick Regan, Patrick Stein, Rodrigo Zhou,
Stephanie Labasan, Temple Keller, Trent McCormick.
An extra thanks to the readers who submitted errata at pimbook.org for the first
edition, including Abhinav Upadhyay, Abhishek Bhatia, Alejandro Baldominos, Andrei
Paleyes, Arman Yessenamanov, Arthur Allshire, Arunoda Susiripala, Bilal Karim
Reffas, Brian Cloutier, Brian van den Broek, Britton Winterrose, Cedric Bucerius,
Changyoung Koh, Charlie Mead, Chris G, Chrislain Razafimahefa, Darin Brezeale,
David Bimmler, David Furcy, David Shockley, David Wu, Devin Conathan, Don-Duong
Quach, Fidel
Jaime, Jan Moren, Jason Hooper, K. Alex Mills, Kenytt Avery, Konstantin Weitz,
Lean-dro Motta Barros, Luke A., Marco Craveiro, Matthijs, Maximilian Schlund, Meji
Abidoye, Michael Cohen, Michaël Defferrard, Nicolas Krause, Nikita V., Oliver
Sampson, Ondrej Slamecka, Patrick Stingley, Rich Yonts, Rodrigo Ariel Sota, Ryan
Troxler, Seth Yastrov, Simon Skrede, Sriram Srinivasan, Steve Dwyer, Steven D.
Brown, Tim Wilkens, Timo
Special thanks to John Peloquin for his thorough technical review for the second
edition, and to Devin Ivy for technical review of parts of the first edition.
Contents
Our Goal
Chapter 1.
Chapter 2.
Polynomials
2.1
2.2
13
2.3
14
2.4
Realizing it in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.5
24
2.6
Cultural Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.7
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.8
Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Chapter 3.
35
Chapter 4.
Sets
39
4.1
40
4.2
48
4.3
51
4.4
54
4.5
Cultural Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.7
Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Chapter 5.
63
Chapter 6.
Graphs
69
6.1
69
6.2
Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
6.3
73
6.4
75
6.5
78
6.6
Approximate Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6.7
Cultural Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
6.9
Chapter Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
Chapter 7.
89
Chapter 8.
95
8.1
96
8.2
Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
100
8.3
The Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107
8.4
Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
8.5
Remainders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
8.6
8.7
8.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
125
Chapter 9.
On Types and Tail Calls
129
135
10.4
Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.5 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 149
10.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 179
185
191
12.8 Cultural
Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
12.9
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
226
233
239
291
301
16.8 Cultural
Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
16.9
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
345
353
Appendix A. Notation
363
365
373
C.2 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
374
C.5 Linear
Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
C.6
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
377
C.8
Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
378
381
Index
383
Our Goal
This book has a straightforward goal: to teach you how to engage with mathematics.
Let’s unpack this. By “mathematics,” I mean the universe of books, papers, talks,
and blog posts that contain the meat of mathematics: formal definitions, theorems,
proofs, conjectures, and algorithms. By “engage” I mean that for any mathematical
topic, you have the cognitive tools to make progress toward understanding that
topic. I will “teach”
you by introducing you to—or having you revisit—a broad foundation of topics and
techniques that support the rest of mathematics. I say “with” because mathematics
requires active participation.
We will define and study many basic objects of mathematics, such as polynomials,
graphs, and matrices. More importantly, I’ll explain how to think about those
objects as seasoned mathematicians do. We will examine the hierarchies of
mathematical abstraction, along with many of the softer skills and insights that
constitute “mathematical intuition.” Along the way we’ll hear the voices of
mathematicians—both famous historical figures and my friends and colleagues—to
paint a picture of mathematics as both a messy amalgam of competing ideas and
preferences, and a story with delightfully surprising twists and connections. In
the end, I will show you how mathematicians think about mathematics.
So why would someone like you1 want to engage with mathematics? Many software
engineers, especially the sort who like to push the limits of what can be done with
programs, eventually realize a deep truth: mathematics unlocks a lot of cool new
programs.
These are truly novel programs. They would simply be impossible to write (if not
incon-ceivable!) without mathematics. That includes programs in this book about
cryptography, data science, and art, but also to many revolutionary technologies in
industry, such as signal processing, compression, ranking, optimization, and
artificial intelligence. As importantly, a wealth of opportunity makes programming
more fun! To quote Randall Munroe in his XKCD comic Forgot Algebra, “The only
things you HAVE to know are how to make enough of a living to stay alive and how to
get your taxes done. All the fun parts of life are optional.” If you want your
career to grow beyond shuffling data around to meet arbitrary business goals, you
should learn the tools that enable you to write programs that captivate and delight
you. Mathematics is one of those tools.
ii
with functions, logic, and protocols gives you an intuitive familiarity with basic
topics such as boolean algebra, recursion, and abstraction. You can rely on this to
make mathematics less foreign, progressing all the faster to more nuanced and
stimulating topics.
By contrast, most educational math content is for students with no background. Such
content focuses on rote exercises and passing tests. This book will omit most of
that.
All told, this book is not a textbook. I won’t drill you with exercises, though
drills have their place. We won’t build up any particular field of mathematics from
scratch. Though we’ll visit calculus, linear algebra, and many other topics, this
book is far too short to cover everything a mathematician ought to know about these
topics. Moreover, while much of the book is appropriately rigorous, I will
occasionally and judiciously loosen rigor when it facilitates a better
understanding and relieves tedium. I will note when this occurs, and we’ll discuss
the role of rigor in mathematics more broadly.
Indeed, rather than read an encyclopedic reference, you want to become comfortable
with the process of learning mathematics. In part that means becoming comfortable
with discomfort, with the struggle of understanding a new concept, and the
techniques that mathematicians use to remain productive and sane. Many people find
calculus difficult, or squeaked by a linear algebra course without grokking it.
After this book you should have a core nugget of understanding of these subjects,
along with the cognitive tools that will enable you dive as deeply as you like.
As a necessary consequence, in this book you’ll learn how to read and write proofs.
The simplest and broadest truth about mathematics is that it revolves around
proofs. Proofs are both the primary vehicle of insight and the fundamental measure
of judgment. They are the law, the currency, and the fine art of mathematics. Most
of what makes mathematics mysterious and opaque—the rigorous definitions, the
notation, the overloading of terminology, the mountains of theory, and the unspoken
obligations on the reader—is due to the centrality of proofs. A dominant obstacle
to learning math is an unfamiliarity with this culture. In this book I’ll cover the
basic methods of proof, and each chapter will use proofs to build the subject
matter. To be sure, you don’t have to understand every proof to finish this book,
and you will probably be confounded by a few. Embrace your humility. Each proof
contains layers of insight that are genuinely worthwhile, but few gain a complete
picture of a topic in a single sitting. As you grow into mathematics, the act of
reading even previously understood proofs provides both renewed and increased
wisdom. So long as you identify the value gained by your struggle, your time is
well spent.
I’ll also teach you how to read between the mathematical lines of a text, and
understand the implicit directions and cultural cues that litter textbooks and
papers. As we proceed through the chapters, we’ll gradually become more terse, and
you’ll have many opportu-2 pimbook.org
iii
nities to practice parsing, interpreting, and understanding math. All of the topics
in this book are explained by hundreds of other sources, and each chapter’s
exercises include explorations of concepts beyond these pages. Finally, I’ll
discuss how mathematicians approach problems, and how their process influences the
culture of math.
You will not learn everything you want to know in this book, nor will you learn
everything this book has to offer in one sitting. Those already familiar with math
may find early chapters offensively slow and detailed. Those genuinely new to math
may find the later chapters offensively fast. This is by design. I want you to be
exposed to as much mathematics as possible. Learn the main definitions. See new
notations, conventions, and attitudes. Take the opportunity to explore topics that
pique your interest.
A number of readers have reached out to me to describe their struggles with proofs.
They found it helpful to read a companion text on the side with extra guidance on
sets, functions, and methods of proof—particularly for the additional exercises and
consistently gradual pace. In this second edition, I added two appendices that may
help with readers struggling with the pace. Appendix B contains more detail about
the formalities underlying proofs, along with strategies for problem solving.
Appendix C contains a list of books, and specifically a section for books on
“Fundamentals and Foundations” that cover the basics of set theory, proofs, and
problem solving strategies.
A number of topics are conspicuously missing from this book, my negligence of which
approaches criminal. Except for a few informal cameos, we ignore complex numbers,
probability and statistics, differential equations, and formal logic. In my humble
opinion, none of these topics is as fundamental for mathematical computer science
as those I’ve chosen to cover. After becoming comfortable with the topics in this
book, for example, probability will be very accessible. Chapter 12 on eigenvalues
includes a miniature introduction to differential equations. The notes for Chapter
16 on groups briefly summarizes complex numbers. Probability underlies our
discussion of random graphs in Chapter 6
and machine learning in Chapter 14. Moreover, many topics in this book are
prerequisites for these other areas. And, of course, as a single human self-
publishing this book on nights and weekends, I have only so much time.
The first step on our journey is to confirm that mathematics has a culture worth
becoming acquainted with. We’ll do this with a comparative tour of the culture of
software that we understand so well.
Chapter 1
–David Hilbert
Do you remember when you started to really learn programming? I do. I spent two
years in high school programming games in Java. Those two years easily contain the
worst and most embarrassing code I have ever written. My code absolutely reeked.
Even after I started studying software in college, it was another year before I
knew what a stack frame or a register was, another year before I was halfway
competent with a terminal, another year before I appreciated functional
programming, and to this day I still have an irrational fear of systems programming
and networking. I built up a base of knowledge over time, with fits and starts at
every step.
In a college class on C++ I was programming a Checkers game, and my task was to
generate a list of legal jump-moves from a given board state. I used a depth-first
search and a few recursive function calls. Once I had something I was pleased with,
I compiled it and ran it on my first non-trivial example. Despite following test-
driven development, I saw those dreaded words: Segmentation fault. Dozens of test
cases and more than twenty hours of confusion later, I found the error: my
recursive call passed a reference when it should have been passing a pointer. This
wasn’t a bug in syntax or semantics—I understood pointers and references well
enough—but a design error. As most programmers can relate, the most aggravating
part was that changing four characters (swapping a few ampersands with asterisks)
fixed it. Twenty hours of work for four characters! Once I begrudgingly verified it
worked, I promptly took the rest of the day off to play Starcraft.
Such drama is the seasoning that makes a strong programmer. One must study the
topics incrementally, learn from a menagerie of mistakes, and spend hours in a
befuddled stupor before becoming “experienced.” This gives rise to all sorts of
programmer culture, Unix jokes, urban legends, horror stories, and reverence for
the masters of C that make the programming community so lovely. It’s like a secret
club where you know all the handshakes, but should you forget one, a crafty use of
grep and sed will suffice. The struggle makes you appreciate the power of debugging
tools, slick frameworks, historically enshrined hacks, and new language features
that stop you from shooting your own foot.
When programmers turn to mathematics, they seem to forget these trials. The same
people who invested years grokking the tools of their trade treat new mathematical
tools and paradigms with surprising impatience. I can see a few reasons why. For
one, we were forced to take math classes for many year in school. That forced
investment shouldn’t have been pointless. But the culture of mathematics and the
culture of mathematics education—elementary through lower-level college courses—are
completely different.
Even college math majors have to reconcile this. I’ve had many conversations with
such students, including friends, colleagues, and even family, who by their third
year decided they didn’t really enjoy math. The story often goes like this: a
student who was good at math in high school reaches the point of a math major at
which they must read and write proofs in earnest. It requires an ambiguous, open-
ended exploration that they don’t enjoy. Despite being a stark departure from the
rigid structure of high school math, incoming students are not warned in advance.
After coming to terms with their unfortunate situation, they decide that their best
option is to persist until they graduate, at which point they return to the
comfortable setting of pre-collegiate math, this time in the teacher’s chair.
I don’t mean to insult teaching as a profession—I love teaching and understand why
one would choose to do it full time. There are many excellent teachers who excel at
both the math and the trickier task of engaging aloof teenagers to think critically
about it.
But this pattern of disenchantment among math teachers is prevalent, and it widens
the conceptual gap between secondary and “college level” mathematics. Programmers
often have similar feelings. The subject they once were good at is suddenly
impenetrable. It’s a negative feedback loop in the education system. Math takes the
blame.
Another reason programmers feel impatient is because they do so many things that
relate to mathematics in deep ways. They use graph theory for data structures and
search.
They study enough calculus to make video games. They hear about the Curry-Howard
correspondence between proofs and programs. They hear that Haskell is based on a
complicated math thing called category theory. They even use mathematical results
in an interesting way. I worked at a “blockchain” company that implemented a
Bitcoin wallet, which is based on elliptic curve cryptography. The wallet worked,
but the implementer didn’t understand why. They simply adapted pseudocode found on
the internet. At the risk of a dubious analogy, it’s akin to a “script kiddie” who
uses hacking tools as black boxes, but has little idea how they work.
Mathematicians are on the other end of the spectrum. Why things work takes priority
over practical implementation.
There’s nothing inherently wrong with using mathematics as a black box, especially
the sort of applied mathematics that comes with provable guarantees. But many
programmers want to dive deeper. This isn’t surprising, given how much time
engineers spend studying source code and the internals of brittle, technical
systems. Systems that programmers rely on, such as dependency management, load
balancers, search engines, alerting systems, and machine learning, all have rich
mathematical foundations. We’re naturally curious.
A second obstacle is that math writers are too terse. The purest fields of
mathematics take a sort of pretentious pride in how abstract and compact their work
is. I can think of a handful of famous books, for which my friends spent weeks or
months on a single chapter! This acts as a barrier to entry, especially since
minute details matter for applications.
What programmers consider “sloppy” notation is one symptom of the problem, but
there there are other expectations on the reader that, for better or worse,
decelerate the pace of reading. Unfortunately I have no solution here. Part of the
power and expressiveness of mathematics is the ability for its practitioners to
overload, redefine, and omit in a suggestive manner. Mathematicians also have
thousands of years of “legacy” math that require backward compatibility. Enforcing
a single specification for all of mathematics—a suggestion I frequently hear from
software engineers—would be horrendously counterproductive.
Ideas we take for granted today, such as algebraic notation, drawing functions in
the Euclidean plane, and summation notation, were at one point actively developed
technologies. Each of these notations had a revolutionary effect on science, and
also, to quote Bret Victor, on our capacity to “think new thoughts.” One can draw a
line from the proliferation of algebraic notation to the invention of the
computer.1 Borrowing software terminology, algebraic notation is among the most
influential and scalable technologies humanity has ever invented. And as we’ll see
in Chapter 10 and Chapter 16, we can find algebraic structure hiding in exciting
places. Algebraic notation helps us understand this structure not only because we
can compute, but also because we can visually see the symmetries in the formulas.
This makes it easier for us to identify, analyze, and encapsulate structure when it
occurs.
Finally, the best mathematicians study concepts that connect decades of material,
while simultaneously inventing new concepts which have no existing words to
describe them.
There are good reasons why mathematics is the way it is, though the reasons may not
always be clear. I like to summarize the contrast by claiming that mathematical
notation is closer to spoken language than to code. There is a historical and
cultural context missing from many criticisms of math. It’s a legacy system, yes,
but a well-designed one. We should understand it, learn from its advantages, and
discard the obsolete parts. Those obsolete parts are present, but rarer than they
seem.
To fairly evaluate mathematics, we must first learn some mathematics. Only then can
we compare and contrast programming and mathematics in terms of their driving
questions, their values, their methods, their measures of success, and their
cultural expectations. Programming, at its core, focuses on how to instruct a
computer to perform some task. But the broader driving questions include how to
design a flexible system, how to efficiently store and retrieve data, how to design
systems that can handle various modes of failure, how to scale, and how to tame
growth and complexity.
Contrast this with mathematics which, at its core, focuses on how to describe a
mathematical object and how to prove theorems about its behavior. The broader
driving questions include how to design a unified framework for related patterns,
how to find computationally useful representations of an object, how to find
interesting patterns to study, and most importantly, how to think more clearly
about mathematical subjects.
A large chunk of this book expands on this summary through interludes between each
chapter and digressions after introducing technical concepts. The rest covers the
fundamental objects and methods of a typical mathematical education. So let’s begin
our journey into the mathematical mists with an open mind.
Chapter 2
Polynomials
We are not trying to meet some abstract production quota of definitions, theorems
and proofs. The measure of our success is whether what we do enables people to
understand and think more clearly and effectively about mathematics.
–William Thurston
We start with the definition of a polynomial. The problem, if you’re the sort of
person who struggled with math, is that reading the definition as a formula will
make your eyes glaze over. In this chapter we’re going to overcome this.
The reason I’m so confident is that I’m certain you’ve overcome the same obstacle
in the context of programming. For example, my first programming language was Java.
And my first program, which I didn’t write but rather copied verbatim, was likely
similar to this monstrosity.
/******************************************
******************************************/
System.out.println("Hello, World");
It was roughly six months before I understood what all the different pieces of this
program did, despite the fact that I had written ‘public static void main’ so many
times I had committed it to memory. Computers don’t generally require you to
understand a code snippet to run. But at some point, we all stopped to ask, “what
do those words actually mean?” That’s the step when my eyes stop glazing over.
That’s the same procedure we need to invoke for a mathematical definition,
preferably faster than six months.
Now I’m going to throw you in the thick of the definition of a polynomial. But stay
with me! I want you to start by taking out a piece of paper and literally copying
down the definition (the entire next paragraph), character for character, as one
would type out a program from scratch. This is not an idle exercise. Taking notes
by hand uses a part of your brain that both helps you remember what you wrote, and
helps you read it closely.
Each individual word and symbol of a mathematical definition affects the concept
being defined, so it’s important to parse everything slowly.
Let’s analyze the content of this definition in three ways. First, syntactically,
which also highlights some general features of written definitions. Second,
semantically, where we’ll discuss what a polynomial should represent as a concept
in your mind. Third, we’ll inspect this definition culturally, which includes the
unspoken expectations of the reader upon encountering a definition in the wild. As
we go, we’ll clarify some nuance to the definition related to certain “edge cases.”
Syntax
A proper mathematical treatment might also define what a “real number” is, but we
simply don’t have the time or space. 1 For now, think of a real number as a
floating point number without the emotional baggage that comes from trying to fit
all decimals into a finite number of bits.
We used a strange phrase in Definition 2.1, that “f has the form” of some
expression.
This means that we’re choosing specific values for the data defining f. It’s making
a particular instance of the definition, as if it were a class definition in a
program. In this case the choices are:
1. The names for all the variables involved. The definition has chosen f for the
function, x for the input variable name, a for the array of coefficients, and n for
the degree. One can choose other names as desired.
Semantics
Let’s start with a simple example polynomial, where I pick g for the function name,
t for the input name, b for the coefficients, and define n = 3, and b 0 , b 1 , b 2
, b 3 = 2 , 0 , 4 , − 1.
g( t) = 2 + 0 t + 4 t 2 + ( − 1) t 3 .
g(2) = 2 + 4(22) − 23 = 10 .
Really what’s being said is that a polynomial is any function of a single input
that can be written in the required form, even if you might write it a different
way sometimes. This 1 If you’re truly interested in how real numbers are defined
from scratch, Spivak’s text Calculus devotes Chapter 29 to a gold-standard
treatment. You might be ready for it after working through a few chapters of this
book, but be warned: Spivak starts Chapter 29 with, “The mass of drudgery which
this chapter necessarily contains…”
makes our internal concept of a polynomial more general than the letter of
Definition 2.1.
A polynomial is any function of a single numeric input that can be expressed using
only addition and multiplication and constants, along with the input variable
itself. So the following is a polynomial:
g( t) = ( t − 1)( t + 6)2
You recover the precise form of Definition 2.1 by algebraically simplifying and
grouping terms. The form described in Definition 2.1 is not ideal for every
occasion! For example, if you want to evaluate a polynomial quickly on a computer,
you might represent the polynomial so that evaluating it doesn’t redundantly
compute the powers t 1 , t 2 , t 3 , . . . , tn. One such scheme is called Horner’s
method, which we’ll return to in an Exercise. The form in Definition 2.1 might be
called a “canonical” or “standard” form, and it’s often useful for manipulation in
proofs. As we’ll see later in this chapter, it’s easy to express a generic sum or
difference of two polynomials in the standard form.
Suffice it to say, there are many representations of the same abstract polynomial.
You can do arithmetic and renaming to get to a standard representation. f( x) = x +
1 is the same polynomial as g( t) = 1 + t, though they differ syntactically.
There are other ways to think about polynomials, and we’ll return to polynomials in
future chapters with new and deeper ideas about them. Here are some previews of
that.
The first is that a polynomial, as with any function, can be represented as a set
of pairs called points. That is, if you take each input t and pair it with its
output f( t), you get a set of tuples ( t, f( t)), which can be analyzed from the
perspective of set theory. We will return to this perspective in Chapter 4.
Using the curves they “carve out” in space, polynomials can be regarded as
geometric objects with geometric properties like “curvature” and “smoothness.” In
Chapter 8 we’ll return to this more formally, but until then one can guess how they
might faithfully describe a plot like the one in Figure 2.1. The connection between
polynomials as geometric objects and their algebraic properties is a deep one that
has occupied mathematicians for centuries. For example, the degree gives some
information about the shape of the curve.
Figure 2.2 shows plots of generic polynomials of degrees 3 through 6. As the degree
goes up, so does the number of times the polynomial “changes direction” between
increasing and decreasing. Making this mathematically rigorous requires more nuance
—after all, the degree five polynomial in Figure 2.1 only changes direction twice—
but the pattern suggested by Figure 2.2 is no coincidence.
That is, polynomials are a family of increasingly expressive objects, which get
more complex as the degree increases. This idea is the foundation of the
application for this chapter (sharing secrets), and it will guide us to use Taylor
polynomials to approximate things in Chapters 8 and 14.
Polynomials occur with stunning ubiquity across mathematics. It makes one wonder
exactly why they are so central. It’s because polynomials encapsulate the full
expressivity of addition and multiplication. As programmers, we know that even such
simple operations as binary AND, OR, and NOT, when combined arbitrarily, allow us
to build circuits that make a computer. Those three operations yield the full gamut
of algorithms.
Polynomials fill a similar role for arithmetic. Indeed, polynomials with multiple
variables can represent AND, OR, and NOT, if you restrict the values of the
variables to be zero and one (interpreted as false and true, respectively).
AND( x, y) = xy
NOT( x) = 1 − x
OR( x, y) = 1 − (1 − x)(1 − y)
Polynomials are expressive enough to capture all of boolean logic. This suggests
that even single-variable polynomials should have strikingly complex behavior. The
rest of the chapter will display bits of that dazzling performance.
10
Culture
The most important cultural expectation, one every mathematician knows, is that the
second you see a definition in a text you must immediately write down examples.
Generous authors provide examples of genuinely new concepts, but an author is never
obligated to do so. The unspoken rule is that the reader should not continue unless
the reader understands what the definition is saying. That is, you aren’t expected
to master the concept, most certainly not at the same speed you read it. But you
should have some idea going forward of what the defined words refer to.
Software testing provides a good analogy. You start with the simplest possible
tests, usually setting as many values as you can to zero or one, then work your way
up to more complicated examples. Later, when you get stuck on some theorem or proof
—an
unavoidable occupational hazard—you return to those examples and test how the
claims in the proof apply to them. This is how one builds so-called “mathematical
intuition.” In the long term, that intuition allows you to absorb new ideas faster.
11
polynomials and your job is to check them against the definition. Take your time,
and you can check your answers in the Chapter Notes.
f ( x) = 0
g( x) = 12
h( x) = 1 + x + x 2 + x 3
i( x) = x 1/2
j( x) =
+ x 2 − 2 x 4 + 8 x 8
1
k( x) = 4 . 5 +
− 5
x 2
l( x) = π − 1 x 5 + eπ 3 x 10
m( x) = x + x 2 − xπ + xe
Like software testing, examples weed out pesky edge cases and clarify what is
permit-ted by the definition. For example, the exponents of a polynomial must be
nonnegative integers, though I only stated it implicitly in the definition.
When reading a definition, one often encounters the phrase “by convention.” This
can refer to a strange edge case or a matter of taste. A common example is the
factorial n! = 1 · 2 · · · · · n, where 0! = 1 by convention. This makes formulas
cleaner and provides a natural default value of an “empty product,” a sensible base
case for a loop that computes the product of a (possibly empty) list of numbers.
To avoid this, we amend Definition 2.1 so that the last coefficient an is required
to be nonzero. But then the function f( x) = 0 is not allowed to be a polynomial!
So, by convention, we define a special exception, the function f( x) = 0, as the
zero polynomial.
By convention, the zero polynomial is defined to have degree − 1. Note that every
time a definition includes the phrase “by convention,” a computer program gains an
edge case.2
This edge case made us reconsider the right definition of a polynomial, but it was
mostly a superficial change. Other times, as we will confront head on in Chapter 8
when we define limits, dealing with an edge case reveals the soul of a concept.
It’s curious how mathematical books tend to start with the final product, instead
of the journey to the 2 You may wonder: is it possible to represent the same
polynomial with two formulas that have different degrees? Theorem 2.3 can be used
to prove this is impossible. Exercise 4 asks you to prove it using elementary
means.
12
right definition. Perhaps teaching the latter is much harder and more time
consuming, with fewer tangible benefits. But in advanced mathematics, deep
understanding comes in fits and starts. Often, no such distilled explanation is
known.
In any case, examples are the primary method to clarify the features of a
definition.
Having examples in your pocket as you continue to read is important, and coming up
with the examples yourself is what helps you internalize a concept.
coefficient = number
variable = 'x'
polynomial = term
| term + polynomial
The problem is that this definition doesn’t tell you what polynomials are all
about.
That’s why an author usually starts with a conceptual definition like Definition
2.1
Underlying all the definitions is an abstract concept we keep in our minds. The
definition is one way to make that concept concrete while also expressing one
particular facet of its properties for the task at hand.
I want to make this extremely clear because in mathematics it’s implicit. My math
teachers in college and grad school never explicitly discussed why one would use
one definition over another, because somehow along the arduous journey through a
math education, the folks who remained understood it. It also explains why
understanding a definition is such an important prerequisite to reading the
mathematics that follows.
Polynomials may seem frivolous to illustrate the difference between an object-as-
13
tion, but the same pattern lurks behind more complicated definitions. First the
author will start with the best conceptual definition—the one that seems to them,
with the hindsight of years of study, to be the most useful way to communicate the
idea behind the concept. For us that’s Definition 2.1. Often these definitions seem
totally useless from a programming perspective.
Then ten pages later (or a hundred!) the author introduces another definition,
often a data definition, which turns out to be equivalent to the first. Any
properties defined in the first definition automatically hold in the second and
vice versa. But the data definition is the one that allows for nice programs. You
might think the author was crazy not to start with the data definition, but it’s
the conceptual definition that sticks in your mind, generalizes, and guides you
through proofs. This interplay between intuitive and data definitions will take
center stage in Chapter 10, our first exposure to linear algebra.
It’s also worth noting that the multiplicity of definitions arose throughout
history.
Polynomials have been studied for many centuries, but parser-friendly forms of
polynomials weren’t needed until the computer age. Likewise, algebra was studied
before the graphical representations of Descartes allowed us to draw polynomials as
curves. Each new perspective and definition was driven by an additional need. As a
consequence, the
“best” definition of a concept can change. Throughout history math has been shaped
and reshaped to refine, rigorize, and distill the core insights, often to ease the
fashionable calculations of the time.
In any case, the point is that we will fluidly convert between the many ways of
thinking about polynomials: as expressions defined abstractly by picking a list of
numbers, or as functions with a special structure. Effective mathematics is
flexible in this way.
When defining a function, one often uses the compact arrow notation f : A → B to
describe the allowed inputs and outputs. All possible inputs are collectively
called the domain, and all possible outputs are called the range. There is one
caveat I’ll explain via programming. Say you have a function that doubles the
input, such as
int f(int x) {
return 2*x;
The inputs are integers, and the type of the output is also integer, but 3 is not a
possible output of this particular function.
In math we disambiguate this with two words. Range is the set of actual outputs of
a function, and the “type” of outputs is called the codomain. The notation f : A →
B
specifies the domain A and codomain B, while the range depends on the semantics of
f.
When one introduces a function, as programmers do with type signatures and function
headers, we state the notation f : A → B before the function definition.
14
other symbols for types. The symbol for the set of real numbers is R. The font is
called
“blackboard-bold,” and it’s the standard font for denoting number systems. Applying
the arrow notation, a polynomial is f : R → R. A common phrase is to say a
polynomial is “over the reals” to mean it has real coefficients. As opposed to,
say, a polynomial over the integers that has integer coefficients.
Most famous number types have special symbols. The symbol for integers is Z, and
the positive integers are denoted by N, often called the natural numbers. 3 There
is an amusing dispute of no real consequence between logicians and other
mathematicians on whether zero is a natural number, with logicians demanding it is.
Finally, I’ll use the ∈ symbol, read “in,” to assert or assume membership in some
set.
Having seen some definitions, we’re ready to develop the main tool we need for
secret sharing: the existence and uniqueness theorem for polynomials passing
through a given set of points.
First, a word about existence and uniqueness. Existence proofs are classic in
mathematics. They come in all shapes and sizes. Mathematicians like to take
interesting properties they see on small objects, write down the property in
general, and then ask things like,
“Are there arbitrarily large objects with this property?” or, “Are there infinitely
many objects with this property?” I imagine a similar pattern in physics. Given
equations that govern the internal workings of a star you might ask, would these
equations support arbitrarily massive stars?
One simple uniqueness question is quite famous: are there are infinitely many pairs
of prime numbers of the form p, p + 2? For example, 11 and 13 work, but 23 is not
part of such a pair.4 It’s an open question whether there are infinitely many such
pairs. The assertion that there are is called the Twin Prime Conjecture.
In some cases you get lucky, and the property you defined is specific enough to
single out a unique mathematical object. This is what will happen to us with
polynomials. Other times, the property (or list of properties) you defined are too
restrictive, and there are no mathematical objects that can satisfy it. For
example, Kleinberg’s Impossibility Theorem for Clustering lays out three natural
properties for a clustering algorithm (an algorithm that finds dense groups of
points in a geometric dataset) and proves that no algorithm can satisfy all three
simultaneously. See the Chapter Notes for more on this. Though such theorems are
often heralded as genius, more often than not mathematicians avoid impossibility by
turning small examples into broad conjectures.
That’s how we’ll approach existence and uniqueness for polynomials. Here is the
theo-3 The Z stands for Zahlen, the German word for “numbers.”
15
rem we’ll prove, stated in its most precise form. Don’t worry, we’ll go carefully
through every bit of it, but try to read it now.
The one piece of new notation is the exponent on R2. This means “pairs” of real
numbers. Likewise, Z3 would be triples of integers, and N10 length-10 tuples of
natural numbers.
A briefer, more informal way to state the theorem: there is a unique degree n
polynomial passing through a choice of n + 1 points.5 Now just like with
definitions, the first thing we need to do when we see a new theorem is write down
the simplest possible examples. In addition to simplifying the theorem, it will
give us examples to work with while going through the proof. Write down some
examples now. As mathematician Alfred Whitehead said, “We think in generalities,
but we live in details.”
Back already? I’ll show you examples I’d write down, and you can compare your
process to mine. The simplest example is n = 0, so that n + 1 = 1 and we’re working
with a single point. Let’s pick one at random, say (7 , 4). The theorem asserts
that there is a unique degree zero polynomial passing through this point. What’s a
degree zero polynomial? Looking back at Definition 2.1, it’s a function like a 0 +
a 1 x+ a 2 x 2 + · · ·+ adxd (I’m using d for the degree here because n is already
taken), where we’ve chosen to set d = 0.
Setting d = 0 means that f has the form f( x) = a 0. So what’s such a function with
f (7) = 4? There is no choice but f ( x) = 4. It should be clear that it’s the only
degree zero polynomial that does this. Indeed, the datum that defines a degree-zero
polynomial is a single number, and the constraint of passing through the point (7 ,
4) forces that one piece of data to a specific value.
Let’s move on to a slightly larger example which I’ll allow you to work out for
yourself before going through the details. When n = 1 and we have n + 1 = 2 points,
say (2 , 3) , (7 , 4), the theorem claims a unique degree 1 polynomial f with f (2)
= 3 and f (7) = 4. Find it by writing down the definition for a polynomial in this
special case and solving the two resulting equations. 6
f ( x) = a 0 + a 1 x.
Writing down the two equations f(2) = 3 , f(7) = 4, we must simultaneously solve: a
0 + a 1 · 2 = 3
a 0 + a 1 · 7 = 4
6 If you’re comfortable solving basic systems of equations, you may want to skip
ahead to the next section.
16
13
f ( x) =
3 − 2
x =
x.
5
5
“x 1 < x 2 < · · · < xn+1”. In our example, this means we require x 1 < x 2. So
this is where we run a sanity check. What happens if x 1 = x 2? Think about it, and
if you can’t tell then you should try to prove it wrong: try to find a degree 1
polynomial passing through the points (2 , 3) , (2 , 5).
The problem could be that there is no degree 1 polynomial passing through those
points, violating existence. Or, the problem might be that there are many degree 1
polynomials passing through these two points, violating uniqueness. It’s your job
to determine what the problem is. And despite it being pedantic, you should work
straight from the definition of a polynomial! Don’t use any mnemonics or heuristics
you may remember; we’re practicing reading from precise definitions.
In case you’re stuck, let’s follow our pattern from before. If we call a 0 + a 1 x
our polynomial, saying it passes through these two points is equivalent to saying
that there is a simultaneous solution to the following two equations f(2) = 3 and
f(2) = 5.
a 0 + a 1 · 2 = 3
a 0 + a 1 · 2 = 5
What happens when you try to solve these equations like we did before? Try it.
What about for three points or more? Well, that’s the point at which it might start
to get difficult to compute. You can try by setting up equations like those I wrote
above, and with some elbow grease you’ll solve it. Such things are best done in
private so you can make plentiful mistakes without being judged for it.
Now that we’ve worked out two examples of the theorem in action, let’s move on to
the proof. The proof will have two parts, existence and uniqueness. That is, first
we’ll show that a polynomial satisfying the requirements exists, and then we’ll
show that if two polynomials both satisfied the requirements, they’d have to be the
same. In other words, there can only be one polynomial with that property.
We will show existence by direct construction. That is, we’ll “be clever” and find
a general way to write down a polynomial that works. Being clever sounds scary, but
the process is actually quite natural, and it follows the same pattern as we did
for reading and understanding definitions: you start with the simplest possible
example (but this time the
17
example will be generic) and then you work up to more complicated examples. By the
time we get to n = 2 we will notice a pattern, that pattern will suggest a formula
for the general solution, and we will prove it’s correct. In fact, once we
understand how to build the general formula, the proof that it works will be
trivial.
Let’s start with a single point ( x 1 , y 1) and n = 0. I’m not specifying the
values of x 1
f ( x) = y 1
x − x 2
x − x 1
f ( x) = y 1
+ y 2
x 1 − x 2
x 2 − x 1
Let’s verify that this works. If I evaluate f at x 1, the second term gets x 1 − x
1 =
0 in the numerator and so the second term is zero. The first term, however, becomes
y x 1 −x 2
= y
1 −x 2
Likewise, if you evaluate f( x 2) the first term is zero and the second term
evaluates to y 2. So we have both f( x 1) = y 1 and f( x 2) = y 2, and the
expression is a degree 1
polynomial. How do I know it’s degree one when I wrote f in that strange way? For
one, I could rewrite f like this:
y 1
y 2
f ( x) =
( x − x 2) +
( x − x 1) ,
x 1 − x 2
x 2 − x 1
and simplify with typical algebra to get the form required by the definition:
x 1 y 2 − x 2 y 1
y 1 − y 2
f ( x) =
x 1 − x 2
x 1 − x 2
What a headache! Instead of doing all that algebra, I could observe that no powers
of x appear in the formula for f that are larger than 1, and we never multiply two
x’s together.
Since these are the only ways to get degree bigger than 1, we can skip the algebra
and be confident that the degree is 1.
The key to the above idea, and the reason we wrote it down in that strange way, is
so that each constraint (e.g., f( x 1) = y 1) could be isolated in its own term,
while all the other terms evaluate to zero. For three points ( x 1 , y 1) , ( x 2 ,
y 2) , ( x 3 , y 3) we just have to beef up the terms to maintain the same
property: when you plug in x 1, all terms except the first evaluate to zero and the
fraction in the first term evaluates to 1. When you plug in x 2, the second term is
the only one that stays nonzero, and likewise for the third. Here is the
generalization that does the trick.
18
f ( x) = y
+ y
+ y
( x
2
1 −x 2)( x 1 −x 3)
( x 2 −x 1)( x 2 −x 3)
( x 3 −x 1)( x 3 −x 2)
Now you come in. Evaluate f at x 1 and verify that the second and third terms are
zero, and that the first term simplifies to y 1. The symmetry in the formula should
convince you that the same holds true for x 2 , x 3 without having to go through
all the steps two more times. Then argue why f is degree 2.
Add up a bunch of terms, and for the i-th term you multiply yi by a fraction you
construct according to the rule: the numerator is the product of x − xj for every j
except i, and the denominator is a product of all the ( xi − xj) for the same j’s
as the numerator. It works for the same reason that our formula works for three
terms above. By now, the process is clear enough that you could write a program to
build these polynomials quite easily, and we’ll walk through such a program
together at the end of the chapter.
Here is the notation version of the process we just described in words. It’s a
mess, but we’ll break it down.
∏ x − xj
f ( x) =
i ·
xi − xj
i=1
j̸= i
∑ ∏
i=1
int i;
theSum += expr(i);
return theSum;
Note by indexing from 1 and including the upper limit of the for loop condition, we
are
deviating from the standard programming style. Indexing from zero, like
, produces
i=0
I used the undefined tokens defaultValue and sometype to highlight that the mean-
ing of the sum depends on what the conventional ‘zero object’ is in that setting.
For adding numbers the zero object is zero, and for adding polynomials it’s the
zero polynomial. It gets exotic with more advanced mathematics, which we’ll see in
Chapter 16 when
Moreover, explaining
place += with *= and reinterpret the “default value” as what makes sense for
multiplica-
19
tion. Functional programmers will know this pattern well, because both are a “fold”
(or
“reduce”) function with a particular choice of binary operation and initial value.
The notation
adds three caveats. First, recall that in this context i is fixed by the j̸= i
Though it sometimes makes me cringe to say it, give the author the benefit of the
doubt. When things are ambiguous, pick the option that doesn’t break the math. In
this respect, you have to act as both the tester, the compiler, and the bug fixer
when you’re reading math. The best default assumption is that the author is far
smarter than we are, and if you don’t understand something, it’s likely a user
error and not a bug. In the occasional event that the author is wrong, it’s often a
simple mistake or typo, to which an experienced reader would say, “The author
obviously meant ‘foo’ because otherwise none of this makes sense,” and continue
unscathed.
Finally, the j ̸= i part is an implied filter on the range of j. Inside the for loop
you add
an extra if statement to skip that iteration if j = i. Read out loud, would be “the
j̸= i
product over j not equal to i.” If we wanted to write out the product-nested-in-a-
sum as a nested loop, it would look like this:
int i, j;
if (j != i) {
f ( x) =
bar( i)
foo( i, j)
i=1
j̸= i
return theSum;
Compare the math and code, and make sure you can connect the structural pieces.
Often the inner parentheses are omitted, with the default assumption that
everything to
the right of a
or
20
If the formula on the right still seems impenetrable, take solace in your own
experience: the reason you find the left side so easy to read is that you’ve spent
years building up the cognitive pathways in your brain for reading code. You can
identify what’s filler and what’s important; you automatically filter out the noise
in the syntax. Over time, you’ll achieve this for mathematical formulas, too.
You’ll know how to zoom in to one expression, understand what it’s saying, and zoom
out to relate it to the formula as a whole. Everyone struggles with this, myself
included.
One additional difficulty of reading mathematics is that the author will almost
never go through these details for the reader. It’s a rather subtle point to be
making so early in our journey, but it’s probably the first thing you notice when
you read math books.
Instead of doing the details, a typical proof of the existence of these polynomials
looks like this.
n+1
∑ ∏ x − xj
f ( x) =
yi
xi − xj
i=1
j̸= i
Clearly the constructed polynomial f( x) has degree at most n because each term has
degree at most n. For each i, plugging in xi causes all but the i-th term in the
sum to
The square □ is called a tombstone and marks the end of a proof. It’s a modern
replacement for QED borrowed from magazines.
The proof writer gives a relatively brief overview and you are expected to fill in
the details to your satisfaction. It sucks, but if you do what’s expected of you—
that is, write down examples of the construction before reading on—then you build
up those neural pathways, and eventually you realize that the explanation is as
simple and clear as it can be. Meanwhile, your job is to evaluate the statements
made in the proof on your examples. Practice allows you to judge how much work you
need to put into understanding a construction or definition before continuing. And,
more importantly, you’ll understand it more thoroughly for all your testing.
Now for the uniqueness part. This is a straightforward proof, but it relies on a
special fact about polynomials. We’ll state the fact as a theorem that we won’t
prove. Some terminology: a root of a polynomial f : R → R is a value z for which f(
z) = 0.
Theorem 2.3. The zero polynomial is the only polynomial over R of degree at most n
which has more than n distinct roots.
21
So suppose f, g are two such polynomials. Consider the polynomial ( f − g)( x),
which we define as ( f − g)( x) = f( x) − g( x). Note that f − g is a polynomial
because, if the coefficients of f are ai and the coefficients of g are bi, the
coefficients of f − g are ci = ai − bi. If f and g have different degrees, then ci
is simply ai or −bi, depending on which of f, g has a larger degree. It is crucial
to this proof that f − g is a polynomial.
What do we know about f − g? It’s degree is certainly at most n, because you can’t
magically produce a coefficient of x 7 if you subtract two polynomials whose
highest-degree terms are x 5. Moreover, we know that ( f − g)( xi) = 0 for all i.
Recall that x is the generic input variable, while xi are the input values of the
specific list of points ( x 1 , y 1) , . . . , ( xn+1 , yn+1) that f and g are
assumed to agree on. Indeed, for every i, f ( xi) = g( xi) = yi, so subtracting
them gives zero.
Now we apply Theorem 2.3. If we call d the degree of f − g, we know that d ≤ n, and
hence that f −g can have no more than n roots unless it’s the zero polynomial. But
there are n + 1 points xi where f − g is zero! Theorem 2.3 implies that f − g must
be the zero polynomial, meaning f and g have the same coefficients.
Just for completeness, I’ll write the above argument more briefly and put the whole
proof of the theorem together as it would show up in a standard textbook. That is,
extremely tersely.
n+1
∏ x − xj
f ( x) =
y
xi − xj
i=1
j̸= i
Clearly the constructed polynomial f( x) is degree ≤ n because each term has degree
at most n. For each i, plugging in xi causes all but the i-th term in the sum to
vanish, and the i-th term clearly evaluates to yi, as desired.
To show uniqueness, let g( x) be another polynomial that passes through the same
set of points given in the theorem. We will show that f = g. Examine f −g. It is a
polynomial with degree at most n which has all of the n + 1 values xi as roots. By
Theorem 2.3, we conclude that f − g is the zero polynomial, or equivalently that f
= g.
22
We spent quite a few pages expanding the details of a ten-line proof. This is par
for the course. When you encounter a mysterious or overly brief theorem or proof it
becomes your job to expand and clarify it as needed. Much like with reading
programs written by others, as your mathematical background and experience grows
you’ll need less work to fill in the details.
Now that we’ve shown the existence and uniqueness of a degree at most n polynomial
passing through a given list of n + 1 points, we’re allowed to give “it” a name.
It’s called the interpolating polynomial of the given points. The verb interpolate
means to take a list of points and find the unique minimum-degree polynomial
passing through them.
Let’s write a Python program that computes the interpolating polynomial. I’m going
to assume the existence of a polynomial class that accepts as input a list of
coefficients (in the same order as Definition 2.1, starting from the degree zero
term) and has methods for adding, multiplying, and evaluating at a given value. All
of this code, including the polynomial class, is available at this book’s Github
repository. 9 Note the polynomial class is not intended to be perfect. The goal is
not to be industry-strength, but to help you understand the constructions we’ve
seen in the chapter.
ZERO = Polynomial([])
f(1) == 6
Now we write the main interpolate function. It uses the yet-to-be-defined function
single_term that computes a single term of the interpolating polynomial for a given
degree. Note we use Python list comprehensions, for which [EXPRESSION for x in
output_list = []
for x in my_list:
output_list.append(EXPRESSION)
output_list
9 See pimbook.org.
23
def interpolate(points):
"""
if len(points) == 0:
The first two blocks check for the edge cases: an empty input or repeating x-
values.
The last block creates a list of terms of the sum from the proof of Theorem 2.2.
The return statement sums all the terms, using the zero polynomial as the starting
value. Now for the single_term function.
Arguments:
"""
the_term = Polynomial([1.])
xi, yi = points[i]
for j, p in enumerate(points):
if j == i:
continue
xj = p[0]
24
>>> interpolate(points1)
1.0
>>> interpolate(points2)
>>> f = interpolate(points3)
>>> f
Next we’ll use polynomial interpolation to “share secrets” in a secure way. Here’s
the scenario. Say I have five daughters, and I want to share a secret with them,
represented as a binary string and interpreted as an integer. Perhaps the secret is
the key code for a safe which contains my will. The problem is that my daughters
are greedy. If I just give them the secret one might do something nefarious, like
forge a modified will that leaves her all my riches at the expense of the others.
Moreover, I’m afraid to even give them part of the key code. They might be able to
brute force the rest and gain access. Any daughter of mine will be handy with a
computer. Even worse, three of the daughters might get together with their pieces
of the key code, guess the rest, and exclude the other two daughters. So what I
really want is a scheme that has the following properties.
2. If four of the daughters collude without the fifth, they cannot use their shares
to reconstruct the secret.
3. If all five of the daughters combine their shares, they can reconstruct the
secret.
In fact, I’d be happier if I could prove, not only that any four out of the five
daughters couldn’t pool their shares to determine the secret, but that they’d
provably have no information at all about the secret. They can’t even determine a
single bit of information about the secret, and they’d have an easier time breaking
open the safe with a jackhammer.
The magical fact is that there is such a scheme. Not only is it possible, but it’s
possible no matter how many daughters I have (say, n), and no matter what minimum
size group I want to allow to reconstruct the secret (say, k ≤ n). So I might have
20 daughters, and I may want any 14 of them to be able to reconstruct the secret,
but prevent any group of 13 or fewer from doing so.
25
Polynomial interpolation gives us all of these guarantees. Here is the scheme.
First represent the secret s as an integer. Construct a random polynomial f( x) so
that f(0) =
s. We’ll say in a moment what degree d to use for f ( x). If we know d, generating
f is easy.
What do we know about subsets of points? If any k people get together, they can
construct the unique degree k − 1 polynomial g( x) passing through all those
points.
The question is, will the resulting g( x) be the same as f( x)? If so, they can
compute g(0) = f (0) to get the secret! This is where we pick d, to control how
many shares are needed. If we want k to be the minimum number of shares needed to
reconstruct the secret, we make our polynomial degree d = k − 1. Then if k people
get together and interpolate g( x), they can appeal to Theorem 2.2 to be sure that
g( x) = f( x).
Let’s be more explicit and write down an example. Say we have n = 5 daughters, and
we want any k = 3 of them to be able to reconstruct the secret. Pick a polynomial
f( x) of degree d = k − 1 = 2. If the secret is 109, we generate f as f ( x) = 109
+ random · x + random · x 2
Note that if you’re going to actually use this to distribute secrets that matter,
you need to be a bit more careful about the range of these random numbers. For the
sake of this example let’s say they’re random 10-bit integers, but in reality you’d
want to do everything with modular arithmetic. See the Chapter Notes for further
discussion.
f ( x) = 109 − 55 x + 271 x 2 ,
The polynomial interpolation theorem tells us that with any three points we can
completely reconstruct f( x), and then plug in zero to get the secret.
For example, using our polynomial interpolation algorithm, if we feed in the first,
third, and fifth shares we reconstruct the polynomial exactly:
>>> interpolate(points)
26
At this point you should be asking yourself: how do I know there’s not some other
way to get f( x) (or even just f(0)) if you have fewer than k points? You should
clearly understand the claim being made. It’s not just that one can reconstruct
f(0) when given enough points on f, but also that no algorithm can reconstruct f(0)
with fewer than k points.
Indeed it’s true, and two little claims show why. Say f is degree d and you have d
points (just one fewer than the theorem requires to reconstruct). The first claim
is that there are infinitely many different degree d polynomials passing through
those same d points. Indeed, if you pick any new x value, say x = 0, and any y
value, and you add ( x, y) to your list of points, then you get an interpolated
polynomial for that list whose “decoded secret” is different. Due to Theorem 2.2,
each choice of y gives a different interpolating polynomial.
The second claim is a consequence of the first. If you only have d points, then not
only can f(0) be different, but it can be anything you want it to be! For any value
y that you think might be the secret, there is a choice of a new point that you
could add to the list to make y the “correct” decoded value f(0).
Let’s think about this last claim. Say your secret is an English sentence s =
“Hello, world!” and you encode it with a degree 10 polynomial f( x) so that f(0) is
a binary representation of s, and you have the shares f(1) , . . . , f(10). Let y
be the binary representation of the string “Die, rebel scum!” Then I can take those
same 10 points, f (1) , f (2) , . . . , f (10), and I can make a polynomial passing
through them and for which y = f (0). In other words, your knowledge of the 10
points gives you no information to distinguish between whether the secret is “Hello
world!” or “Die, rebel scum!” Same goes for the difference between “John is the
sole heir” and “Joan is the sole heir,” a case in which a single-character
difference could change the entire meaning of the message.
To drive this point home, let’s go back to our small example secret 109 and encoded
polynomial
f ( x) = 109 − 55 x + 271 x 2
I give you just two points, (2 , 1083) , (5 , 6609), and a desired “fake” decrypted
message, 533. The claim is that I can come up with a polynomial that has f (2) =
1083 and f (5) =
6609, and also f (0) = 533. Indeed, we already wrote the code to do this! Figure
2.3
533.0
Note that the coefficients of the fake secret polynomial are no longer integers,
but this problem is fixed when you do everything with modular arithmetic instead of
floating point numbers (again, see the Chapter Notes).
27
Figure 2.3: A plot of four different curves that agree on the two points (2 , 1083)
, (5 , 6609), but have a variety of different “decoded secret” values.
The property of being able to “decode” to any possible plaintext given an encrypted
text is called perfect secrecy, and it’s an early topic on a long journey through
mathematical cryptography.
1. Whenever you see a definition, you must immediately write down examples. They
2.7 Exercises
28
3. Does the above fact work when f or g are the zero polynomial, using our
convention that the zero polynomial has degree − 1? If not, can you think of a
better convention?
4. Prove that two polynomial formulas with different degrees cannot be equal as
functions. That is, there must be some input on which they disagree.
1. Two integers a, b are said to be relatively prime if their only common divisor
is 1. Let n be a positive integer, and define by φ( n) (for n > 1) the number of
positive integers less than n that are relatively prime to n. Describe why one
might reasonably add the restriction n > 1.
g( x) h( x), for some polynomial h. It is said that f can be “factored” into g and
h.
Note that g and h must both have real coefficients and be of smaller degree than
f .
2.3. Verify the following theorem using the examples from the previous exercise.
That is, write down examples and check that the theorem works as stated. If a, n
are relatively prime integers, then aφ( n) has remainder 1 when dividing by n. This
result is known as Euler’s theorem (pronounced “OY-lurr”), and it is the keystone
of the RSA cryptosystem.
like 2 are algebraic, while numbers like π and e are famously not algebraic. The
golden
2.6. Prove the product of two algebraic numbers is algebraic. Similarly (but much
harder), prove the sum of two algebraic numbers is algebraic. Despite the fact that
π
29
and e are not algebraic, it is not known whether π + e or πe are algebraic. Look up
a proof that they cannot both be algebraic. Note that many such proofs appeal to
vector spaces, the topic of Chapter 10.
∑ ri = −an− 1
an
i=1
∏ ri = ( − 1) na 0 .
an
i=1
2.8. Look up a proof of Theorem 2.3. There are many different proofs. Either read
one and understand it using the techniques we described in this chapter (writing
down examples and tests), or, if you cannot, then write down the words in the
proofs that you don’t understand and look for them later in this book.
2.9. There are many ways to skin a cat. The polynomial interpolation construction
from this chapter is just one, often called Lagrange interpolation. Another is
called Newton interpolation. Find a source that explains what it is, try to
understand how these two interpolation methods differ, and implement Newton
interpolation. Compare the two interpolation methods in terms of efficiency.
2.10. Bézier curves are single-variable polynomials that draw a curve controlled by
a given set of “control points.” The polynomial separately controls the x and y
coordinates of the Bézier curve, allowing for complex shapes. Look up the
definition of quadratic and cubic Bézier curves, and understand how it works. Write
a program that computes a generic Bézier curve, and animates how the curve is
traced out by the input. Bézier curves are most commonly seen in vector graphics
and design applications as the “pen tool.”
20
w( x) =
( x − i)
i=1
30
2.12. Write a web app that implements the distribution and reconstruction of the
secret sharing protocol using the polynomial interpolation algorithm presented in
this chapter, using modular arithmetic with a 32-bit modulus p.
2.13. The extended Euclidean algorithm computes the greatest common divisor of two
numbers, but it also works for polynomials. Write a program that implements the
Euclidean algorithm to compute the greatest common divisor of two monic
polynomials.
2.15. Perhaps the biggest disservice in this chapter is ignoring the so-called
Fundamental Theorem of Algebra, that every single-variable monic polynomial of
degree k can be factored into linear terms p( x) = ( x − a 1)( x − a 2) · · · ( x −
ak). The reason is that the values ai are not necessarily real numbers. They might
be complex. Moreover, all of the proofs of the Fundamental Theorem are quite hard.
In fact, one litmus test for the
“baby” fundamental theorem, which says that every single-variable polynomial with
real coefficients can be factored into a product of linear and degree-2 terms
31
Twin Primes
The Twin Prime Conjecture, the assertion that there are infinitely many pairs of
prime numbers of the form p, p + 2, is one of the most famous open problems in
mathematics.
Its origin is unknown, though the earliest record of it in print is in the mid
1800’s in a text of de Polignac. In an exciting turn of events, in 2013 an unknown
mathematician named Yitang Zhang13 published a breakthrough paper making progress
on Twin Primes.
His theorem is not about Twin Primes, but a relaxation of the problem. This is a
typical strategy in mathematics: if you can’t solve a problem, make the problem
easier until you can solve it. Insights and techniques that successfully apply to
the easier problem often work, or can be made to work, on the harder problem. Zhang
successfully solved the following relaxation of Twin Primes, which had been
attempted many times before.
Theorem. There is a constant M, such that infinitely many primes p exist such that
the next prime q after p satisfies q − p ≤ M.
If M is replaced with 2, then you get Twin Primes. The thinking is that perhaps
it’s easier to prove that there are infinitely many primes pairs with distance 6 of
each other, or 100. In fact, Zhang’s paper established it for M approximately 70
million. But it was the first bound of its kind, and it won Zhang a MacArthur
“genius award” in addition to his choice of professorships.
As of this writing, subsequent progress, carried out by some of the world’s most
famous mathematicians in an online collaboration called the Polymath Project,
brought M down to 246. Assuming a conjecture in number theory called the Elliott-
Halberstam conjecture, they reduced this constant to 6.
Impossibility of Clustering
• A list of points S,
13 Though he had a Ph.D, early in his career Zhang had been unable to find academic
work, and had stints in a motel, as a delivery driver, and at a Subway sandwich
shop before he found a position as a lecturer at the University of New Hampshire.
32
and produces as output a clustering of S, i.e., a choice of how to split S into
non-overlapping subsets. The individual subsets are called “clusters.”
The function d is also required to have some properties that make it reasonably
interpretable as a “distance” function. In particular, all distances are
nonnegative, d( x, y) =
One can interpret this theorem as an explanation (in part) for why clustering is a
hard problem. While there are hundreds of clustering algorithms to choose from,
none “just works” the way we humans intuitively want one to. This may be, as
Kleinberg suggests, because our naive brains expect these three properties to hold,
despite the fact that they are mathematically incompatible.
It also suggests that the “right” clustering function depends more on the
application you use it for, which raises the question: how can one pick a
clustering function with principle?
It turns out, if you allow the required number of output clusters to be an input to
the clustering algorithm, you can avoid impossibility and instead achieve
uniqueness. For more, see the 2009 paper “A Uniqueness Theorem for Clustering” of
Zadeh and Ben-David.
The authors proceeded to study how to choose a clustering algorithm “in principle”
by studying what properties uniquely determine various clustering algorithms;
meaning if you want to do clustering in practice, you have to think hard about
exactly what properties your application needs from a clustering. Suffice it to
say, this process is a superb example of navigating the border separating
impossibility, existence, and uniqueness in mathematics.
The secret sharing scheme presented in this chapter was originally devised by Adi
Shamir (the same Shamir of RSA) in a two-page 1979 paper called “How to share a
secret.” In this paper, Shamir follows the terse style and does not remind the
reader how the interpolating polynomial is constructed.
33
He does, however, mention that in order to make this scheme secure, the
coefficients of the polynomial must be computed using modular arithmetic. Here’s
what is meant by that, and note that we’ll return to understand this in Chapter 16
from a much more general perspective.
Given an integer n and a modulus p (in our case a prime integer), we represent n
a ≡ n mod p.
The syntactical operator precedence is a bit weird here: “mod” is not a binary
operation, but rather describes the entire equation, as if to say, “everything here
is considered modulo p.”
We chose a prime p for the modulus because doing so allows you to “divide.” Indeed,
for a given n and prime p, there is a unique k such that ( n · k) ≡ 1 mod p. Again,
an interesting example of existence and uniqueness. Note that it takes some work to
find k, and the extended Euclidean algorithm is the standard method. When
evaluating a polynomial function like f( x) at a given x, the output is taken
modulo p and is guaranteed to be between 0 and p.
Moreover, when you use modular arithmetic you can prove that picking a uniformly
random ( d + 1)-th point in the secret sharing scheme will produce a uniformly
random decoded “secret” f(0). That is, uniformly random between 0 and p. Without
bounding the allowed size of the integers, it doesn’t make sense to have a
“uniform” distribution.
Finally, from discussions I’ve had with people using this scheme in industry,
polynomial interpolation is not fast enough for modern applications. For example,
one might want to do secret sharing between three parties at streaming-video rates.
Rather, one should use so-called “linear” secret sharing schemes, which are based
on systems of linear equations. Such schemes are best analyzed from the perspective
of linear algebra, the topic of Chapter 10.
Chapter 3
You enter the first room of the mansion and it’s completely dark. You stumble
around bumping into the furniture but gradually you learn where each piece of
furniture is.
Finally, after six months or so, you find the light switch, you turn it on, and
suddenly it’s all illuminated. You can see exactly where you were. Then you move
into the next room and spend another six months in the dark. So each of these
breakthroughs, while sometimes they’re momentary, sometimes over a period of a day
or two, they are the culmination of, and couldn’t exist without, the many months of
stumbling around in the dark that precede them.
We learned a lot in the last chapter. One aspect that stands out is just how slow
the process of learning unfamiliar math can be. I told you that every time you see
a definition or theorem, you had to stop and write stuff down to understand it
better. But this isn’t all that different from programming. Experienced coders know
when to fire up a REPL
The main difference for us is that mathematics has no debugger or REPL. There is no
reference implementation. Mathematicians often get around this hurdle by
conversation, and I encourage you to find a friend to work through this book with.
As William Thurston writes in his influential essay, “On Proof and Progress in
Mathematics,” mathematical knowledge is embedded in the minds and the social fabric
of the community of people thinking about a topic. Books and papers support this,
but the higher up you go, the farther the primary sources stray from textbooks.
If you are reading this book alone, you have to play the roles of the program
writer, the tester, and the compiler. The writer for when you’re conjuring new
ideas and asking questions; the tester for when you’re reading theorems and
definitions; and the compiler to check your intuition and hunches for bugs. This
often slows reading mathematics down to a crawl, for novices and experts alike.
Mathematicians always read with a pencil and notepad handy.
When you first read a theorem, you expect to be confused. Let me say it again: the
rule is that you are confused, the exception is that everything is clear.
Mathematical culture requires being comfortable being almost continuously in a
state of little to no 35
36
understanding. It’s a humble life, but once you nail down what exactly is unclear,
you can make progress toward understanding. The easiest way to do this is by
writing down lots of examples, but it’s not always possible to do that. We’ve
already seen an example, a theorem about the impossibility of having a nonzero
polynomial with more roots than its degree.
In the quote at the beginning of this chapter, Andrew Wiles discusses what it’s
like to do mathematical research, but the same analogy holds for learning
mathematics. Speaking with experienced mathematicians and reading their books makes
you feel like an idiot.
Whatever they’re saying is the most basic idea in the world, and you barely stumble
along.
I’ve been in the student’s shoes a thousand times. Indeed, if I’m not in those
shoes at least once a day then it wasn’t a productive day! I say at least a dozen
stupid things daily and think countlessly many more stupid thoughts in search of
insight. It’s a rare moment when I think, “I’m going to solve this problem I don’t
already know how to solve,” and there is no subsequent crisis. Even in reading what
should be basic mathematical material (there’s a huge list of things that I am
embarrassed to be ignorant about) I find myself mentally crying out, “How the hell
does that statement follow⁉”
In Andrew Wiles’s analogy, my friend is still in the dark room, but she’s feeling
some object precisely enough to understand that it’s a vase. She still has no idea
where the light switch is, and the vase might give her no indication as to where to
look next. But if piece by piece she can construct a clear enough picture of the
room in her mind, then she will find the switch. What keeps her going is that she
knows enough little insights will lead her to a breakthrough worth having.
Though she is working on far more complicated and abstract mathematics than you
are likely to, we must all adopt her attitude if we want to learn mathematics. If
it sounds like all of this will take way too much of your time (all day to learn a
single little thing!), remember two things. First, my colleague works on much more
abstract and difficult mathematics than the average programmer interested in
mathematics would encounter.
She’s looking for the meta-insights that are many levels above the insights found
in this 1 You can watch it at http://youtu.be/KdxEAt91D7k
37
book. As we’ll see in Chapter 11, insights are like a ladder, and every rung is
useful.
Second, the more you practice reading and absorbing mathematics, the better you get
at it. When my colleague says she spent an entire day understanding something, she
efficiently applied tools she had built up over time. She has a bank of examples to
bolster her. She knows how to cycle through applicable proof techniques, and how to
switch between different representations to see if a different perspective helps.
Some of these techniques are described in Appendix B.
But most importantly, she’s being inquisitive! Her journey is led as much by her
task as by her curiosity. As mathematician Paul Halmos said in his book, “I Want to
be a Mathematician,”
Don’t just read it; fight it! Ask your own questions, look for your own examples,
discover your own proofs.
Mathematician Terence Tao expands on this in his essay, “Ask yourself dumb
When you learn mathematics, whether in books or in lectures, you generally only see
the end product—very polished, clever and elegant presentations of a mathematical
topic.
However, the process of discovering new mathematics is much messier, full of the
pursuit of directions which were naive, fruitless or uninteresting.
While it is tempting to just ignore all these “failed” lines of inquiry, actually
they turn out to be essential to one’s deeper understanding of a topic, and (via
the process of elimination) finally zeroing in on the correct way to proceed.
So you’ll get confused. We all do. A good remedy is finding the right pace to make
steady progress. And when in doubt, start slow.
Chapter 4
Sets
God created infinity, and man, unable to understand infinity, created finite sets.
– Gian-Carlo Rota
In this chapter we’ll lay foundation for the rest of the book. Most of the chapter
is devoted to the mathematical language of sets and functions between sets. Sets
and functions serve not only as the basis of most mathematics related to computer
science, but also as a common language shared between all mathematicians. Sets are
the modeling language of math. The first, and usually simplest, way to convert a
real world problem into math involves writing down the core aspects of that problem
in terms of sets and functions. Unfortunately set theory has a lot of new
terminology. The parts that are new to you are best understood by writing down lots
of examples.
After converting an idea into the language of sets, you may use the many existing
tools and techniques for working with sets. As such, the work one invests into
understanding these techniques pays off across all of math. It’s largely the same
for software: learning how to decompose a complex problem into simple, testable,
maintainable functions pays off no matter the programming language or problem
you’re trying to solve. The same goes for the process of modeling business rules in
software in a way that is flexible as the business changes. Sets are a fundamental
skill.
At the end of the chapter we’ll see the full modeling process for an application
called stable marriages, which is part of an interdisciplinary field of mathematics
and economics called market design. In economics, there are occasionally markets in
which money can’t be used as a medium of exchange. In these instances, one has to
find some other mechanism to allow the market to function efficiently. The example
we’ll see is the medical residency matching market, but similar ideas apply to
markets like organ donation and housing allocation. As we’ll see, the process of
modeling these systems so they can be analyzed with mathematics requires nothing
more than fluency with sets and functions.
39
40
In Python they are simply called “sets.” In Java they go by HashSet, and in C++ by
unordered_set. Functionally they are all equivalent: a collection of objects
without repetition. While set implementations often have a menagerie of details—
such as immutabil-ity of items, collision avoidance techniques, complexity of
storing/lookup—mathematical sets “just work.” In other words, we don’t care how
items enter and leave sets, and mutability is not a concern because we aren’t
hashing anything to look it up. Efficiency is irrelevant.
To start, we need to know how to describe sets. The simplest way is with words. For
example, I can describe the set of integers divisible by seven, or the set of
primes, or the set of all syntactically correct Java programs.1 Often the goal of
analyzing a mathematical object is to come up with a concrete description of a set,
but implicit definitions are a great starting point.
Set-builder notation provides a more syntactic way to describe sets. For example,
the set of all positive integers divisible by seven can be written:
S = {x : x ∈ N , x is divisible by 7 }
The notation reads like the sentence in words, where the colon stands for “such
that.”
I.e., “The set of values x such that x is in N and x is divisible by 7.” Sometimes
a vertical bar | is used in place of the colon. The symbols separate the
constructive expression from the membership conditions (it’s not an output-input
pipe as in shell scripting). The ∈
symbol denotes membership in a set, and the objects in a set are called elements.
Lists made with list comprehensions need not have unique elements, while
mathematical sets must. In set-builder notation is also more expressive. Put
whatever conditions you like after the colon, even if you don’t know how to compute
them! The left hand side of the colon may also be an expression, as in
{( x, 2 x + 1) : 0 ≤ x < 10 }
Now we turn to some definitions you may already be familiar with. If not, remember
it’s your job to write down examples. In either case, mathematical texts typically
define 1 There are some strange “meta” things you are not allowed to describe as
sets, such as the set of all sets. It turns out this is not a set, and it caused a
lot of grief to early 20th century mathematicians who really cared about the
logical foundations of mathematics. This book omits these topics, since all of our
sets will be comfortably finite or concrete like R.
41
something once and only once. I will occasionally repeat definitions that are used
across chapters, but generally authors will not. You’re expected to have understood
a definition to an appropriate degree of comfort before continuing.
Definition 4.1. The cardinality or size of a set A, denoted |A|, is the number of
elements in A when that number is finite, and otherwise we say A has infinite
cardinality. 2 A set with no elements is called the empty set, and it has
cardinality zero.
Proving one set is a subset of another is usually easy, but not always. The
standard technique is to fix b to be an arbitrary element of B, and use whatever
characteristic defines B to show that b ∈ A as well. Here’s a brief example: the
set of integers divisible by 57 is a subset of the set of integers divisible by 3,
because any number b divisible by 57 has the form b = 57 · k = 3 · (19 · k), which
means it’s also divisible by 3. No alarms and no surprises.
̸∈ denotes the negation of that claim or query. Other slashed operators include ̸= ,
̸⊂, ̸∼.
Definition 4.3. Given two sets A and B, the complement of B in A is the set {a ∈
A : a ̸∈ B}. The complement is denoted either by A \ B or A − B, and sometimes B C
when B ⊂ A and A is clear from context.
You can already see I’m starting to be creatively flexible with set-builder
notation. Here a ∈ A might be interpreted as a boolean-valued expression,
suggesting the set has only boolean-valued members. However, reading it as a
sentence makes sense of it instead as an assertion: “The set of a in A such that a
is not in B.” Writing it more verbosely,
{a : a ∈ A and a ̸∈ B} is extra work without significant gain for the reader. If you
prefer the verbose version, it’s likely because you’ve spent so long phrasing your
thoughts to be machine readable. Appeal to your inner voice here, not your inner
type-checker.
2 It is not trivial to prove formally that every set has a well-defined size. This
fact is intertwined with the formal axiomatic framework set theory is based on,
called the Zermelo-Fraenkel set theory, often abbreviated as ZF or ZFC. Axiomatic
set theory is beyond the scope of this book, but it is one of those topics that
every mathematician has seen at least once.
⊊.
42
If you want some practice working with basic set definitions, prove that for any
two sets A, B, the following containments hold: A ∩ B ⊂ A and A ⊂ A ∪ B.
Definition 4.5. The product of two sets A, B denoted A × B, is the set of all
ordered pairs of elements in A and elements in B. In set-builder notation it is: A
× B = {( a, b) : a ∈ A and b ∈ B}
The product is the usual way we turn the real line R into the real plane R2. That
is, R2
(R × R) × R = {(( a, b) , c) : a ∈ R , b ∈ R , c ∈ R }
R × (R × R) = {( a, ( b, c)) : a ∈ R , b ∈ R , c ∈ R }
We want these sets to be considered the same. Indeed, the difference between the
two is the kind of distinction that programmers are very familiar with, because
compilers will refuse to proceed unless the parentheses align. But mathematicians,
for reasons we’ll see shortly,5 brush aside the difference and just say they’re the
“same” set, and they’re both equivalent to
(R × R) × R = R × (R × R) = {( a, b, c) : a ∈ R , b ∈ R , c ∈ R }
We will return later in this chapter, and again in Chapters 9 and 16 when
complexity will beg for a rigorous and useful abstraction called the quotient, to
understand why it’s okay to call these two sets “the same.” For now, simply define
an n-fold product to collapse pairs into tuples of length n:
R n = R × · · · × R
|
{z
n times
This notation can be used for any set. Next we define functions as special subsets
of a product.
and B is called the codomain of F . To denote this, we use the arrow notation F : A
→ B.
5 There’s a bijection!
43
You should be writing down examples, but this one needs some help. We think of
For the example, say F is the set of pairs of positive integers and their squares.
F = {(1 , 1) , (2 , 4) , (3 , 9) , (4 , 16) , . . . } = {( x, x 2) : x ∈ N }.
It’s a subset of N × N. Now we can add a bit of notation: instead of saying that (3
, 9) ∈
F we use the mapping notation F (3) = 9. With this, we could describe F the way we
wanted to all along, as F ( x) = x 2. The conditions in Definition 4.6 ensure that
every input x has some output F ( x), and that each input x has only one output F (
x). Providing a concrete algorithm to compute the output from the input makes these
conditions trivial, as is the case with squared integers, but an algorithm is not
needed to define a function.
Reiterating a note from Chapter 2, the codomain B is not strictly encoded in the
data of a function F : A → B. The codomain is the set of allowed outputs.
So why go through all the trouble of defining functions in terms of sets? Part of
the answer is historical. The concept of sets as a modeling tool has probably
existed for as long as mathematics, but it was primarily used in its language form
(“I declare, consid-ereth only those heavenly numbers whose factorisation into
prymes containeth nary a repeated factor!”). The notation y = f( x) was invented in
the 1700’s by Leonhard Euler, and in those times most functions were only defined
in terms of formulas that were easy to write down. It was not until the late 19th
century that mathematicians formally studied sets, and proposed them as a logical
foundation for all of mathematics. To do so requires restating all existing
concepts in terms of sets. Definition 4.6 does this for functions. Similar
definitions exist defining integers and ordered tuples in terms of sets.
How tedious.
In this light, our initial definition of a set was completely imprecise. There is a
more precise definition, but it is the sort that only a logician would love, called
Zermelo-Fraenkel set theory. In brief, its base concepts are the empty set, set
membership, a notion of infinity, and a restricted choice of ways to build sets
from other sets. Using this one can define numbers, functions—even all of calculus—
from “first principles.” To instill this idea in future mathematicians, many
introductory proof textbooks define everything in terms of sets, and do formal
proofs to a degree of precision most mathematicians avoid in their day to day work.
In theory, mathematicians like the idea that everything can be reduced to sets.
Actually doing it in practice will drive you mad. It’s like writing all your
programs in pure binary.
Few do it, but we all take comfort in the idea that we could peel back the layers
to reveal the raw assembly instructions. In reality, abstractions keep us
productive. Likewise, defining the entirety of mathematics in sets is like “bare
metal” programming, but without any of the speed benefits of the finished program.
Someone ironed out set theory it once, and we have a record of their work. Now we
can get back to doing mathematics.
44
The special notation for functions highlights our conceptual emphasis. We think of
functions differently than regular sets, with a semantic input-output dependence
that set notation doesn’t natively convey.
Now we turn to a few useful definitions about subsets of inputs and outputs of a
function. A seasoned programmer is less likely to be familiar with the remainder of
the definitions in this chapter, but we will rely on them throughout the book.
Definition 4.7. Given a function f : A → B, we define the image of f (or the image
of A under f ) as the set
f ( A) = {f ( a) : a ∈ A}
im f = {b ∈ B : ∃a ∈ A with f( a) = b}.
We won’t rely heavily on the ∃ notation, but it is quite common. Now we define the
preimage, the set of inputs mapping to a specified set of outputs.
45
Figure 4.2: An example of a surjection, where every element of the codomain is hit
by some element of the domain mapped through f. The dots are elements of the set,
and the arrows show the mapping. This example is also a non-injection.
46
Another bit of notation, just like ∃ meaning “there exists,” the symbol ∀ is a
shorthand for “for all.” I remember it by the backwards E standing for Exists,
while the upside-down A stands for All. So the surjective property can be written
hyper-compactly as
∀b ∈ B, ∃a ∈ A such that f( a) = b.
The symbols ∀, ∃ are called quantifiers and an expression in which every variable
is bound by a quantifier is called “fully quantified.”
I will shy away from such dense notation in this book, though it will come in handy
when we study Calculus in Chapter 8. While this example is not particularly
difficult to parse, unrestrained use of ∀, ∃ can quickly spin out of control. Just
as programmers shouldn’t cram a lot of complex logic into a single line of code,
bad mathematical writers cram many quantifiers into a single line of math when it’s
not necessary. That being said, familiarity with the symbols is broadly assumed.
Bijections are nice because they can be used to say that two sets have the same
cardinality (size), and it makes sense for infinite sets. If there is a bijection A
→ B then |A| = |B|.
Likewise, if there is an injection A → B then |A| ≤ |B|, and the opposite works for
surjections. See the exercises for more on this. Figure 4.3 shows the typical
picture for a bijection.
All bijections are invertible, and vice versa invertible functions must be
bijections.
47
Here are two such propositions we’ll use much later in our study of linear algebra
concerning the existence and structure of inverses. If you feel emotionally drained
by all the definitions in this chapter so far, feel free to skip these and come
back when we refer to them in Chapter 12.
Proof. It’s crucial here that f is surjective (otherwise the theorem is not true!).
Given b ∈ B, we need to show that f ( g( b)) = b. Start by choosing an a ∈ A for
which f ( a) = b. Then g( b) = g( f ( a)) = a. Apply f to both sides to get f
( g( b)) = f ( a) = b, as desired.
48
R ×(R × R) by “brushing aside” the differences between the two. There is a rigorous
way to do this, but I’ll only explain half of the rigor right now. The essential
reason is because there is a bijection (R × R) × R → R × (R × R) that maps (( a, b)
, c) to ( a, ( b, c)). Often when mathematicians want to “call” two things the
same, they’ll come up with such a bijection, and say the two things on either side
of such a bijection should be considered the same. It’s like an implicit typecast,
always reversible in this case. The formal idea is called a “quotient,” which we’ll
see in Chapter 9.
Now that we have the basic language of sets to model our problems, on to some
problems. Say you want to count the size of a set. Since sets can be defined
implicitly, it may not be obvious how. A useful tool used all over math is the
trick of coming up with a clever bijection. This can transform a seemingly
difficult counting problem into an elegantly trivial one. 7
Say you start with a thousand players. Let’s entertain a naive computation. In the
first round of the tournament, each player is paired up with another and 500 games
are played. In the second round there are 500 remaining players, and they again
pair off to play 250 games. In the third, 125 games. In the fourth round you hit an
edge case, because there are an odd number of players and one must sit out. Fine,
you keep going, diligently tracking the players who sit out, and eventually you get
to a number. You should try this yourself, and verify that the answer is 999 games.
Isn’t that a weird coincidence? We got 1 less than the total number of players.
Does this pattern hold for other tournament sizes?
The answer is yes. To prove it, we apply the technique of finding a clever
bijection. It will make you feel like our computation was a complete waste of time,
but if you did the exercise you’ll appreciate the elegance of this method that much
more.
The primary observation is that every loser loses exactly one game. So if we want
to count the number of games, we can instead count the number of losers. But there
is only one player who is not a loser: the winner. Hence 999 games.
7 Here’s a neat fact I learned from John D. Cook: in the Middle Ages, people
studied a “quadrivium” of mathematical arts: arithmetic, geometry, music, and
astronomy. This followed the “trivium” of grammar, rhetoric and logic. So when I
say a result is “trivial,” I’m not trying to insult anyone, but rather informing
that no new ideas are needed above basic logic. The best and most pleasing
mathematics takes a hard-seeming problem, and rephrases it in a clever way so that
the proof is trivial.
49
Let’s rephrase that elegant argument in the language of sets. Let X be the set of
games and Y the set of players. Define a function f : X → Y by calling f( x) the
loser of game x. This function is not a surjection. Rather, the image f( X) is the
subset L ⊂ Y of losers. However, f is an injection (different games have different
losers), and f defines a bijection between X and L. This means that X and L have
the same size, and the fact that there is only one winner of the entire tournament
means that |L| = |Y | − 1. So if there are n players then there will always be n −
1 games.
To make sure you understand this argument, extend it to the case of a double-
This general strategy for counting has applications any time you need to count or
estimate the size of a set. Imagine you want to estimate the number of homeless
people in a city, a problem the US Census Bureau faces regularly. You might
implicitly count them by observing the residual effects of their actions. This is
precisely looking for functions between sets that are close to bijections, or
double- or triple-covers of the set you want to count.
define the quantity X , read “X choose two,” to be the set of all unordered pairs
of 2
( )
( )
( )
( )
depend on the particular elements in X, just its size. In words, n is the number of
ways 2
to choose two objects from a set of n objects.8 The problem is, can we come up with
an ( )
arithmetic formula for n in terms of n? We’ll show by way of a bijection that it’s
equal 2
to the quantity
1 + 2 + · · · + n − 1 .
g : Y → X : given any ball y ∈ Y , you draw two diagonals as in the picture and you
2
get g( y) as the pair of squares at the end of both diagonals. The picture should
convince you that two different choices of balls give you different diagonals,
i.e., g is an injection.
( )
50
( )
Now we count: how many balls and squares are there? The last row has n− 1 = 6
balls, and each row has one fewer ball than the row underneath it, so |Y | = 1 + 2
+ · · · + n − 1.
( )
( )
You may wonder: how can we use a picture as the central part of our proof? Didn’t
we only prove that this bijection works for n = 7? Technically you’re right: no
mathematician would consider a picture as a rigorous proof in and of itself.
However, when the goal is to communicate the central nugget of wisdom in a proof, a
small example with all the essential features of a general proof is often good
enough. Consider one alternative.
You could represent the balls as points inside R2. You’d need a generic way to
construct coordinates for them, and a generic way to describe the diagonals. That’s
a huge pain in the ass for something so simple! Every mathematician would agree it
could be done but it would be a colossal waste of time to actually do it.
stantly reading papers, and there is rarely enough time to verify all the details
of every argument. If you’re not an official reviewer of the paper before it’s been
published, it is usually enough to be convinced that something should be true,
especially if the details are messy but clear, while focusing on the high level
picture. An example with all the essential features of a general solution is an
effective substitute. And this doubles for readers of mathematics too: finding a
simple example with the essential features of a general solution, and testing
claims on the example, is one of the best ways to read a proof!
51
Next we’re going to see two rigorous methods of proof that are used in all areas of
math.
The first is induction, but you’re likely familiar with it by a different name:
recursion.
with fib(0) = fib(1) = 1. Most programmers have implemented some version of this
2. Second, do the inductive step, where one uses the assumption that P ( n) is true
to prove that P ( n + 1) is true. Equivalently, one can use P ( n − 1) to prove P (
n).
Just like with recursion, you get a chain of proofs: P (6) implies P (7)
implies . . . implies P ( n) for any n you like. One bit of terminology: one often
invokes the inductive hypothesis, which is the assumption that P ( n) is true. It’s
helpful when P ( n) is cumbersome to restate.
( )
Proof. Call the statement to be proved P ( n). We prove this by induction for n ≥
2. For the base case10 n = 2, we need to prove
( )
P (2) :
= 1
( )
We argue 2 is trivially 1. There is only one way to choose two items from a set of
2
10 When n = 0 or 1 we are asking how many ways there are to choose two things from
a set of fewer than two things. According to our definition this is zero (which you
saw if you wrote your test cases starting from the simplest ones), and one usually
calls an empty summation to be zero. But the first n that’s not “vacuously”
52
( )
n
P ( n) :
= 1 + 2 + · · · + n − 1 .
n + 1
P ( n + 1) :
= 1 + 2 + · · · + n.
( )
pick two elements from X. Note that we are using numbers as elements of X instead
of
“arbitrary objects.” We might have instead called them “ball 1, ball 2, ball 3” and
discuss how many ways to select two balls from a bin. 11 For simplicity we’ll use
the numbers ( )
themselves. Now X is a set of size n+1 and we want to express the size in terms of
2
( )
our (inductively assumed) formula for n . Pick any element of X, say n + 1, and
define 2
Y = X − {n + 1 } = { 1 , 2 , . . . , n}.
( )
Now let’s split the elements of X into two parts: the part where both chosen
elements 2
are in Y , and the part where one of the two chosen elements is n + 1. Since there
are no other options and no overlap between the two options, we can add the sizes
of both ( )
( ) ( )
1 + 2 + · · · + ( n − 1) + n,
write down the elements of X . Follow the steps through the inductive step of the
proof 2
on this example, and your understanding of the general case will feel like an
epiphany.
12 If you read this part carefully, you’ll notice we’re defining a bijection. One
can define the mapping as ( )
f ( {a, b}) = min( a, b). Then f is a bijection between Y and the subset {S ∈ X : n
+ 1 ∈ S}.
53
The second proof technique is called “proof by contradiction.” There’s a simple
puzzle I often use to illustrate the technique.
You’re at a party. You’re chatting with your friend, and out of curiosity you ask
how many friends he has at the party. He counts them up, there are five, and you
realize that you also have five friends at the party. What a coincidence! Putting
on your mathematician hat, you poll everyone at the party and you’re shocked to
find that a few other people also have five friends at the party. The puzzle is: is
this true of every party? Maybe not five exactly, but will there always be at least
two people with the same number of friends who are at the party?
Before I give the solution by contradiction, let’s iron out what I mean by
“friendship.” I insist that friendship is symmetric: you can’t be friends with
someone who is not friends with you. And moreover you can’t be friends with
yourself.13
You’ll appreciate the answer to this problem best if you spend some time trying to
solve it first.
Back already? The answer is yes, there will always be a pair of people with the
same number of friends. The technique we use to prove it is called proof by
contradiction. It works by assuming the opposite of what you want to prove is true,
and using that assumption to deduce nonsense.
Proof. Suppose for the sake of contradiction that there is some party where
everybody has a different number of friends at the party. Say the party has n > 1
people, then everyone must have between zero and n − 1 friends. Since there are n
people and n different numbers between zero and n − 1, we can map each person to
the number of friends they have, and this map will be a bijection. Now here comes
the contradiction: someone must have zero friends at the party, and someone must
have n − 1 friends, i.e., someone must be friends with everyone. But the person who
is friends with everyone must be friends with the person that has no friends! The
only way to resolve this contradiction is if the original assumption is actually
false. That is, there must be two people with the same number of friends.
This is how every proof by contradiction goes, but they’re usually a bit more
concise.
They always start with, “Suppose to the contrary” to signal the method. And there
is no warning when the contradiction will come. A proof writer usually just states
the contradiction and follows it with “which is a contradiction,” ending the proof.
14
The point of a proof by contradiction is to get an object with a property that you
can work with. If you’re trying to prove that no object with some special property
exists, a proof by contradiction gives you an instance of such an object, and you
can use its special 13 Looking forward to Chapter 6 on graph theory, we’re saying
that the social connections at our “party” form a simple, undirected graph.
14 A professor of mine had a funny refrain to end his proofs by contradiction. If,
say, x was assumed to be prime, he’d arrive at a contradiction and say, “and this
is very embarrassing for x because it was claiming to be prime.”
54
property to go forward in the proof. In this case the object was a special
friendship count among partygoers, and in the next section we’ll apply the same
logic to “marriages.”
For those readers who are interested in a bit more details about what makes a
mathematical proof, or how to approach proving things, in this second edition I
added two appendices that may help. Appendix B contains a bit more details about
the formalities underlying proofs, along with a section at the end called “How does
one actually prove things?” Appendix C contains a list of books under “Fundamentals
and Foundations” that cover the basics of set theory, proofs, and problem solving
strategies. Readers of the first edition have told me that following along with
these books has helped immensely.
Now we’re ready to apply the tools in this chapter to implement a Nobel Prize-
winning algorithm for the stable marriage problem. The problem is set up as
follows. Say you have n men and n women. Your end goal is to choose who should
marry whom. Same-sex marriages are excluded, not for political or religious reasons
but because it’s a more difficult problem. So if we call M the men and W the women,
our output will be a bijection M → W describing the marriages (or equivalently W →
M). I will freely switch between “bijection” and “marriage” in this section.
Of course, we don’t just want any bijection. This is where the “stable” part comes
in.
We want to choose the marriage so that everyone is happy in some sense. Let’s make
this precise. Say that each man has a ranking of the women, mathematically a
bijection W → { 1 , 2 , . . . , n}, with 1 being the most preferred and n being the
least. In other words, if we call the bijection p then p( w) < p( x) means that
this particular man prefers woman w over woman x. Likewise, each woman has a
ranking of the men M → { 1 , 2 , . . . , n}.
Now we obviously can’t ensure that every woman gets her top choice and vice versa;
the men could all prefer the same woman. So we need a subtler notion of happiness:
that no (man, woman) pair mutually prefer each other over their assigned partners.
55
Before I state what “not cheating” means mathematically for the marriage problem, I
encourage you to write down a small example of sets M, W of size n = 4, rankings
pref ( m) for each w ∈ W and pref ( w) for each m ∈ M, and a candidate marriage w
2. The pair m and w mutually prefer each other over their assigned matches.16 I.e.,
both pref ( w) < pref ( f( m)) and pref ( m) < pref ( f− 1( w)).
In other words, the bijection is called stable if there is no pair of people with
mutual incentive to cheat on their assigned spouses. This is not to say cheating
can’t happen, but if it does one of the two involved will be “lowering their
standards.”
The algorithmic question is, given lists of preferences as input, can we find a
stable marriage? Can we even guarantee a stable marriage will exist for any set of
preferences?
The answer to both questions is yes, and it uses an algorithm called deferred
acceptance.
The rejected men are sad, but in the next round they recover and propose to their
next most preferred woman, and again the women reject all but one. The men keep
proposing until every man is tentatively held by some woman, or until all women
have rejected them. That is not a happy place to imagine. But actually, the theorem
that we’ll prove says that this process always ends with each woman holding onto a
man, and no men are left out; the set of women’s held picks forms a stable
bijection.
Before we prove that the algorithm works, let’s state it more formally in Python
code. A complete working program is available on this book’s Github repository.17
In the interest of generality, I’ve defined classes Suitor and Suited to
differentiate: Suitors propose to Suiteds.
56
class Suitor:
self.preference_list = preference_list
self.index_to_propose_to = 0
self.id = id
def preference(self):
return self.preference_list[self.index_to_propose_to]
def post_rejection(self):
self.index_to_propose_to += 1
The Suitor class is simple. Instances are uniquely identified by an id, which I’m
defining to be the index in a global list of Suitors. A Suitor has a
preference_list, which is a list of Suited ids sorted from most preferred to least
preferred. The
class Suited:
self.preference_list = preference_list
self.held = None
self.current_suitors = set()
self.id = id
def reject(self):
"""
if len(self.current_suitors) == 0:
return set()
self.held = min(
self.current_suitors,
key=lambda suitor: self.preference_list.index(suitor.id))
self.current_suitors = set([self.held])
return rejected
self.current_suitors.add(suitor)
Here current_suitors are the new proposals in a given round, and held is the
Suited’s held pick. In the method reject, a Suited looks at all her current
suitors, chooses the best in her preference_list, and returns all others as
rejected Suitors.
Finally, we have the main routine for the deferred acceptance algorithm.
57
unassigned = set(suitors)
next_to_propose_to = suiteds[suitor.preference()]
next_to_propose_to.add_suitor(suitor)
unassigned = set()
The dictionary at the end is the type we use to represent a bijection. Now let’s
prove this algorithm always produces a stable marriage.
We will argue that the algorithm terminates by monotonicity. Here’s what I mean by
that: say you have a sequence of integers a 1 , a 2 , . . . which is monotonically
increasing, meaning that a 1 < a 2 < · · · . Say moreover that you know none of the
ai are larger than 50 ( ai is bounded from above) but each ai+1 ≥ ai + C for some
constant C > 0. Then it’s trivial to see that either the sequence stops before it
hits 50, or eventually it hits 50.
To show an algorithm terminates, you can cleverly choose an integer at for each
iteration t of the core loop, and show that at is monotonically increasing (or
decreasing) and bounded. Then show that if the algorithm hits the bound then it’s
forced to finish, and otherwise it finishes on its own.
Theorem 4.15. The deferred acceptance algorithm always terminates, and the
bijection produced at the end is stable.
Proof. For the deferred acceptance algorithm we have a nice monotonic sequence. For
round t set at to be the sum of all the Suitor’s index_to_propose_to variables.
Recall that this variable also represents the number of rejections of each Suitor.
Since there are exactly n preferences in the list and exactly n Suitors, we get the
bound at ≤ n 2
(each Suitor could be at the very end of their list; come up with an example to
show this can happen!).
Moreover, in each round one of two things happens. Either no Suitor is rejected
58
Now that we’ve shown the algorithm will stop, we need to show the bijection f
produced as output is stable. The definition of stability says there is no Suitor m
and Suited w with mutual incentive to cheat, so for contradiction’s sake we’ll
suppose that the f output by the algorithm does have such a pair, i.e., for some m,
w, pref ( w) < pref ( f( m)) m
What had to happen to w during the algorithm? Well, m ended up with f( m) instead
of w, and if pref ( f( m)) > pref ( w), then m must have proposed to w at some
earlier m
round. Likewise, the held pick of w only increases in quality when w rejects a
Suitor, but w ended up with some Suitor f− 1( w) while pref ( m) < pref ( f−
1( w)). So at some w
>>> suitors = [
>>> suiteds = [
Suitor(0): Suited(3),
Suitor(1): Suited(2),
Suitor(2): Suited(5),
Suitor(3): Suited(0),
Suitor(4): Suited(4),
Suitor(5): Suited(1),
1. Sets and functions between sets are a modeling language for mathematics.
2. Bijections show up everywhere, and they’re a central tool for understanding the
same object from two different perspectives.
3. Mathematicians usually accept silent type conversions between sets when it makes
sense to do so, i.e., when there is a very clear and natural bijection between the
two
59
sets.
5. A picture or example that captures the spirit of a fully general proof is often
good enough.
4.6 Exercises
4.1. Write down examples for the following definitions. A set A (finite or
infinite) is called countable if it is empty, or if there is a surjection N → A.
The power set of a set A, denoted 2 A, is the set of all subsets of A. For two sets
A, B, we denote by BA the set of all functions from A to B. This makes sense with
the previous notation 2 A if we think of “2” as the set of two elements 2 = { 0 , 1
}, and think of a function f : A → { 0 , 1 } as describing a subset C ⊂ A by
sending elements of C to 1 and elements of A − C to 0. In other words, the subset
defined by f is C = f− 1(1).
( )
4.3. Look up a formula online for the quantity n , the number of ways to choose k k
4.4. Look up a statement of the pigeonhole principle, and research how it is used
in proofs.
4.6. For each n ∈ N, let An be a countably infinite set, such that all the An have
empty intersection. Prove that the union of all the An is countable. Hint: use the
previous problem.
R : x ≥ 1 }?
4.8. I would be remiss to omit Georg Cantor from a chapter on set theory. Cantor’s
Theorem states that the set of real numbers R is not countable. The proof uses a
famous technique called “diagonalization.” There are many expositions of this proof
on the internet ranging in difficulty. Find one that you can understand and read
it. The magic of this 18 Note, this ambiguous notation conflicts with the previous
exercise, and takes a different meaning here. ABC: Always Be Contextualizing.
60
theorem is that it means there is more than one kind of infinity, and some
infinities are bigger than others.
4.11. Continuing the previous exercise, a Steiner system may not exist for every
choice of n > k > t. Prove that if an ( n, k, t)-system exists, then so must an
( n− 1 , k − 1 , t− 1)-
4.12. Continuing the previous exercise, the non-existence of Steiner systems for
some choices of n suggests a modified problem of finding a minimal size family F of
size- k subsets such that every t-size subset is in at least one set in F . For
( n, k, t) arbitrary, find a lower bound on the size of F . Try to come up with an
algorithm that gets close to this lower bound for small values of k, t.
4.13. A generalization of Steiner systems are called block designs. A block design
F is again a family of size- k subsets of X = { 1 , . . . , n} covering all size- t
subsets, but also with parameters controlling: the number of sets in F that contain
each x ∈ X, and the number of sets covering each size- t subset (i.e., it can be
more than one). Block designs are used in the theory of experimental design in
statistics when, for example, one wants to test multiple drugs on patients, but the
outcome could be confounded by which subset of drugs each patient takes, as well as
which order they are taken in, among other factors.
4.15. The formal mathematical foundations for set theory are called the Zermelo-
Fraenkel axioms (also called ZF-set theory, or ZFC). Research these axioms and
determine how numbers and pairs are represented in this “bare metal” mathematics.
Look up Russell’s paradox, and understand why ZF-set theory avoids it.
61
4.17. Write a program that extends the deferred acceptance algorithm to the setting
of
“marriages with capacity.” That is, imagine now that instead of men and women we
have medical students and hospitals. Each hospital may admit multiple students as
residents, but each student attends a single hospital. Find the most natural
definition for what a stable marriage is in this context, and modify the algorithm
in this chapter to find stable marriages in this setting. Then implement it in
code. See the chapter notes for historical notes on this algorithm.
4.18. Come up with a version of stable marriages that includes the possibility of
same-sex marriage. This variant is sometimes called the stable roommate problem. In
this setting, there is simply a pool of people that must be paired off, and
everybody ranks everyone else. Perform the full modeling process: write down the
definitions, design an algorithm, prove it works, and implement it in code.
4.19. Is the stable marriage algorithm biased? Come up with a concrete measure of
how
“good” a bijection is for the men or the women collectively, and determine if the
stable marriage algorithm is biased toward men or women for that measure.
Residency Matching
Medical residency matching was the setting for one of the major accomplishments of
Alvin Roth, currently an economics professor at Stanford. He applied this and
related algorithms to kidney exchange markets and schooling markets. Along with
Lloyd Shapley, one of the original designers of the deferred acceptance algorithm,
their work designing and implementing these systems in practice won the 2012 Nobel
Prize in economics.
Measured by a different standard, their work on kidney markets has saved thousands
of lives, put students in better schools, and reduced stress among young doctors.
Roth gives a fascinating talk19 about the evolution of the medical residency market
before he stepped in, detailing how students and hospitals engaged in a maniacal
day-long sprint of telephone calls, and all the ways unethical actors would try to
game the protocol in their favor.
19 https://youtu.be/wvG5b2gmk70
62
Marriage
Please don’t treat marriage as an allocation problem in real life. I hope it’s
clear that the process of doing mathematics—and the modeling involved in converting
real world problems to sets—involves deliberately distilling a problem down to a
tractable core. This often involves ignoring features that are quite crucial to the
real world. A quote often attributed to Albert Einstein speaks truth here, that “a
problem should be made as simple as possible, but no simpler.” Indeed, the unstated
hope is that by analyzing the simplified, distilled problem, one can gain insights
that are applicable to the more complex, realistic problem. Don’t remove the core
of the problem when phrasing it in mathematics, but remove as much as you need to
make progress. Then gradually restore complexity until you have solved the original
problem, or fail to make more progress. Marriage is used as a communication device
for this particular simplification. It’s not the problem being solved.
The idea that one can reduce complex human relationships to a simple allocation
problem is laughable, and borderline offensive. In the stable marriage problem the
actors are static, unchanging symbols that happen to have preferences. In reality,
the most important aspect of human relationships is that people can grow and
improve through communication, introspection, and hard work.
Chapter 5
– Henri Poincaré
names, how they overload and abuse notation, and how the words they use to describe
things are essentially nonsense words made up for the sole purpose of having a new
word.
This causes bizarre sentences like “Map each co-monad to the Hom-set of quandle
endo-morphisms of X.” I just made that up, by the way, though each word means
something individually. One question programmers rarely ask is why mathematicians
do this. Is it to feign complexity? Historical precedence? A hint of malice?
Of course there are bad writers out there, along with people who like to sound
smart.
On the other hand, good software is measured (after it’s deemed to work) by
maintainability, extensibility, modularization, testability, robustness, and a
whole host of other metrics which are primarily business metrics. You care about
modularization because you want to be able to delegate work to many different
programmers without stepping on each other’s toes. You want extensibility because
customers never know what features they actually want until you finish designing
the features they later decide are no good.
You want to ensure that your software is idiot-proof because your company just
hired three idiots. These metrics are good targets because they save time and
money.
Mathematicians don’t experience these scaling problems to the same degree of tedium
because mathematics isn’t a business. Mathematics isn’t idiot-proof because the
success 63
64
of a mathematical theory doesn’t depend on whether the next idiot that comes along
understands it.1 In fact, mathematical sophistication in the business world is
extraordinary.
I should make a side note that saying “mathematics isn’t a business” is overly
naive.
Mathematicians need to make money just like everyone else, and this manifests
itself in some strange practices in academic journals, conferences, and the
multitude of committees that decide who is worth hiring and giving tenure.
Mathematicians, like folks in industry, bend over backwards to game (or
accommodate) the system. But all of that is academia. What I’m talking about is
established mathematics which has been around for decades, or even centuries, which
has been purified of political excrement. This applies to basically every topic in
this book.
That’s not to say that mathematics isn’t designed to scale. To the contrary, the
invention of algebraic notation was one of humanity’s first massively scalable
technologies.
On the other end of the spectrum, category theory—which you can think of as a newer
foundation for math roughly based on a new notation that goes beyond what sets and
functions can offer—provides the foundation for much of modern pure mathematics.
It’s considered by many as a major advancement.
Rather than being designed to scale to millions of average users, mathematics aims
to scale far up the ladder of abstraction. Algebra—literally, the marks on paper—
boosted humanity from barely being able to do arithmetic through to today’s machine
learning algorithms and cryptographic protocols. Sets, which were only invented in
the late 1800’s, hoisted mathematical abstraction even further. Category theory is
a relative rocket fuel boosting one through the stratosphere of abstraction (for
better or worse).
The result of this, as the argument goes, is that mathematicians have optimized
their discourse for more relevant metrics: maximizing efficiency and minimizing
cognitive load after deep study.
• Variable names
• Operator overloading
• Sloppy notation
65
they know that when they see these letters out of context, they should at least
behave like a natural number and a function, respectively. Seeing n( f) out of
context would momentarily startle me, though I can imagine situations making it
appropriate. 2 Similarly, if f is a function and you can use f to construct another
function in a “canonical”
(forced, unique) way, then a mathematician might typically adorn f with a star like
f∗.
Two related objects often inhabit the same letter with a tick, like x and x′. Even
if you forget what they represent, you know they’re related.
Every field of mathematics has its own little conventions that help save time. This
is especially true since mathematics is often done in real time (talking with
colleagues in front of a blackboard, or speaking to a crowd). The time it takes to
write f∗ while saying out loud “the canonical induced homomorphism, 3” is much
faster than writing down InducedHomomorphismF in ten places. And then when you need
an h∗ to compose f ∗h∗, half of the characters help you distinguish it from h∗f ∗.
Whereas determining the order of
InducedHomomorphismF.compose(InducedHomomorphismH)
is harder with more characters, and Gauss forbid you have to write down an identity
about the composition of three of these things! A single statement would fill up an
entire blackboard, and you’d never get to the point of your discussion.
More deeply, there is often nothing more a name can do to elucidate the na-
ture of a mathematical object. Does saying f∗ really tell you less about what an
object is than something like InducedF? It’s related to f, its definition is
somehow “induced,” and what? The further up the ladder of abstraction you go, the
more contrived these naming conventions would get. Rather than say, for example,
But this is a trade-off. You can use long words that make it difficult to put
everything you want to say in front of your face at the same time—thus making it
harder to reason. Or, you can use fonts and foreign alphabets to differentiate
concepts. Sans-serif is for one purpose, the curly-scripty font is for another.
66
Why not invent a better name? They do! Just later. In fact, because the expression
H 1( X, O∗) is so important in the study of algebraic geometry, it was renamed to
Pic( X) named after Picard who studied them. But it might take decades to get to
the point where you realize this object is worth giving a name, and in the mean
time you just can’t use 80-character names and expect to get things done.
One reason mathematicians can get away with single-character variable names is that
they spend so much time studying them. When a mathematician comes up with a new
definition, it’s usually the result of weeks of labor, if not months or years!
These objects aren’t just variables in some program whose output or process is the
real prize. The variables represent the cool things! It’s as if you returned to
rewrite and recheck and retest the same twenty-line program every day for a month.
You’d have such an intimate understanding of every line that you could recite it
while drunk or asleep. Now imagine that the intimate understanding of every line of
that program was the basis of every program you wrote for the next year, and you
see how ingrained this stuff is in the mind of a mathematician.
Mathematicians don’t just write a proof and file it away under “great tool; didn’t
read.”
They constantly revisit the source. It’s effective to gild meaning and subtext into
the bones of single letters, because after years you don’t have to think about it
any more. It eliminates the need to keep track of types. Clearly f is a function, z
is probably a complex variable, and everyone knows that ℵ 0 is the countably
infinite cardinal. If you use b and β in the same place, I will know that they are
probably related, or at least play analogous roles in two different contexts, and
that will jump-start my understanding in a way that descriptive variable names do
not.
Operator overloading. Much of what I said above for variable names holds for
operator overloading too. One key feature that stands out for operator overloading
is that it highlights the intended nature of an operation.
We’ll get to this more in Chapter 9, but mathematicians use just a handful of
boolean logic operations for almost everything. There are the standard inequalities
and equalities. Then there are operators that look like ∼
differences that we don’t care about.” In Java terms, mathematicians regularly roll
their own .equals() methods, with proofs that their notions behave. Specifically,
they prove it satisfies the properties required of an equivalence relation, which
is the mathematical version of saying “equals agrees with hashing and toString.”
And so typically mathematicians will drop whatever the original operator symbol was
and replace it with the equal sign. The core properties of = are respected even if
not identity. We’ll see this in detail in Chapters 9 and 16, but the same idea goes
behind the reuse of standard arithmetic operations like addition and
multiplication. It suggests what behavior to expect from the operation. For
example, it is considered bad form to use the
With this in mind it’s the mathematician’s turn to criticize programmers. For
example,
67
reading programming style guides has always amused me. It makes sense for a company
to impose a style guide on their employees (especially when your IDE is powerful
enough to auto-format your programs) because you want your codebase to be uniform.
In the same way, a mathematician would never change notational convention in the
same paper, except to introduce a new notation. But to have a programming language
designer declare style edicts for the entire world, like the following from the
Python Style Guide, is just ridiculous:
Yes: import os
import sys
Okay, so you have an arbitrary idea of what a pretty program looks like, but
wouldn’t you rather spend that time and energy on actually understanding and
writing a good program? Besides, if there were truly a good reason for the first
option, why wouldn’t the language designer just disallow the second option in the
syntax? Of course, programmers get away with it because they use automated tools to
apply style guides automatically.
It’s much harder to do that in math, where the worst offenses are not resolvable
(or dis-coverable!) from syntax alone. Still, I don’t doubt there could be some
progress made in automating some aspects of a mathematical style guide.
In an ideal world, a compiler would see how I use the “stdout” variable and be able
to infer the semantics from a shared understanding about the behavior of standard
output in basically every program ever. This would eliminate the need to declare
module imports or even define stdout! That’s basically how math solves the problem
of overloaded operators.
Sloppy notation. This is probably the area where mathematicians get the most flak,
and where they could easily improve their communication with those aiming to learn.
index variable, a minimum and maximum value for the index, and an expression being
summed. So
2 i + 1 sums the first ten positive odd integers. This is the kind of
i=0
However, this notation is so convenient that it’s been overloaded to include many
other syntax forms. A simple one is to replace the increment-by-one range of
integers with a
“all elements in this set” notation. For example, if B is a set, you can write b 2
to
b∈B
But wait, there’s more! It often happens that B has an implicit, or previously
defined order of the elements B = {b 1 , . . . , bn}, in which case one takes the
liberty of writing
∑ b 2 (“the sum over relevant i”) with no mention of the set in the (local) syntax
at all!
i i
As we saw in Chapter 2 with polynomials, one can additionally add conditions below
the index to filter only desired values, or even have the constraint implicitly
define the
68
variable range! So you can say the following to sum all odd bi ∈ B
∑ b 2 i +3
bi odd
The reason this makes any sense is because, as is often the case, the math notation
often comes from speech. You’re literally speaking, “over all bi that are odd, sum
the terms b 2 + 3.” Equations are written to mimic conversation, not the other way
around.
You see it when you’re in the company of mathematicians explaining things. They’ll
write their formulas down as they talk, and half the time they’ll write them
backwards!
For a sum, they might write the body of the summation first, then add the sum sign
and the index. Because out loud they’ll be emphasizing the novel parts of the
equation, filling the surrounding parts for completeness.
Finally, the things being summed need not be numbers, so long as addition is
defined for those objects and it satisfies the properties addition should satisfy.
In Chapter 10 we’ll see a new kind of summation for vectors, and it will be clear
why it’s okay for us to reuse
∑ in that context. The summing operation needs to have properties that result in
the final sum not depending on the order the operations are applied.
Another prominent example of summation notation being adapted for an expert audi-
notation the
symbol is itself implied from context! For example, rather than write
y =
akxk,
k=1
The sum and the bounds on the indices are implied from the presence of the indices,
as in
y = akxk.
What makes all of this okay is when the missing parts are fixed throughout the
discussion or clear from context. What counts as context is (tautologically)
context dependent.
More often than not, mathematicians will preface their abuse to prepare you for the
new mental hoop. The benefit of these notational adulterations is to make the
mathematics less verbose, and to sharpen the focus on the most important part: the
core idea being presented. These “abuses” reduce the number of things you see, and
as a consequence reduce the number of distractions from the thing you want to
understand.
Chapter 6
Graphs
One will not get anywhere in graph theory by sitting in an armchair and trying to
understand graphs better. Neither is it particularly necessary to read much of the
literature before tackling a problem: it is of course helpful to be aware of some
of the most important techniques, but the interesting problems tend to be open
precisely because the established techniques cannot easily be applied.
– Tim Gowers
In this chapter we won’t learn any new tools. Instead we’ll apply the tools above
to study graphs. Most programmers have heard about graphs before, perhaps in the
context of breadth-first and depth-first search or data structures like heaps.
Instead of discussing the standard applications of graphs to computer science,
we’ll focus on a less familiar topic that still finds use in computer science:
graph coloring.
The definition of a graph is best done by picture, as in Figure 6.1. Take some
“things”
and describe which things are “connected.” The result is a graph. As a simple
example, the
“things” might be airports, and two airports are “connected” if there is a flight
between 69
70
v 2
e 6
e 2 v 1
4 e
e 3
5 e 7
and edges.
the two. Or the things are people and friends have connections. We draw the things
and connections using dots and lines to erase the application from our minds. All
we care about is the structure of the connections.
Let’s lay out the definitions, using sets as the modeling language. The “things”
are called vertices (or often nodes) and the “connections” are called edges (or
links). For shorthand in the definition, I’ll reuse a definition from Chapter 4 for
the set of all ways to choose two things from a set.
( )
= {{v 1 , v 2 } : v 1 ∈ V, v 2 ∈ V, v 1 ̸= v 2 }.
2
This is like V × V , but the order of the pair does not matter.
( )
( )
Alternatively, one can think of E as just any set, and require a function f : E → V
to 2
describe which edges connect which pairs of vertices. This view is used when one
wants to define a graph in a context where the vertices are complicated. We will
briefly see one from compiler design later in this chapter. Despite the definition
of an edge e ∈ E
as a set of size two like {u, v}, mathematicians will sloppily write it as an
ordered pair e = ( u, v). 2
Here’s some notation and terminology used for graphs. We always call n = |V | the
number of vertices and m = |E| the number of edges, and for us these values will
always be finite. When two vertices u, v ∈ V are connected by an edge e = ( u, v)
we call the 1 This is not the most general definition for a graph, but we will not
need graphs with self loops, weights, double edges, or direction. You’ll explore
some of these extensions in the exercises.
2 I have suspicions about why this abuse is commonplace. Curly braces are more
cumbersome to draw than parentheses, and in the typesetting language LaTeX,
typesetting braces requires an escape character. They’re also visually harder to
parse when nested. Finally, directed edges use the ordering of a tuple.
71
The size of a neighborhood (and the number of incident edges) is called the degree
of a vertex, and the function taking a vertex v to its degree is called deg : V →
Z. To practice the new terms, see Figure 6.2, labeling the graph from Figure 6.1.
Vertices have label ‘v’
Another concept we’ll need in this chapter is the concept of a connected graph.
First, a path in a graph is a sequence of alternating vertices and edges ( v 1 , e
1 , v 2 , e 2 , . . . vt) so that each ei = ( vi, vi+1) connects the two vertices
next to it in the list. Visually, a path is just a way to traverse through the
vertices of G by following edges from vertex to vertex.
In Figure 6.2, there are many different paths from v 4 to v 6, four of which do not
repeat any vertices. Many authors enforce that paths do not repeat vertices by
definition, and give the name “trail” or “walk” to a path which does repeat
vertices. For us, the difference won’t matter. A cycle is a path that starts and
ends at the same vertex.
A graph is called connected if there is a path from each vertex to each other
vertex, and otherwise it is called disconnected. Equivalently (you will prove this
in an exercise), G =
By now you should know to write down examples for small n and k before moving on.
Because this is a crucial definition, here is a more complicated example. The
Petersen
72
graph is shown in Figure 6.3. The Petersen graph has a distinguished status in
graph theory as a sort of smallest serious unit test. Conjectures that are false
tend to fail on the Petersen graph. 3 The Petersen graph is 3-colorable (find a 3-
coloring!) but not 2-colorable.
Definition 6.3. The chromatic number of a graph G, denoted χ( G), is the minimum
integer k for which G is k-colorable.
Recall from Chapter 4 that mathematicians often define functions without knowing
how to compute them. The chromatic number is an excellent example. We define the
concept to clarify what it is we want to study, and the modeling language of sets
allows us to start to reason about it.
If you believe that the Petersen graph is not 2-colorable—or you do the exercise
that proves this—then we know the Petersen graph has chromatic number 3. Here is a
simple fact about the chromatic number.
3 Why? Part of it is that the Petersen graph is highly symmetric. We’ll revisit
this in the exercises for Chapter 16.
73
graph.
deg
v∈V
( v) + 1. See if you
can find it. On the other hand, this bound can be quite loose. Here “loose” means
that there are graphs which meet the conditions of the proposition, but the true χ(
G) is much smaller than the proposition enforces. Consider the “star” graph which
has n vertices and only one vertex of degree n − 1, pictured in Figure 6.4. Clearly
the star graph is 2-colorable, but the max degree is n − 1. The guarantee of the
proposition is effectively useless.
This perspective can be used to design coloring algorithms. Start with an improper
or unfinished coloring, and fiddle with it to correct the improprieties. We will do
this in the main application of this chapter, coloring planar graphs. But right now
we’re going to take a quick detour to see why graph coloring is useful.
The wishy-washy way to motivate graph coloring is to claim that many problems can
be expressed as an “anti-coordination problem,” where you win when no agent in the
system behaves the same as any of their neighbors. A totally made up example is
radio 4 A partition of X is a set of non-overlapping (disjoint) subsets Ai ⊂ X the
union of all of them being
∪iAi = X.
74
frequencies. Radio towers pick frequencies to broadcast, but if nearby towers are
broadcasting on the same frequency, they will interfere. So the vertices of the
graph are towers, nearby towers are connected by an edge, and the colors are
frequencies.
The connection to graph coloring is beginning to reveal itself: the vertices are
the logical variables and the colors are physical registers, but I haven’t yet said
how to connect two vertices by an edge. Intuitively, it depends on whether the
logical variables “overlap”
in the scope of their use. The structure of scope overlap is destined to be studied
with graph theory.
To simplify things, we’ll do what a compiler designer might reasonably do, and
compile a program down to almost assembly code, where the only difference is that
we allow infinitely many “virtual” registers, which we’ll just call variables. So
for a particular program P , there is a nP ∈ N that is the number of distinct
variable names used in the program. Each of these integers is a vertex in G.
As an illustrative example, say that the almost-compiled program looks like this,
where the dollar sign denotes a variable name:
whileBlock:
$41 = $41 - 1
endBlock:
In this example variables 41 and 42 cannot share a physical register. They have
different values and are used in the same line to compute a difference. Call a
variable live at a statement in the code if its value is used after the end of that
statement. Thinking of it in reverse: a variable is dead in all of the lines of
code between when it was last read and when it is next written to. Whenever a
variable is dead we know it’s safe to reuse its physical register (storing the
value of the dead variable in memory).
Now we can define the edges. Two variables $i and $j “interfere,” and hence we add
the edge ( i, j) to G, if they are ever live at the same time in the program. With
a bit of work (uncoincidentally using graphs to do a flow analysis), one can
efficiently compute the places in the code where each variable is live and
construct this graph G. Then if we can compute the chromatic number of G and find
an actual χ( G)-coloring, we can
75
assign physical registers to the variables according to the coloring. Without some
deeper semantic analysis, this provides the most efficient possible use of our
physical registers.5
Unfortunately, in general you should not hope to compute the chromatic number of
an arbitrary graph. This problem is what’s called “NP-hard,” which roughly means
there is no known provably correct (in the worst case) and provably efficient
algorithm for computing it. Moreover, if there were, the same algorithm could be
adapted to solve a whole class of problems that are also believed to be
intrinsically hard to solve. The notion of efficiency here is—as usual for
algorithm analysis—in terms of the runtime compared to the size of the input as the
input grows. This is called “asymptotic analysis” or “big-O.”
Given any graph G as input, if G has n vertices, output a number Z with the
property that Z < nc.
χ( G)
But I digress. The takeaway is that coloring is a hard problem. This is a sad
result for people who really want to color their graphs, but there are other ways
to attack the problem. You can assume that your graph has some nice structure. This
is what we’ll do in the next section, and there it turns out that the chromatic
number will always be at most 4. Alternatively, you could assume that you know your
graph’s chromatic number, and try to color it without introducing too many
improperly colored edges. We’ll see this approach in Section 6.6.
The condition we’ll impose on a graph to make coloring easier is called planarity.
A graph G = ( V, E) is called planar if one can draw it on a plane in such a way
that no edges cross. Figure 6.6 contains an example.
Here’s a little exercise: come up with an example of a graph which is not planar.
Don’t 5 In fact, it can happen that the chromatic number of G is greater than the
total number of registers on the target machine. In this case you have to spill
some variables into memory.
76
Figure 6.6: An example of a planar graph which can be drawn with no edges crossing.
be surprised if you’re struggling to prove that a given graph is not planar. You
personally failing to draw a specific graph without edges crossing is not a proof
that it is impossible to do so. There is a nice rule that characterizes planar
graphs, but it is not trivial. See the chapter exercises for more.
Now that you’ve tried the exercise: Figure 6.7 depicts two important graphs that
are not planar. The left one is called the complete graph on 5 vertices, denoted K
5. The word “complete” here just means that all possible edges between vertices are
present.
The second graph is called the complete bipartite graph K 3 , 3. “Bipartite” means
“two parts,” and the completeness refers to all possible edges going between the
two parts.
The subscript of Ka,b for a, b ∈ N means there are a vertices in one part and b in
the other.
One feature about planar graphs is that when you draw a planar graph in such a way
that no edges cross, you get a division of R2 into distinct regions called “faces.”
Figure 6.8
shows a graph with four faces, noting that by convention I’m calling the “outside”
of the drawing also a face. If we call f the number of faces, and remember n is the
number of vertices and m is the number of edges, then we can notice7 a nice little
pattern: n − m +
f = 2.
The amazing fact is that this equation does not depend on how you draw the graph!
So long as your drawing has no crossing edges, the value n − m + f will always be
2. We can prove it quite simply with induction.
Theorem 6.5. For any connected planar graph G = ( V, E) with at least one vertex,
and any drawing of G in the plane R2 defining a set F of faces, the quantity |V |−|
E|+ |F | = 2 .
7 Why anyone would have reason to analyze this quantity is a historical curiosity;
it was discovered by Euler for certain geometric shapes in three dimensions called
convex polyhedra. See the following for more: http:
//mathoverflow.net/q/154498/6429
77
F 4
F 2
F 1
F 3
K 5
K 3,3
planar.
graph.
Proof. We proceed by induction on the total number of vertices and edges. The base
case is a single isolated vertex, for which |V | = 1, |E| = 0, and |F | = 1, so the
theorem works out.
Now suppose we have a graph G for which the theorem holds, i.e. |V |−|E|+ |F | = 2,
and we will make it larger and show that the theorem still holds. In particular, we
will do induction on the quantity |V | + |E|. There are two cases: either we add a
new edge connecting two existing vertices, or we add a new edge connected to a new
vertex (which now has degree 1). Adding a vertex by itself is not allowed because
the graph must stay connected at all times.
|V | − ( |E| + 1) + ( |F | + 1) = |V | − |E| + |F | = 2
Notice how it does not matter how we drew the edge, so long as it doesn’t cross any
other edges to create more than one additional face. The second case is similar,
except adding an edge connected to a new vertex does not create any new faces.
Convince
yourself that any vertex involved in a path that encloses a face has to have degree
at least two. So again we get that for the new graph |V | + 1 − ( |E| + 1) + |F | =
2. This finishes the inductive step.
graph that doesn’t depend on the choices made to draw it! This is called an
invariant, and we’ll discuss invariants more in Chapter 10 when we study linear
algebra, and Chapter 16
when we study geometry. For now it will remain a deep mathematical curiosity.
Lastly,
78
note that the connectivity requirement is crucial for the theorem to hold, since a
graph with n vertices and no edges has |V | − |E| + |F | = n + 1.
This was proved by Kenneth Appel and Wolfgang Haken in 1976 after being open for
over a hundred years. You may have heard of it because of its notoriety: it was the
first major theorem to be proved with substantial aid from a computer.
Unfortunately the proof is very long and difficult (on the order of 400 pages of
text!). Luckily for us there is a much easier theorem to prove.
If you’re like me and frequently make off-by-one errors, then the five color
theorem is just as good as the four color theorem. In order to prove it we need
three short lemmas.
v∈V
Proof. The important observation is that the degree of a vertex is just the number
of edges incident to it, and every edge is incident to exactly two vertices.
This is where the proof would usually end. As a variation on a theme, you can (and
should) think of this as constructing a clever bijection like we did in Chapter 4,
but it’s difficult to clearly define a domain and codomain. Let me try: the domain
consists of
“edge stubs” sticking out from each vertex, and the codomain is the set of edges E.
We’re mapping each edge stub to the edge that contains that stub. This map is a
surjection and
deg( v).
v∈V
Lemma 6.9. If a planar graph G has m ≥ 2 edges and f faces, then 2 m ≥ 3 f, i.e., f
≤ (2/3) m.
Proof. Pick your favorite embedding (drawing) of G in the plane. We’ll use a
similar counting argument as in Lemma 6.8: for any planar drawing, every face is
enclosed by at least three edges, and every edge touches at most two faces. 8 In
other words, each face is “counted” by each edge it touches, and each face has at
least three edges counting it.
Hence 3 f counts each edge at most twice, while 2 m counts each face at least three
times.
8 An edge incident to a vertex of degree 1 will touch the “outside” face twice, but
this only counts as one face.
79
The requirement that m ≥ 2 is necessary, since if there is only one edge (or zero),
then the outside face is the only face. It only gets “counted” twice (or zero
times) by the edges it touches. Once we get to two edges, the outside face is
counted twice (2 m = 4). As you add more edges, either you add dangling edges (or
subdivide existing edges) which increases 2 m but not 3 f, or you add edges that
create new faces. In the case of a single edge creating a single new face, the
lower bound 3 f increases by exactly 3, but the upper bound 2 m only increases by
2.
Despite having just read a proof, this may be surprising: can’t we keep adding
face-creating edges to make the lower bound of 3 f exceed the upper bound of 2 m?
It’s instructive to take a moment and play with examples. You’ll eventually get to
a situation in which all interior faces are triangles, and the inequality is either
an equality or very close. Then the creation of new faces requires a sufficient
number of non-face-creating edges to be made first, which loosens the inequality.
The proof above explains how this loosening and tightening of the inequality
corresponds to the geometry of a graph drawn in the plane. It translates the
geometry to algebra. When the algebra seems to misbehave, we can call back to the
geometry to understand.
You should do what I did for Lemma 6.8 and think about how to express this as an
injection from one set to another. The last lemma is the key to the five color
theorem.
Substituting the inequality from Lemma 6.9 into the Euler characteristic equation
gives 2 = |V | − |E| + |F | ≤ |V | − |E| + (2/3) |E|
Rearranging terms to solve for |E| gives |E| ≤ 3 |V | − 6. Now we want to use Lemma
6.8, so we multiply by two to get 2 |E| ≤ 6 |V | − 12. Since 2 |E| is the sum of
the degrees, and each vertex has degree at least six, 2 |E| is at least as large as
6 |V |.
6 |V | ≤ 2 |E| ≤ 6 |V | − 12 ,
which is a contradiction.
As a quick side note that we’ll need in the next theorem, along the way to proving
Lemma 6.10 we get a bonus fact: the complete graph K 5 is not planar. This is
because we proved that all planar graphs satisfy |E| ≤ 3 |V | − 6, and for K 5, |E|
= 10 > 15 − 6.
This argument doesn’t work for showing K 3 , 3 is not planar, but if you’re willing
to do a bit extra work (and take advantage of the fact that K 3 , 3 has no cycles
of length 3), then you can improve the bound from Lemma 6.10 to work. In
particular, because K 5 is not planar, no planar graph can contain K 5 as a
subgraph.
80
Proof. By induction on |V |. For the base case, every graph which has 5 or fewer
vertices is 5-colorable by using a different color for each vertex.
So why is G′ planar? To argue this, we can show that for any planar drawing of G,
removing v leaves wi and wj in the same face. This is equivalent to being able to
trace a curve in the plane from wi to wj without hitting any other edges, since we
could then
“drag” wi along that curve to wj and “lengthen” the edges incident to wi as we go.
When the two vertices merge, and “become” x, we get a planar drawing of G′. The
picture in my head is like the strands of a spider web, shown in Figure 6.9.
The key is that G is planar and that v has all of the w’s as neighbors. If we want
to merge wi to wj, we can use the curve already traced by the edges from wi to v
and from v to wj. By planarity this is guaranteed not to cross any of the other
edges of G, and hence of G′. To say it a different way, if we took the drawing
above and continued drawing G′, and the result required an edge to cross one of the
edges above, then it would have crossed through one of the edges going from v to wi
or v to wj!
That proof neatly translates into a recursive algorithm for 5-coloring a planar
graph.
We’ll finish this section with Python code implementing it. In order to avoid the
toil of writing custom data structures for graphs, we’ll use a Python library
called igraph to handle our data representation. As a very quick introduction, one
can create graphs in igraph as follows.
9 The tick is called the “prime” symbol, and it is used to denote that two things
are closely related, usually that the prime’d thing is a minor variation on the un-
primed thing. So using G′ here is a reminder to the reader that G′ was constructed
from G.
81
wi
wi
wi=wj
Figure 6.9: The “strands of a spider web” image guide the proof that G′ is planar.
import igraph
G = igraph.Graph(n=10)
For example, given a graph and a list of nodes in the graph, one might use the
following function to find two nodes which are not adjacent.
return x, y
Also, the vertices of an igraph graph can have arbitrary “attributes” that are
assigned like dictionary indexing. We use this to assign colors to the vertices,
using [ ]. For example, this is the base case of our induction: trivially color
each vertex of a ≤ 5 vertex graph with all different colors.
82
colors = list(range(5))
def planar_five_color(graph):
n = len(graph.vs)
if n <= 5:
graph.vs['color'] = colors[:n]
return graph
...
The igraph library overloads the assignment operator to allow for entry-wise as-
colors[:n], the nodes of G are being assigned the first n colors in the list of
colors.
The rest of the planar_five_color function involves finding the vertices of the
needed degree, forming the graph G′ to recursively color, and keeping track of
which vertices were modified to make G′ so you can use its coloring to color G.
Here is the part where we find vertices of the right degree and do bookkeeping:
deg_at_most5_nodes = graph.vs.select(_degree_le=5)
deg_at_most4_nodes = deg_at_most5_nodes.select(_degree_le=4)
deg5_nodes = deg_at_most5_nodes.select(_degree_eq=5)
g_prime = graph.copy()
g_prime.vs['old_index'] = list(range(n))
The select functions are igraph-specific: they allow one to filter a vertex list by
various built-in predicates, such as whether the degree of the vertex is equal to
5. The old_index attribute keeps track of which vertex in G′ corresponded to which
vertex in G, since when you modify the vertex set of an igraph the locations of the
vertices within the data structure change (which changes the index in the list of
all vertices).
Next we construct G′. This is where the two cases in the proof show up.
if len(deg_at_most4_nodes) > 0:
v = deg_at_most4_nodes[0]
g_prime.delete_vertices(v.index)
else:
v = deg5_nodes[0]
g_prime.delete_vertices(v.index)
neighbors_in_g_prime = g_prime.vs.select(old_index_in=neighbor_indices)
We implemented a function called merge_two that merges two vertices, but the
implementation is technical and not interesting. The official igraph function we
used is called contract_vertices. The remainder of the algorithm executes the
recursive call, and
83
then copies the coloring back to G, computing the first unused color with which to
color the originally deleted vertex v.
colored_g_prime = planar_five_color(g_prime)
for w in colored_g_prime.vs:
graph.vs[w['old_index']]['color'] = w['color']
neighbor_colors = set(w['color'] for w in v.neighbors())
return graph
The entire program is in the Github repository for this book.10 The second case of
the algorithm is not trivial to test. One needs to come up with a graph which is
planar, and hence has some vertex of degree 5, but has no vertices of degree 4 or
less. Indeed, there is a planar graph in which every vertex has degree 5. Figure
6.10 shows one that I included as a unit test in the repository.
Earlier I remarked that coloring is probably too hard for algorithms to solve in
the worst case. To get around the problem we added the planarity constraint. Though
a practical coloring algorithm would likely use an industry standard optimization
engine to approximately color graphs, let’s try something different to see the
theory around graph coloring. Suppose we’re promised a graph can be colored with 3
colors, and let’s 10 See pimbook.org.
84
The first algorithm of this kind colors a 3-colorable graph with 4 n colors, where
n = |V |. To make the numbers concrete, for a 3-colorable graph with 1000 vertices,
this algorithm will use no more than 127 colors. Sounds pretty rotten, but the
algorithm is
three new colors. Use one for v, and the other two to color N( v). Then remove all
these
vertices from the graph and repeat. If there are no vertices of degree n, then use
the greedy algorithm to color the remaining graph.
Theorem 6.11. This algorithm colors any 3-colorable graph using at most 4 ⌈ n⌉
colors. 12
Proof. Let G be a 3-colorable graph. For the first case, where there is a vertex v
of degree
≥ n, we have to prove that the neighborhood N( v) can be colored with two colors.
But this follows from the assumption that G is 3-colorable: in any 3-coloring of G,
v uses a color that none of its neighbors may use. Only two colors remain.
most ⌈ n⌉ − 1, and we proved in Proposition 6.4 that the greedy algorithm will use
no
Now we have to count how many colors get used total. The first case can only happen
n times, because each time we color v and its neighbors, we remove those n+1 ≥
√ √
vertices from G ( n · n = n). Since we add 3 new colors in each step, this part
uses at
most 3 n colors. The greedy algorithm uses at most ⌈ n⌉ colors, so in total we get
at
One might naturally ask whether we can improve n to something like log( n), or even
some very large constant. This is actually an open question. Recent breakthroughs
using a technique called semidefinite programming got the number of colors down to
roughly n 0 . 2. For reference, a thousand-node 3-colorable graph would have n 0 .
2 ≈ 4. That’s quite
“impossible” to solve, I mean that there exists (or does not exist, respectively)
an efficient algorithm that achieves the desired worst-case guarantee on all
inputs. In particular, there is no evidence for either claim that it is possible or
impossible to color a 3-colorable graph with log( n) colors (or anything close to
that order of magnitude, like (log( n))10).
11 Ideally we might hope to color a 3-colorable graph with 4 colors, but this was
shown to be NP-hard as well.
See http://dl.acm.org/citation.cfm?id=793420.
12 The symbols ⌈−⌉ denotes the ceiling of the argument, which is the smallest
integer greater than or equal to the input. Similarly, ⌊−⌋ denotes the floor. These
are mathematical ways to say round up or down.
85
2. Sometimes if you want to come up with the right rigorous definition for an
intuitive concept (like a planar graph), you need to develop a much more general
framework
for that concept. But in the mean time, you can still do mathematics with the
informal notion.
6.8 Exercises
6.1. Write down examples for the following definitions. A graph is a tree if it
contains no cycles. Two graphs G, H are isomorphic if they differ only by
relabeling their vertices.
6.2. Look up the statement of Wagner’s theorem, which characterizes planar graphs
in terms of contractions and the two graphs K 3 , 3 and K 5. Find a proof you can
understand.
6.3. In Section 6.1 we claimed that the following two definitions of a connected
graph are equivalent: (1) there is a path between every pair of vertices, (2) it is
impossible to split V into two nonempty subsets X, Y such that no edge e = ( a, b)
has a ∈ X and b ∈ Y . Prove this.
6.4. Here’s a simple way to make examples of planar graphs: draw some non-
86
6.5. Given a graph G, the chromatic polynomial of G, denoted PG( x), is the unique
polynomial which, when evaluated at an integer k ≥ 0, computes the number of proper
colorings of G with k colors. Compute the chromatic polynomial for a path on n
vertices, a cycle on n vertices, and the complete graph on n vertices. Look up the
chromatic polynomial for the Petersen graph.
6.7. In the chapter I remarked that the Euler characteristic is a special quantity
because it is an invariant. Look up a source that explains why the Euler
characteristic is special.
6.8. Find a simple property that distinguishes 2-colorable graphs from graphs that
are not 2-colorable. Write a program which, when given a graph as input, determines
if it is 2-colorable and outputs a coloring if it is.
6.10. A directed graph is a graph in which edges are oriented (i.e., they’re
ordered pairs instead of unordered pairs). The endpoints of an edge e = ( u, v) are
distinguished as the source u and the target v. A directed graph gives rise to
natural directed paths, which are like normal paths, but you can only follow edges
from source to target. A graph is called strongly connected if every pair of
vertices is connected by a directed path. Write a program that determines if a
given directed graph is strongly connected.
6.11. A directed acyclic graph (DAG) is a directed graph which has no directed
cycles (paths that start and end at the same vertex). DAGs are commonly used to
represent dependencies in software systems. Often, one needs to resolve
dependencies by evaluating them in order so that no vertex is evaluated before all
of its dependencies have been evaluated. One often solves this problem by sorting
the vertices using what’s called a
“topological” sort, which guarantees every vertex occurs before any downstream
dependency. Write a program that produces a topological sort of a given DAG.
6.12. A weighted graph is a graph G for which each edge is assigned a number we ∈
R.
Weights on edges often represent capacities, such as the capacity of traffic flow
in a road network. Look up a description of the maximum flow problem in directed,
weighted
graphs, and the Ford-Fulkerson algorithm which solves it. Specifically, observe how
the maximum flow problem is modeled using a graph. Find real-world problems that
are
6.13. A hypergraph generalizes the size of an edge to contain more than two
vertices.
87
Hypergraphs are also called set systems or families of sets. Edges of a hypergraph
are called hyperedges, and a k−uniform hypergraph is one in which all of its
hyperedges have size k. Look up a proof of the Erdős-Ko-Rado theorem: let G be a k-
uniform hypergraph with n ≥ 2 k vertices, in which every pair of hyperedges shares
a vertex in common. Then (
The reason a planar graph is so hard to define rigorously is because the right
definition of what it means to “draw” one thing inside another is deep and deserves
to be defined in general. And such a definition requires some amount of topology,
the subfield of mathematics that deals with the intrinsic shape of space without
necessarily having the ability to measure distances or angles.
• Every fe is injective.
( t
( t
e 2
1) = fe 2
2),
i.e., the images of fe and f do not intersect except possibly at their endpoints.
e 2
• Whenever there are two edges ( u, v) and ( u, w), the corresponding functions
must intersect at one endpoint, and these intersections must be consistent across
all the vertices. I.e., every u ∈ V corresponds to a point xu ∈ R2 such that for
every edge ( u, v) incident to u, either f( u,v)(0) = xu or f( u,v)(1) = xu.
The problem is that the definition is full of a bunch of “except” and special cases
(like that the endpoint could either be zero or one). This makes for ugly
mathematics, and the mathematical perspective is to spend a little bit more time
understanding exactly what we want from this definition. We are humans, after all,
who are inventing this mathematics so that we can explain our ideas easily to
others and appreciate the beautiful proofs and algorithms. Keeping track of edge
cases is dreary.
And because we said we don’t want any of the edges to cross each other in the
plane, we probably want f to be injective. Finally, because the drawing has to be a
sensible drawing, we need f to be continuous. Recall from calculus that a
continuous function intuitively maps points that are “close together” in the domain
to points that remain close together
88
in the codomain. Without continuity, a “drawing” could break edges into disjoint
pieces and there would be chaos.
The real question is: what is the domain of this function? It can’t be G as a set
because we don’t have a notion of “closeness” for pairs of vertices, and we really
want to think of an edge as a line-like thing.
The trick is to start imagining abstract spaces that are not sitting in any ambient
geometric space. This is where the formalisms of topology shine, but unfortunately
a satisfying overview of the basic definitions of topology is beyond the scope of
this book. It suffices for our purposes to understand two concepts:
One can take the disjoint union of two abstract spaces and get another abstract
space in which the points comprising the two pieces are different. In other words,
we can take lots of different copies of the same space (in our case [0 , 1]), their
disjoint union is like a bunch of lines, but we aren’t presuming any way to compare
the different pieces. Each piece retains its internal geometry in the composite.
The second idea is that one can identify two points in an abstract space.
Intuitively, one can “glue together” two points and maintain the rest of the space
unhindered. For us, if a copy of [0 , 1] represents an edge, then we’ll want two
edges incident to the same vertex to have one of their two endpoints identified.
This foreshadows a topic in a later chapter called the equivalence relation, which
formalizes how to identify points in a consistent way.
Putting these two ideas together, the abstract space XG corresponding to a graph G
is the disjoint union of copies of [0 , 1] for each edge, with endpoints identified
when two edges intersect at a vertex. Then we can define a function f : XG → R2,
require it to be injective, and call it continuous if points that are close in XG,
using the natural distance for points in the interval [0 , 1], get sent to points
that are close in f( XG). How do I measure distance between two points a, b ∈ XG
that might be on different edges?
Well a, b are either vertices or on some copy of [0 , 1], so I can find a path in
the graph G, that gets from one edge to another (if not, then the distance can be
called infinite). Then I could measure the length of each full edge on this path,
and add up the partial edges required to get from a or b to the desired endpoint of
the edge they’re in.
This is a very fancy way to say that I can impose the same geometry that was on
[0 , 1]
onto the different pieces of XG and patch them together. But once you get
comfortable with that idea, you have a natural way to define an embedding of any
abstract space into any other abstract space: a continuous injective function!
If this interests you and you’d like to see it made more formal, pick up a book on
topology. Appendix C contains some references. Unfortunately I haven’t yet found a
topology book that I genuinely like. Most books tend to be terse and contain few
pictures (which is the opposite of how topology is done!). Topology aims to
generalize much of calculus, so waiting until after Chapter 14 might be prudent.
Chapter 7
Some people may sit back and say, “I want to solve this problem” and they sit down
and say, “How do I solve this problem?” I don’t. I just move around in the
mathematical waters, thinking about things, being curious, interested, talking to
people, stirring up ideas; things emerge and I follow them up. Or I see something
which connects up with something else I know about, and I try to put them together
and things develop. I have practically never started off with any idea of what I’m
going to be doing or where it’s going to go. I’m interested in mathematics; I talk,
I learn, I discuss and then interesting questions simply emerge. I have never
started off with a particular goal, except the goal of understanding mathematics.
– Alfréd Rényi
There is a fascinating bit of folk lore, which as far as I know originated with a
2010
blog post of Ben Tilly, that you can tell what type of mathematician you are by how
you eat corn on the cob. It turns out there are multiple ways to eat corn, and they
are roughly grouped as “eat in rows like a typewriter, left to right,” and “eat in
a spiral, teeth scraping the corn into your mouth.”
According to Tilly, who surveyed 40-ish mathematicians and received countless more
self-selected responses via the internet, corn eating predicts mathematical
preference with surprising accuracy. Since his post, this observation has become a
bit of folk lore that reinforces the idea that mathematics has many subcultures
organized around preference and character.
89
90
If you are unsure to which class you belong, then consider the following two
statements.
Most mathematicians would say that there is truth in both (1) and (2). Not all
problems are equally interesting, and one way of distinguishing the more
interesting ones is to demonstrate that they improve our understanding of
mathematics as a whole. Equally, if somebody spends many years struggling to
understand a difficult area of mathematics, but does not actually do anything with
this understanding, then why should anybody else care?
The Hungarian mathematician Paul Erdős was a pillar of the problem solving camp.
Though this short essay could not possibly do justice to his outlandish life story,
I will try to summarize. Erdős is the most prolific mathematician in history, by
count of papers published (over 1500). He was able to do this because he renounced
every aspect of life beyond mathematics. He had no home, and lived out of a
suitcase while traveling from university to university. At each stop, he would show
up, knock on the department chair’s office door, and be provided housing and food
by an attendant professor. In the subsequent weeks, Erdős and his host would work
on problems and usually publish a paper or two, until such time as Erdős decided to
move on to his next host. As Erdős said,
Erdős would often do bizarre things like wake up his host in the middle of the
night, exclaiming, “My mind is open,” meaning he was ready to do mathematics. He
was a
He also claimed God kept a Book of the most beautiful proofs of every theorem. He
didn’t believe in God, but he believed in the Book.
Erdős’s hosts tolerated his idiosyncratic behavior because his presence was a boon
to one’s career. Mathematicians jumped at the chance to work with Erdős, and in
turn they started to track their so-called Erdős number. In the graph whose
vertices are people and whose edges are coauthorship, your Erdős number tracks the
length of the shortest path from you to Erdős. 1
1 You didn’t ask, but my Erdős number is three, by way of György Turán → Endre
Szemerédi (and others) →
Erdős.
91
His work focused on problems in combinatorics, number theory, graph theory, and
incidence geometry (statements about configurations of points and lines), the sort
of counting arguments that we saw in Chapters 4 and 6—though much more
sophisticated and interesting. As he spread his ideas from university to
university, he gave combinatorics credibility as a field of study and established
its reputation as a field that prioritizes problem solving over theory building. To
Erdős, mathematics was “conjecture and proof.”
Indeed, as Tim Gowers writes, graph theory tends not to benefit from extensive
theory-building.
At the other end of the spectrum is, for example, graph theory, where the basic
object, a graph, can be immediately comprehended. One will not get anywhere in
graph theory by sitting in an armchair and trying to understand graphs better.
Neither is it particularly necessary to read much of the literature before tackling
a problem: it is of course helpful to be aware of some of the most important
techniques, but the interesting problems tend to be open precisely because the
established techniques cannot easily be applied.
In mathematics, ideas and concepts come first, then come questions and problems. At
this stage the search for solutions begins, one looks for a method or strategy.
Once you have convinced yourself that the problem has been well-posed, and that you
have the right tools for the job, you then begin to think hard about the
technicalities of the proof.
Before long you may realize, perhaps by finding counterexamples, that the problem
was incorrectly formulated. Sometimes there is a gap between the initial intuitive
idea and its formalization. You left out some hidden assumption, you overlooked
some technical detail, you tried to be too general. You then have to go back and
refine your formalization of the problem. It would be an unfair exaggeration to say
that mathematicians rig their questions so that they can answer them, but there is
undoubtedly a grain of truth in the statement. The art in good mathematics, and
mathematics is an art, is to identify and tackle problems that are both interesting
and solvable.
Proof is the end product of a long interaction between creative imagination and
critical reasoning.
92
I interpret this in more of a metaphysical sense than a literal sense; one needs to
know what questions are worth asking before one can provide a proof answering them.
For whatever reason, Atiyah doesn’t consider the validations or refutations of
these initial ideas as “proofs” in the formal sense.
One person who might be said to be the stylistic antithesis to Paul Erdős is the
French mathematician Alexander Grothendieck. He also lived a curiously eccentric
lifestyle involving radical anti-military politics and an eventual self-exile to a
small village in South-ern France. Grothendieck declined various prizes for his
life’s work, and decried the mathematical establishment as being obsessed by status
to the point of intellectual bankruptcy.
Toward the end of his life he also turned to mysticism and spiritualism, almost
starving himself to death via unusual diets and fasting.
Although mathematics became more and more abstract and general throughout the 20th
century, it was Alexander Grothendieck who was the greatest master of this trend.
His unique skill was to eliminate all unnecessary hypotheses and burrow into an
area so deeply that its inner patterns on the most abstract level revealed
themselves—and then, like a magician, show how the solution of old problems fell
out in straightforward ways now that their real nature had been revealed.
Grothendieck’s ideas were to find out what theorems are important, and then rewrite
the basic definitions of mathematics until those theorems become completely
trivial. In his mind, a theory is powerful only insofar as what it makes obvious. A
radical conviction indeed!
Among the German geometers of this century, two names above all are illustrious,
those of the two scientists who have founded the general theory of functions,
Weierstrass and
93
Riemann. Weierstrass leads everything back to the consideration of series and their
analytic transformations; to express it better, he reduces analysis to a sort of
prolongation of arithmetic; you may turn through all his books without finding a
figure. Riemann, on the contrary, at once calls geometry to his aid; each of his
conceptions is an image that no one can forget, once he has caught its meaning.
We’ll see the two sides of this analytic/geometric coin in the forthcoming
chapters: the view that geometric ideas should be studied using series is how we
will approach Calculus in Chapter 8 (and to a lesser extent Chapter 14), while the
geometric view is the heart of the study of hyperbolic geometry in Chapter 16.
These could have easily been swapped, with geometric ideas founding calculus and
analytic ideas underlying hyperbolic geometry.
Styles fall along a spectrum, depending on the occasion and whether one has had a
full breakfast. Whether Poincaré, Mumford, Atiyah, or Tilly, the mathematical
universe is as varied in attitudes and preferences as any other community, and
mathematics reaps the benefits of diversity.
For the record, I eat corn like a typewriter, and I do prefer algebra. Although,
much of my mathematical research involved analysis-style arguments, and I have come
to appreciate the beauty of a good bound. Maybe next time I’m in a rush I’ll try
scraping that corn.
Chapter 8
I can remember absorbing each of these concepts as something new and interesting,
and spending a good deal of mental time and effort digesting and practicing with
each, reconciling it with the others. I also remember coming back to revisit these
different concepts later with added meaning and understanding.
– William Thurston
Much of the mastery of calculus (and any subject!) comes with practice. Even so, in
this chapter and Chapter 14 we can survey many of the important features of a
complete calculus course and do a bit of machine learning at the end. This chapter
will be about calculus for functions with one input, while Chapter 14 will cover
functions with many inputs.
95
96
If you’ve seen a lot of calculus before, you can probably tell that I don’t regard
it as reverently as most other authors. While I can appreciate its place in history
and its applications to physics and everything else, my esteem for calculus is
essentially limited to
“It’s a great tool for computation.” I avoid nonsense rhetoric about calculus like
a plague (“With calculus you can hold infinity in the palm of your hand!”). I’d
much rather use it to do something useful and draw divine inspiration from other
areas of math. But that’s a personal preference.
Besides calculus, in this chapter we’ll dive into more detail about the process of
designing a good mathematical definition. In doing this we’ll introduce the idea of
a quantifier, which is the basis for compound (recursive) conditions and claims.
We’ll also come to understand the idea of well-definition in mathematics, which is
how a mathematician proves (or asserts) that the definition of a concept doesn’t
depend on certain irrelevant details in its construction. Finally, we’ll level up
our proof skills by using multiple definitions in conjunction to prove theorems.
The application for this chapter is an analysis of the classic Newton’s method for
finding roots of functions.
Let’s start with something we know well. If you give me a line in the plane, with
tick marks forming integer coordinates like in Figure 8.1, then I can tell you how
“steep”
the line is. That is, I can assign a number to the line, and larger numbers
correspond to steeper lines while smaller numbers correspond to more gradual lines.
Also recall that the picture with coordinate axes is just one representation of the
line. Another might be as a set of points {( x, y) ∈ R2 : 2 y − x = 2 }. How we
choose to draw the line isn’t as important as the set-with-equation definition, but
a good drawing swiftly reveals qualitative facts about the line (such as whether
its “steepness” goes up or down).
slope
2 − y 1
( L) =
x 2 − x 1
The difference in the y’s corresponds to a vertical change, while the difference in
x’s corresponds to a horizontal change. The slope is an invariant of the line
because it does not depend on the choice of points. This can be proved rigorously
by appealing to similar triangles. Lines and other simple functions often represent
the 1-dimensional position of an object over time, while the steepness—the ratio of
the change in position to the change in time—is the velocity of that object.
Before graduating from lines, let me point out that not all lines are functions
from the x coordinate to the y coordinate. 1 If you pick a line which is a function
y = f ( x), then 1 For example, the line {( x, y) : x = 1 } cannot be written as a
function of x. It can be written as a function of the y variable, but then the
concept of slope is rotated by 90 degrees.
97
f ( x
slope
2) − f ( x 1)
( f ) =
x 2 − x 1
In this way, the concept of slope requires an orientation of the line and the
coordinate system it is represented in. The input coordinate is defined as
“horizontal” while the output coordinate is “vertical.” This is the mathematician’s
choice, though calling x the
Now let’s try to translate the concepts to the curved function f( x) in Figure 8.2.
It has a complicated formula we won’t write down. The curve is steeper at some
places (e.g., A) and less steep at others ( B). Despite the self-evident fact that
the line is steep at A and gradual at B, if we were pressed to say numerically and
consistently how the two steepnesses compare, we’d be at a loss. The picture gives
only qualitative information.
To motivate an exact answer, let’s approximate steepness using tools we know. Focus
on the point labeled A, and call it A = ( x, f( x)). After a moment of thought, the
idea naturally occurs to draw a line between ( x, f( x)), and a nearby point ( x′,
f( x′)), and have our approximation be the slope of that line, as in Figure 8.3.
steepness at A ≈ f( x′) − f( x)
x′ − x
2 It’s a shame that the tick symbol is also used in calculus to denote the
derivative of a function, but this will be a good opportunity to practice
disambiguating notation using context.
98
B
Figure 8.2: For a general curve, steepness depends on where you measure.
A'
x'
Figure 8.3: We can use the slope of a line as a proxy for the corresponding
“steepness”
measurement on a curve.
99
A'
B'
x'
Figure 8.4: Two different lines show how the approximation can be better or worse,
depending on where it is.
We also use the ≈ symbol as a stand-in for the phrase “is approximately.” I also
went back to using the word “steepness” instead of slope because we’re using the
slope of a line to reason about this new kind of steepness.
steepness at A ≈ f( x 1) − f( x)
x 1 − x
Our yearnings are destined for iteration. Do it again, and again, getting f( x 2)
−f( x) and x 2 −x
f ( x 3) −f ( x) , and so on. With each step the line approximation gets better and
better, closer x 3 −x
How do we reason about the “end” of this process? We get a number at every step.
If we were to run this loop forever, would these approximate numbers approach some
concrete number? If so, we could reasonably call that number the “true” steepness
of f at A.
100
x'
That is exactly what limits do. A limit is computational machinery that allows one
to say “this sequence of increasingly good approximations would, if followed
forever, end up at a specific value.” The limit of this particular line-
approximation-scheme is called the derivative. We’ll return to derivatives in a
bit. Note in particular that whether this “limiting process” works shouldn’t depend
on how we move x′ closer to x. A good definition should work so long as x′
approaches x somehow.
8.2 Limits
In the last section we saw a strong motivation for inventing limits, and an
intuitive understanding for what a limit should look like. It’s the “end result” of
iteratively improving an approximation forever. You have some quantity an indexed
by a positive integer n, and as n grows, an eventually gets closer and closer to
some target. For example, if an = 1 − 1/ n, the numbers in the sequence 0 , 1 , 2 ,
3 , 4 , . . . seem to approach 1.
From a specification standpoint, you care mostly about how one intends to use an
interface. When actually writing the program you have to worry about people
misusing your code, intentionally or not. You have to anticipate and defend against
the edge case inputs which are syntactically allowed but semantically unnatural.
Anyone who has spent time designing a software library has spent hours upon hours
thinking about:
• How to avoid writing a mess of extra code just to handle edge cases.
101
Ideally a library author wants to meet all of these criteria at once! We have the
same problem in mathematics.
Most concepts in math—in this case limits—usually make intuitive sense in the
overwhelming majority of cases you encounter in real life. However, 99% of the work
in making the math rigorous is converting the concept into concrete definitions
that can handle pathological counterexamples. By pathological, I mean examples that
are mathematically valid, but which nobody would ever encounter in the wild. 3 The
best pathological examples are edge cases on steroids, and some mathematicians gain
fame for constructing particularly vexing pathological examples. They’re the
penetration testers of mathematics. You have have heard of a particularly famous
one called the Cantor Set.
Indeed, much like a program, once a mathematical definition is written down it must
be judged on its own merits. It must behave properly under any “input.” Best
practices also suggest definitions reduce cognitive load and avoid too many special
cases. Achieving the right balance is a serious challenge.
An unfortunate consequence of all this is that math books start with the final
(though it was really a team effort over decades). Modern calculus textbooks are a
strange mix. They want to capture the informality of Leibniz, feel obliged to
Weierstrass’s rigor, but can’t commit to either approach fully. Going full Leibniz
would be error-prone. On the other hand, The cult of Weierstrass requires detailed
proof-reading skills. Alas, mathematicians are usually the only ones who enjoy the
elaborate tour of blunders and false starts that historically sculpted a modern
definition. One could hardly cajole the average student to care, or even the
brightest student, until after those blunders come to bear on their own work.
To my delight, you’re still reading. My goal for the rest of the chapter is to whet
your appetite for definition crafting. Let’s continue with the “steepness of a
function” as our prototypical example of a limit. Here’s one of those pathological
examples that makes limits hard. I’m going to define a non-curve and not-even-
connected function f : R → R
But I dastardly chose f in such a way that the limit changes depending on how you
pick the sequence x 1 , x 2 , . . . . In fact, if you pick xk = 1/ k, every slope
in the sequence is 2, implying the limit is 2. There isn’t even an approximation
because the values in the sequence are constant. But if you choose xk =
k+0 . 5
3 This is relative, of course. Once upon a time complex numbers like 1 + i were
thought to be pathological, but now they’re standard.
102
1/2
Figure 8.6: This pathological function admits two different possibilities for the
derivative depending on the sequence of approach.
This will be the last pathological example I inflict upon you, 4 but it emphasizes
an important point. However we choose to define derivatives, it should not depend
on the arbitrary choice of which points you use to do the approximation. It should
be a definition like “no matter how your x values approach the target, the slope
limit is the same.”
The generic mathematical term for this is that the derivative should be well-
defined. Two of the definitions we scrutinize in this chapter—the limit of a
function (Definition 8.4) and the derivative (Definition 8.6)—will encounter the
issues above. The quality of Definition 8.1, which defines the limit of a sequence
of numbers, and its subsequent analysis provide a foundation that ensures well-
definition.
With that thought, let’s start with the limit of a sequence of numbers, which will
be used to define limits for functions. 5 Since sequences of numbers can have
repetition, we won’t use set notation (though some authors do). Instead we’ll use a
comma notation x 1 , x 2 , . . . which the strongly-typed programmer can think of
as the output of an iterator which never terminates, or a tuple/array of infinite
length ( x 1 , x 2 , . . . ). The ε character is a lower-case Greek epsilon,
contextually used across mathematics as a small positive real number.
This is the first time we’ve encountered a definition that relies heavily on
alternating quantifiers (for every…, there is…), so let’s discuss it in detail. A
statement like “for every 4 If you want more, check out the book “Counterexamples
in Calculus.”
5 We say “the” limit because the definition makes it unique. You will prove this in
Exercise 8.4.
103
as input, I can produce a BAR with the desired property as output.6 In fact there
may be many such BARs. Interpreting this for Definition 8.1, the input is a real
number threshold ε > 0, and the output is an integer k with a special property. So
the relationship is: int sequence_index_from_threshold(float epsilon) {
return k;
The special property of k is that all the sequence elements after k are close to L.
They’re at least as close as specified by ε.
of 1. I.e., all these xn’s should satisfy 1 − 1/4 < xn < 1 + 1/4. Another way to
write this is with the absolute value: |xn − 1 | < 1/4. Since we already see that
3/4, also known as 1 − 1/4, is one of the sequence elements, it should be easy to
guess that everything starting at k = 5 will be close enough to 1. Indeed, we can
do the algebra.
)
|
n − 1 | = 1 − 1
− 1
= − 1 n = n
Now let ε > 0 be unknown, but fixed. We can do the same algebra as above. How large
of an index k do we need to ensure |xn − 1 | < ε for all n > k? In other words, can
I write ε in terms of n so that all of the above equations and inequalities are
still true when I replace 1/4 with ε?
Proof. Let ε > 0 be fixed. Pick any integer k > 1/ ε. We will show that |xn − 1 | <
ε for all n ≥ k. Indeed,
n − 1 | = 1 − 1
− 1
,
= − 1 n = n
6 It isn’t strictly true in math that it can be computed. Sometimes you can prove a
thing exists without knowing how to compute it. But in most important cases you can
compute, and it makes the explanation here simpler.
7 The fraction 1/ ε doesn’t risk division by zero because Definition 8.1 requires ε
> 0.
104
You can think of this ε-to- k process as a game. A skeptical contender doesn’t
believe xn converges to L, and challenges you to find the tail of the sequence that
stays within ε =
1/2 of L. You provide such a k, but the contender isn’t happy and re-ups the
challenge using ε = 1/100. You comply with a bigger k. The contender retorts with ε
= (1/2)99.
Unfazed, you still produce a working k. If there’s any way for the contender to
stump you in this game, then xn doesn’t converge to L. But if you can always
produce a good k no matter what, the sequence converges to L.
As a notational side note, the phrase “for every x there is a y” can be long and
annoying to write all the time. It also makes it difficult to study the syntactic
structure of statements like this, since dependence among variables may be unclear.
Mathematicians designed an unambiguous notation for this situation called
quantifiers. We briefly introduced quantifiers in Chapter 4, and promised we
wouldn’t use them in this book. However, standard textbook definitions often use
the symbols heavily, so this digression helps put what you might see elsewhere in
context.
The first quantifier is the symbol ∀, which means “for all” (the upside-down A
stands for All). The second is ∃, which stands for “there exists” (the backwards E
in “Exists”).
∃x ∈ R , ∀y ∈ R , x + y = 3 ,
I’m saying I can come up with a real number x, such that no matter which y you
produce, it’s true that x + y = 3. Obviously no such x exists, so the statement is
false.
Note the meaning changes if the order of the quantifiers is reversed: for every y,
there is indeed an x for which x + y = 3, it’s x = 3 − y.
If I were to state the definition of the limit in its briefest form, I might say:
xn converges to L if: ∀ε > 0 , ∃k > 0 , ∀n > k, |xn − L| < ε.
We’ve just packed the math like sardines in a tin box. That being said (and now
we’re really digressing), some situations benefit from writing logical statements
in this form.
Particularly in the realm of formal logic, it turns out that as you add more
“alternating”
Back to limits. The definition of a limit allows a sequence to have no limit, like
the sequence 0 , 1 , 0 , 1 , 0 , . . . , which isn’t pathological at all. For this
sequence you can’t even satisfy the limit definition with ε = 1/3 (no matter what
you think the limit L might be!).
105
Given ε > 0
come up with
derive
proof that xn → 2
Given ε0 > 0
|xn − 2| < ε0
Figure 8.7: Starting in the top left corner, we want to deduce the top right
corner. We do this by taking the longer route down and around.
Sometimes we abbreviate the claim that xn converges to L by the notation lim n→∞ xn
= L, and sometimes even more compactly as xn → L. In this setting, the symbol ∞
doesn’t have any concrete mathematical meaning by itself, it’s just notation to
remind us that we’re talking about n’s that get arbitrarily large.
The notation f( xn) is shorthand for a sequence yn = f( xn). In this context we’re
implicitly “mapping” f across the sequence xn as one would say in functional
programming, or alternatively we’re “vectorizing” f. The notation x → c is used to
signify that xn is a sequence converging to c, and the value of x is used in the
expression inside the limit.
Proposition 8.5.
lim x 2 − 1 = 3 .
x→ 2
Proof. Let ε > 0 be the threshold required by the definition of f( xn) → 3. We’ll
use the proof of the fact that xn → 2 as a subroutine for some special ε′ that we
choose, and use the index we get as output to prove that f( xn) → 3.
Figure 8.7 contains a diagram to illustrate the gymnastics. The top row is the
theorem we want to prove, with the input on the left and the desired output on the
right. Likewise,
106
the bottom row is the black box subroutine for xn → 2. Given the initial ε > 0 that
we don’t get to pick, we choose a threshold ε′ to use for xn → 2. Picking a useful
ε′ is the tricky part of these kinds of proofs.
|f( xn) − 3 | = |x 2 −
4 | = |( xn + 2)( xn − 2) | < ε
We get to control how close xn is to 2 and how fast it gets there—this will be the
subroutine proving that xn → 2 and the choice of ε′. Through that control we can
make the term |xn − 2 | small. As long as we can make |xn − 2 | smaller faster than
the other term ( xn + 2) grows (which may be a consequence of us trying to make |xn
− 2 | small), we’ll be able to make the product |( xn + 2)( xn − 2) | as small as
we need.
Since we know xn will eventually be close to 2, we can analyze that growth in the
range 1 < xn < 3, i.e., when ε′ < 1. In this range, |xn + 2 | < 5. This bounds the
growth as described, and simplifies the expression.
We are nearly victorious. Now we want to choose ε′ smaller than 1, such that 5 |xn
−
2 | < ε. Since ε′ also controls the size of xn − 2 (it’s the threshold for the
subroutine for xn → 2) we arrive at 5 |xn − 2 | < 5 ε′. Finally, we can solve: 5 ε′
< ε when ε′ < ε/5. Combining the pieces together, we can start the proof from the
beginning, explicitly invoking the subroutine this time.
Let ε > 0 be arbitrary. Choose ε′ < min(1 , ε/5), and choose k such that |xn − 2 |
< ε′
|f( xn) − 3 | = |x 2 −
All of this was a formal way of saying that to compute lim x→ 2 x 2 − 1, you may
“plug in” 2 to the expression x 2 − 1. Indeed, in almost all cases where the
expression inside the limit is defined (and continuous) at the limiting input (in
this case x = 2), you can do that. But there are non-pathological functions with
useful limits (not just the derivative) for which you can’t simply “plug the value
in.” See the exercises for a famous example.
To reiterate from earlier, all of this hefty calculus machinery was invented to
deal with those difficult functions.
This proof embraces a style of mathematics called analysis. The term “analysis” can
refer to specific subfields of study, such as real analysis or complex analysis
which are the formalizations of calculus for real and complex numbers. More
broadly, an area of math called “analysis” stresses proof techniques that deal with
bounds and approximations. The error in these approximations can be controlled to
achieve the necessary goals: loosely when attempting to simplify complexity that is
irrelevant to the goal, or tightly
107
As we saw with our pathological “two lines” example from Figure 8.6, not every
function has a limit at every point. For the “two lines” f( x), we computed the
slope as f ( xn) −f (0) where x
n− 0
x→ 0
x− 0
to zero, but their slope-sequences f( an) −f(0) , f( bn) −f(0) gave different
values. As a con-an− 0
bn− 0
sequence, the limit cannot be equal to either value. So we’ve seen that this
definition of the limit passes a litmus test: good functions have limits, and bad
functions do not.
Before continuing, here are a few basic propositions for working with limits that
will come in handy in the rest of the chapter and in the exercises. Most calculus
or real analysis textbooks will contain a detailed proof. Basically, they say that
most arithmetic operations are compatible with limits, provided the limits involved
exist. These formalize the general rule that, absent of any strange function
behavior, you can “plug in” the sequence limit to get a function limit, i.e., that
f( a) = lim x→a f( x).
• lim x→a( f( x) + g( x)) = lim x→a f( x) + lim x→a g( x), provided that each limit
on the right hand side exists.
• lim x→a f( x) g( x) = (lim x→a f( x)) ·(lim x→a g( x)), provided that each limit
on the right hand side exists.
• lim x→a g( f( x)) = g( f( a)), provided that lim x→a f( x) = L exists and g is
continuous at L.
f ( x) − f ( c)
lim
x→c
x − c
This value is denoted f′( c).8 In the limit, sequences xn → c are taken so that xn
̸= c 8 Here is where the prime ′ is being used to denote the derivative.
108
to avoid division by zero.
f ( x) − f (3)
f ′(3) = lim
x→ 3
x − 3
x 2 − 6 x + 9
= lim
x→ 3
x − 3
( x − 3)( x − 3)
= lim
x→ 3
x − 3
n for which xn ̸= 3.
( x − 3)( x − 3)
f ′(3) = lim
x→ 3
x − 3
= lim x − 3
x→ 3
= 0
This was a nice exercise, but it’s tedious to compute derivatives over and over
again for every input. It would be much more efficient to instead compute a compact
representation of the derivative at all possible points. That is, we want a process
which, when given a differentiable function f : R → R as input, produces another
function g : R → R as output, such that g( c) = f′( c) for every c. While computing
the limit may be tedious, our representation of g should make subsequent derivative
calculations as computationally easy as evaluating f.
If you ask a mathematician how to come up with such a g, you’d probably receive the
reply, “You just do it.” This means we can calculate directly from the definition.
If, for example, f( x) = x 2,
109
f ( x) − f ( c)
f ′( c) = lim
x→c
x − c
x 2 − c 2
= lim
x→c x − c
( x − c)( x + c)
= lim
x→c
x − c
= lim x + c
x→c
= 2 c
Forever after, we may plug in the desired value of c to get the derivative at c.
Most mathematicians don’t switch variables, so they’d call the derivative function
f′( x) instead of f′( c). This has the added advantage of displaying patterns in
derivative computations.
For example, if you compute the derivative of x 4, you get 4 x 3, and the
derivative of x 8
Theorem 8.7. For any real numbers x, c and any positive integer n,
Proof. Start to multiply the right-hand side and notice that each term, except the
first and last, pair off and sum to zero. In particular, you get
xn + [ −c · xn− 1 + x · xn− 2 c]
+ [ −c · xn− 2 c + x · xn− 3 c 2]
...
+ [ −c · xcn− 2 + x · cn− 1]
+ ( −c · cn− 1)
At this point in a standard calculus course, a student would spend a few weeks (or
months) learning:
110
2. When given two functions f, g whose derivatives you know separately, how to
compute the derivative of an elementary combination of f and g, such as f + 3 g and
f( g( x)).
3. How to use special values of the derivative (such as zero) to find maxima and
minima of various functions, such as maximizing profit from selling a widget
subject to costs for creating certain variations of that widget.
Because this book can only give you a taste of calculus, and because we’re rushing
to an interesting application, we’ll skip most of this in favor of stating the
facts that are, in my view, the most important for applications.
Let F be the set of all functions R → R that have derivatives. Let D be the
function that takes as input a function f and produces as output its derivative f′.
Note the domain of D is F , but its codomain is not F because some differentiable
functions are not twice differentiable.
As a function, “cf” is the function that takes as input x and produces as output c
· f( x).
“maps to.”
consistent notation like d ( f) versus df and forces one to choose a variable name
x. In dx
dx
my opinion, this notation exists for bad reasons: backwards compatibility with
legacy math, and trying to trick you into thinking that derivatives are fractions
so you’ll guess the forthcoming chain rule. But it is too widespread to avoid.
Theorem 8.9 immediately lets us compute the derivative of any polynomial, because
we can use Theorem 8.8 to compute the derivatives of each term and add them up.
E.g., the derivative of 3 + 2 x − 5 x 3 is 2 − 15 x 2. Quick spot check exercise:
using intuition, reason that a constant function like f( x) = 3 has derivative f′
( x) = 0. If your intuition fails you, use the definition of the limit to compute
it.
10 I sneer, but if you’re serious about mathematics then at some point you need to
become intimately familiar with specific derivatives of elementary functions. This
book is not the place for that, and I suspect many of my readers will have seen
calculus at least once before, and knows how to google “derivative of arccos( x)”
111
The other crucial fact, which we’ll use later, is the chain rule.
Theorem 8.10 (The chain rule). Let f, g : R → R be two functions which have
derivatives.
In the chapter exercises you’ll look up a proof of this theorem. The chain rule
makes it easy to compute derivatives that would require a lot of algebra to
compute, such as 50
49
(2 x). The chain rule also lets us compute derivatives that would otherwise be
completely mysterious, such as that of sin( ex). If you’re told what the
derivatives of sin( x) and ex are separately, then you can compute the derivative
of the composition.
As a notational side note, let me explain the “fractions make you guess the chain
rule”
remark. Call h( x) = f( g( x)). Then if we use the fraction notation dh for the
derivative dx
of h, the standard way to write the chain rule for this would be dh = dh · dg . The
“hint”
dx
dg
dx
of the notation is that if you’re a reckless miscreant, you might jump to the
conclusion that the dg’s “cancel” like fractions do. Rest assured that is not how
it works, but calculus students the world over are encouraged to do it this way
because the resulting rule is correct. We’ll return to this in Chapter 14.
Approximation by a Line
If you got ten mathematicians in a room they’d come up with twenty different ways
to motivate calculus. In this chapter we used, “generalize the slope of a line to
curvy things,”
but here’s another. One prevalent idea is to take a complicated thing and
approximate it by simpler things. Without calculus, the simplest function we fully
understand is a straight line. So we might ask, “Given a function f : R → R and a
point x ∈ R at which f is differentiable, what line best approximates f at x?”
If you define “best approximates” in a particular but reasonable way, the answer to
this question uniquely defines the derivative. Call L( x) the line approximation of
f we get using the derivative of f at x = c. That is, L( x) = f′( c)( x − c) +
f( c). This is just the line passing through ( c, f( c)) with slope f′( c), often
called the “tangent line” to f at c.
A'
x'
Figure 8.8: The line between A and A′ does not approximate f well close to A.
in the statement. Sadly, this doesn’t work. Take our example from earlier,
replotted in Figure 8.8. There, the line between A and A′ is not the tangent line
at A, and it is also far closer to f at A′ than the tangent line would be. However,
for points close to A, the tangent line is a much better approximator. If we’re
trying to approximate f “at” A, we care more about points closer to A than points
far from A. Here’s how we make this clear in the math.
Take any line K( x) that is supposedly challenging the tangent line for the title
of
“best approximating line of f at x = c.” Then I claim I can choose a small enough
interval around c (the width of this interval depends on the features of the
challenger K) so that L
beats K on all points in this interval. Here’s the formal theorem I’ll prove
momentarily.
Notation time: people often write the set of points {x ∈ R : |x − c| < ε} using the
“open interval” notation ( c − ε, c + ε). They also often call this an epsilon-ball
around c. Using this, the last sentence of the theorem might read, “For all x ∈ ( c
− ε, c + ε), it holds that |L( x) − f( x) | ≤ |K( x) − f( x) |.” This makes the
statement clearer. Instead of saying “if this then that,” you’re saying what you
want to say outright, that “FOO is always true in my domain of interest.”
113
f′( x) − f( x) − f( c)
x − c
≤ m − f( x) − f( c)
x − c
The fraction f( x) −f( c), which is on both sides, is most of the definition of the
derivative, x−c
missing only the limit. And f′( x) is the value of that limit, whereas m is some
other number. This should already make it pretty clear that the inequality above
holds, but let’s prove it formally by contradiction.
Suppose to the contrary that no matter which ε I choose, there is some x in ( c−ε,
c+ ε) that contradicts the inequality above. I would like to pick a sequence of x
values going to c that violates the definition of the derivative. I will do that by
picking a sequence of ε’s, using the supposed hypothesis that the inequality above
is false for every ε, and arriving at the sequence of x’s needed for my
contradiction. Let
to f′( c). The contradictory hypothesis says it’s closer to m instead. This
contradicts the definition of the derivative.
We have proved that derivatives provide the best linear approximation to a function
at a point for a concrete sense of “best.” This raises a natural question. Can we
improve this approximation by using more complicated functions than lines? The
answer is yes.
Taylor Polynomials
One nice thing about polynomials is that they have a grading. By that I mean, if
you increase the degree of your polynomial, you can express a wider variety of
functions. In principle, higher degree allows a polynomial to express more
complexity, and produces better approximations of f.
You can derive exactly how this works by following the steps of Theorem 8.11, and
asking for a degree at most 2 polynomial whose derivative best approximates f′
close to a.
Suppose our candidate is the following (where below q∗ ∈ R is the unknown parameter
we must set to get a degree 2 polynomial).
114
p′( x) = 2 q∗( x − a) + f ′( a) .
Plugging in x = a leaves only f′( a). In the same way, in Theorem 8.11 we couldn’t
avoid using f( a) for the constant term because the line had to pass through ( a,
f( a)).
And so if we want to optimize p′( x) by choosing q∗, it’s almost exactly the same
proof as Theorem 8.11, with the difference being an extra factor of 2. We’ll leave
it as an exercise for the reader to redo the steps, but at the end you get q∗ = f′′
( a), where f′′ is the 2
Two quick asides. First, the attempt to use the second derivative only makes sense
if f has a first derivative at that point, and as we saw not all functions have
derivatives at all points. Second, adding more and more primes to denote repeated
applications of the derivative operation is cumbersome. Rather, it’s customary to
use a parenthetical superscript notation f( n)( x) for the n-th derivative of f.
You call a function n-times differentiable if it has n derivatives at every point.
If f has infinitely many derivatives (i.e., it is n-times differentiable for every
n ∈ N), f is called smooth. The typical example of a smooth function is sin( x) or
2 x. A default modeling assumption is that life is smooth, and when it’s not you
pay very close attention.
p( x) = f ( a) + f ′( a)( x − a) +
( x − a)2
A proof by induction, which the reader should finish (we just did the step from n =
1
to n = 2 which has all the features of the general induction), extends the Baby
Taylor Theorem to the Adolescent Taylor Theorem. Note that by n! we mean the
factorial function n 7→ n · ( n − 1) · ( n − 2) · · · · · 2 · 1 where n is a
positive integer. We’re not merely excited about n, though it is bittersweet to
have watched n grow up so fast.
115
∑ f( n)( a)
p( a) +
( x − a) n
n!
n=1
As if possessed by the spirit of Leonhard Euler, we write down examples. Here are
the first three terms of the general summation.
f ′( a)
f (2)( a)
f (3)( a)
f ( a) +
( x − a) +
( x − a)2 +
( x − a)3
1!
2!
3!
To have an example that’s not already a polynomial, let f( x) = ex. Recall or learn
now that the derivative of ex is also ex. In fact, the number e is uniquely defined
by this property. Then the degree 4 Taylor polynomial for ex at x = 0 is
particularly simple because e 0 is 1 in every term:
x 2
x 3
x 4
1 + x +
+
24
Figure 8.9 contains a picture of ex and its approximation by the degree 4 Taylor
polynomial. The approximation is faithful to the original function, but only close
to x = 0.
The Taylor polynomial is one of the most often used applications of mathematics to
itself. The reason is because when you’re analyzing a mathematical problem, it’s
easy to define functions with convoluted behavior. One example of this is in
machine learning, when you analyze the probability that a classifier is wrong. You
can often write down the probability as a massive product, but can’t compute it
exactly. Instead, one often uses a small-degree Taylor polynomial to approximate
it. With knowledge of whether the Taylor polynomial is an over- or under-
approximation of the truth, one can bound the complicated behavior enough to prove,
for example, that the classification error decreases with more data.
Theorem 8.13 seems to show us that every function can be approximated arbitrarily
well using polynomials. As useful as polynomials are, it turns out this is not
entirely true.
Let’s say we’re working with a function where the polynomial approximation does get
progressively better at higher degrees. If you’re in the proper mindset for
calculus, you naturally ask what happens in the limit? If I call pk the degree k
Taylor polynomial for f at a = 0, how can we make sense of the expression
lim pk( x) . . . ?
k→∞
Remember, we only defined what it means for a sequence of numbers to converge, but
this is a sequence of functions R → R. Convergence of functions requires a
definition of what it means for two functions to be “close” together, which has
subtleties beyond the scope of this chapter. But suppose we did that and we can
make sense of this expression,
116
we’d hope that this limit was also equal to f, at least when x is sufficiently
close to 0. This expression, the limit of Taylor polynomials, is called the Taylor
series of f at that point.
Mathematics is not so kind to us here. There are certain simple functions for which
the Taylor series breaks down in certain regions. In particular, if f( x) = log(1 +
x) and you compute the limit at a = 0, the resulting function would only be equal
to f( x) between x = − 1 and x = 1. When x > 1 the sequence does not converge, even
though log(1 + x) exists for x > 1. In that case, you have to compute a different
Taylor series at, say, a = 2.
The complete function is then joined together piece-wise by enough Taylor series
pieces until you get the whole function. The functions which can be reconstructed
in this way (and aren’t sensitive to which points you choose within a region, again
in the interest of well-definition) are called analytic functions.13
There are somewhat natural functions that fail to accommodate Taylor series worse
than the logarithm. Let f( x) = 2 − 1/ x 2 when x ̸= 0, and let f(0) = 0. Figure
8.10
contains a plot of this function. You will prove in Exercise 8.11 that f( n)(0) = 0
for every n ∈ N. As a consequence, all of its Taylor polynomials at x = 0 are the
zero function, 13 There is a more rigorous way to say “not sensitive to the points
you choose,” which is to say that computing the Taylor series of f at every input a
in the domain of f converges to f in some open set around a. Defining an “open” set
is another can of worms, but for most functions R → R this just means “any interval
containing a.” This can fail, e.g., when the Taylor series at a only equals f at a
finite set of other points.
117
and the “limit function” should be the constant zero function. 14 In this case, the
Taylor series tells you nothing about the function except its value at x = 0.
Polynomials aren’t able to express what f looks like near zero.
This highlights the shortcomings of Taylor polynomials. They’re not the perfect
tool for every job. It also leads us to ask why, for this mildly pathological f,
the Taylor series fails so spectacularly. Complex analysis provides a satisfactory
answer, but the subject is unfortunately beyond the scope of this book.
8.5 Remainders
The Adolescent Taylor Theorem tells us how to compute the best polynomial of a
given degree that approximates the behavior of a function. In fact, it approximates
the behavior of a function’s “slope” (first derivative) and more informally its
curvature (higher derivatives), provided you’re willing to compute enough terms.
The Adolescent Taylor Theorem, however, doesn’t allow us to quantify how good the
approximation is. As we just saw, there are pesky functions whose Taylor
polynomials at certain rotten points are all zero. They’re so flat they tricked the
poor polynomial!
As you might have guessed, there is an Adult Taylor Theorem—just called the Taylor
Theorem—which gets one much closer to quantifying the error of the Taylor
polynomial.
Unfortunately, the proof of this theorem requires the Mean Value Theorem, which
does not fit in this book, but we can state the Taylor theorem easily enough.
f ( d+1)( z)
f ( x) = pd( x) +
( x − a) d+1
( d + 1)!
In words, the exact value of f( x) can be computed from the Taylor polynomial
pd( x) plus a remainder term involving a magical z plugged into the ( d+1)-st
derivative instead of x.
The dependence of the variables on each other are a bit confusing. Let’s make it
explicit with some pseudocode. In particular, the needed value of z depends on the
specific input x.
Arguments:
'''
p = taylor_polynomial(f, d, a)
119
Figure 8.11: A function whose root does not have a nice formula.
Let’s say you have a function f( x) and you want to find its zeros,15 that is, an
input r producing f ( r) = 0. Let’s also say that you can compute both f ( x) and f
′( x) at any given input. An example of such a function is x 5 − x − 1. Try to
algebraically solve for f ( x) = 0, if you dare. On the other hand, f ′( x) = 5 x 4
− 1 is simple enough to compute.16
Figure 8.11 contains a plot of f( x). The root is just under 1 . 2, but coming up
with an algebraic formula for the root in terms of the coefficients is impossible
in general (this is a deep theorem known as the Abel-Ruffini theorem).
One idea that should feel very natural by this point is to approximate the root of
f by starting with some value close to the root (which we can guess), and
progressively improving it. In theory, we want to find a sequence x 1 , x 2 , . . .
, such that lim n→∞ xn = r, where f( r) = 0.
One initial thought is obvious: perform a binary search. That is, pick two guesses
c, d, where f( c) < 0 < f( d), and then let your improved guess be the midpoint ( c
+ d)/2, updating your upper and lower search bounds in the obvious way depending on
whether 15 For polynomials, zeros are sometimes called roots, and I will use these
terms interchangeably.
3 x log(3)
120
xn
xn+1
Binary search does produce a sequence approaching a root of f, but it turns out to
be much slower than the forthcoming Newton’s method.17 In Newton’s method you
choose your next guess xn+1 depending on the derivative of f at xn. To convince you
that this this could be faster than binary search, suppose you chose bad bounds for
binary search as in Figure 8.12.
121
The tangent line at the point ( d, f( d)) intersects the x-axis quite close to the
root, whereas the midpoint between c and d is rather far away. A binary search
would slowly approach the root from the left, whereas the tangent line guides us
close to the root in the first step.
If this isn’t convincing enough, we can provide something much better: a proof. But
first, we have to make the algorithm explicit. Phrased geometrically, start from
some intermediary x-value guess, calling it xn for the n-th step in the algorithm.
Draw the tangent line at xn, which is y = f( xn)+ f′( xn)( x−xn), and let xn+1 be
the intersection of this line with the x-axis. This is illustrated in Figure 8.13.
To find the intersection point, set y = 0 in the equation for the tangent line, and
solve for x: 0 = f ( xn) + f ′( xn)( x − xn)
f ( xn)
0 =
+ ( x − xn)
f ′( xn)
x = xn − f ( xn)
f ′( xn)
n)
x = starting_x
while True:
yield x
x -= f(x) / f_derivative(x)
Obviously, if f′( xn) = 0 then we’re dividing by zero which is highly embarrassing.
So let’s assume f′( xn) ̸= 0, i.e., that the tangent line to f is never horizontal,
and we’ll make this formal in a moment.
When Taylor’s theorem is your hammer, the world is full of nails. It takes no
inspiration to come up with this algorithm. As we’ll see in the proof below,
literally all you do is rearrange the degree 1 Taylor polynomial and squint at the
remainder. Still, without going through the proof it’s not entirely clear that
Newton’s method should outperform binary search, other than the fuzzy reasoning
that an algorithm that somehow uses the derivative should do better than one that
does not.
Indeed, we’ll wield a Taylor polynomial like a paring knife to prove Newton’s
method works. The theorem says that not only does xn converge to a root r of f, but
that if x 1
starts close enough, then in every step the number of correct digits roughly
doubles. That is, the error in step n + 1, which is |xn+1 − r|, is roughly the
square of the error in step n, i.e. |xn − r| 2. Binary search, on the other hand,
improves by only a constant number of digits in each step.
This theorem we’ll treat like a cumulative review of proof reading. That is, we’ll
be more terse than usual and it’s your job to read it slowly, parse the individual
bits, and generate tests if you don’t understand part of it.
122
Let f : R → R be a function which is “nice enough” (it has some properties we’ll
explain after the proof). Let r ∈ R be a root of f inside a known interval c < r <
d, and pick a starting value x 1 in that interval. Define x 2 , x 3 , . . . using
the formula xn+1 =
Theorem 8.15 (Convergence of Newton’s Method). For every k ∈ N , the error ek+1 ≤
|f′′( z) |
C =
max
c≤z,y≤d 2 |f ′( y) |
In other words, the error of Newton’s method vanishes quadratically fast in the
number of steps of the algorithm.
Proof. Fix step k. Compute the degree 1 Taylor polynomial for f at xk. This is
exactly the tangent line to f at xk. Use that Taylor polynomial to approximate
f( r), the value of f at the unknown root r.
Recall we want to analyze the error of the approximation ek+1 = |xk+1 − r|, so at
some point we must use use the formula for xk+1 in terms of xk. The next three
steps are purely algebraic rearrangements to enable this.
f ′( xk)
2 f ′( xk)
f ′′( z)
xk − f ( xk) − r =
( r − xk)2
f ′( xk)
2 f ′( xk)
f ′′( z)
e
k+1 = |xk+1 − r| =
( ek)2
2 f ′( x
k )
123
Despite all the algebraic brouhaha in the proof above, all we did was take some
value x = xk (though calling it xk was only relevant in hindsight), write down the
degree 1
Speak of the devil! The proof allows us to identify the requirements of a “nice
enough”
function:
• f′( x) can never be zero between c and d, except possibly at the root r itself,
in which case you can check to see if f( x) = 0 at each step to avoid the edge case
of hitting r exactly. Otherwise we risk dividing by zero, or worse, getting stuck
in a loop (as we’ll see in the example below).
• f has to have first and second derivatives everywhere between c and d. Otherwise
the claims in the proof that use those values are false.
• f′( x) should never be very close to zero, and f′′( x) should never be very far
from zero, or else C will be impractically large.
method for f( x) = x 5 − x − 1.
THRESHOLD = 1e-12
x = starting_x
function_at_x = f(x)
while abs(function_at_x) > THRESHOLD:
yield x
x -= function_at_x / f_derivative(x)
function_at_x = f(x)
def f(x):
return x**5 - x - 1
def f_derivative(x):
return 5 * x**4 - 1
starting_x = 1
approximation = []
i = 0
print((x, f(x)))
i += 1
if i == 100:
break
After only six iterations we have reached the limit of the display precision.
124
Figure 8.14: An example where the starting point of Newton’s method fails to
converge due to an unexpected loop.
(1, -1)
(1.25, 0.8017578125)
(1.1784593935169048, 0.09440284131467558)
(1.16753738939611, 0.001934298548380342)
(1.1673040828230083, 8.661229708994966e-07)
(1.1673039782614396, 1.7341683644644945e-13)
(1.1673039782614187, 6.661338147750939e-16)
(1.1673039782614187, 6.661338147750939e-16)
(1.1673039782614187, 6.661338147750939e-16)
Let’s see the same experiment with the starting_x changed to 0 instead of 1. This
is an input which, as you can see from Figure 8.14, drives Newton’s method in the
wrong direction! By the end of a hundred iterations, Newton’s method cycles between
three points:
...
(0.08335709970125815, -1.083353075191566)
(-1.0002575619492795, -1.001030911349579)
(-0.7503218281592572, -0.4874924386834848)
...
This behavior is allowed by Theorem 8.15, because in between the starting point and
the true root, the derivative f′( x) is zero, making the error bound C from Theorem
8.15
undefined (and indeed, unboundedly large for x values close to where f′( x) is
zero). Newton’s method is very powerful, but take care to choose a wise starting
point.
125
Newton’s method stirs up a mathematical hankering: why stop at the degree 1 Taylor
polynomial? Why not degree 2 or higher? All we did to “derive” Newton’s method was
take a random point, write down the degree 1 Taylor polynomial p( x), and solve
p( x) = 0.
By rearranging to isolate the error terms, we got the formula for xk+1 for free.
For degree 2, why not simply use the degree 2 Taylor polynomial instead?
2!
There are two obstacles: (a) this polynomial might not even hit the x axis; it’s
trickier to nail down for quadratics than lines, and (b) even if it does, it might
be hard to find the intersection, since finding roots is the problem we started
with!
approximate a root for is the Taylor polynomial, and we don’t know how to find its
roots.
• Good definitions are designed to match a visual intuition while withstanding (or
excluding) pathological counterexamples.
• Much of the murkiness of calculus comes from the fact that it must support a long
history of manual calculations and pathological counterexamples. The “normal”
8.8 Exercises
126
used as a number, but rather notation for the concept, “xn grows in magnitude
without bound.” This unifies it with the usual limit definition.
of a curve?
8.2. Prove the following basic facts using the definitions from Exercise 8.1.
(
) (
lim g( x) ,
x→a
x→a
x→a
3. Prove that
an = 2
diverges as n → ∞.
n 10
5. Let xn be a sequence of real numbers. Suppose that for every ε > 0, there is an
N ∈ N (depending on ε), such that for every n, m > N it holds that |xn−xm| < ε.
Such a sequence is called a Cauchy sequence. Look up the statement of the Bolzano-
Weierstrass theorem and use it to prove that every Cauchy sequence converges.
8.3. Prove that the numeric value for the slope of a line doesn’t depend on the
choice of points.
8.4. Prove that the limit of a sequence, if it exists, is unique. In other words,
the limit L
does not change depending on the choices of ε and k used to satisfy the definition.
This justifies us calling it “the” limit of a sequence. Hint: Suppose you had an L
and an L′ that both worked, and prove that L = L′.
8.6. Compute the Taylor series for f( x) = e− 2 x, and compare this to the
procedure of plugging in z = − 2 x into the Taylor series for ez. Find an
explanation of why this works.
127
(1+ r) N − 1
terest payment on a loan with compounding monthly interest rate r, total number of
months N, and principal P . In the December 1996 issue of Mathematics Magazine,
Pey-man Milanfar described a slightly modified version of the linear Taylor
approximation, C∗( r) ≈ 1 ( P + 1 P N r), which has been used by Persian merchants
for hundreds of N
years to compute monthly loan payments in their heads. Compare C∗( r) with the
exact degree-1 Taylor polynomial for C( r). What is the error of the Persian
method? Under what conditions on N and P is the Persian method accurate?
8.9. There are some functions which are challenging to compute limits for, but they
aren’t considered “pathological.” One particularly famous function is
f ( x) = x sin(1/ x) .
Compute the limit for this function as x → 0. The difficulty is that sin(1/ x) is
not defined at x = 0, and algebra doesn’t provide a way to simplify sin(1/ x).
Instead, you have to use “common sense” reasoning about the sine function. This
common-sense reasoning is made rigorous by the so-called Squeeze Theorem. Look it
up after trying this problem. A plot of f will help..
8.10. Find a differentiable function f : R → R with the property that lim x→∞ f( x)
= 0, but lim x→∞ f′( x) does not exist.
{2 − 1/ x 2 if x ̸= 0
f ( x) =
if x = 0
This function has derivatives of all orders at x = 0, and despite the fact that
f( x) is not flat, all of its derivatives are zero at x = 0. Prove this or look up
a proof, as the computation is quite involved. These functions are sometimes called
flat functions, since they’re literally so flat that they avoid detection of any
curvature by derivatives.
8.12. There are two definitions of the number e. One is the number used as an
exponent base in the exponential function ex, for which the derivative of ex is ex.
The other is (
) n
e = lim
. First, prove the somewhat surprising fact that this limit is not
n→∞ 1 + 1
equal to 1. Second, understand why these two definitions result in the same
quantity.
128
8.14. Look up a proof of the chain rule on the internet, and try to understand it.
Note that there are many proofs, so if you can’t understand one try to find
another. Come up with a good geometric interpretation.
8.15. Write a program that implements the binary search root-finding algorithm and
compare its empirical convergence to Newton’s method. Find an example input for
which (gasp!) they have the same convergence rate, and analyze the statement of
Theorem 8.15
8.16. Look up a proof of the Taylor theorem, which may depend on other theorems in
single-variable calculus like Rolle’s theorem or the Intermediate Value Theorem.
8.17. Look up an exposition of the degree-2 Householder method for finding roots of
differentiable functions, and implement it in code.
8.18. In the chapter I mentioned that parts of calculus and real analysis are
formalized in such a way that maintains backwards compatibility with “legacy math.”
The experienced programmer might protest: why not redesign analysis from scratch to
avoid that?
This has been done, and the field of nonstandard analysis is one such redesign.
Look up an introductory exposition about nonstandard analysis and identify where it
becomes backwards incompatible with standard calculus.
Chapter 9
By relieving the brain of all unnecessary work, a good notation sets it free to
concentrate on more advanced problems, and in effect increases the mental power of
the race.
– Alfred Whitehead
There are two topics I want to discuss in this chapter that don’t fit elsewhere in
the progression of the book. First, on how the organizational structure of a proof
can guide the reader’s attention. Second, on equivalence relations and quotients,
the standard abstraction for building and representing complicated mathematical
spaces. Both are new ways to reduce reader’s cognitive burden by hiding
technicalities. The latter will also prepare us for the use of equivalence
relations through the rest of the book.
The recipe for doing this is taught in most undergraduate calculus courses. It
reduces the optimal parameter choice from a continuum of options to a discrete set
to check by hand. Define f : R → R whose input is the parameter of interest, and
whose output you’d like to minimize (maximizing is analogous).
1 If you don’t want to restrict to a range, you have to worry about the limiting
behavior of f as the input tends to ±∞. When f blows up to ∞ or −∞, these are sort
of “trivial” optima, as well as being unattainable by a fixed input. But if, for
example, you can compute that both infinite limits are −∞, then that leaves open
the possibility of a finite global maximum.
129
130
The analysis of an algorithm using the above recipe is so routine that authors
seldom remark on it. In research papers they often skip the entire argument
assuming the reader will recognize it. Life is similar for the ubiquitous Taylor
polynomial. Such brevity can seem like malicious obfuscation, but it makes sense as
a cognitive “tail call optimization”
for proofs. 2
The core of the proof is the primary focus. It requires all your working memory.
Optimizing a parameter using standard tools is easy once you’ve done it enough
times. Leaving it to the end compartmentalizes the two jobs. Big picture
comprehension first, and rote computation last. Indeed, the ability to maximize an
elementary function rarely depends on memory of how you created that function, so
why not shed a few mental stack frames while you do the real work?
This is also a justification for why one might write the statement of a theorem
like we did in the last chapter.
Theorem (Convergence of Netwon’s Method). For every k ∈ N , the error ek+1 ≤ Ce 2 ,
k
|f′′( z) |
C =
max
c≤z,y≤d 2 |f ′( y) |
Now let’s move on to discuss two technical tools for making complicated types
(realized as sets): equivalence relations and quotients.
Since quotients are often formed from set products, let’s briefly review. The
direct product of sets, A × B, is the most common mathematical way to make a
compound data type. It’s the set {( a, b) : a ∈ A, b ∈ B}. To reiterate from
Chapter 4, if we repeat this operation, we tend to ignore the nested grouping of
tuples, so that A × B × C is viewed 2 For unfamiliar readers, tail call
optimization is a feature of certain programming languages whereby a function whose
last operation is a recursive call can actually shed its stack frame. It doesn’t
need it because there is no work left after the recursive call but to return. In
this way, functions written in tail-call style will never cause a stack overflow.
131
In your mind you can replace f( a, b) = 1 with “a and b are equivalent.” A more
common notation for this is a squiggle ∼, so that a ∼ b if and only if f( a, b) =
1, with a ̸∼ b if f ( a, b) = 0. The squiggle reminds one of the equal sign without
asserting that it’s an equivalence relation before it’s proved to be.
To define an equivalence relation is to say, “Here are the terms by which I want to
think of different things as the same.” We are essentially overloading equality
with a specific implementation. As long as the equivalence relation satisfies these
three properties, you rest assured it has the most important properties of the
equality operator.
But 1/2 is not equivalent to 1. We call the set of all things equivalent to one
object an equivalence class. So in this case Z is an equivalence class, as is the
set of half-fractions
Back to our example with R, the quotient R/ ∼ has a simpler representation. Since
equivalence classes partition R, and every real number shows up in some equivalence
class, we can identify each equivalence class in R/ ∼ with our favorite
“representative”
3 Most math books introduce the generic notion of a relation, and then use
relations to define functions. We’ll instead use functions as the primitive type
and jump straight to an equivalence relation without defining relations at all.
132
Concretely, let’s choose the representative from each class in R/ ∼ that’s between
0
R/ ∼ = {[ x] : 0 ≤ x < 1 }.
Curious plants spring from fertile soil. In this world [1 + 1] = [0], and a
sequence
]
which diverges in R converges here: x
2 n+1
n =
+ 1 .
133
different things to be the same in a principled manner. You override equality, show
it meets standards of decency, and then introduce it to your friends.
We can now make the “ignoring” of nested pairs in the set product rigorous. Define
the sets
1500
≡ 1
mod 9. This tells you that 83000 is one plus a multiple of 9. Similar tricks with
conveniently chosen moduli can extract useful information about 83000 without
computing it exactly, such as the last few digits of the number in base 10. Another
useful tool when studying equations of integer variables is to recognize that if an
equation has a solution, then the same solution must exist if the equation is
considered mod n for any n.
You can also freely choose the most advantageous equivalence class representative
for your task, possibly easing computation. It’s similar to the programmer’s adage:
work hard now to allow yourself to be lazy later. Mathematicians are well practiced
in that philosophy.
Chapter 10
Linear Algebra
There is hardly any theory which is more elementary [than linear algebra], in spite
of the fact that generations of professors and textbook writers have obscured its
simplicity by preposterous calculations with matrices.
– Jean Dieudonné
For a long time mathematicians focused on studying interesting sets, like numbers
and solutions to various equations. In Chapter 6 we saw graphs, which are
interesting kinds of sets. In Chapter 8 we saw sets of numbers (sequences) and sets
of pairs of numbers (functions R → R). One could spend a lifetime studying
interesting graphs or interesting sets of numbers. However, more recent trends in
mathematics have shifted the main focus from studying sets with interesting
structure to studying functions with interesting structure. 1
To ease into it, let’s first consider the familiar concept of a compiler. A
compiler is a function mapping the set of programs in a source language to the set
of programs in a target language, often assembly. A compiler preserves the semantic
behavior of a valid input program in the target language when you run it. In that
sense, it preserves the structure of the input by representing that structure
appropriately in the codomain.
defined by the behavior of the compiler. This is never more visible than when
dealing with language forms that have “undefined behavior.” Different compilers run
on the same source produce programs that behave differently. Languages like C, in
which behavior can vary depending on the arbitrary contents of uninitialized
memory, widen such pitfalls.
This isn’t how we want to work with programs. We want to consider programs in their
most natural environment, the semantics defined by a language’s documentation.
135
136
Chapter 9). This allows us to identify and isolate structure in new settings, and
mentally disregard impertinent information.
The vector space, which encompasses mathematical objects with a linear structure,
is a foundational example. It’s the basic object of study in linear algebra. The
main tool that we use to relate two vector spaces is the linear map. As we will
see, linear maps have a useful computational representation called matrices
(singular, matrix). Matrices are
The definition of a linear map requires a bit of groundwork to nail down precisely,
but the crucial underlying intuition is simple. A function f : A → B is called
linear if the following identity2 is always true, no matter what x, y ∈ A are: f
( x + y) = f ( x) + f ( y) Simple, yet something is missing. Take a moment to
identify what that is.
The problem is that we don’t know what “+” means in this context. Because I used
the + symbol you may have guessed that A and B are sets of numbers, but this need
not be the case. Instead, we’ll isolate the important properties of addition, and
the result will be called a vector space. Any set can be a vector space, and we
call the elements of a vector space vectors. One defines a + operation and
establishes that the isolated addition properties hold.
Now we can define a vector space. The gist is that vectors can be any type, scalars
must be nicely-behaved numbers, and almost every arithmetic identity you expect to
be true is true, so long as you formally prove the axioms according to this
definition. The only missing thing is that vector spaces don’t have multiplication
or division of vectors 2 We also need preservation of scalar multiples, but we are
in inspiration mode. The formal definition is in Section 10.2
137
by other vectors. Moreover, the concepts defined here, particularly the zero vector
and additive inverses, can be proven to be unique from the definition. You will do
this in the exercises, and it justifies the use of the notation post hoc.
Definition 10.1. A set V is called a vector space over R if it has two operations +
and ·
scalars, and using the operation · is called scaling. 4 Rather than denoting the
operation by the prefix notation +( x, y) and ·( a, v), we’ll use the infix
notation x + y and a · v.
a) v + w = w + v
b) ( u + v) + w = u + ( v + w)
c) 0 + u = u + 0 = u
d) u + ( −u) = ( −u) + u = 0
b) 1 · v = v
c) a · ( b · v) = ( ab) · v
a · v + b · v.
+ and · that you’d expect are valid are indeed valid. These properties are the
minimum set of requirements to force the needed arithmetic to work.
3 Another word commonly used here is that V is closed under this operation:
applying + to vectors in V stays in V . We ensure this by stating the codomain of +
is V , but it is a more stringent requirement if the vector space is built from a
subset of some well-known set.
5 Some authors write all vectors bold, but I will only do it when disambiguation is
needed. More often than not the choice of letters suffices, u, v, w, x, y, z for
vectors and a, b, c or Greek letters for scalars.
138
This is a monumental definition, and it’s not even the most general definition (see
the Chapter Notes for more). But it’s entirely contained in the implementation of
the operations + and ·. The miniature proofs that + , · have the needed properties
constitute a proof that the chosen implementation is a vector space. This proof is
rarely a challenge.
In the examples that follow, I’ll skip detailed proofs, but if you want more
practice, fill in the details.
The simplest natural vector space is R, with R also being the scalars. In this case
vectors are just numbers, + is addition of real numbers, and · is multiplication of
real numbers. The number zero is both the scalar identity and the zero vector.
Nothing about this should be surprising.
A more interesting example is one we’re familiar with from Chapter 2, polynomials.
Call V the set of all polynomials of a single variable. If t is our variable then 1
+ t ∈ V
Even more general is the vector space of all functions f : X → R for any set X. As
an exercise to the reader: go through the conditions from Definition 10.1 and
figure out what + and · could mean. There should only be one natural option. As a
specific example, the space of all differentiable functions f : R → R is a vector
space, and the derivative operation f 7→ f′ is a linear map from that space to the
space of all functions.
( a 1 , a 2 , . . . , an) + ( b 1 , b 2 , . . . , bn) = ( a 1 + b 1 , . . . , an +
bn) .
All of the vector space axioms hold because they apply independently to each entry,
and each entry is just arithmetic in R.
With a few examples handy, let’s turn to the geometric side of Definition 10.1. A
vector space is designed to be the simplest way to define what addition means in a
context that is useful for geometry (defining an “algebra” for geometric objects).
Let’s expand this. The first thing a geometry needs is a space of points. In a
vector space, the points are the vectors themselves. In Figure 10.1, we draw some
vectors in R2 for the ease of visualization. For a reason we’ll explain shortly, we
also draw these points as arrows from the zero vector (the zero vector is called
the “origin,” in graphical parlance).
139
(1, 2)
(-2, 1)
(1, -1)
Returning to our vector space, points are indeed simply vectors in R n. In Figure
10.1, we draw some vectors in R2 for the ease of visualization. For a reason we’ll
explain shortly, we also draw these points as arrows from the zero vector (the zero
vector is called the “origin,” in graphical parlance).
Second, a geometry needs lines. In a vector space, a line is the set of all ways to
scale a single nonzero vector. In symbols, a line through the origin and v is the
set Lv = {c · v : c ∈ R }. For example, drawn in Figure 10.3 you can scale v = (1 ,
2) by a factor of 2 to get (2 , 4), shrink it down to (0 . 5 , 1), or scale it
negatively to ( − 2 , − 4). The set of all possible ways to do this gives you all
the points on the line through (1 , 2).
You can further get a line not passing through the origin by taking some other
vector w and adding it to every point on the line, i.e. {w + c · v : c ∈ R }. This
is the line through the point w parallel to Lv, shown in Figure 10.4.
All this said, a plain vector space isn’t quite enough to get all of geometry. For
example, we can’t compute distances or angles without more structure in the vector
space. We will enhance the geometric picture by the end of the chapter, but for now
we see there are connections between vectors and geometry. We’ll keep this
geometric foundation in mind while dealing with linear maps more abstractly (which,
to be frank, is the hard part
140
Figure 10.2: An example of vector addition. The dark dashed vector is the sum of
the two solid vectors, and the light dashed vector shows the geometric addition
process.
v = (1, 2)
L = { c·v : c ∈ R }
141
{ w + c·v : c ∈ R }
Figure 10.4: An example of a line described as all possible scalings of a given
vector v, then shifted away from the origin by a second vector w.
of linear algebra). Our task for now is to study where Definition 10.1 takes us.
A linear map describes a function between two vector spaces that preserves the
linear structure of the input. The formal definition is just an iota more
complicated than our version from the beginning of the chapter.
1. f( v + X w) = f( v) + Y f( w) 2. f( c ·X v) = c ·Y f( v)
This notation + X, ·X burns my eyes, so we’ll drop it and understand that when I
say f ( v + w) = f ( v) + f ( w), I mean that the + on the left hand side is
happening in X and the + on the right hand side is happening in Y . Likewise for
scaling, f( cv) = cf( v). Any other interpretation would be a fatal type error.
Moreover, as we go on I’ll begin to drop the · in favor of “juxtaposition”, so that
if a is a scalar and v is a vector, it’s understood that av = a · v. I will use the
dot only when disambiguation is needed.
142
Here’s a simple example of a linear map. Let X be the vector space of polynomials,
and Y = R. Define the evaluation at 7 function, which I’ll denote by eval7 : X → R,
as eval7( p) = p(7). Let’s check the two conditions hold. If p, q are two
polynomials, then eval7( p) + eval7( q) = p(7) + q(7) = ( p + q)(7) In just a
little bit more detail at the expense of a big ugly formula, if p = a 0 + a 1 x +
+ bk+17 k+1 + · · · + bm 7 m.
And we can distribute and rearrange all these terms to get exactly p(7)+ q(7).
Likewise, eval7( c · p) = c · p(7). Since the number 7 was arbitrary, the same
logic shows that eval a for any scalar a ∈ R is a linear map.
For the rest of the chapter, linear maps are the only kind of function we care
about for vector spaces. The reason, which we’ll spend the rest of the chapter
trying to understand, is that linear maps are the maps which preserve the structure
of a vector space. Indeed, we defined them to preserve the two operations that
define a vector space! But as we’ll see this covers all the bases. For example,
linear maps preserve the zero vector.
Proposition 10.3. If X, Y are vector spaces and f : X → Y is a linear map, then
f(0) =
0 .
As I did with + and ·, I’m using the same symbol 0 for the additive identity in
both vector spaces. In light of this fact it’s not so surprising: if there’s a
unique zero vector in every vector space, and every linear map preserves the zero,
then using the same symbol for both zero vectors is not so strange, even if the
types of the two zero vectors may be very different.
The proof of this fact “falls out” from the definition. To distinguish 0 the vector
from 0 the scalar, I’ll make the vector bold, like 0.
Proof. Let’s use the fact that · is preserved by a linear map. First, f(0) is the
same as f (0 · 0). Since f is linear, this is the same as 0 · f (0). But 0 · v = 0
no matter what v is.
f (0) = f (0 · 0) = 0 · f (0) = 0 ,
143
Subtracting f(0) from both sides gives 0 = f(0). Now it’s your turn: prove the
facts in Exercises 10.1-10.4 which establish basic properties of linear maps.
Though we defined a vector space as a set with two operations, you can’t do much
with that mental model. We need more concrete computational tools to work with a
vector space. The first tool is called a basis. In short, a basis for a vector
space V is a minimal set of vectors B from which you can get all vectors in V by
adding and scaling vectors in B. The important examples in this book—and crucially,
the proofs in this chapter—will focus on the case where B is finite. 6
{e 1 , . . . , en}.
Two things to note about the R2 example. First, this is far from the only basis.
Almost any two vectors you can think of form a basis. Say, {(3 , 4) , ( − 1 , −
5) }. One way to show this is a basis is to write a known basis like (1 , 0) and (0
, 1) in terms of these two vectors: 5
4
(1 , 0) =
(3 , 4) +
( − 1 , − 5)
11
11
From the above, one can write (0 , 1) as 1((3 , 4) − 3 · (1 , 0)). Once (1 , 0) and
(0 , 1) are 4
expressed in terms of your basis, you can get any vector by using ( c, d) = c(1 ,
0)+ d(0 , 1).
3 a − b = 1
4 a − 5 b = 0
Solving for a and b gives a = 5/11 and b = 4/11. The fact that this works for most
pairs of vectors you can think of is no coincidence, but we’ll return to that later
in the chapter. The point for now is that there are many possible bases (“BAY-
sees,” the plural 6 An infinite size basis is possible. We will remark on them
mostly as commentary for your enticement and further investigation.
144
v = e + 2 e
e = (0, 1)
e = (1, 0)
of basis) of a vector space, and each basis allows you to write any vector in the
vector space by summing and scaling the vectors in the basis.
is represented as ( − 1 , − 5).
The brief and formal way to say a vector v “can be written using sums and scales of
other vectors” is the following definition.
R with
x = a 1 v 1 + · · · + anvn =
aivi
i=1
145
v = (–1/3) v + (–5/3) v
v = (2, -1)
v = (-1, -1)
In particular, any way one could “add and scale” vectors reduces to this form,
provided one is willing to distribute scalar multiplication over addition, expand,
and group all the terms. This is the standardized way to express the existential
claim that x can be “built”
A bit of common terminology is the span of a set B of vectors, which is the set of
all linear combinations of those vectors. That is,
When we said informally that a basis is a set of vectors from which you can “get
all vectors in V ,” we could have said the set spans V . That would have been
incomplete, and now we’re ready for the formal definition.
This definition makes it clear why we don’t say things like “{(1 , 0) , (2 , 0) ,
(3 , 0) , (0 , 1) }
is a basis for R2.” Because while it does span R2, it includes superfluous
information.
It doesn’t make sense as a coordinate system either, because points don’t have
unique representations.
We will have a lot more to say about bases. Many insights and applications of
linear algebra revolve around computing a clever basis. But first we need a few
more tools. One of the most important definitions in elementary linear algebra is
related to the existence and uniqueness of linear combinations.
146
∑
∑
written as both
( a
i=1
ivi and
i=1
i − bi) vi would be
a nontrivial way to write the zero vector! It’s nontrivial because some ai and bi
have to be different, by our assumption that x has two different representations.
and that one cannot remove any vectors from B and still span V .
For the first, let x ∈ V be a vector not in B, and our task is to write x as a
linear combination of the vectors in B. First, we form the set C = B ∪ {x} by
adding x to B.
147
x = − 1 ( a 1 v 1 + · · · + anvn)
a 0
This proves that x ∈ span( B). Because x was chosen arbitrarily from V , this
proves that V ⊂ span( B). Since span( B) ⊂ V by definition of a vector space, 7
we’ve shown span( B) = V (cf. Definition 4.2 for a reminder on using subsets to
prove set equality).
Second, we need to show that B is minimal with respect to spanning V . Indeed, you
cannot write v 1 as a linear combination of v 2 , . . . , vn, because v 1 , . . . ,
vn form a linearly independent set! Hence, removing v 1 from B would make the
resulting set not span V ; ( v 1 ̸∈ span {v 2 , . . . , vn}). The same goes for
removing any vi.
The above proof makes it clear that for any x ̸∈ B, the statements “x ∈ span( B)”
and “B ∪ {x} is a linearly dependent set” are logically equivalent. This theorem
also provides a simple algorithm to construct a basis (though it’s not quite
concrete enough to implement). Start with B = {}. While there exists some vector
not in span( B), find such a vector and add it to B. When this loop terminates, B
is a basis.
With linear independence, spanning, and bases in hand, we can define dimension and
finally the matrix.
10.4 Dimension
At first the concept of a basis seems tame. But it unlocks a world of use. The
first thing it allows us to do is measure the size of a vector space. We can do
this because of the following fact:
Theorem 10.8 (The Steinitz exchange lemma). Let V be a vector space. Then every
basis of V has the same size.
8 The only other proof of this theorem I’m aware of uses all kinds of needless
machinery regarding homoge-neous systems of linear equations. Algorithms save the
day!
148
for w 1, is nonzero. 9
Repeat this process with u 2, forming W 2 , U 2, and keep doing it until you get to
Un = {}, and Wn. In each step we can always remove a new wi—that is, we can find a
wi with a nonzero coefficient—because all of the u’s that we’re adding are linearly
independent, while Wi is still spanning. So the algorithm will reach the n-th step,
at which point either all of W is replaced by all of U (i.e. n = m), or there are
some wi left over ( n < m).
Definition 10.9. The dimension of a vector space V is the size of a basis. Denote
the dimension of V by dim( V ).
Theorem 10.8 provides well-definition for the notion of the dimension of a vector
space.
Dimension does not depend on which basis you choose. This reinforces our intuitive
understanding of what dimension should be for R n, i.e., how many coordinates are
needed to uniquely specify a point. R is one-dimensional, the plane R2 is two-
dimensional, physical space at a fixed instant in time is 3-dimensional, etc. The
dimension of the space doesn’t (and shouldn’t) depend on the perspective, and for
linear algebra the perspective is the choice of a basis.
just relabel the vectors post-hoc so that w 1 is one of the vectors with a nonzero
coefficient.” You often need a mental spot-check to convince yourself this doesn’t
break the argument; in this case, the order of the wi is irrelevant. If we had to
program this, we might be forced to keep track, perhaps for efficiency gains
(relabeling would require a full loop through the wi). But in mathematical
discourse we can flexibly and usefully change the data to avoid crusty notation and
get to the heart of the proof.
149
As these two examples suggest, subspaces can be formed easily by taking a basis B
of V , and picking any subset A of B to form a basis of W = span( A) ⊂ V . The
converse also works: if you start with a set of vectors A = {v 1 , . . . , vk}
spanning a k-dimensional subspace of an n-dimensional vector space V , you can
iteratively add vectors not in the span of A until the resulting set spans all of V
. This process, though not well-defined algorithmically, is existentially possible,
and it’s called extending A to a basis of V . In Chapter 12 we’ll see a concrete
algorithm for it called the Gram-Schmidt process, which produces additional useful
properties of the resulting basis.
10.5 Matrices
Linear maps seem relatively complicated at first glance, but they have a rigid
structure uniquely determined once you fix a basis in the domain and codomain.
Let’s draw this out and discover what that structure is. In this section English
letters v, w, x, and y will always be vectors, while Greek letters α, β, and γ will
be scalars.
Start with a linear map f : V → W , maybe given by some formula. We want to compute
f on an input x. You choose a basis {v 1 , . . . , vn} and a basis {w 1 , . . . ,
wm} for V and W , respectively. 10 Now fix x ∈ V to be arbitrary. Since the vi form
a basis, there is some way to write x as a linear combination of the vi, say
This is such an important revelation that I want to shout it from the mountaintops!
Chisel it on the forearm of the Statue of Liberty! Put a fuchsia HTML marquee on
the front page of Google!
150
This implies the data representation of any linear map f : V → W can be reduced to
a fixed number dim( V ) of vectors in W : the output of f for each input basis
vector.
...
if ( vi).
+ · · ·
+ · · ·
∑
Using summation notation, the coefficient of w
j is
i=1
iβ[ i, j].
This is a mouthful of notation, but it’s completely generic. The αi’s let you
specify an arbitrary input vector x ∈ V , and the n-by- m array β[ i, j] contains
all the data we need to specify the linear map f. We’ve reduced this initially
enigmatic operation f to a simple table of numbers. Provided we’ve fixed a basis,
that is.
We’ve only cracked the tip of the iceberg. The problem with the notational mess
above is it adds too much cognitive load. It’s hard to keep track of so many
indices! You could make it more succinct by writing it in summation notation, but
we can do better. What we really need is a well-chosen abstraction.
The abstraction we’re about to see (the matrix) has two virtues. First, it eases
the cognitive burden of doing a calculation by representing the operations
visually. Second, it provides a rung on the ladder of abstraction which you can
climb up when you want
151
to consider the relationship between matrices, linear maps, and the basis you’ve
chosen more abstractly. It does this by defining a new algebra for manipulating
linear maps.
Both the visual representation and the algebra merge seamlessly with the functional
description of linear maps. As we’ll see, composition of functions corresponds to
matrix multiplication. Natural operations on linear maps correspond to operations
on the corresponding matrices, and conversely operations on matrices correspond to
new, useful operations on functions. We will explore this in even more detail in
Chapter 12.
So here’s the abstraction that works for any linear map f : V → W . Again, we fix a
basis {vi} for V and {wj} for W . Write the numbers from β describing the linear
map f : V → W in a table according to the following rule. The columns of the table
correspond to the basis of V , and the rows correspond to basis vectors of W . We
call this construction M( f), and the mapping f 7→ M( f) will be a bijection from
the set of linear maps (all using the same fixed basis) to the set of matrices. The
underscores denote the part of the construction I haven’t specified yet.
v 1
v 2
· · · vn
w 1
· · ·
w 2 _
· · ·
_
M ( f ) =
..
..
..
..
..
wm
· · ·
The entries of a column i are defined as the expansion of f( vi) in terms of the
wj. That is, take the basis vector vi for that column, and expand f( vi) in terms
of the wj, getting f ( vi) = β[ i, 1] w 1 + · · · + β[ i, m] wm. The numbers β[ i,
j] (where j ranges from 1 to m) form the i-th column of M( f).
v 1
v 2
· · ·
vn
w 1
β[1 , 1]
β[2 , 1]
· · · β[ n, 1]
w 2 β[1 , 2]
β[2 , 2]
· · · β[ n, 2]
M ( f ) =
..
..
..
..
..
wm
β[1 , m]
β[2 , m]
· · · β[ n, m]
You will have noticed that we’ve flipped the indices β[ i, j] from their normal
orientation so that i is the column instead of the row. This is an occupational
hazard, but we trust a programmer can handle index wizardry. One clever way to
express the construction of M ( f ) with fewer indices is like this:
v 1
· · ·
vn
w 1
M ( f ) =
..
f ( v 1)
· · · f( vn)
wm
152
The vertical lines signal that f( vi) is “spread out” over column i by its
expansion in terms of {wj}.
α 1
α
2
x =
..
.
αn
Sometimes people call this a “column vector” to distinguish it from the obvious
analogue of writing the entries in a row. Let’s just call it a vector. Now to
compute f( x) using M = M ( f ), you write M and x side by side (as if the
operation were multiplication of integers).
v 1
v 2
· · ·
vn
β[1 , 1]
β[2 , 1]
· · · β[ n, 1]
2 β[1 , 2]
β[2 , 2]
· · · β[ n, 2] 2
M x =
.
.
..
..
..
.. .
..
β[1 , m]
β[2 , m]
· · · β[ n, m]
v 1
v 2
· · ·
vn
w
α
β[1 , 1]
β[2 , 1]
· · · β[ n, 1]
2 β[1 , 2]
β[2 , 2]
· · · β[ n, 2] 2 2
M x =
.
..
..
..
.. .
.
.
.. = .. = z
β[1 , m]
β[2 , m]
· · · β[ n, m]
The computation to get from the left-hand side of this equation to the right is the
same as how we grouped terms to get the coefficient of wi earlier. Take the row of
M
corresponding to wi, compute an entrywise product with x, and sum the result.11
153
be multiplied by. Then the sum gives you the first entry γ 1, and you continue down
the rows of M. Here’s an example with a 2 × 3 matrix.
)
( )
1
−
1 =
− 2 0
)
− 1 −−−→ 2 − 1
−−−→ a = 9 · 3 + 2 · ( − 1) + 1 · 4 = 29
The second:
)
− 1 −−−→ − 2 − 1
7
− 2 0
−−−→ b = 7 · 3 + ( − 2) · ( − 1) + 0 · 4 = 23
It’s easy to get lost in the notation and miss the bigger picture. We’ve defined a
mechanical algebraic process for computing the output f( x) ∈ W from the input x ∈
V , provided we have chosen a basis for V and W and provided we can express vectors
in terms of a given basis. This is a new type of “multiplication” operator that has
very nice properties. For example:
Theorem 10.13. Let V, W be vector spaces and f, g : V → W two linear maps. The
mapping f 7→ M( f) is linear. That is, if f + g is the function x 7→ f( x) +
g( x) , then M ( f + g) = M ( f ) + M ( g) , and likewise M ( cf ) = cM ( f ) for
every scalar c.
154
( − 1) M ( g). If that’s the matrix of all zeroes, then, because linear maps
preserve zero, f − g must be the zero map. Surjectivity: if you specify a matrix A,
the f mapping to A is the one with f( vi) equal to the linear combination defined
by the i-th column of A.
This bijection allows us to say that linear maps and matrices are “the same thing”
without angry mathematicians throwing chalkboard erasers at us.13 The matrix
representation of a linear map is unique, so we can freely switch back and forth
between a linear map and its matrix, provided the bases does not change.
B = b
· · · b m
AB =
Ab1 · · · Ab m
M ( g ◦ f ) = M ( g) M ( f ) ,
This whole process we’ve undertaken, going from an abstractly defined theory of
vector spaces and linear maps to the concrete world of matrices, is analogous to
the process of building a computational model for a real-world phenomenon. It’s
like we’re taking light, something which we observe obeys certain behaviors such as
reflecting on various surfaces, and casting it to a type where we can
quantitatively answer how much it reflects. We can say, without observation, what
its different components are in our model, and how two types of light we’ve never
observed interacting would interact. All of these things are possible because of
the computational model.
In some more concrete and advanced terminology, we’ve defined an algebra for linear
maps. We showed how to add and “multiply” (compose) linear maps, and these
operations 13 This actually happened to a friend of mine, and there’s an apocryphal
tale of the irascible wunderkind Évariste Galois, who, during an admittance exam to
a prestigious French university, was so frustrated by the examiner’s inability to
recognize his genius that Galois threw a chalkboard eraser at him. Needless to say,
Galois was not admitted.
155
The task of finding a route from a conceptually intuitive land (linear maps) to a
computationally friendly world (matrices) is a chief goal of much of mathematics.
This is the same goal of calculus—it’s namesake is “calculate”—to convert
computations on curves with an infinite nature to a domain where one can do
mechanical calculations. And we aren’t yet done doing this with linear algebra!
Because while we have said how to compute once you have chosen a basis, we haven’t
discussed the means of actually finding such bases. Many applications of linear
algebra are based on computing a useful basis, and that will be the subject of both
this chapter’s application and the next. As such, we must dive deeper.
and a vector x ∈ V , one can find the unique expression of x in terms of the basis.
In fact, the way we defined a basis ensures existence, but the only example I gave
so far to compute this decomposition was, for V = R2, to set up a system of two
linear equations with two variables, and solve them.
3 a − b = 1
4 a − 5 b = 0
One important thing to point out: even though we want to write x = (1 , 0) in terms
of v 1 , v 2, we actually had a representation of x in terms of a basis already! To
even write x down in this coordinate-form, we implicitly used the standard basis
for R2, e 1 =
(1 , 0) , e 2 = (0 , 1). In the example above x = 1 e 1 + 0 e 2. In order to
express x in terms of a given basis, you have to have already expressed it in terms
of some (maybe easy) basis.
This strategy generalizes. Let’s say we have an n-dimensional vector space V with
two bases:
E = {e 1 , e 2 , . . . , en}
B = {v 1 , v 2 , . . . , vn}
14 The map M provides an isomorphism of algebras, but rather than introduce this
term now, we will discuss it at length in Section 10.7, and again in later
chapters.
156
Say E is the “easy” basis, often the standard basis in R n, and B is the target
basis we wish to express some vector x = α 1 e 1 + · · · + αnen in. Write down a
system of n equations with n unknowns, as follows. First express each of the
vectors in B in terms of E. I’m going to use the notation (e.g.) v 2 , 4 to denote
the 4th coefficient of v 2 as it’s written in the basis E. Finally, write down an
equation for each ei, which asserts that the coefficient αi of x in E is the same
as the sum of the ei coefficients of the (hypothetical) representation of x in B.
Note that all symbols here represent numbers in R.
β 1 v 1 , 1 + · · · + βnvn, 1 = α 1
β 1 v 1 , 2 + · · · + βnvn, 2 = α 2
...
β 1 v 1 ,n + · · · + βnvn,n = αn
This was a mouthful, but refer back to the two-dimensional example above and
identify how that generalizes to this system of equations. Next, we can rewrite the
system of equations as a single matrix equation.
v 1 , 1
· · · vn, 1
β 1
α 1
v
1 , 2
· · · vn, 2
β 2
α 2
..
. .
.
..
.. .. = ..
v 1 ,n · · · vn,n
βn
αn
This makes it clear that expressing a vector in terms of a basis can be phrased as
computing the unknown input of a linear map, y = ( β 1 , . . . , βn), given a
specified output x = ( α 1 , . . . , αn). It’s worthwhile to break this down a bit
further.
The matrix A = ( vi,j) defined above converts a vector from the domain basis to the
codomain basis. The domain basis—which indexes the columns of A—is the target
basis.
It’s the one we want to express x in terms of. The codomain basis—indexing the rows
—is the “easy” basis E, the basis used to write x = ( α 1 , . . . , αn). Finally, y
is the vector of coefficients ( β 1 , . . . , βn) that expresses x in terms of v
1 , . . . , vn, which is what we want.
157
· · · 0
0 1 0 · · · 0
· · · 0
n =
.. . . .
.
.. ..
.. ..
· · · 1
The matrix multiplication operation ensures that InA = AIn = A for any matrix A.
As an exercise, prove that if a linear map is a bijection, then its inverse is also
a linear map, and the linear-map-to-matrix correspondence preserves inverses.
P − 1 AP w
This expression works in sequence right to left: express w in basis E, apply A, and
convert the result back to B. The matrix P − 1 AP is exactly the linear map for A
expressed in terms of the B basis.
“similar” because we’re really saying they’re identical. If you look at a laptop on
your desk and then pick it up and hold it sideways above your head, it’s not
“similar” to the laptop on your desk, it’s the same thing from two different
perspectives! That’s exactly what happens when you conjugate a matrix. Taking a cue
from Chapter 9, matrix similarity is an equivalence relation, and the equivalence
classes correspond to linear maps.
Let’s restrict our attention back to finite-dimension. We’ll argue why R n is the
only vector space by an illuminating example. Define by Pm the vector space of
polynomials of degree at most m. Note that the obvious basis is { 1 , t, . . . ,
tm}, making dim Pm = m + 1.
= W .
Proposition 10.16. Let Pm be the vector space of polynomials in one variable with
degree at most m. Then R m+1 ∼
= Pm.
Proof. Let { 1 , t, t 2 , . . . , tm} be the usual basis for Pm, and fix the
standard basis of R m+1, i.e., {e 1 , . . . , em+1 }. Define f : Pm → R m+1 as f
( a 0 + a 1 t + · · · + amtm) = ( a 0 , a 1 , . . . , am) First, f is a linear map:
when you add polynomials you add their same-degree coefficients together, and
scaling simply scales each coefficient. Second, f is a bijection: if two
polynomials are different, then they have at least one differing coefficient
(injection); if
( b
m
k=0 k tk under f .
159
This theorem isn’t meant to conclude that polynomials are the same as lists in
every respect. Quite the opposite, a polynomial comes with all kinds of extra
interesting structure (as we saw in Chapter 2). Rather, to phrase polynomials as a
vector space is to ignore that additional structure. It says: if all you consider
about polynomials is their linearity, then they have the same linear structure as
lists of numbers. At times it can be extremely helpful to “ignore” certain unneeded
aspects of a problem. As you’ll see in an exercise, the polynomial interpolation
problem from Chapter 2 relies only on the linear structure of polynomials. Noticing
this can inspire other (perhaps more efficient) techniques for doing secret
sharing.
This plan has a wrinkle. We’re about to define, the inner product, which computes
angles in R n. However, the quantitative values of the inner product might not be
preserved by an isomorphism! As it turns out, you can always find a special
isomorphism that preserves the formula, allowing the inner product formula to work
in generality. We’ll see this happen in Chapter 12 in more detail.
for R n, so that v =
i=1
iei and w =
i=1
⟨v, w⟩ = α 1 β 1 + · · · + αnβn =
αiβi.
i=1
160
∥ v – w∥
∥ v∥
∥ w∥
Figure 10.7: The lengths of the sides of the triangle satisfy the law of cosines.
First, a special case of the inner product: the norm of a vector v, denoted ∥v∥, is
de-
√
fined as ∥v∥ =
formula, ∥v∥ =
dimensions.
We’ll also need two facts in the proof, whose proofs follow from the formula for
the inner product and simple arithmetic. We will see in Chapter 12 how these
properties become a definition.
Proposition 10.19. The inner product is symmetric, i.e., ⟨v, w⟩ = ⟨w, v⟩, and
linear in each input. In particular for the first input: ⟨x + y, z⟩ = ⟨x, z⟩ + ⟨y,
z⟩ and ⟨cx, z⟩ = c⟨x, z⟩.
The same holds for the second input by symmetry of the two inputs.
Theorem 10.20. The inner product ⟨v, w⟩ is equal to ∥v∥∥w∥ cos( θ) , where 0 < = θ <
= π
Proof. If either v or w is zero, then both sides of the equation are zero and the
theorem is trivial, so we may assume both are nonzero. Label a triangle with sides
v, w and the third side v −w as in Figure 10.7. The length of each side is ∥v∥, ∥w∥,
and ∥v−w∥, respectively.
Assume for the moment that θ is not 0 or 180 degrees, so that this triangle has
nonzero area.
161
The left hand side is the inner product of v−w with itself, i.e. ∥v−w∥ 2 = ⟨v−w,
v−w⟩.
⟨v − w, v − w⟩ = ⟨v, v − w⟩ − ⟨w, v − w⟩
Combining our two offset equations, subtract ∥v∥ 2 + ∥w∥ 2 from each side and get
Now if θ = 0 or 180 degrees, the vectors are parallel and cos( θ) = ± 1. That means
we can write w = cv for some scalar c. In particular, c < 0 when θ = 180 and c > 0
for θ = 0, and ∥w∥ = c∥v∥ when c > 0 and ∥w∥ = −c∥v∥ when c < 0. So the inner product
is
⟨v, cv⟩ = c⟨v, v⟩ = c∥v∥ 2 = ( c∥v∥) ∥v∥ = ±∥w∥∥v∥, where the sign matches up with cos(
θ) ∈ {± 1 }.
Theorem 10.21. Two nonzero vectors v, w ∈ R n are perpendicular if and only if ⟨v,
w⟩ =
0 .
When I say, “P is true if and only if Q is true,” I am claiming that the two
properties are logically equivalent. In other words, you cannot have one without
the other, nor can you exclude one without excluding the other. Proving such an
equivalence requires two sub-proofs, that P implies Q and that Q implies P .
Because logical implication is often denoted using arrows—“P implies Q” being
written P → Q, and “Q implies P ” being written P ← Q—these sub-proofs are
informally called “directions.” So one will prove an if-and-only-if by saying, “For
the forward direction, assume P …and hence Q,” and
“For the reverse/other direction, assume Q…and hence P .” Authors will also often
mix in proof by contradiction to complete the sub-proofs. The combined if-and-only-
if is often denoted with double-arrows: P ↔ Q, and when pressed for brevity,
mathematicians abbreviate “if and only if” with “iff” using two f’s. So “iff” is
the mathematical cousin of a classic Unix command: 2–3 letters and a long man page
to explain it.
Proof. For the forward direction, assume v and w are perpendicular. By definition
the angle θ between them is 90 or 270 degrees, and cos( θ) = 0. Hence ⟨v, w⟩ =
162
meaning one of ∥v∥, ∥w∥, or cos( θ) must be zero. Perpendicularity is not defined if
one of the two vectors is zero, 18 so both vectors must be nonzero and have a
nonzero norm.
Proof. Suppose for contradiction that ⟨x, y⟩ = 0 but ax + by = 0 for some scalars
a, b.
c = v,w⟩ .
∥v∥ 2
Let me depict this formula geometrically. Say that v, the vector being projected
onto, is special in that it has magnitude 1. Such a special vector is called a unit
vector. 20 In this case the formula defined above for the projection is just ⟨v,
w⟩v. Now (trivially) write w = proj ( w) + [ w − proj ( w)]
The terms above are labeled on the diagram in Figure 10.8, with v and w solid dark
vectors, and the terms of the projection formula as dotted lighter vectors
perpendicular to each other. To convince you that the inner product computes the
pictured projection, 18 One can either say that perpendicularity as a concept only
applies to nonzero vectors, or establish (by convention) that the zero vector is
perpendicular to all vectors.
20 The words “unit” and “unity” refer to the multiplicative identity 1, and their
etymology is the Latin word for one, unus. The word also shows up in complex
numbers when we speak of “roots of unity,” being those complex numbers which are n-
th roots of 1. Someday they’ll make a biopic about collaborating mathematicians
called “Roots of unity,” and Erdős will roll over in his grave.
163
v
w – pro v j ( w)
proj ( w)
I need to prove to you that the two terms proj ( w) and w − proj ( w) are
geometrically v
⟨w − proj ( w) , proj ( w) ⟩ = 0
Indeed, since proj ( w) = ⟨v, w⟩v, let’s call p = ⟨v, w⟩ and expand: v
= p⟨w, v⟩ − p 2 ∥v∥ 2
= p 2 − p 2 = 0
The last step used the assumption that ∥v∥ = 1, and again that p = ⟨w, v⟩ = ⟨v, w⟩.
You can prove the same fact with the version of the projection formula that does
not require unit vectors, if you keep track of the extra norms. The essence of the
proof is the same. The extra term in the formula for proj ( w) dividing by ∥v∥ 2 is
just to make v a v
unit vector in the two places v is used, once in the inner product and once to make
the v being projected onto a unit vector. Ideally you never project onto something
which is not a unit vector, but if you must you can normalize it as part of the
formula.
Figure 10.8 is accurate in suggesting the two vectors are actually perpendicular.
By virtue of being perpendicular to the projection, the norm of the vector w − proj
( w) can v
Finding the line of best fit for a collection of points is the base case of the SVD
algorithm, the application for this chapter.
164
proj ( w) =
proj ( w) .
vi
i=1
A brief summary of this chapter would rephrase the relationship between a matrix
and a linear map. A matrix is a useful representation of a linear map that is fixed
after choosing a basis, and the algebraic properties of a matrix correspond to the
functional properties of the map. That, and certain operations on vectors have nice
geometric interpretations.
We save the juiciest properties for Chapter 12, where we will discuss eigenvalues
and eigenvectors. Nevertheless, we have access to fantastic applications. The
technique for this chapter, the singular value decomposition (SVD), is a ubiquitous
data science tool.
It was also a crucial part of the winning entry for the million dollar Netflix
Prize. The Netflix Challenge, held from 2006–2009, was a competition to design a
better movie recommendation algorithm. The winning entry, awarded to Robert Bell,
Yehuda Koren, and Chris Volinsky, improved on the accuracy of Netflix’s algorithm
by ten percent. The singular value decomposition was used to represent the data
(movie ratings) as vectors in a vector space, and the “decomposition” part of SVD
chooses a clever basis that models the data. After finding this useful
representation, the Netflix Prize winners used the vector representation as input
to a learning algorithm. 21
Though true movie ratings require dealing with issues we will ignore (like missing
data), we’ll couch the derivation of the SVD in a discussion of movie ratings. The
geometric punchline is: treat the movie ratings as points in a vector space, and
find a low-dimensional subspace which all the points are close to. This low-
dimensional subspace
“approximates” the data by projecting onto the subspace. Using the subspace as a
model makes subsequent operations like clustering and prediction faster and more
stable in the presence of noise.
Let’s start with the idea of a movie rating database to understand the modeling
assumptions of the SVD. We have a list of people, say Aisha, Bob, and Chandrika,
who rate each movie with an integer 1–5. These intrepid movie lovers have watched
and critiqued every single movie in the database. We write their ratings in a
matrix A as in Figure 10.9.
21 Ironically, most of the details beyond the standard SVD and subsequent learning
algorithm were not used by Netflix, even after declaring the winner.
165
Up
Skyfall
1
Thor
4
Amelie
3
= A
Snatch
5
Casablanca
4
Bridesmaids
2
Grease
That is, we’re saying the input is R3 and the basis vectors are people:
The codomain is R8 (if there are only 8 movies, as in this toy example), and the
basis vectors are y Up , y Skyfall, etc. By representing the ratings this way,
we’re imposing the hypothesis that the process of rating movies is linear in
nature. That is, the map A computes the decision making process from people to
ratings. The coefficients of A( x Aisha) written in terms of the basis of movies,
forms the first column of the matrix in Figure 10.9. In this way, each vector in
the domain can be seen as either a person, or purely as the movie ratings provided
by that person. Conversely, each vector on the codomain is purely defined in terms
of how it is assembled from the ratings of the basis movies. The movie rating
function A is also assumed to be one combined function, as opposed to different for
each person.
These assumptions should give us pause. Beyond the sociological assumptions made
here, the linear model also grants us strange new mathematical abilities. We
started with
166
We may also ask for a “person” whose movie-rating preferences are half-way in
between Aisha and Bob, and ask how this person would rate Amelie. Indeed, the fact
that A is a linear map provides an immediate answer to this question: average the
ratings of Aisha and Bob. The behavior of A on any vector is determined by its
behavior on the basis.
We can also create nonsense when we subtract vectors, or scale them beyond
reasonable interpretations. What would the movie 75 y Grease − 8 y Thor look like?
You may conjure a cohesive explanation, but you’d be straining logic to fit the
image of gibberish. Very off brand.
The central point is that we can represent a movie (or a person) formally as a
linear combination in some abstract vector space. But we don’t represent a movie in
the sense of its content, only those features of the movie that influence its
rating. We don’t know what those features are, but we can presumably access them
indirectly through the data of how people rate movies. We don’t have a legitimate
mathematical way to understand that process, so the linear model is a proxy. What’s
amazing is how powerful a dumb linear proxy can be, given enough data.
It’s totally unclear what this means in terms of real life, except that you can
hope (or hypothesize, or verify), that if the process of rating movies is “linear”
in nature then this formal representation will accurately reflect the real world.
It’s like how physicists all secretly know that mathematics doesn’t literally
dictate the laws of nature, because humans made up math in their heads and if you
poke nature too hard the math breaks down. But math as a language is so convenient
to describe hypotheses (and so accurate in most cases!), that we can’t help but use
it to design airplanes. We haven’t yet found a better tool than math.
Likewise, movie ratings aren’t literally a linear map, but if we pretend they are
we can make algorithms that accurately predict how people rate movies. So if you
know that Skyfall gets ratings 1, 2, and 1 from Aisha, Bob, and Chandrika,
respectively, then a new person would rate Skyfall based on a linear combination of
how well they align with these three people on other ratings. In other words, up to
a linear combination, in this example Aisha, Bob, and Chandrika epitomize the
process of rating movies.
The idea in SVD is to use a better choice of people than Aisha, Bob, and Chandrika,
and a better choice of movies, by isolating the independent aspects of the process
into separate vectors in the basis. Concretely this means the following:
167
3. Do (1) and (2) in such a way that the resulting representation of A only has
entries on the diagonal.22 I.e., A( p 1) = c 1 q 1 for some constant c 1, likewise
for p 2, p 3, etc.
One might think of the pi as “idealized critics” and the qj as “idealized movies.”
If the world were unreasonably logical, then q 1 might correspond to the “ideal
action movie”
and p 1 to the “idealized action movie lover.” The fact that A only has entries on
the diagonal means that p 1 gives a nonzero rating to q 1 and only q 1. A movie is
represented by how it decomposes (linearly) into “idealized” movies. To make up
some arbitrary numbers, maybe Skyfall is 2/3 action movie, 1/5 dystopian sci-fi,
and − 6/7 comedic romance. A person would similarly be represented by how they
decompose (via linear combination) into a action movie lover, rom-com lover, etc.
To be completely clear, the singular value decomposition does not find the ideal
action movie. The “ideality” of the singular value decomposition is with respect to
the inherent linear structure of the rating data. In particular, the “idealized
genres” are related to how closely the data sits in relation to certain lines and
planes. This is the crux of why the SVD algorithm works, so we’ll explain it
shortly. But nobody has a strong idea of how the movie itself relates to the
geometric structure of this abstraction. It almost certainly depends on completely
superficial aspects of the movie, such as how much it was adver-tised or whether
it’s a sequel. Nevertheless, much of the usefulness of the SVD abstraction relies
on not being domain-specific. The more a model encodes about movie-specific
features, the less it applies to data of other kinds. One sign of a deep
mathematical insight is domain-agnosticism.
The takeaway is that this mental model of an idealized genre movie and an idealized
genre-lover grounds our understanding of the SVD. We want to find bases with
special structure related to the data. We know the analogy is wrong, but it’s a
helpful analogy nonetheless.
Earlier I said that the SVD is about finding a low-dimensional subspace that
approximates the data well. It won’t be clear until we dive into the algorithm, but
this is achieved by taking our special basis of idealized people, p 1 , . . . , pn
(likewise for movies), and ordering them by how well they capture the data. There
is a single best line, spanned by one of these pi, that the points are collectively
closest to. Once you’ve found that, there is a second best vector which, when
combined with the first, forms the best-fitting plane (two-dimensional subspace),
and so on.
22 Matrices with only nonzero entries on the diagonal are often called “diagonal”
matrices, and if a matrix is diagonal with respect to some choice of a basis, it’s
called “diagonalizable.”
168
The approximation aspect of the SVD is to stop at some step k, so that you have a
k-dimensional subspace that fits the data well. The matrix P whose rows are the
chosen p 1 , . . . , pk is the linear map that projects the input vector x to the
closest point in the subspace spanned by p 1 , . . . , pk. This is simply because
the matrix-vector multiplication P x involves an inner product ⟨pi, x⟩—the
projection formula onto a unit vector pi—between each row of P and x.
Hopefully, k is much less than m or n, but still captures the “essence” of the
data.23
Indeed, it turns out that if you define the special basis vectors in this way—
spanning the best-fitting subspaces in increasing order of dimension—you get
everything you want.
You can also build these best-fitting subspaces recursively. The best-fitting 2-
dimensional subspace is formed by taking the best line and finding the next best
vector you could add.
Likewise, the best 3-dimensional subspace is that best plane coupled with the next
best vector. We’re glomming on vectors greedily.
It should be shocking that this works. Why should the best 5-dimensional subspace
be at all related to the best 3-dimensional subspace? For most problems, in math
and in life, the greedy algorithm is far from optimal. When it happens, once in a
blue moon, that the greedy algorithm is the best solution to a natural problem—and
not obviously so—it’s our intellectual duty to stop what we’re doing, sit up
straight, and really understand and appreciate it.
First we’ll define what it means to be the “best-fitting” subspace to some data.
Below, by the “distance from a vector x to a subspace W ,” I mean the minimal
distance between x and any vector in W .
Next we study this definition to come up with a suitable quantity to optimize. Say
I have a set of m vectors w 1 , . . . , wm in R n, and I want to find the best
approximating 1-dimensional subspace. Given a candidate line spanned by a unit
vector v, measure the quality of that line by adding the sum-of-squares distances
from wi to v. Using the projection function defined earlier,
quality( v) =
∥wi − proj ( w
i) ∥ 2
i=1
This formula, in a typical math writing fashion, exists only to help us understand
what we’re optimizing: squared distances of points from a line. To make it
tractable, we convert it back to the inner product. I’ll describe this process in
fine detail, with sidebars to explain some notational choices.
169
We want to find the unit vector v that minimizes the quality function. We’d write
the goal of minimizing this expression as
arg min
∥wi − proj ( w
i) ∥ 2 .
i=1
whose input is v and whose output is EXPR (depending on v), and the total
expression (with the min) evaluates to the minimal output value considered over all
possible inputs v.
The domain of v is usually defined in the prose, but if it’s helpful and fits, the
conditions on v can be expressed in the subscript, such as
min EXPR ,
v∈ R n
∥v∥=1
which is the minimum value of EXPR considered over all possible unit vectors in R
n.
Just to drive the point home, this is equivalent to the pseudo-Python snippet:
The analogous expression which evaluates to the input vector v (instead of the
value of the expression being optimized) is called “arg min.” The arg prefix
generally means, get the “argument,” or input, to the optimized expression. Note
that there can be multiple minimizers of an expression, so we are implicitly saying
we don’t care which minimizer is chosen. It’s a highly context-dependent bit of
notation. If I replaced min with arg min in the offset equation above, it would
correspond to the following Python snippet.
I introduced the argmin because we actually want to find the minimizing vector.
It’s false to claim min x≥ 0( x 2 + 1) = min x≥ 0 x 2, even though the argmins are
unique and equal. So our line-of-best-fit problem is most rigorously written as:
arg min
v∈ R n
v
∥v∥=1 i=1
( w)
i) and wi − proj v
are perpendicular, we can apply the Pythagorean theorem, in this case that ∥ proj (
w v
i) ∥ 2+
∥wi − proj ( w
∑ (
arg min
∥wi∥ 2 − ∥ proj ( w
i) ∥ 2
i=1
170
Next, notice that the ∥wi∥ 2 don’t depend on the input v, meaning we can’t optimize
them and can remove them from the expression without changing the argument of the
minimum (it does change the value of the min). The minimization problem is now (
arg min −
∥ proj ( w
i) ∥ 2
v
i=1
And because minimizing something is the same as maximizing its opposite, we can
swap the optimization. Let’s also use the inner product formula for the projection
instead of the squared-norm. We’ve reduced the best fitting line optimization to
finding a unit vector v which maximizes
arg max
⟨wi, v⟩ 2
i=1
Maximizing the square of a non-negative value is the same as maximizing the non-
Av∥.
There are many algorithms that solve this optimization problem. We’ll use a
particularly simple one, and defer implementing it until after we see how this
problem can be used as a subroutine to compute the full singular value
decomposition.
Theorem 10.25 (The SVD Theorem). Computing the best k-dimensional subspace fitting
a dataset reduces to k applications of the one-dimensional optimization problem.
This is so astounding and useful that the solutions to each one-dimensional problem
are given names: the singular vectors. I will define them recursively. Let A be an
m × n matrix ( m rows for the movies, and n columns for the people) whose rows are
the data points wi. Let v 1 be the solution to the one-dimensional problem
171
v∈ R n
∥v∥=1
Call v 1 the first singular vector of A. Call the value of the optimization
problem, i.e.
1 ⟩v 1 = x.
Now we can move up in dimension. To find the best 2-dimensional subspace, you first
take the best line v 1, and you look for the next best line, considering only those
vectors perpendicular to v 1. That optimization problem is written as (assuming
henceforth that the domain is R n)
∥v∥=1
⟨v,v 1 ⟩=0
The solution v 2 is called the second singular vector, along with the second
singular value σ 2( A) = ∥Av 2 ∥.
Often writers will use the binary operator ⊥ to denote perpendicularity of vectors
instead of the inner product. So v ⊥ v 1 is the assertion that v and v 1 are
perpendicular.
The ⊥ symbol has many silly names (“up tack” on Wikipedia). In my experience most
people call it the “perp” symbol, since in mathematical typesetting it’s denoted by
\perp.
Continuing with the recursion, the k-th singular vector vk is defined as the
solution to the optimization problem ∥Av∥ for unit vectors v perpendicular to every
vector in span {v 1 , . . . , vk− 1 }. The corresponding singular value is σk( A) =
∥Avk∥. You can keep going until either you reach k = n and you have a full basis,
or else some σk( A) = 0, in which case all the vectors in your data set lie in the
span of {v 1 , . . . , vk− 1 }.
As a side note, by the way we defined the singular values and vectors,
σ 1( A) ≥ σ 2( A) ≥ · · · ≥ σn( A) ≥ 0 .
This should be obvious, and if it’s not take a moment to do a spot check and see
why.
Proof. Recall we’re trying to prove that the first k singular vectors span the k-
dimensional subspace of best fit for the vectors that are the rows of A. That is,
they span a linear subspace Y which maximizes the sum-of-squares of the projections
of the data onto Y .
We’ll show that the subspace spanned by the two singular vectors v 1 , v 2 is at
least as good (and hence equally good as Y ).
172
dimensions.)
{v 1 , v 2 } is at least as good as Y :
The right hand side of this inequality is maximal by assumption, so they must
actually be equal and both be maximizers.
For the general case of k, the inductive hypothesis tells us that the first k terms
of the objective for k + 1 singular vectors is maximized, and we just have to pick
any vector yk+1 that is perpendicular to all v 1 , v 2 , . . . , vk, and the rest
of the proof is just like the 2-dimensional case. We encourage the skeptical reader
to fill in the details.
The singular vectors vi are elements of the domain. In the context of the movie
rating example, the domain was people, and so the singular vectors in that case are
“idealized people.” As we said earlier, we also want the same thing for the
codomain, the “idealized movies,” in such a way that A is diagonal when represented
with respect to these two bases.
Say the singular vectors are v 1 , . . . , vn, and the singular values are σ
1 , . . . , σn. That gives us two pieces of the puzzle: the diagonal representation
Σ (the Greek capital letter sigma, since its entries are the lower case sigma
singular values σi) defined as follows:
σ 1
· · ·
0 σ
· · ·
..
.. ..
..
..
.
.
· · · σ
Σ =
n− 1
0
· · ·
σ
0
· · ·
0
..
.
.
..
..
..
..
· · ·
173
And the domain basis: a matrix V whose columns are the vi, or equivalently V T
whose rows are the vi. 24 If we want to write A in this diagonal way, we just have
to fill in a change of basis matrix U for the codomain.
A = U Σ V T
Indeed, there’s one obvious guess (which we’ll later scale to unit vectors): define
ui =
Avi. Let’s verify the ui form a basis. Note they form a basis of the image of A
(the set
{Av : v ∈ R n}), since it can happen that m > n. To get a full basis, just extend
the partial basis of ui’s in any legal way to get a full basis. To show the ui form
a basis, take any vector w in the image of A, write it as w = Ax, and write x as a
linear combination of the vi:
w = A( c 1 v 1 + · · · + cnvn)
= c 1 Av 1 + · · · + cnAvn
= c 1 u 1 + · · · + cnun
It can be proved that the ui are perpendicular, but the only proof I have seen is
somewhat technical and for brevity’s sake I will skip it. But taking this on faith,
the ui form a basis and one can express A = U Σ V T , as desired. The fact that A =
U Σ V T is why SVD
Now that we’ve seen that the SVD can be computed by greedily solving a one-
dimensional optimization problem, we can turn our attention to solving it. We’ll
use what’s called the power method for computing the top eigenvector. The next
chapter will be all about eigenvectors, but we don’t need to know anything about
eigenvectors to see this algorithm. In lieu of knowledge about eigenvectors, the
algorithm will just appear to use a clever trick.
The idea is to take A, the original input data matrix, and instead work with AT A.
Why is this helpful? Using our decomposition from the previous section, we can
write A = U Σ V T , where U, V are change of basis matrices (whose columns are
perpendicular unit vectors!) and V actually contains as its columns the vectors we
want to compute.
AT A = ( U Σ V T ) ( U Σ V T ) = V Σ T U T U Σ V T = V Σ2 V T
We’re using Σ2 to denote Σ T Σ, which is a square matrix whose diagonals are the
squares of the singular values σi( A)2. Also note that because the columns of U are
per-24 Here the superscript T denotes the transpose of V ; that is, V T has as its
i, j entry the j, i entry of V . It swaps rows and columns but we’ll have much more
to say in Chapter 12. For now, it’s enough to note (and easy to verify) that if V
has perpendicular unit vectors as columns, then V T = V − 1, so we can use V T as a
change of basis from the standard basis to the basis defined by V .
174
pendicular unit vectors, the product UT U is a matrix with 1’s on the diagonal and
zeros elsewhere; i.e., the identity matrix.
Using AT A isolates the V part of the decomposition. Now for the algorithm: Theorem
10.26 (The Power Method). Let x be a unit vector that has a nonzero component of v
1 (a random unit vector has this property with high probability). Let B = AT A =
xk
Proof. I will use σi as a shorthand for σi( A). First expand x in terms of the
singular
vectors x =
c
v
i=1 iσ 2
i. Applying it repeatedly
gives
xk = Bkx =
ciσ 2 k
i vi
i=1
Notice that, since σ 1 is larger than σ 2 (and hence all other singular values),
the coefficient for σ 1 grows faster than the others. Normalizing xk causes the
coefficient of σ 1 to tend to 1 while the other coefficients tend to 0.
The intuition to glean from this proof is that B = AT A, when applied to a vector,
“pulls” that vector a little bit toward the top singular vector. If you normalize
after each step, then the magnitude of the vector doesn’t change, but the direction
does.
The relevant quantity tracking the coefficient growth is the ratio between the two
biggest singular values, ( σ 1/ σ 2)2 n. Even if σ 1 is only marginally bigger, say
σ 1 =
Code It Up
Here’s the python code that solves the one-dimensional problem, using the numpy
library for matrix algebra. Note that numpy uses the dot method for all types of
matrix-matrix and matrix-vector and inner product operations.25 Also note the .T
property returns the transpose of a matrix or vector.
First, some setup and defining a function that produces a random unit vector.
25 They, along with most applied linear algebraists, view vectors as matrices with
one column.
175
from math import sqrt
def random_unit_vector(n):
And now the core subroutine for solving the one-dimensional problem.
n, m = A.shape
x = random_unit_vector(min(n, m))
last_v = None
current_v = x
if n > m:
B = np.dot(A.T, A)
else:
iterations = 0
while True:
iterations += 1
last_v = current_v
return current_v
Since, as we saw in Chapter 8, the sequence will never quite achieve its limit, we
stop after xn changes its angle (as computed using the inner product) by less than
some threshold.
Now we can use the one-dimensional subroutine to compute the entire SVD. The helper
function we need for this is how to exclude vectors in the span of the singular
vectors you’ve already computed. Unfortunately, to solve this question opens up
questions about a new topic, namely the rank of a matrix, which I’ve found hard to
fit into this already very long chapter. As much as it hurts me to do so, we will
save it for an exercise, and present the formula here. 26
The idea is this: to exclude vectors in the span of the first singular vector v 1
with corresponding u 1, subtract from the original input matrix A the rank 1 matrix
B 1 defined by bi,j = u 1 ,iv 1 ,j (the product of the i-th and j-th entries of u 1
, v 1, respectively). The name for this matrix is the “outer product” of u 1 and v
1, and it’s closely related to a concept called the tensor product. Likewise, you
can define Bi for each of the singular 26 And, again, I would like to stress that
this book is far too small to provide a complete linear algebra education.
The fantastic text “Linear Algebra Done Right” is an excellent such book for the
aspiring mathematician. In that I mean, they exhaustively prove every fact about
linear algebra from the ground up.
176
vectors vi. To exclude all the vectors in the span of {v 1 , . . . , vk}, you
replace A with
A −
i=1
i.
In the following code snippet, we do this iteratively when we loop over svd_so_far
and subtract. The following assumes the case of n > m, with the other case handled
similarly in the complete program. 27 The parameter k stores the number of singular
values to compute before stopping.
A = np.array(A, dtype=float)
n, m = A.shape
svd_so_far = []
if k is None:
k = min(n, m)
for i in range(k):
matrix_for_1d = A.copy()
u_unnormalized = np.dot(A, v)
svd_so_far.append((sigma, u, v))
Let’s run this on some data. Specifically, we’ll analyze a corpus of news stories
and use SVD to find a small set of “category” vectors for the stories. These can be
used, for example, to suggest category labels for a new story not present in our
data set. We’ll sweep a lot of the data-munging details under the rug (see the
Github repository for full details), but here’s a summary:
1. Scrape a set of 1000 CNN stories, and a text file one-grams.txt containing a
list of the most common hundred-thousand English words. These files are in the data
2. Using the natural language processing library nltk, convert each CNN story into
a list of (possibly repeated) words, excluding all stop words and words that aren’t
in one-grams.txt. The output is the file all-stories.json.
3. Convert the set of all stories into a document-term matrix A, with m rows (one
for each word) and n columns (one for each document), where the ai,j entry is the
count of occurrences of word i in document j.
27 See pimbook.org
177
data = load(filename)
matrix = normalize(matrix)
Here U is the basis for the subspace of documents, V for the words. However, these
basis vectors are very difficult to understand! If we go back to our interpretation
of such a word vector as an “idealized” word, then it’s a “word” that best
describes some large set of documents in our linear model. It’s represented as a
linear combination of a hundred thousand words!
To clarify, we can project the existing words onto the subspace, and then we can
cluster those vectors into groups and look at the groups. Here we use a black-box
clustering algorithm called kmeans2, provided by the scipy library.
projectedDocuments = np.dot(matrix.T, U)
Once we’ve clustered, we can look at the output clusters and see what words are
grouped together. As it turns out, such clusters often form topics. For example,
after one run the clusters have size:
>>> Counter(wordClustering)
The first cluster, as it turns out, contains all the words that don’t fit neatly in
other clusters—such as “skunk,” “pope,” and “vegan”—which explains why it’s so big.
28 The other clusters have more reasonable interpretations. For example, after one
run the second largest cluster contained primarily words related to crime:
>> print(wordClusters[1])
This is just as we’d expect, because crime is one of the largest news beats. Other
clusters 28 It could also occur like this because we chose too few clusters: we
have to pick ahead of time how many clusters we want kmeans2 to attempt to find,
which I omitted from the simplified code above.
178
include business, politics, and entertainment. We encourage the reader to run the
code themselves and inspect the output.
A natural question to ask is why not just cluster to begin with? Efficiency! In
this model, each word is a vector of length 1000 (one entry for each story), and
each document has length 100,000! Clustering on such large vectors is slow. But
after we compute the SVD and project, we get clusters of length k = 10. We trade
off accuracy for efficiency, and the SVD guarantees us that it’s extracting the
most important (linear) features of the data. Because of this, SVD is often called
a “dimensionality reduction” algorithm: it reduces the dimension of the data from
their natural dimension to a small dimension, without losing too much information.
But there’s more to the story. Recall our modeling assumption, that word meanings
“have the structure of” a low-dimensional vector space, but the values we see are
perturbed by some noise. A crime story might use the word “baseball” for
idiosyncratic reasons, but most crime stories do not. The low-dimensional subspace
captures the “essence”
of the data, ignoring noise, and the projection of the input word vectors onto the
SVD
Before I explain what that means, I need a caveat. What I’m about to describe
doesn’t strictly work for the code presented in this chapter. Since I wrote this
code with the goal to group news articles by topic, I counted frequency of terms
occurring in documents (and the dataset I used is quite small!). If you want to
reproduce the behavior below, you need a larger dataset and a different
preprocessing technique, which is basically to count how often word pairs co-occur
in a document. Check out Chris Moody’s lda2vec, 29 which does this.
Now the fun stuff. The vector representation of words produced by the SVD has a
semantic linear structure. For example, if you take the vector for the word “king,”
subtract the vector for “man” and add the vector for “woman,” the result
approximates the vector for “queen.” Indeed, the SVD representation has reproduced
the gender aspect of language. This occurs for all kinds of other properties of
words that fit into typical word-association style tests like “Paris is to France
as Berlin is to…”
This is surprising, and it tells us that some aspect of this SVD representation of
words is much better than the original input of raw word counts. It’s surprising
because we think of language as a highly quirky, strange, perhaps nonlinear thing.
But when it comes to the relationships between words, or the semantic meaning of
document topics, these linear methods work well. One might argue that the core
insight behind this is that for language, context is linear in nature. And then
it’s immediately clear why this works: if you see a document with “child” and “she”
in it, and those words occur close together, you intuitively know, that you’re more
likely to be talking about a daughter than a son.
29 https://github.com/cemoody/lda2vec, forked at
https://github.com/pim-book/lda2vec just in case the original is removed. Also note
that these techniques can also be produced by neural networks, the application of
Chapter 14.
179
Replace the “she” with a “he” and you expect to see the word son instead. The SVD
captures this.
1. The heart of linear algebra is a very concrete connection between linear maps
and matrices. The former is intuitive, useful for thinking about linear algebra
geometrically. The latter is computationally tractable, allowing us to discover and
apply useful algorithms. Operations on linear maps, such as function composition,
correspond pleasingly to operations on matrices, such as matrix multiplication.
2. Coordinate systems are arbitrary, and linear algebra gives you the power to
change coordinate systems—change the basis of the vector space—at will. A useful
basis is a treasure.
3. The matrix representation hides the difficult notation of working with linear
maps, reducing the cognitive burden of the mathematician.
4. The linear model is a powerful abstraction for working with real-world data, and
understanding linear algebra allows us to pinpoint the assumptions of this model,
and in particular where those assumptions might break down or limit the applica-
bility of the model.
10.11 Exercises
10.1. Prove the 0 (the zero vector) is unique; that is, if there are two vectors v,
w both having the properties of the zero vector, then they are equal.
10.2. Prove that the composition of two linear maps is linear. I.e., the map x 7→
g( f( x)) is linear if g and f are linear.
10.3. Prove that if a linear map f is a bijection, then the inverse f− 1 is also a
linear map.
10.4. Let V, W be two vector spaces. Show that the direct product V × W is also a
vector space by defining the two operations + and ·. How does the dimension of V ×
W
180
10.5. Prove that the image of a linear map f : V → W is a subspace of the codomain,
W . Prove that the subset {v ∈ V : f ( v) = 0 } is a subspace of V .
10.6. In R2 we have colorful names for special classes of linear maps that
correspond to geometric transformations. Look up definitions and pictures to
understand matrices that perform rotation, shearing, and reflection through a line.
10.7. Research definitions and write down examples for the following concepts:
10.8. Prove that the standard inner product on R n (Definition 10.18) is linear in
the first input. I.e., if you fix y ∈ R n, then ⟨x, y⟩ : R n → R is a linear map.
Argue by symmetry that the same is true of the second coordinate.
fa,b(0) = a
fa,b(1) = b
Prove that the set of all Fibonacci-type sequences form a vector space (under what
operations?). Find a basis, and thus compute its dimension.
10.13. The Bernstein basis is a basis of the vector space of polynomials of degree
at most n. In an exercise from Chapter 2, you explored this basis in terms of
Bézier curves. Like Taylor polynomials, Bernstein polynomials can be used to
approximate functions R → R
181
to arbitrary accuracy. Look up the definition of the Bernstein basis, and read a
theorem that proves they can be used to approximate functions arbitrarily well.
10.14. Look up the process of Gaussian Elimination, and specifically pay attention
to the so-called elementary row operations. Each of these operations corresponds to
a change of basis, and is hence a matrix. Write down what these matrices are for
R3, and realize that every change of basis matrix is a product of some number of
these elementary matrices.
10.17. Continuing the previous exercise, the classical algorithm for solving linear
programs is called the simplex method. It was invented in the 1940’s by George
Dantzig30. At its core, the algorithm builds up a vector space basis corresponding
to the variables in the solution that have nonzero values. Then it iteratively uses
the objective (and Gaussian-elimination-style elementary row operations) to guide
how to improve the solution. Research this algorithm and implement it in its basic
form.
10.18. Look up the definition of an inner product space (a vector space equipped
with an inner product), and the definition of an isometry between two inner product
spaces.
Find, or discover yourself, the aforementioned proof that all n-dimensional inner
product spaces are isometric.
One fruitful area is the concept of a matroid. Matroids have a special place in
computer science, because they are the setting in which one studies greedy
algorithms in general.
That is, every problem that can be solved optimally with a greedy algorithm
corresponds to some matroid, and every matroid can be optimized using the greedy
algorithm. Look up an exposition on matroids and understand this correspondence.
Apply this to the problem of finding a minimum spanning tree in a weighted graph.
See Chapter 6, Exercise 6.12 for an introduction to weighted graphs.
182
10.21. The singular value decomposition code in this chapter has at least one
undesirable property: numerical instability. In general, numerical instability is
when an algorithm is highly sensitive to small perturbations in the input. The SVD
of a matrix which is not full rank (Cf. Exercise 10.7) contains values that are
zero. The algorithm in this chapter does not output these properly, and instead
produces non-deterministic mumbo-jumbo.
Audit the algorithm to verify this undesirable behavior occurs, and research a fix.
10.22. Research the details of the winning submission for the Netflix Prize
competition.
Identify what other ways a linear model is incorporated into the solution.
• Addition and multiplication have identity elements which are distinct. Call them
zero and one, respectively.
• Addition and multiplication both have inverses, and every element is invertible,
with the exception that zero has no multiplicative inverse.
The field is the triple ( K, + , ·), or just K if the operations are clear from
context.
183
a field, but there are many others. For example, the set of fractions of integers
(rational numbers) forms a field denoted Q with the normal addition and
multiplication. Another example is the binary field { 0 , 1 } with the logical AND
and OR operations.
Now a vector space can be defined so that its scalars come from some field K in the
same way we used scalars from R. We say that V is a vector space over K to mean
that the scalars come from K. As long as the operations in K have the properties
outlined above, you can do all the same linear algebra we’ve done in this chapter.
To be particularly clear, a linear combination of vectors in V requires
coefficients coming from K, and so they’re called K-linear combinations. Also note
that K-linear combinations must be finite sums.
Linear algebra can have more nuance for some special fields, but to understand when
and how they are different you need to study a bit of field theory. If you’re
interested, look up the notion of field characteristic and in particular what
happens when fields have characteristic 2.
To leave you with one example of an interesting vector space over a field that’s
not R, consider V = R as a vector space over K = Q. This might not seem interesting
at first until you ask what a basis might be. Take the set C = { 1 , 2 , 3 , 4 ,
5 }, for example. Is it possible to write π (an element of V ) as a Q-linear
combination of the vectors in C?
You could only do so if π itself was rational, which it’s not. So how, then, might
one find a basis so that π (and every other irrational number) can be written as a
finite Q-linear combination of the elements in the basis? A curious thought indeed.
that constructs embeddings using neural networks, and there are many other tools
(such as GloVe) that have become popular since then.
Semantic word embeddings are an interesting case study into the shortcomings of
linear models. In a 2016 paper, “Man is to Computer Programmer as Woman is to Home-
Whether one is willing to accept this outcome depends on the goal of the
application, but awareness is crucial. Mathematical assumptions baked into
algorithms and models—
even simple ones like linearity—can dupe the unwitting. Take care when applying
them to situations that involve people’s lives or livelihoods.
Chapter 11
Good mathematicians see analogies between theorems or theories. The very best ones
see analogies between analogies.
– Stephen Banach
During my PhD studies, my thesis advisor Lev and I would occasionally talk about
teaching. Among others, he taught algorithms and I taught calculus and intro
Python.
For those who don’t know (and apropos to an essay between two linear algebra
chapters) the Fourier Transform is a linear map that takes an input function f : R
→ R and outputs the coefficients for a representation of f with respect to a
special basis of sine and cosine functions, called the Fourier basis. 1 The Fourier
transform has a whole host of properties that make it useful for science, but in
brief, the input functions are often thought of as “signals,” such as composite
sound waves, and the output is thought of as constituent tonal frequencies. For
example, automated phone systems (“For English, press one…”) recognize the buttons
you press using Fourier analysis. Each button corresponds to two overlapping pure
frequencies, and the receiving end applies the Fourier transform to identify the
frequencies, and hence which number was pressed. The Fast Fourier Transform, or FFT
for short, is a particularly efficient algorithm for writing (finite approximations
of) signals in the Fourier basis. It’s fast because it takes advantage of the
symmetries in sines and cosines. The discovery of this algorithm has been described
as the beginning of the information age.
But then he excitedly explained a new insight! It was something he learned about
the FFT while preparing his lecture notes. The details are irrelevant, but my
advisor also attempted to explain this new insight to his students. This was
probably not helpful for 1 This is nontrivial because the vector spaces involved
are infinite dimensional.
185
186
them. Instead of focusing on basic syntax and properties of the Fourier Transform,
Lev tried to convey insights he had learned over his career. This would have been
great for a graduate seminar, but unfortunately it was levels above his students
ability to comprehend. They were still missing the foundational tools needed to
express these thoughts.
Lev was tapping the beat of a song that played clearly in his head, but which his
students had never heard before.
Pedagogical critiques aside,2 after that conversation I synthesized what felt like
an obvious truth in hindsight, about math, programming, and surely all endeavors
worth pursuing. Understanding comes in levels of insight. And as you learn—but more
importantly as you re-learn—you gain meta insights. Insights about insights. You
learn what parts of a thing to appreciate and what parts are cruft.
Most experienced programmers understand these levels well. You start with the basic
syntax and semantics of a given programming language. You move up to the basic
tenets of designing and maintaining software, such as how to extract and organize
functions for reuse, proper testing and documentation, and the role of various
protocols interfacing with your system. From there it grows to insights about a
particular area of specialization, such as how the choice of database affects the
performance of a web application, how to manage an ecosystem of interdependent
services, or the tradeoffs between development speed, maintainability, and
extensibility.
When you switch to a new language, syntactic scaffolding and new paradigms
initially hide the core idea of a program. This can be complex type declarations,
or the orthodoxy of a particular pattern (promises, streams, coroutines, etc.),
which are foundationally important, but unrelated to the core logic of a program.
Over time—and with experience, an improved mental model, and useful tooling—the
cruft becomes invisible. You see a program for its core logic while still taking
advantage of the features of the language.
Terry Tao summarizes it well in his essay,3 “There’s more to mathematics than
rigour 2 Collegiate education at research institutions is a snake’s nest of
competing incentives and demands on one’s time. Having been on the academic job
market and seen what constitutes success in research, I can understand the need to
conduct teaching as Lev did even if I want the world to be better.
3 https://terrytao.wordpress.com/career-advice/
187
and proofs.”
The point of rigour is not to destroy all intuition; instead, it should be used to
destroy bad intuition while clarifying and elevating good intuition. It is only
with a combination of both rigorous formalism and good intuition that one can
tackle complex mathematical problems; one needs the former to correctly deal with
the fine details, and the latter to correctly deal with the big picture. Without
one or the other, you will spend a lot of time blundering around in the dark (which
can be instructive, but is highly inefficient). So once you are fully comfortable
with rigorous mathematical thinking, you should revisit your intuitions on the
subject and use your new thinking skills to test and refine these intuitions rather
than discard them. One way to do this is to ask yourself dumb questions; another is
to relearn your field.
This is a worthwhile endeavor for anyone who wants to understand mathematics more
deeply than copying a formula from a book or paper. One aspect of this is that it’s
difficult to fully appreciate a definition or theorem the first time around.
Veterans of college calculus will appreciate our discussion of the motivation for
the “right” definition of a limit in Chapter 8, because typical calculus courses
are more about the mechanics—the syntax and basic semantics—of limits and
derivatives. A deep understanding of the elegance and necessity of the “supporting”
definitions, and how they generalize to ideas all across mathematics, is nowhere to
be found. To do so requires equal parts elementary proofs and sufficient time to
discuss counterexamples, neither of which are present for college freshmen in
computer science and engineering.
Another aspect is that mathematical definitions and theorems create a complex web
of generalization, specialization, and adaptation too vast to keep in your head at
once.
As one traverses a career, and studies some topics in more detail, reevaluating the
same ideas can produce new inspiration. While gnawing on a tough problem, returning
to teach basic calculus and thinking about limits might spur you to frame the
problem in the light of successively better approximations, providing a new avenue
for progress. While many researchers may find this more grueling than it’s worth—
dealing with the added distractions of grading, course design, and cheating
students—in theory it has benefits beyond the education of the pupils. My advisor’s
foray into Fourier Analysis is another example. He may not have found that insight
were he not required to prepare a lecture on the topic.
Linear algebra, even the basic stuff, is a perfect example of the web of variation
and generalization. One can take the idea of linear independence of vectors, and
generalize it to the theory of matroids, which turns out is a cozy place to study
greedy algorithms (Cf. Chapter 10, Exercise 10.19). In number theory, vector spaces
drive the idea of transcendental numbers, those numbers like e and π which can’t be
represented as the root of a polynomial with rational coefficients. Since R is a
vector space over Q, one studies a transcendence basis of this vector space (cf.
Chapter 2, Exercise 2.5). In fields like algebraic geometry or dynamical systems, a
central tool is to take a complicated object and
“linearize” it, via a transformation that, say, adds new variables and equations,
so that
188
techniques from linear algebra can be applied. The form and function of the
applications shapes one’s understanding of the basic theory.
Linear algebra has higher levels of abstraction as well. We spent time, and will
continue to spend time, discussing how to cleverly choose a basis. But there is a
whole other side of linear algebra that builds up the entire theory basis-free. As
we discussed about the definition of the limit, the “right” definition of a concept
shouldn’t depend on arbitrary choices. But almost everything we’ve seen about
linear algebra depends on the choice of a basis! Recreating linear algebra without
a basis requires more complicated and nuanced definitions, but often results in
more enlightening proofs that generalize well to harder problems. As the
mathematician Emil Artin once said, “Proofs involving matrices can be shortened by
50% if one throws the matrices out.” Though we don’t have the bandwidth in this
book to cover this perspective, it’s a higher rung on the ladder.
One might expect a basis-free theory could completely eliminate messy matrix
algebra.
It could hardly be further from the truth. There is a famous quote of Irving
Kaplansky, an influential 20th century mathematician who worked in abstract algebra
(among other topics), discussing how he and his colleagues approach problems that
use linear algebra.
In that respect, “cumbersome” syntax is like the manuals, READMEs, and automated
scripts that you write for yourself and refer to every time you forget how to
configure your web server. Writing things down in a precise, computational syntax
also has the benefit of isolating and clarifying the nuance and essential
characteristics of difficult examples. It’s much easier to focus on the bigger
picture, to look at a mess and point to the interesting core—as one would with a
large program—once one can freely create and manipulate the atomic units. It’s the
same reason I say (fully aware of the irony) that the primary goal of a calculus
class is to master algebra.
You don’t learn calculus until you do differential equations. And then you don’t
learn calculus until you study smooth manifolds. And then you don’t learn calculus
until you write programs that do calculus. And then you don’t learn calculus until
you teach calculus. You basically never learn calculus, and every time you use it
in a new setting you get new insights about it. I learned calculus while writing
this book! As you mature,
189
those insights become more nuanced, and your continued appreciation for that nuance
is what keeps mathematics fresh and enjoyable. This isn’t a unique feature to
mathematics (appreciation for nuance is as important over a long career in politics
or tennis as it is in mathematics), but the layman’s attitude toward mathematics is
that of stark facts. In reality, theories evolve and take on new colors over time.
subject, you must repeat the useful mechanism I’ve been touting throughout this
book: to write down characteristic examples that serve as your mental model for a
general pattern. Keeping examples in mind—picturesque examples with enough detail
that you can descend the ladder of abstraction to compute if necessary—is what
fortifies an idea and fertilizes the orchard from which you can pick ripe
analogies.
The final aspect is that relearning one’s field allows one to revisit the proofs of
the central theorems of that subject. The maturity afforded by not spending most of
one’s effort trying to understand the proof allows one to then judge the proof on
its merits.
It’s like reading the code for a system you designed, long after you’ve implemented
and maintained it. You have a much better understanding of the real requirements
and failures of the system. Such considerations often result in alternative proofs,
which generalize and adapt in new and novel ways. Or one can gain a deeper
understanding of the benefits and limitations of a proof technique, and how they
apply (or don’t) to a problem in the back of one’s head.
Back down to earth, this book is roughly a second or third level of insight. The
first level would be functional fluency with symbol manipulation. Though it sounds
like it’s quite basic, most of college mathematics education for engineers does not
tread far off this path. This includes even differential equations, statistics, and
linear algebra, often considered the terminal math courses for future software
engineers.
The second level is largely about proof. Can you logically prove that the symbolic
manipulations in the first level are correct? It’s a meta level of insight, but in
another sense it’s still a kind of basic fluency. For many undergraduate
mathematics majors, becoming fluent in the language of proof is the central goal of
their studies. This is why almost all advanced math courses are proof-based
courses, and why we’ve spent so much time in this book proving and discussing
methods for proof.
The next level of insight, usually which comes after being able to prove the basic
facts about an object, are the insights about why the existence and prevalence of
that object makes sense. This occurs often through proof, but also through a non-
rigorous hodgepodge of examples, discussion, connections to other objects, and the
consideration of alternatives by which one becomes accommodated with a thing.
Further tiers revolve around new research. Understanding what questions are
interesting, sketching why a theorem should be true before a proof is found,
generalizing families of proofs into a theory that makes all those proofs trivial.
And all the while one traverses the ladder of abstraction as needed, sometimes
diving into the muddy waters to crack a tough integral, other times honing in on
the importance of one particular property of an object.
It sounds negligent to speak about math in such an imprecise manner, and mathemati-
190
cians like to poke fun at themselves. John von Neumann (of computer architecture
fame) once told a physicist colleague, “In mathematics you don’t understand things.
You just get used to them.” How deliciously blasphemous! More seriously, my
interpretation is that this quote continues, “…until you find that next level of
insight.” It’s true, at least, in my experience, that one must gain sufficient
comfort in mechanics before one can attempt proof, and one must gain some level of
comfort with proof before the next-level insights about definitions can be
appreciated.
It’s not just professional mathematicians who experience this. This happens at
every level of the hierarchy. My wife is a math professor at a community college,
and despite having spent years of her undergraduate career doing proofs by
induction, it was not until she taught it a few times that the deeper understanding
of why it worked dawned on her. She had a similar experience re-learning algebraic
topology for a qualifying exam, and I distinctly recall her gleeful yelp when she
realized that she intimately understood what she was doing and why it worked. She
shouted, “The proof is trivial!”
The cognitive scientist Douglas Hofstadter asserts that analogies are the core
mechanism of human cognition. Part of his evidence is the wealth of analogies that
surround us in language: the commonplace concept of an airport “hub” relies on
analogies between the spokes of a bicycle wheel and notions of centrality in a
network, each of which rely on lower-level analogies of position and motion. These
ideas are paired with ideas about corporations, a brand, and not to mention the web
of concepts around human conceptions of airplane flight. This is all summarized by
the single word “hub.”
Mathematical cognition is also largely built on analogies. And just like humans
understand the concepts of motion or a wheel long before we’re able to understand
the concept of an airport hub, we’re able to understand the lower levels of
mathematical abstraction (and must become comfortable with them) before we can draw
the analogies necessary to make use of the more complex and nuanced abstractions.
And then, much later, we can look back at the bicycle wheel, or the derivative,
with a new appreciation for its purpose and use. Mathematical intuition in
particular is the graduation from purely analytical and mechanical analysis to a
visceral feeling of why a thing should behave the way it does.
No matter where you currently stand, there are insights to be found and analogies
to draw. Don’t underestimate their value, even if they lie among “simple” things
that you think you should have mastered years ago.
Chapter 12
The notion of eigenvalue is one of the most important in linear algebra, if not in
algebra, if not in mathematics, if not in the whole of science.
– Paolo Aluffi
If you polled mathematicians on what the “most interesting” topic in linear algebra
was, they’d probably agree on eigenvalues. The definition of an eigenvalue is so
simple that I can state it now without further ado.
The question of why eigenvalues are so central to linear algebra and its
applications is a deep one, and there is no easy answer. In a vague sense, the
eigenvectors and eigenvalues of a linear map encode the most important data about
that map in a natural, efficient way. More concretely, in the scope of this chapter
eigenvectors provide the “right” basis in which to study a linear map V → V . They
transform our perspective so that the important features of a map can be studied in
isolation. If you accept that premise, it’s no surprise that eigenvalues are useful
for computation. But to say anything more concrete than that, to explain the
universality of eigenvalues, is difficult.
The application for this chapter is a deep dive into how eigenvectors and
eigenvalues explain the dynamics of a particular physical system describing one-
dimensional waves.
In no uncertain terms, eigenvalues are the scientific theory that reveals the inner
nature 1 “Trivial” gets new meaning in this context that is partially subjective.
To conjure “the nontrivial solutions”
In this book we will state explicitly what the “trivial” solutions are, but
elsewhere you may have to infer.
191
192
In Chapter 14 we’ll see how eigenvalues encode information about smooth surfaces in
a way that enables optimization. And the singular values we saw in Chapter 10 are
closely related to eigenvectors and eigenvalues in a way we didn’t have the
language to explain in that chapter (see the exercises for more on that).
I could spend all day giving examples of how eigenvectors are used in practice. But
to get to the heart of what makes them useful is another task entirely. The word
eigenvalue itself doesn’t have any intrinsic meaning that might hint at an answer.
Eigenvalue comes from the German word eigen, simply meaning “own,” in the sense of
the phrase, “I have my own principles to uphold and refuse to use emacs.” In that
sense, eigenvalue simply means a value that is intrinsic to the linear map. The
importance of the study of eigenvalues and eigenvectors is analogous to the
importance of the roots of a polynomial to the study of polynomials. Knowing the
roots of a polynomial allows you to write the polynomial in a simpler form, and
“read off” information about the polynomial from the simpler representation. So it
is with eigenvalues and eigenvectors. The eigenvalues of a linear map are even the
roots of a special polynomial (See Exercise 12.11).
Theorem 12.2. Let A be a matrix and U be a change of basis matrix, with B = UAU−
1 .
Proof. We need to show that BUv = λUv. To do this, expand B = UAU− 1 and apply
algebra. 2 In what follows, In is the n-by- n identity matrix, i.e., the
representation of the function I( v) = v that is the same for every basis.
So while (the coordinates of) eigenvectors are not preserved across different
bases, the eigenvalues are. A technical way to say this is that eigenvalues of a
linear map f are invariant properties of f. Invariance means that the property
doesn’t change under some 2 I hopefully assured you in Chapter 10 that basic
algebra operations such as regrouping parentheses are legal in matrix algebra,
without requiring a detailed and painful derivation of that fact. Such work belongs
in textbooks, and we have more exciting things to do here.
193
We’ll have more to say on this when we study hyperbolic geometry in Chapter 16.
This is the best high-level intuition I can give without getting too deep in the
math.
Before we do, let’s see a compelling example of why eigenvalues are so interesting
and complex for specific matrices called adjacency matrices. In the next section we
won’t prove any of the theorems we state.
In the exercises, you will write down a description of this matrix as a linear map
and interpret what it means in graph-theoretic terms. In particular, each of the
standard basis vectors ei = (0 , . . . , 0 , 1 , 0 , . . . , 0) can be thought of
as identifying the i-th vertex vi of G. Figure 12.1 is an example graph and its
adjacency matrix. We call a graph bipartite if its vertices can be partitioned into
two parts in such a way that all edges cross from one part to the other. The graph
G in Figure 12.1 is bipartite because it can be partitioned into
{ 1 , 3 } and { 2 , 4 , 5 }.
194
A( G)
e e e e e
2 3 4 5
0 1 0 1 1
e 1
1 0 0 0 0
0 0 0 1 0
1 0 1 0 0
1 0 0 0 0
Bipartite graphs are common in applications, because they naturally encode networks
in which there are two classes of things, where things within a class don’t relate
to each other. For example: students and teachers, with edges being class
membership; wholesale factories and distributors, with edges being shipments; or
files and users, with edges being access logs. Problems that can be intractable on
general graphs can be easy to solve on bipartite graphs, which is a compelling
reason to study them.
Now here is a fantastic theorem that we won’t prove. Let A( G) be the adjacency
matrix of a (not-necessarily bipartite) graph G. Let λ 1 be the largest eigenvalue,
λ 2 the second largest, etc., so that λn is the smallest. Note that these
eigenvalues may be negative. Also note that adjacency matrices have n eigenvalues,
though to see why we’ll need the theory built up in this chapter (Propositions
12.11 and 12.14).
This is just one of the many ways that the eigenvalues of the adjacency matrix of G
Here is another theorem, which I will paraphrase slightly to hide the nitty-gritty
details.
It says that the eigenvector for the second-largest eigenvalue of the adjacency
matrix encodes information about tightly-knit clusters of vertices in a graph. In
fact, it encodes this information better than simple statistics in the following
concrete setting.
3 More specifically, it’s called an Erdős-Rényi random graph, and the output is a
draw from the uniform
195
One can show (though we will not) that for a random graph, with overwhelming
probability the densest cluster of vertices will have almost exactly 2 log( n)
vertices in it. It’s also widely believed that no efficient algorithm can reliably
find the densest cluster.
So to make this cluster-finding problem easier, after creating the graph in this
random way, pick a random subset of vertices of size t, and connect all remaining
edges among those vertices. We’ll call the chosen subset a planted clique. In
general, a clique is a subset of vertices with a complete set of edges among them.
It’s a subgraph that forms the complete graph Kt for some t. You might expect that
such a dense cluster of vertices would be detectable, simply by being a statistical
anomaly. Maybe you could just count up how many edges are on each vertex, looking
at the ones that are unusually large, to find the planted clique. I won’t prove so
here, but for this method to work, the planted
√
smaller t:4
n such vertices
whose corresponding entries in v are the largest in absolute value. Call this set T
.
2. Output the set of vertices of G that are adjacent to at least 3/4 of the
vertices in T .
By now I hope I have convinced you that eigenvectors and eigenvalues, together
often called an eigensystem, encode useful information about linear maps, and the
underlying data those linear maps represent.
196
However, we still have little understanding about why eigensystems reveal such
valuable information. The briefest possible answer might be formulated as
“eigenvectors, scaled by their eigenvalues, provide the most natural coordinate
system in which to view linear maps V → V .”
Introducing complex numbers makes other things simpler, while making some things
more complicated. But more importantly, if you’re not comfortable with the geometry
of complex numbers, you will have difficulty interpreting how they relate to a
linear map for vectors of real numbers. This book skips complex numbers, so we will
not be able to give a complete picture.
A second reason is that multiple linearly independent eigenvectors can exist for
the same eigenvalue, and there may or may not be “enough” eigenvectors to provide a
complete picture. This topic is nuanced—and not needed for our application—so we
omit it except to mention some pointers in Section 12.5.
Luckily, there is a nice way to avoid dealing with these problems while still
seeing the lion’s share of eigenvalue power in practice. That is the following
theorem:
Theorem 12.6. Let f : R n → R n be a linear map and let A be its associated matrix.
If A is symmetric, meaning A[ i, j] = A[ j, i] for every i, j, then A has n real
eigenvalues (not necessarily distinct) and eigenvectors.
A useful notation when working with symmetric matrices is that of the transpose.
Define by AT the matrix whose i, j entry is A[ j, i]. That is, you take A, and flip
it along the top-left-to-bottom-right diagonal, and you get AT . With this
notation, saying A is symmetric is saying that A = AT . Here’s an example of a
symmetric matrix.
2 5 6
7
9
− 1
197
prove the important takeaway of that discussion: symmetric matrices play nicely
with the inner product.
First, one can verify that the standard inner product definition results in ⟨Ax, y⟩
=
With symmetry, this simplifies to ⟨Ax, y⟩ = ⟨x, Ay⟩. What’s special is that
symmetric matrices can be defined by this property.
Theorem 12.7. Let A be a real-valued n × n matrix, and let ⟨−, −⟩ denote6 the
standard inner product of real vectors. Then A is symmetric if and only if ⟨Ax, y⟩
= ⟨x, Ay⟩ for every pair of vectors x, y ∈ R n.
Proof. Symmetry gives the forward direction of the “if and only if,” since ⟨x, AT
y⟩ =
⟨x, Ay⟩. For the reverse direction, suppose that ⟨Ax, y⟩ = ⟨x, Ay⟩ for all x, y.
Let a 1 , . . . , an be the columns of A, and apply this fact to the vectors x =
ei, y = ej (the standard basis vectors with a 1 in positions i and j,
respectively). We have
⟨Aei, ej⟩ = ⟨ai, ej⟩ = A[ j, i]
And we can do the same thing with A on the other side, by assumption:
We will use symmetry to prove that every symmetric matrix with real-valued entries
has a real eigenvalue. This is the central lemma needed to prove Theorem 12.6.
Funnily, we’ve spent so long preaching the virtues of eigenvalues, we haven’t even
considered the basic question of their existence!
Lemma 12.8. Let A be a symmetric real-valued matrix. Then A has a real eigenvalue.
Proof. Let x be a unit vector which maximizes7 the norm ∥Ax∥, and let c = ∥Ax∥. Then
Ax = cy for some unit vector y. If y is in the span of x (which happens most of the
time), then we are done, because Ax = cy = ( cd) x for some d ∈ R, which makes 6
The notation ⟨−, −⟩ is used to signify that the function will be expressed in this
nonstandard “pairing”
7 Why must such a vector exist? This is not trivial, but is true due to a
generalization of the Extreme Value Theorem to R n. It is a standard result which
usually involves a little bit of topology (compact sets and continuous functions),
and is hence beyond the scope of this book.
198
cd an eigenvalue. Otherwise, we may assume going forward that x and y are linearly
independent. By the maximality of x we know that ∥Ay∥ ≤ c.
We will show that x + y is an eigenvector with eigenvalue c. After the proof we’ll
explain as a side note why it makes sense in hindsight to consider x + y. Now
notice that
The first equality is due to Theorem 12.7, the second is the definition of y, and
the third is because the inner product is linear in each argument and y is a unit
vector (Proposition 10.19).
The crucial observation is that ⟨x, Ay⟩ is the (signed) length of the projection of
Ay onto the unit vector x. Projecting a vector onto a unit vector can only make the
first vector shorter. You should have some intuitive sense that this is true after
our analysis—
particularly the pictures—in Chapter 10. We leave a rigorous proof for the
exercises. As a consequence, c = ⟨x, Ay⟩ ≤ ∥Ay∥.
Combining this fact with the earlier fact that ∥Ay∥ ≤ c gives us
“halfway” between x and y. Indeed, it’s in the span of the vector ( x + y)/2, which
is a more suggestive way to say the “average” of x and y. Symmetry was our guide: A
sends x to the span of y and vice versa. The seasoned linear algebraist would guess
—
and prove shortly thereafter—that the symmetry extends to the whole plane spanned
by {x, y}. Since the behavior of any linear map (on this subspace) only depends on
its behavior on the basis (of the subspace), we deduce that A behaves as a
reflection, flipping the entire plane span {x, y}. And every reflection in a plane
has a line of symmetry, which in this case is through x + y.
The inner product is starting to take center stage. We should study it in more
detail.
In order to express one very useful aspect of eigenvectors, we must revisit the
discussion from Chapter 10 about the inner product. In general, a vector space only
has a limited amount of geometry you can describe. However, if you specify an inner
product, you can describe angles, lengths, and more. The inner product is imposed
on a vector space, in the same way that a style guide is imposed on a programmer:
to give structure
199
to (or elucidate structure in) the underlying space. The standard inner product on
R n is defined by the formula
⟨x, y⟩ =
xiyi.
i=1
This formula is intimately connected with geometry. It can be used to compute the
angle between two nonzero vectors (via cos θ = ⟨x, y⟩/( ∥x∥ · ∥y∥)), and its value
is the signed length of the projection of one argument onto the other (scaled by
the length of the other).
Over the years mathematicians have extracted the generic properties of this formula
that conjure up its geometric magic. The result is a distilled definition of an
inner product.
Definition 12.9. Let V be a vector space with scalars in R. An inner product for V
is a function ⟨−, −⟩ : V × V → R with the following properties:
1. Symmetric: For every v, w ∈ V swapping the order of the inputs doesn’t change
the inner product, i.e. ⟨v, w⟩ = ⟨w, v⟩.
⟨cv, w⟩ = c⟨v, w⟩ for all c ∈ R, and ⟨u + v, w⟩ = ⟨u, w⟩ + ⟨v, w⟩. Likewise for
fixing the first input.
A vector space V and a specific inner product ⟨−, −⟩ are together called an inner
product space.
This allows us to justify using the standard inner product and R n for applications
that lack a more principled choice.
More generally, the abstract definition of an inner product becomes more useful and
interesting when you’re dealing with infinite-dimensional vector spaces. We won’t
cover this in depth in this book, but a quick aside may pique your interest. The
gold standard example of an interesting inner product space is the space of
functions of a single real
200
variable f : R → R whose square has a finite integral.8 Call this space L 2(R), or
just L 2
L 2(R) =
f : R → R
f ( x)2 dx is finite
−∞
A typical example of where these functions occur in real life is as sound waves. L
2
f ( x) + g( x), and with the requisite calculus one can prove that the sum of two
square-integrable functions is square-integrable. The case is similar for the other
required vector space properties. And finally, the jewel in the crown, the inner
product is9
∫ ∞
⟨f, g⟩ =
f ( x) g( x) dx.
−∞
This inner product space—which actually satisfies some additional properties that
make it into a so-called Hilbert space—is different from vector spaces we’ve seen
so far. In particular, in R n there’s a “default” basis in which we express vectors
without realizing it: the standard basis. L 2 has no obvious basis. From on our
discussion of Taylor series in Chapter 8, we know that polynomials can approximate
functions in the limit. One might hope that polynomials form a basis of this space,
perhaps { 1 , x, x 2 . . . }. But actually these functions are not even in L 2.
Moreover, many functions in L 2 aren’t differentiable everywhere, so Taylor series
can run into trouble.
As it happens, there are many interesting and useful bases for this space. For
example, the following basis is called the Hermite basis:10
= {xke−x 2/2 : k = 0 , 1 , 2 , . . . }
But proving this is a basis is not trivial! There are other useful bases as well.
The Fourier basis, a staple of the signal-processing world and electrical
engineering, is the set of complex exponentials {e 2 πikx : k ∈ Z }. Since we’re
not officially covering complex numbers in this book, think of this basis as the
set of all sine and cosine functions with all possible periods.
These bases are difficult to discover. But even when we have one, how in the name
of Grace Hopper can one even write a function in such a basis? You can’t set up a
system 8 We won’t cover integration in this book, but you don’t need to know (or
remember) how to integrate func-
area in between f( x) and the x-axis. The meta-motivation for inner products is
well-worth any notational discomfort.
10 More specifically, the Hermite basis is what happens when you apply Gram-Schmidt
to orthogonalize and normalize this basis, which we’ll see later in this chapter.
201
Using the inner product, and some work to modify the basis to make it geometrically
amenable, the process of writing a function with respect to one of these (modified)
bases reduces to computing an inner product. Once again, we translate an intuitive
but hard mathematical concept into a more computationally friendly language. This
should impress upon you the importance of the inner product. Not only does it endow
a vector space with new, geometric measurements; it also makes computing basis
representations possible where it might otherwise not be. A powerful revelation
indeed.
In the rest this chapter, except for the application, the inner product will be
considered abstractly, as we study its generic properties and how it relates to
eigenvectors. We’ll also see how the inner product relates to simplifying the
computation of expressing a vector in terms of a basis.
Definition 12.9 implies some easy consequences. Here are two examples.
Proposition 12.10. Let 0 be the zero vector of V , and 0 the real number zero. Then
⟨v, w⟩ =
Proof. For the forward direction, if ⟨v, w⟩ = 0 for every w, then fix w = v. The
defining properties of an inner product require v = 0. For the reverse direction,
fix any w and note that f( v) = ⟨v, w⟩ is a linear map. Linear maps preserve the
zero vector, so f(0) = 0.
In the exercises you will prove some other basic facts about inner products, but
here is one too important to relegate to the end of the chapter.
Since this is an inner product, we can pull out the scalar multiples on the far
left and right-hand sides to get λ⟨v, w⟩ = µ⟨v, w⟩. The only way for this equation
to be true in spite of λ ̸= µ is if ⟨v, w⟩ = 0.
As we proved in Chapter 10, the standard inner product on R n allows one to compute
angles, and more specifically to determine when two vectors are perpendicular to
each other. In a generic inner product space, perpendicularity is undefined, and so
we define it by generalizing what we proved in R n. Perpendicularity and length get
new names.
202
Another way to say Proposition 12.11 is that if two eigenvectors are not
orthogonal, then they must have the same corresponding eigenvalue (this is the
contrapositive state-
ment11).
a square root, it’s called the square norm. Vectors with norm 1 are called unit
vectors.
Most of the facts about perpendicularity and projection we proved for R n actually
don’t depend on the definition of the standard inner product. They can be re-proved
using any inner product, because the key ingredients from those proofs were
extracted into the definition of an inner product. Next we’ll show that orthogonal
vectors can be used to build up a basis.
· · · + ckvk = 0. To show linear independence, recall, we need to show that all the
ci = 0.
pairwise orthogonality, all the terms in the sum are zero except ci⟨vi, vi⟩. Thus,
this sum reduces to ci⟨vi, vi⟩ = 0. Then either vi = 0 (ruled out by assumption) or
ci = 0. The same argument applies to every ci.
They make it easy to write a vector in terms of that basis. Let V be an inner
product space, and suppose that {v 1 , . . . , vn} is a basis for V , where every
vi is a unit vector and
⟨vi, vj⟩ = 0 for every i ̸= j. Such a basis is called an orthonormal basis. The
“ortho”
is because each pair is orthogonal, and “normal” because each vector is a unit
vector (normalized). Having such a basis allows you to compute the basis
representation of any vector using inner products.
203
Proof. Fix any basis vector vi and let x = c 1 v 1 + · · · + cnvn where cj are the
(unknown) coefficients of x’s representation with respect to the basis. Then
= c 1 · 0 + · · · + ci− 1 · 0 + ci · 1
|{z} + ci+1 · 0 + · · · + cn · 0
i-th term
= ci.
However, with an orthonormal basis all you need to do is compute n inner products.
The standard inner product only takes n multiplications and n additions, meaning
the entire decomposition only takes time n 2. This is a huge improvement if,
suppose, you could compute an orthonormal basis once and use it to compute basis
representations many more times, as opposed to doing Gaussian elimination for each
vector you wanted to represent in the target basis. It’s also worth noting that in
practice there’s often a natural ordering on a basis, so that the first vectors in
the basis contribute “most significantly” to the space, and one can approximate a
basis representation using a constant-sized subset of the basis. The singular
values played this role in Chapter 10. For our physics application the eigenvalues
will determine the ordering.
But beyond that, in a space like L 2 where there’s no natural starting basis, this
gives us a feasible way to compute basis representations: just compute the inner
product! In L 2 you simply integrate.12
Proof. We can prove this directly by showing that BT B is the identity matrix,
i.e., the matrix 1 n with 1s on the diagonal and zeros elsewhere. Indeed, the
entries of BT B encode all pairwise inner products of the vectors in the basis. The
i, j entry of BT B is the inner product ⟨vi, vj⟩, which is 1 if and only if i = j,
and zero otherwise.
12 Integration is not always computationally easy, but you choose the orthonormal
basis so that it is.
204
One may wonder if it’s also necessary to show BBT = 1 n in order to conclude that
BT is a proper inverse of B. A direct proof hits an immediate barrier, because the
inner products don’t line up as they did above. It turns out this barrier is a
mirage. By pure set theory, namely Proposition 4.13 from Chapter 4, a one-sided
inverse of a bijection is automatically a two-sided inverse. All change of basis
matrices are bijections.
If we wanted to prove this without set theory hijinks, we could have done so by
proving
− 1
( AT )
Our next task is to compute orthonormal bases. For finite dimensional inner product
spaces there’s an algorithmic method called the Gram-Schmidt process. It falls
short of an algorithm by not defining how to do one important step. First, a
definition:
Definition 12.18. Let V be an inner product space and W ⊂ V a subspace with an
orthonormal basis B = {w 1 , . . . , wk}. Let v be a vector, and define the
projection of v onto the subspace W , denoted by proj ( v), as follows:
proj ( v) =
proj ( v)
wi
wi∈B
1. Let S 0 = {} be the empty set. Si will contain the partial basis built up so far
at step i.
2. For i = 1 , . . . , n:
b) Let v′ = v − projspan
( Si− 1)
205
3. Output Sn.
The Gram-Schmidt process doesn’t dictate how to find a vector not in the span of a
given set, but using that as a subroutine, the rest is well-defined arithmetic. The
proof that the result is an orthonormal basis is a simple exercise in induction.
The same algorithm allows one to start from a given basis (possibly of a subspace),
and transform it into an orthonormal basis with the same span. For this variant, if
you have a subspace basis
{v 1 , . . . , vk}, and you want to know what new vector to choose at step i, you
can simply choose vi.
Another reason why the analysis of eigenvalues is hard is that zero can be an
eigenvalue. The eigenvectors with eigenvalue zero span the preimage of the zero
vector.
206
If you believe that finding roots of single-variable polynomials is hard, you might
also be convinced that finding “roots” of linear maps is hard. In fact, you’ll
prove in an exercise that computing eigenvalues of linear maps is at least as hard
as computing roots of polynomials. And as we’ll see below, all eigenvalues can be
expressed in terms of kernels. For the next proposition, I denotes the identity map
I( x) = x, with corresponding matrix In for n-dimensions.
A = 0
0
1 − λ
A − λI
3 =
1 − λ
−λ
A vector ( a, b, c) in the kernel of this map (for some unknown λ) must satisfy a(1
−
B = 0
1
This matrix clearly has one eigenvector, (1 , 0 , 0) for the eigenvalue λ = 1. But
what about other potential eigenvectors? Indeed, we’re looking for the kernel of B
−I 3, which is
B − I
3 =
0
0
207
Aside from the span of (1 , 0 , 0), there are no zeroes. And moreover, B − λI 3 has
only the trivial kernel { 0 } (set up the system of three equations and verify
this).
For for the matrix A above, the eigenvalue 1 has geometric multiplicity 2, but for
B
If you’re studying a linear map f : V → V , and for each eigenvalue you can find an
orthogonal set of eigenvectors spanning the eigenspace, then the representation of
the matrix for f is extremely simple. In this case, the eigenvectors form an
orthonormal basis (recall Propositions 12.11 and 12.14). The matrix for f, when
written with respect to that basis, has all its nonzero entries on the diagonal.
λ 1
· · · 0
0 λ
· · · 0
A =
..
.
..
..
..
· · · λn
A linear map that can be written this way for some basis is called diagonalizable.
This theorem requires some nontrivial amount of work, pieces of which we have al-
ready proved in this chapter. The easy part is the reverse direction. It uses the
fact that ( AB) T = BT AT , and Proposition 12.16 that for an orthonormal change of
basis matrix U , U − 1 = U T .
208
Proof. There is a change of basis matrix U, whose columns are the orthonormal
basis, for which A = UT DU, for D a diagonal matrix. A diagonal matrix is clearly
symmetric, so T
AT = ( U T DU ) = U T DT ( U T ) = U T DU = A, implying A is symmetric.
The strategy for the other half of the proof will be by induction on the dimension
of the vector space. That is, given the fact that every ( n − 1) × ( n − 1)
symmetric matrix has an orthonormal basis of eigenvectors, we’ll show that every n
× n symmetric matrix does as well.
Induction suggests we should find a way to “peel off” one dimension from the matrix
A in a way that’s independent of the rest of the argument. Given A, we’ll find an
eigenvector v with corresponding eigenvalue λ, normalize it, and use it as the
first vector in the basis.
A →
0 A′
In the above, the boldface 0 are to denote that zeroes take up the entire “area”
implied by the dimensions. If A is an n × n matrix, and λ is a scalar, then A′ is (
n − 1) × ( n − 1) and each boldface zero represents n − 1 zeroes in the only
allowable shape.
Intuitively, what we’re doing here is partially rewriting the basis in terms of one
known eigenvector. Indeed, we have to describe a full basis to get a block
decomposition, but as long as whatever process we use to make the basis maintains
the symmetry of A′, we win.
We know we can do this by Lemma 12.8. Normalize the eigenvector, call it v, and use
it as the first vector in a new basis of R n.
209
Note that only v need be an eigenvector; the other vectors in the basis are not
necessarily eigenvectors of A, but the whole basis is orthonormal.
change of basis by B
A −−−−−−−−−−−→
= BT AB
A′
To prove the block form is as we say it is, we just need to reason about the first
column of this matrix: if you apply A to v you get λv, which includes none of the
other basis vectors. So in the new basis representation you get a column with a λ
and zeros elsewhere.
As we argued above, this block decomposition is symmetric, so the first row must
also have zeros as indicated.
Finally, we can invoke the inductive hypothesis for the matrix A′ (which is
symmetric because BT AB is) and the subspace W . I.e., A′ has an orthonormal basis
of eigenvectors, call it {u 2 , . . . , un}. Then the final basis is {v, u
2 , . . . , un}.
As you can probably tell from the book to this point, my favorite applications of
math are to computer science. Linear algebra is no different. However, it would be
intellectually dishonest to omit the influence of linear algebra in physics.
Nowhere else does the beauty and utility of eigenvalues shine so bright.
13 It is a simple exercise to show that for a fixed nonzero vector v, the set {x :
⟨x, v⟩ = 0 } is a subspace of dimension n − 1, and it’s called the orthogonal
complement of v.
210
Figure 12.2: A system in which five beads are equidistantly spaced on a taut
string.
The discrete analysis we’re about to do also generalizes both in dimension (waves
on a surface) and to a continuous setting (the wave equation). While we gave a
taste of what linear algebra and eigenvectors look like in infinite dimensions,
this application will hopefully motivate further study.
The Setup
Consider the system depicted in Figure 12.2 in which a string is pulled tight
through five equally spaced beads. If you pluck the string, it creates a wave that
propagates through the string from end to end.
First, we need to write down a formal mathematical model in which we can describe
the motion of a bead. We start by defining a function of time that represents an
object’s position. Ultimately, we’ll only care about the vertical motion of the
beads, but a priori we’ll need two dimensions to describe the forces involved.
211
These should intuitively make sense when thinking of the derivative as a rate of
change.
We must also describe a mathematical model for a physical force. Note that while
we’re doing everything here in two dimensions, the same principles apply to three
or more dimensions.
In the formulas below, we’re concerned with the force in a particular direction.
Indeed, given a force vector F ( t) at a specific time t, projecting F ( t) onto
the appropriate unit vector v gives the component of F in the direction of v. If we
choose the basis to align with the vertical direction, the projection is trivial:
just look at the second entry of the force vector. But in general you can use
projections to get the component of a force in any direction.
As part of the mathematical model, forces “act” on objects. By that I mean they are
applied to objects and influence their motion. If you pluck a string, it moves. The
following revolutionary observation allows us to describe exactly how forces that
act on an object influence their motion.
Model 12.26 (Newton’s n-th law for some n). If F 1 , . . . , Fn are all of the
forces acting on an object with mass m whose position is described by x( t), then n
∑ F( i) = mx′′( t)
i=1
In other words, the sum of the forces applied to an object determines the
acceleration of that object. More massive objects need larger forces to move them.
One Bead
Now let’s inspect our beaded string in the special case of a single bead in the
middle of a string. The bead has been plucked and released, as in Figure 12.3.
Our goal is to model the dynamics of this system as a linear system. At any given
time t, we should be able to calculate the acceleration x′′( t) of the bead as
linear function
212
Figure 12.3: A simpler system that has only one bead, displaced from its
equilibrium and released.
of its current position. As we’ll see that’s enough to compute the position x( t)
at any time. When we extend the model to include all five beads, it will depend
linearly on the positions of multiple beads.
We’ll make a whole host of unrealistic assumptions to aid us. Let’s pretend the
string has no mass, the bead has no width, there is no friction or air resistance,
and let’s do away with gravity. More generously, we assume that all of these values
are “negligibly small” compared to the forces we care about. These kinds of
simplifying assumptions are the physics analogue of what mathematicians do when
they encounter a hard problem: keep stripping out the difficult parts until you can
solve it. If you simplify the problem in the right way, you’ll be analyzing just
the aspects of the problem that you really care about. After solving it, having
hopefully gained useful intuition in the process, you can replace each removed bit
and use your newfound intuition to find a solution of the harder problem. Or, if
you cannot, you can see how the simpler solution breaks with the new assumption,
and thus understand why the full problem is hard to solve. This process is by no
means as easy as it sounds, but it’s a powerful guide.
The above assumptions are minor, but there are two crucial assumptions that we have
to discuss in more detail. First, we assume the string is not stretched too far.
This allows us to use a Taylor series approximation for the sine and tangent of a
small angle. Second, assume the string is already stretched tightly when the beads
are plucked. This is what allows us to ignore the horizontal motion of the bead.
We’ll discuss these in more detail when we employ them.
Once we’ve eliminated gravity and its cohort, there are only two forces acting on
the bead: the force of tension in the string on the left and right sides of the
bead. When the bead is pulled downward, the string is stretched, and the bonds
between the string’s atoms create a force that “pulls” the string back to its
normal length. Luckily, tension is well understood. The standard model is Hooke’s
law.
Model 12.27 (Hooke’s law). The force of tension in an elastic string that has been
stretched from its resting length by a distance d ≥ 0 is −T d, where T ≥ 0 is a
con-
213
F + F
Figure 12.4: The forces pull in opposite directions toward the wall, and together
sum to a vertical force.
stant depending on the material of the string. This model only applies for a
sufficiently small d that does not exceed a limit (which again depends on the
material in the string).
If the string is tied to a surface and you pull away from the surface, even at an
angle, the force is directed back along the string toward the surface. This gives
our bead two forces as in Figure 12.4.
Since we assumed the bead has no width (or, if you will, the forces act on the
center of mass of the bead), the tails of these vectors are the same point, and
when we sum them we get the net force pulling the bead upward.
In our system the string is taut, and we’ll suppose it’s stretched to begin with.
Call 2 l the natural length of the string (so that l is the length of one of the
two halves), T the tension constant, and 2 l init the length the string is
initially pulled to when the system is at rest. In that case, the two forces on the
bead have magnitude T ( l init − l) and face in opposite directions. The bead does
not move.
Let’s focus on the right hand side of the bead (the left side is symmetric) in
Figure 12.6.
Choose the resting point of the bead, when the string is completely straight, to be
(0 , 0).
Now we compute. Our choice of basis and the Pythagorean theorem give d( t) =
214
F 2
F 1
l init
(0, 0) F 1
d( t)
( x ( t), x ( t))
1
Figure 12.6: The force pulling the bead rightward when the bead is displaced.
215
( l init , −x 2( t))
F 1( t) = T ( d( t) − l)
d( t)
The magnitude of the vector has a nonlinear part d( t) −l involving d( t), so let’s
simplify that first. Since the string was initially stretched to length l init, we
have d( t) − l =
∥F 1( t) ∥ = T ( d( t) − l init) + T ( l init − l) .
The right hand term is the (constant) magnitude of tension when the system is at
rest.
For the left hand term, we can use a Taylor series approximation. First we do some
simplification.
d( t) =
l 2init + x 2( t)2
2( t)
= l init
1 +
l init
z 2
z 6
1 + z 2 = 1 +
− z 4 +
− · · ·
16
init
to be more rigorous, we could hide the lower order terms in a big-O notation, but
we’ll save that for Chapter 15.
other words the magnitude of the force of tension in the string is the initial
tension, plus a small factor proportional to the square of the deviation.
x 2( t)2
+ T ( l init − l)
2 l init
The formula above is why we can assume, as most physics texts do without nearly
as much fuss as we have displayed here, that the magnitude of tension in the string
is constant. This Taylor series approximation is the first assumption showing up in
the math: if the initial deviation x 2( t) is small, say much less than 1 unit of
measurement, then x 2( t)2 is even smaller and can be ignored, as can all higher
powers of x 2( t). Our computation shows that the first power x 2( t) does not show
up anywhere in the Taylor series, so if we’re committed to simplifying everything
to be linear, the Taylor series assures us we’re not accidentally ignoring terms we
want to preserve.
I personally feel it’s important to see how the math justifies the assumptions
rather than relying entirely on “physical intuition.” Once you state which forces
you want to
216
consider—and once you’ve formalized the mathematical rules governing those forces—
the mathematics should stand on its own. In particular, many physics books say that
the constant tension assumption rests on the fact that the bead is not displaced
very far from rest. Strictly speaking, this is not enough information. What also
matters is the relationship between the displacement of the bead and the initial
stretch that holds the string taut at rest. The former must contribute an order of
magnitude smaller force than the latter to be negligible. The Taylor series
revealed this nuance, and further allows us to measure how big a displacement is
too big to ignore. 16
We continue with the assumption, then, that the magnitude of the force of tension
in the string is constant over the entire evolution of the system. From this point
on we’ll use T in place of T ( l init − l) to simplify the formulas (it’s all just
a constant anyway).
Recalling that we formed the unit vector by scaling by d( t), the force on the
right string is the vector
( l init , −x 2( t))
F 1( t) = T
d( t)
Note that while we ignored the x 2( t)2 factor in the magnitude, we haven’t yet
ignored its contribution to the scaling of the unit vector. That begins now. Since
the two forces F 1( t) and F 2( t) are symmetric, we only need the component of F
1( t) in the vertical direction. We project F 1( t) onto the vector (0 , 1), i.e.,
isolate the second entry of the vector.
And if we expand d( t) =
l 2init + x 2( t)2 and use the same Taylor series argument to justify setting x
2( t)2 to zero, we get F vert( t) = (0 , −T x 2( t)/ l init).17
Now that all our forces are vertical, we can just work with the 1-dimensional
picture and see that the sum of the forces on the bead in the vertical direction is
F ( t) =
− 2 T x 2( t)/ l init. By Newton’s law, this dictates the acceleration of the bead,
giving mx′′ 2( t) = − 2 T x 2( t)/ l init .
17 In physics texts you often see the author instead use the cosine formula Theorem
10.20, and the Taylor approximations for sin θ and tan θ. The way we laid it out
makes that unnecessary, but we will use those approximations when we generalize to
multiple beads.
217
1 , f ′(0) = 0. The restrictions on f (0) and f ′(0) are called initial conditions,
and as they change the solution changes. In the case of Theorem 12.28 the solution
only changes by constants. In fact, the way these values vary hints at two
independent dimensions which provide solutions to f′′ = −f.
function d : U → U mapping f 7→ f′′ is a linear map on U, and the sine and cosine
functions are eigenvectors with eigenvalue − 1. This hints at the deep truth that
sine and cosine are special, in part explaining why we expect Theorem 12.28 to be
true.
Despite how the initial conditions may vary, the solution is a linear combination c
1 sin( x) + c 2 cos( x). With a bit of algebra, given the initial conditions you
can solve for those coefficients based on the initial conditions. We will do this
below.
First, we have to wrangle the extra coefficient of 2 T . We can modify the theorem
slightly. Note that for a scalar a, the derivative of sin( ax) is a cos( ax) (the
chain rule, Theorem 8.10), but since we’re differentiating twice we have a square
in the second derivative −a 2 sin( ax). I.e., the solution to x′′( t) = − 2 T x 2
2 T .
Combining this with the assumption that at time t = 0 the bead is displaced by some
fixed amount and let go (has zero initial velocity), we get
x 2(0) = c 1 sin( ω · 0) + c 2 cos( ω · 0)
= c 1 · 0 + c 2 · 1
We can read off the solution as c 1 = 0 , c 2 = x 2(0). This means that our lonely
bead, plucked and left to wait all this time to learn its destiny, finally has an
equation for its
Multiple Beads
Horizontal forces are a new concern. We want to retain our assumption of constant
tension in the string. But because the angles are different on different sides of a
bead, the fraction of that constant tension pulling the bead left and right can be
different, resulting in horizontal motion. We know that the tension in the string
will eventually pull the bead back to the center, but we want to feel secure that
these violations of our assumptions are minor enough that we can justify ignoring
them. We leave it as an exercise to the reader to adapt the setup for a single bead
to this scenario, and to use Taylor series approximations to find the conditions
under which horizontal motion can be ignored.
218
b 3
b 1
b 4
b 3
b 2 ϑ 2
b 1
ϑ 1
y 3
y 2
y 1
Since we are ignoring horizontal motion, we’ll simplify the notation so that the
forces, displacements, velocities, and accelerations are 1-dimensional vectors,
i.e., scalars representing vectors pointing in the vertical direction. Let b
1 , . . . , b 5 be the beads of mass mi, and let yi be the displacement of bi, with
y′ and y′′ the velocity and acceleration, as be-i
fore. The natural resting point of the beads is zero. If we just think about
position—and as we saw this completely determines the forces and the acceleration—
then the state of this system is a vector y = ( y 1 , y 2 , y 3 , y 4 , y 5) ∈ R5.
The forces we’re about to compute will form a linear map A mapping y 7→ y′′.
Let’s now focus on bead b 2 as a generic example, shown in Figure 12.8. In the
figure, the vertical gap between b 1 and b 2 is y 2 − y 1, and the angle θ 1 is the
angle between the string and the horizontal. Likewise for the corresponding data on
right hand side of the bead. The tension is a constant T . The projected tension in
the vertical direction is
219
−T sin( θ 1) + T sin( θ 2), with the sign flip because the left side pulls the bead
down.18
θ 5
sin( θ) = θ − θ 3 +
+ · · ·
3!
5!
θ 3
2 θ 5
tan( θ) = θ +
+ · · ·
15
Because the first two terms are equal, and for θ small enough to ignore θ 3 and
higher, we can replace sin( θ) with tan( θ) wherever it occurs. This is the same
reasoning as before, because we want to extract the linear aspects of the model.
The force on bead b 2 is y′′
2 m 2 = F 2( t)
= −T sin( θ 1) + T sin( θ 2)
= −T tan( θ 1) + T tan( θ 2)
y 2 − y 1
y 3 − y 2
= −T
+ T
l init
l init
m 2 l init y′′
2 = y 1 − 2 y 2 + y 3
Simplify the equation by setting m 2 = l init = T = 1. The forces for the other
beads are analogous, with the beads on the end having slightly different formulas
as they’re attached to the wall on one side. As a whole, the equations are
y′′
1 = − 2 y 1 + y 2
y′′
2 = y 1 − 2 y 2 + y 3
y′′
3 = y 2 − 2 y 3 + y 4
y′′
4 = y 3 − 2 y 4 + y 5
y′′
5 = y 4 − 2 y 5
− 2
1
1
− 2
0
A =
0
− 2
0
− 2
1
− 2
At last, we turn to eigenvalues. This matrix is symmetric and real valued, and so
by Theorem 12.22 it has an orthonormal basis of eigenvectors which A is diagonal
with 18 When b 1 is above b 2, the angle is negative and that reverses the sign:
sin( −θ) = − sin( θ). So the orientations work out nicely.
220
respect to. Let’s compute them for this matrix using the Python scientific
computing library numpy. Along with Fortran eigenvector computations, numpy wraps
fast vector operations for Python.
After defining a helper function that shifts a list to the right or left (omitted
for brevity), we define a function that constructs the bead matrix, foreseeing our
eventual desire to increase the number of beads.
def bead_matrix(dimension=5):
Next we invoke the numpy routine to compute eigenvalues and eigenvectors, and sort
the eigenvectors in order of decreasing eigenvalues. For those unfamiliar with
numpy, the library uses an internal representation of a matrix with an overloaded
index/slicing operator [ ] that accepts tuples as input to select rows, columns,
and index subsets in tricky ways.
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
And, finally, a simple use of the matplotlib library for plotting the eigenvectors.
Here our x-axis is the index of the eigenvector being plotted, and the y-axis is
the entry at that index. Plotting with five beads gives the plot in Figure 12.9.
In case it’s hard to see (there will be a clearer, more obvious diagram at the end
of the section), let’s inspect it in detail. The top eigenvalue, λ = − 0 .
267 . . . , corresponds to the eigenvector in the chart above with circular
markers. The eigenvector entry starts at 0 . 29, increases gradually to 0 . 58, and
then back down to 0 . 29, a sort of quarter-period of a full sine curve. The second
largest eigenvalue, λ = − 1 with triangular markers, has an eigenvector starting at
− 0 . 5 and increasing up to 0 . 5, performing a half-period of sorts.
The next eigenvector for λ = − 2 performs a single full period, and so on.
Now this is something to behold! The eigenvectors have a structure that mirrors the
waves in the vibrating string, and as the corresponding eigenvalue decreases, the
“frequency” of the wave plotted by the eigenvector increases. That is, the wave
exhibits faster oscillations.
This wave is not a metaphor. If you simulate the beaded string with initial
position set to one of these eigenvectors, you’d see a standing wave whose shape is
exactly the plot of that eigenvector. In fact, I implemented a demo of this in
Javascript, which you can
221
Eigenvalue Eigenvector
y 1
y 2
y 3
y 4
y 5
-0.27
0.29
0.50
0.58
0.50
0.29
-1.00
0.50
0.50
-2.00
0.00
0.58
-3.00
-0.50
-3.73
-0.29
0.50 -0.58
0.50 -0.29
0.6
0.4
0.2
= -0.267949
= -1
0.0
= -2
= -3
= -3.73205
0.2
0.4
0.6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Figure 12.9: The rounded entries of the eigenvectors of the 5-bead system (top) and
their plots (bottom).
Because of this, if you set the initial positions of the beads to be quite large,
you’ll see irregularities caused by horizontal motion. These are highlighted by how
the demo draws the force vector acting on each bead at every instant. It’s fun to
watch, and it provides a hint as to what assumption allows one to ignore horizontal
motion. Indeed, if you set the position to the top eigenvector 100 v 1 (scaled to
account for the units being pixels), you can see the same shape as v 1 in the plot
above. If you scale it even larger, you can see the horizontal forces come into
play. For example, try setting the initial positions to 300 v 1 = (87 , 150 , 174 ,
150 , 87).
Let’s witness how the formulas work out for the first eigenvector v 1, when the
positions start as that eigenvector y = v 1 ≈ (0 . 29 , 0 . 5 , 0 . 58 , 0 . 5 ,
0 . 29). In that case each 19 Note the demo is written in ES6 using d3.js, and the
implementation is available in the Github repository linked at pimbook.org.
222
We have the tools to understand this eigenvector phenomenon beyond concrete com-
The basis vectors are the independent components of the joint forces acting on all
the beads. What’s more, the proof of the Spectral Theorem explains why the
eigenvectors have a natural ordering. The way we choose an eigenvector at each step
is, according to Lemma 12.8, by maximizing ∥Av∥ over unit vectors v. In the proof
of the Spectral Theorem we then removed that vector, and its span, from
consideration for the next vector.20
So the largest magnitude eigenvalue (in this case the most negative one) is the
first one extracted, and that corresponds to the highest frequency. The next
eigenvector chosen corresponds to the second largest magnitude eigenvalue, and so
on, each having a smaller frequency than the last.
But wait, there’s more! Because it’s an orthonormal basis of eigenvectors, we can
express any evolution of this system in terms of the eigenvectors, and do it as
simply as taking inner products.
Take, for example, the complex evolution that occurs when you pluck the second
bead.
coefficients = {}
for i in range(len(vector)):
return coefficients
With results printed below rounded for legibility, the coefficients for our chosen
y can be computed and used to reconstruct the original vector.
223
>>> A = bead_matrix(5)
>>> print(coeffs)
0, 5.0e-01,
0,
0,
0])
y′′ = Ay = ⇒ z′′ = Dz
We can solve each of these differential equations separately, just as we solved the
single-bead equation, and then combine them by converting back to the standard
basis of bead positions. The result will give us the trajectory of each bead
expressed as a sum of simple cosine waves.
The equations, with initial conditions placed adjacent, are (with some rounding to
simplify):
z′′
1 = − 0 . 27 z 1;
z 1(0) = 0 . 25 ,
z′ 1(0) = 0
z′′
2 = −z 2;
z 2(0) = − 0 . 25 , z′ 2(0) = 0
z′′
3 = − 2 z 3;
z 3(0) = 0 ,
z′ 3(0) = 0
z′′
4 = − 3 z 4;
z 4(0) = 0 . 25 ,
z′ 4(0) = 0
z′′
5 = − 3 . 73 z 5;
z 5(0) = 0 . 25 ,
z′ 5(0) = 0
224
z 1( t) = 0 . 25 cos(0 . 52 t)
z 2( t) = − 0 . 25 cos( t)
z 3( t) = 0
z 4( t) = 0 . 25 cos(1 . 73 t)
z 5( t) = 0 . 25 cos(1 . 93 t)
Finally, as you may have guessed from the arbitrary choice of five beads, we can
generalize this system to any number of beads. If we take even just a hundred
beads, and plot the eigenvectors for the top few eigenvalues as we did above, we
see smoother, more obvious waves. Figure 12.10 shows this. With such natural shapes
of increasing complexity, it makes sense to give a name to these eigenvectors.
They’re called the fundamental modes of the system, and the frequencies of the
“sinusoidal curve” of each eigenvector21
If one decreases the distance between beads and increases the number of beads in
the limit, the result is the wave equation. This is a differential equation (in
both time and position along the string) that one can use to track the motion of a
traveling wave through a string. See the exercises for more on that. But more
importantly for us, the vector space for that continuous model has infinite
dimension, it still has a basis of eigenvectors, and they correspond to proper sine
curves instead of discrete approximations. In this case, since the “zero-width”
beads are now at every position of the string, you can think of them as cross
sections of molecules that make up the string itself, with atomic forces playing 21
Or rather, the curves implied to underlie these discrete points.
225
0.15
0.10
0.05
= -0.000967435
= -0.00386881
0.00
= -0.0087013
= -0.0154603
= -0.0241391
0.05
0.10
0.15
0
20
40
60
80
100
Figure 12.10: The plot of the top five eigenvectors for a hundred-bead system.
the role of Hooke’s law. These eigenvectors then describe the intrinsic properties
of the string itself.
So there you have it. Eigenvectors have revealed the secrets of waves on a string.
1. Eigenvalues and eigenvectors often provide the best perspective (basis) with
which to study a linear map.
decompositions easy.
One can then solve that simplified problem and gain insight. Then gradually add
complexity back to the problem and, using the new insights, attempt to solve the
harder problem.
226
12.9 Exercises
12.2. Let V be an n-dimensional inner product space, whose norm ∥x∥ 2 = ⟨x, x⟩ is
given by the inner product. Prove the following.
d( x, z) for all x, y, z ∈ V .
12.3. Prove that a linear map f : R n → R n preserves the standard inner product—
i.e. ⟨x, y⟩ = ⟨f( x) , f( y) ⟩ for all x, y—if and only if its matrix
representation A has orthonormal columns with respect to the standard basis. Hint:
use the fact that ⟨x, y⟩ =
xT y.
12.4. Let A be a square matrix with an inverse. Using only the fact that ( BC) T =
CT BT
− 1
B, C, prove that ( AT )
= ( A− 1) .
12.5. Prove the following basic facts about eigenvalues, eigenvectors, and inner
products.
1. Fix a vector y and let fy( x) = ⟨x, y⟩. Prove that if x is restricted to be a
unit vector, then fy( x) is maximized when x = y/ ∥y∥.
2. Let V, W be two n-dimensional inner product spaces with inner products ⟨−, −⟩V
3. Fix the inner product space R n with the standard inner product. Let A : R n → R
n be a change of basis matrix. Find an example of A for which ⟨x, y⟩ ̸= ⟨Ax, Ay⟩.
In other words, an arbitrary change of basis does not preserve the formula for the
standard inner product. As we saw in the chapter, only an orthonormal change
of basis does this. Determine a formula (that depends on the data of A), that shows
how to convert inner product calculations in one basis to inner product
calculations in another.
12.6. Look up a proof of Theorem 12.28, on the uniqueness of the sine function,
that uses Taylor series. The analytical tool required to understand the standard
proof is the concept of absolute convergence. The central difficulty is that if
you’re defining a function by an infinite series, you have to make sure that series
converges with the properties needed to make it a valid Taylor series. Repeat the
proof for sin( ax).
227
12.7. In Definition 12.3 we defined the adjacency matrix A( G) of a graph G = ( V,
E).
12.9. Implement the algorithm presented in the chapter to generate a random graph
on n vertices with edge probability 1/2, and a planted clique of size k. For the
rest of this
exercise fix k = ⌈ n log n⌉. Determine the average degree of a vertex that is in
the plant, and the average degree of a vertex that is not in the plant, and use
that to determine a rule for deciding if a vertex is in the clique. Implement this
rule for finding planted cliques of
size at least
12.10. As in the previous problem, implement the algorithm in this chapter for
finding
we mean that p( f) is the zero map. Look up a proof that λ is a root of p if and
only if λ
is an eigenvalue of f.
12.12. We proved that symmetric matrices have a full set of eigenvectors and
eigenvalues.
· · ·
0
· · ·
..
..
..
..
.
.
p =
0
· · ·
· · ·
−a 0 −a 1 −a 2 · · · −an− 2 −an− 1
Notice that this matrix is not symmetric. Because the roots of a polynomial might
be
228
complex numbers, this implies the eigenvalues of a matrix (when viewed as a linear
map on a vector space of complex numbers) might also be complex. Walk away from
this
exercise with a new appreciation for the convenience of symmetric matrices, and the
inherent difficulty of writing a generic eigenvalue solver.
12.13. Implement the Gram-Schmidt algorithm using the following method for finding
vectors not in the span of a partial basis: choose a vector with random entries
between zero and one, repeating until you find one that works. How often does it
happen that you have to repeat? Can you give an explanation for this?
12.14. Look up the derivation of the wave equation from Hooke’s law for a beaded
string (or equivalently, beads on springs) as the distance between adjacent beads
tends to zero.
12.15. Look up a proof that the singular values of a non-square real matrix A are
the square roots of the eigenvalues of the matrix AT A. Use this to understand why
we computed AT A in the SVD algorithm from Chapter 10.
12.16. Generate a “random” symmetric 2000 × 2000 matrix via the following scheme:
pick a distribution (say, normal with a given mean and variance), and let the i, j
entry with i ≥ j be an independent draw from this distribution. Let the remaining i
< j entries be the symmetric mirror. Compute the eigenvalues of this matrix (which
are all real) and plot them in a histogram. What does the result look like? How
does this shape depend on the parameters of the distribution? On the choice of
distribution?
12.18. Using Taylor series, find appropriate conditions under which horizontal
motion in the 5-bead system can be ignored.
12.21. PageRank is a ranking algorithm that was a major factor in the Google search
engine’s domination of the early internet search market. The algorithm involves
setting up a linear system based on links between webpages, and computing the
eigenvector for the largest eigenvalue. Find an exposition of this algorithm and
implement it in code.
is injective, and
dimension.
This construction works without need for an inner product, but if you have an inner
product, you get an obvious way to take a general basis {v 1 , . . . , vn} of V to
a dual basis of V ∗ by mapping v to the function x 7→ ⟨v, x⟩. If the {vi} were an
orthonormal basis, this would be the same “coordinate picking” function as we did
for the standard basis, due to Proposition 12.15.
Moreover, every linear functional on R n can be expressed as the inner product with
a single vector (not necessarily a basis vector). Expressed in terms of matrices,
the linear functional can be written as a (1 × n)-matrix—since it is a linear map
from an n-
dimensional vector space to a 1-dimensional space. Say we call it fv( x) = ⟨v, x⟩.
If you start from the perspective that all vectors are columns, then the matrix
representation of fv is vT , and the “matrix multiplication” vT x is a scalar (and
also another way to write the inner product, as we saw in this chapter).
Now we finally get to the transpose, which just extends this linear functional
picture to a finite number of independent functionals, the outputs of which are
grouped together in a vector. Let f : V → W be a linear map with matrix
representation A, an ( m×n)-matrix for n-dimensional V and m-dimensional W . Define
the transpose of f (sometimes called the adjoint) as the linear map fT : W ∗ → V ∗
which takes as input (a linear functional!)
230
Since W ∗ and W are isomorphic, and V ∗ and V are isomorphic, you may wonder if you
can apply this to realize the dual fT as a map W → V as well. Indeed you can, and
it can even be defined without referring to dual vector spaces at all. Let V, W be
inner product spaces and f : V → W a linear map. Define the transpose fT : W → V
⟨Ax, y⟩ = xT AT y.
Note that these two definitions of the transpose can only be said to be the same in
the case that the vector space has scalars in R. If you allow for complex number
scalars, things get a bit trickier.
From this definition, we can see that the algebraic multiplicities of λ = 1 are
different for A and B in Section 12.5. Taking successive powers of B − I 3 gives
first (0 , 1 , 0) and then (0 , 0 , 1) in the kernels, while the algebraic
multiplicity for A is just 1.
These are square sub-matrices with λ on the diagonal and 1’s on the adjacent
diagonal.
λ, 3 =
The Jordan canonical form theorem states that for any linear map V → V (with
complex scalars) there is a basis for V , for which the matrix of that linear map
consists entirely of Jordan blocks along the diagonal. There may be more than one
Jordan block for a given eigenvalue, but the size and number of blocks are
determined by the algebraic and geometric multiplicities of that eigenvalue,
respectively.
All of this is to note two things: it’s possible to compute all of the eigenvalues
and eigenvectors for a linear map, and these, along with some auxiliary data (some
of which
231
I’ve left out from this text), do in fact give a complete characterization of the
map. However, it’s a more nuanced characterization, and one whose benefits are not
as easily displayed as when you have an orthonormal basis of eigenvectors. The
Jordan canonical form is an important theorem that has generalizations and
adaptations in other fields of mathematics.
Finally, as a quick aside, the set of all eigenvalues together with their geometric
multiplicities is called the spectrum of a linear map.
Definition 12.30. Let f : V → V be a linear map between vector spaces. Define the
spectrum of f as the set
Spec( f) = {( λ, dim ker( f − λI)) : f( v) = λv for some nonzero v ∈ V }.
It is interesting to note that most scientific uses of the word “spectrum” refer to
this mathematical idea, for example the spectrum of wavelengths of light or the
spectrum of an atom.
Chapter 13
Mathematics as we practice it is much more formally complete and precise than other
sciences, but it is much less formally complete and precise for its content than
computer programs. The difference has to do not just with the amount of effort: the
kind of effort is qualitatively different. In large computer programs, a tremendous
proportion of effort must be spent on myriad compatibility issues: making sure that
all definitions are consistent, developing good data structures that have useful
but not cumbersome generality, deciding on the right generality for functions, etc.
The proportion of energy spent on the working part of a large program, as
distinguished from the bookkeeping part, is surprisingly small. Because of
compatibility issues that almost inevitably es-calate out of hand because the right
definitions change as generality and functionality are added, computer programs
usually need to be rewritten frequently, often from scratch.
Programmers who brave mathematical topics often come away wondering why math-
ematics isn’t more like programming. We’ve discussed some of the issues surrounding
this question already in this book, like why mathematicians tend to use brief
variable names, and how conventions will differ from source to source. Beneath
these relatively superficial concerns is a question about rigor.
Thurston’s observations above were as true in the mid 90’s as they are over twenty
years later. Software is far more rigorous than mathematics, and most of the work
in software is about interface and data compatibility—“bookkeeping,” as Thurston
calls it.
This is the kind of work required by the rigor of software. You need to care
whether your strings are in ASCII or Unicode, that data is sanitized, that
dependent systems are synchronized, because ignoring this will make everything fall
apart.
I once took a course on compiler design. The lectures were taught in the
architecture building on campus. One day, the architecture students were having a
project fair in the building, marveling over their structures and designs. In a
lightly mocking tone, my compilers professor observed that software architecture
was much more impressive than building architecture. Their buildings wouldn’t fall
over if they forgot a few nails or slightly changed the materials. But a few
misplaced characters in software has caused destruction, financial disaster, and
death.
233
234
My professor had a point. Regular mayhem is caused by software security lapses,
with root causes often related to improper string validation or bad uses of memory
copying.
A single improperly set permission bit can cause troves of private data to become
public. Financial insecurity is almost synonymous with digital currencies, one
particularly relevant example being the 2016 hack of the “Decentralized Autonomous
Organization,”
a sort of hedge fund governed by an Ethereum contract that contained a bug allowing
a hacker to withdraw the equivalent of 50 million USD before it was mitigated. The
root cause was a bug in the contract allowing an infinite recursion. Multiple
(unmanned) space probes, costing hundreds of millions of dollars each, have been
destroyed shortly after launch due to coding errors. The Ariane 5 crashed in 1996
because of a bug with integer overflow. The Mariner 1 in 1962 because of a missing
hyphen. Finally, in 1991, a bug in the Patriot missile defense system resulted in
the death of 28 soldiers at a military base in Saudi Arabia. The bug was an
inaccurate calculation of wall-clock time due to a poor choice of rounding. I have
little doubt there will be additional deaths1 caused by lapses and insecurities in
self-driving car software, in addition to the damage already caused by accidents
(many of which went unreported, according to some 2018 journalism).
These sorts of bugs cause internal debacles at every company with alarming
regularity.
One consequence is a general feeling among many engineers that “all software is
shit.”
More optimistically, the best engineers work very hard to design interfaces and
abstractions that, to the best of software’s ability, prevent mistakes. Those who
design aircraft control systems do this quite well. Once you’ve made enough
mistakes of your own, you learn a certain air of humility. No matter how smart,
even the best engineers get tired, grumpy, overworked, or forgetful—each of which
is liable to make them forget a hyphen.
In the subfield of computer science dealing with distributed systems, these issues
are exacerbated by the extreme difficulty of even telling whether a system
satisfies the guarantees you need it to. A titan of this area is mathematician
turned computer scientist Leslie Lamport. Through his work, Lamport essentially
defined distributed computing as a field of study. Many of the concepts you have
heard of in this area—synchronized clocks, Paxos consensus, mutexes—were invented
by Lamport.
Lamport has no particular love of mathematical discourse. In his 1994 essay, “How
to Write a Proof,” he admits, “Mathematical notation has improved over the past few
centuries,” but goes on to claim that the style of mathematical proof employed by
most of mathematics (including in this book)—mixing prose and formulas in a web of
propositions, lemmas, and theorems—is wholly inadequate.
Much of Lamport’s seminal work in the last few decades grew out of his frustration
with errors in distributed systems papers. As he attests, some researcher would
propose (say) a consensus algorithm. It might seem correct at first glance, but
inevitably it would contain mistakes—if not be wrong outright. Lamport concludes
that guarantees about the behavior of distributed systems are particularly hard to
establish with the rigor that is 1 I personally attribute the 2018 death of Elaine
Herzberg to engineers intentionally disabling safety features and cutting personnel
costs than to software bugs.
235
Lamport writes,
These proofs are seldom deep, but usually have considerable detail. Structured
proofs provided a way of coping with this detail. The style was first applied to
proofs of ordinary theorems in a paper I wrote with Martín Abadi. He had already
written conventional proofs—proofs that were good enough to convince us and,
presumably, the referees.
Rewriting the proofs in a structured style, we discovered that almost every one had
serious mistakes, though the theorems were correct. Any hope that incorrect proofs
might not lead to incorrect theorems was destroyed in our next collaboration. Time
and again, we would make a conjecture and write a proof sketch on the blackboard—a
sketch that could easily have been turned into a convincing conventional proof—only
to discover, by trying to write a structured proof, that the conjecture was false.
Since then, I have never believed a result without a careful, structured proof. My
skepticism has helped avoid numerous errors.
This is coming from a Turing Award winner, a man considered a luminary of computer
science. Even the smartest theorem provers among us make ample mistakes.
Consequently, Lamport designed a proof assistant called TLA+, which he has used to
check the correctness of various claims about distributed systems.2 TLA+ is
supposed to prevent you from shooting your own mathematical foot. TLA+ falls in
step with a body of work related to automated proof systems. Some systems you may
have heard include Coq and Isabelle. Some of these systems claim the ability to
prove your theorems for you, but I’ll instead focus just on the correctness
checking aspects.
So computer scientists like Lamport and software engineers are perturbed by the
lack of rigor in mathematics. Each remembers the fresh wounds of catastrophes due
to avoidable mistakes. Meanwhile, Lamport and others provide systems like TLA+ that
would allow mathematician to achieve much higher certainty in their own results.
This raises the question, why don’t all mathematicians use automated proof
assistants like TLA+? This is a detailed and complex question. I will not be able
to answer it justly, but I can provide some perspective.
We have argued that the elegance of a proof is important. Mathematicians work hard
to be able to summarize the core idea of a proof in a few words or a representative
picture. Full rigor as the standard for all proofs would arguably strip many proofs
of their elegance, increasing the burden of transmitting intuition and insight
between humans.
The work you put into making an argument automatable is work you could have spent
on making math accessible to humans (via additional papers, talks, and working with
students). These extra activities already serve as correctness checks, so is there
significant added benefit to a formal specification? Lamport’s counter is that
making it accessible 2 I particularly enjoyed his tutorial video course, which you
can find at https://lamport.azurewebsites.net/
video/videos.html.
236
to humans is counterproductive when the result is incorrect. He would also argue
that a structured proof is easier to understand. One underlying issue Lamport’s
riposte ignores is that mathematics is a social activity, and formal proof
specifications are decidedly an-tisocial. Good for those who want to ensure planes
don’t crash, bad for those who want to do mathematics.
well enough to make the proof trivial. Conversely, problem solvers might complain
that proof assistants limit their ability to employ clever constructions. Being
able to invoke a result from a disconnected area of math requires you to re-
implement that entire field in your new context. Dependency management would turn
few-page arguments into
work so well for distributed systems theorems is that those theorems have
relatively few layers of indirection. A handful of bits might represent consensus.
On the other hand, in geometry you might think the thought, “this space is very
flat, and that should have such-and-such effect.” An automated proof assistant will
be of no use there, nor will it help you refine the degree to which your
hypothesized effect is present. You must lay everything out perfectly formally,
even if your definitions haven’t been finalized. Then too often you resort to
writing and rewriting, and before long you’ve stopped doing math entirely. If you
believe Michael Atiyah that the proof is the very last step of mathematical
inquiry, a proof assistant is useless for the majority of your work.
As most engineers can understand, the degree of rigor to require is a tradeoff with
tangible benefits on both sides. Mathematicians opt to let some errors slip
through. Over time these errors will eventually be found and reverted or fixed.
Since technology rarely goes straight from mathematical publication to space probe
control software, the world has enough slack to accommodate it.
1. that there is uniform, objective and firmly established theory and practice of
mathematical proof, and
Thurston instead prefers a question more leading to what he feels is the correct
answer: “How do mathematicians advance human understanding of mathematics?” Many
theorem, the shortest proof of which to date involves much brute force case
checking by computer. As much as rigor helps one establish correctness, it does not
guarantee synthesis and understanding.
Thurston continues,
Because we have a high standard for clear and convincing thinking and because we
place a high value on listening to and trying to understand each other, we don’t
engage in interminable arguments and endless redoing of our mathematics. We are
prepared to be convinced by others. Intellectually, mathematics moves very quickly.
Entire mathematical landscapes change and change again in amazing ways during a
single career. When one considers how hard it is to write a computer program even
approaching the intellectual scope of a good mathematical paper, and how much
greater time and effort have to be put into it to make it “almost” formally
correct, it is preposterous to claim that mathematics as we practice it is anywhere
near formally correct.
Rather, Thurston claims that reliability of mathematical ideas “does not primarily
come from mathematicians formally checking formal arguments; it comes from
mathematicians thinking carefully and critically about mathematical ideas.”
Chapter 14
—David Mumford
As the application for this chapter, we’ll write a neural network from scratch.
We’ll define the so-called computation graph of a function, and optimize its
parameters using the chain rule and gradient descent. We’ll apply this to the
classic problem of classifying handwritten digits. Along the way, we’ll get a
whirlwind introduction to the theory and practice of machine learning.
Let’s start with our fond memories of single-variable calculus. Recall Definition
8.6 of the derivative of a single-variable function.
239
240
f(x1, x2)
x2
1.5 2.0 2.5 3.0 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.51.0
x1
(1 + 1/n, 1)
(1, 1 + 1/n)
Figure 14.1: The steepness of a surface depends on the direction you look.
f ( x) − f ( c)
f ′( c) = lim
x→c
x − c
On the real line, we defined the symbolic abstraction x → c to mean “any sequence
xn that converges to c,” where we declared the derivative only exists if the limit
doesn’t depend on the choice of sequence. When we work in R n (which, among many
other properties, has a nice measure of distance for vectors d( x, y) = ∥x − y∥)
the notion of a convergent sequence generalizes seamlessly. A sequence of vectors x
1 , x 2 , · · · ∈ R n converges to c ∈ R n if the sequence dn = ∥xn − c∥ of real
numbers converges to zero.
Despite sequence convergence generalizing, the obvious first attempt to adapt the
derivative violates well-definition. We might try the same formula as Definition
14.1, interpreting x and c as vectors, and using the norm in the denominator.
Unfortunately, the “value” of this derivative depends on the sequence chosen.
xn = (1 + 1 , 1) and x′
n = (1 , 1 + 1
depends on the second coordinate quadratically, (and doesn’t depend on the first
coordinate at all!) the direction along which x′ approaches is steeper than that of
n
xn. Using
the former for “the derivative” would result in something like lim
− 1+1
n→∞
= 0, while
(1/ n)
n→∞
(1/ n)
241
from the natural world; a hiker traverses switchbacks to avoid walking straight up
a hill, and a skier skis in an S shape to slow down their descent. 1 In fact, for
f( x 1 , x 2) = −x 2, 2
and standing at the point (1 , 1), every direction provides a slightly different
slope.
This suggests one intuitive way to generalize the one-dimensional definition of the
derivative: parameterize by the direction of approach.
f ( c + tv) − f ( c)
Dir( f, c, v) = lim
t→ 0
As we’ll see soon, a stronger derivative definition avoids these issues. It will
provide a linear map representing the whole function, and applying linear algebra
produces the directional derivative in any direction. Being linear algebra, we may
choose a beneficial basis, though I haven’t yet made it clear what the vector space
in question is. That will come as we refine what the right definition of “the”
derivative should be.
For dimension 1, the derivative of f had the distinction of providing the most
accurate line approximating f at a point. The line through ( c, f( c)) with slope f
′( c) is closer to 1 I grew up on a hill-covered cattle ranch, and when I was young
I noticed the trails traced out by the cows were always nearly flat along the side
of the hill. Those massive beasts know how to get from place to place without
wasting energy.
242
f(x1, x2)
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.50.0
0.0
0.5 x2
x1 0.5 1.0
1.0
1.5
1.5
2.0
2.0
the graph of f near c than any other line. We proved this in detail in Theorem
8.11.
This approximator is more than just a line. It’s a linear map, and now that we
have the language of linear algebra we can discuss it. Define by Lf,c the linear
map Lf,c( z) = f ′( c) z. As input, this linear map takes a (one-dimensional)
vector z representing a deviation from c. The output is the derivative’s
approximation of how much f will change as a result. The matrix for Lf,c is the
single-entry matrix [ f ′( c)]. Moreover, Lf,c( z) is exactly the first-degree
Taylor polynomial for the version of f that gets translated so that ( c, f( c)) is
at the origin. Figure 14.3 shows the difference.
If you don’t like shifting f to the origin, we can define the affine linear map
(affine just means a translation of a linear map away from the origin), which we’ll
call a linear approximation to f.
Lc( x) = f ′( c)( x − c) + f ( c) .
243
( c, f( c))
(0 , f(0))
y = f' ( c)( x – c) + f( c)
y = f' ( c) x
The linear approximator has the following property, which is a restatement of the
limit definition of the derivative.
Proposition 14.4. For any differentiable f : R → R and its linear approximation Lc(
x) , f ( x) − L
lim
c( x) = 0
x→c
x − c
f ( x) − f ( c)
f ′( c)( x − c)
lim
− lim
= f ′( c) − f ′( c) = 0
x→c
x − c
x→c
x − c
I spell this out in such detail because the existence of a linear approximator (an
affine linear function satisfying 14.4) becomes a definition for functions R n → R.
lim
c( x)
x→c
∥x − c∥
= 0
244
f(x1, x2)
10
15
1.0
0.5
0.0
0.5
1.0
1.5
1.0
2.0 x1
2.5
x2
Figure 14.4: The linear subspace defined by the total derivative of f sits tangent
to the surface of f at the point the total derivative is evaluated at.
bad such an approximation might be). This rules out the confounding corkscrew
example.
245
f ( c).3
It’s worthwhile to do some concrete examples. First in one dimension, then in
three.
Thus, the tangent space Tf (2) is a copy of R, and the total derivative at c = 2 is
A( x) =
The tangent space Tf ( c) = R3, and so the total derivative A : R3 → R has three-
dimensional inputs. We won’t learn how to compute this map from the definition of f
until Section 14.4, so for now we give the answer magically; it’s the following 1 ×
3
matrix:
A = 6
− 4 .
And as a result
L( x, y, z) = A( x − 3 , y − 2 , z − 1) + f (3 , 2 , 1)
= 6( x − 3) + 3( y − 2) − 4( z − 1) + 11 .
Many elementary calculus books have students compute this (“the equation of the
Proof. Suppose there are two functions LA with matrix A and LB with matrix B that
are both total derivatives of f at c. We will show that A = B, and hence that LA =
LB.
First, notice the difference of the two defining limits of the total derivative is
related to the difference between B and A. Below, LB,c( x) = B( x − c) + f( c) and
likewise for LA.
f ( x) − L
f ( x) − L
lim
B,c( x) − lim
A,c( x)
x→c
∥x − c∥
x→c
∥x − c∥
f ( x) − [ B( x − c) + f ( c)] − f ( x) + [ A( x − c) + f ( c)]
= lim
x→c
∥x − c∥
( A − B)( x − c)
= lim
x→c
∥x − c∥
246
Since both LA and LB are total derivatives, both of their defining limits exist and
are zero. This reduces the above to
( A − B)( x − c)
0 = lim
x→c
∥x − c∥
Assume to the contrary that B ̸= A. Then there must be some unit vector v ∈ R n for
which ( A − B) v ̸= 0. Define the sequence xk → c by xk = c + (1/ k) v. Then, noting
the change in limit index from x to k,
( A − B)( x − c)
(1/ k)( A − B) v
lim
x→c
∥x − c∥
= lim
k→∞
∥(1/ k) v∥
= ( A − B) v ̸= 0 .
This validates us calling the total derivative the total derivative. There is no
other linear map that can satisfy the defining property. As such, we can define a
more convenient notation for the total derivative.
Definition 14.7. Define the notation Df( c) to mean the total derivative matrix A
of f at the point c.
A quick note on notation, D is a mapping from functions to functions, but the way
it’s written it looks like c is an argument to a function called “Df”. To be formal
one might attempt to curry arguments. D( f)( c) is a concrete matrix of real
numbers, and D( f) is a function that takes as input a point c and produces a
matrix as output. Mathematicians often drop the parentheses to reduce clutter, and
even the evaluation at c if this is clear from context. One might also subscript
the c as in Dfc, or use a pipe that usually means
Now we’d like to compute total derivatives. To make this process cleaner, we first
deviate to generalize the derivative to functions R n → R m.
247
Each fi : R n → R stands on its own as a function. Moreover, if one defines πj : R
m →
R to be the function that extracts the j-th coordinate of its input, then fi = πi ◦
f.4, 5
The definition of the derivative is nearly identical with multiple outputs, but now
all codomains are R m and the limit numerator has a vector norm. The diff between
Definitions 14.5 and 14.8 is literally four characters (two m’s in R m and two
∥’s).
∥f( x) − L
lim
c( x) ∥
x→c
∥x − c∥
= 0
In most of the rest of this chapter, we’ll restrict to the special case m = 1.
However, the chain rule—a singularly powerful and beautiful tool that will guide
our proofs and application—shines most brightly in arbitrary dimensions. It says
that the derivative of a composition of two functions is the composition (product)
of their total derivative matrices.
Note that this should not be surprising! The best linear approximation of a
composition should naturally be the compositions of the best linear approximations
of the composed functions. This formalizes it, and allows us to compute it using
matrices.
This tidy theorem will be the foundation of our neural network application. It is
not far off to say that all you need to train a neural network is “the chain rule
with caching.” However, we’ll delegate the proof—it’s admittedly technical and dull
—to the chapter notes.
The chain rule is an extremely useful tool, and despite being abstract, it lands us
within arms reach of our ultimate goal—easy derivative computations. In the next
section, we’ll see how finding these complicated matrices reduces to computing a
handful of directional derivatives.
Because indeed, it’s exactly the linear-algebraic projection onto the i-th basis
vector. This is also why you’ll see π used as a function, since π is the Greek “p,”
and “p” stands for projection.
248
h : R t7→c+ tv 1
−−−−−→ R n
−−−−→ R
0
Dg(0) =
..
.
Theorem 14.10 provides two pieces of insight. The first is that the directional
derivative wasn’t so far off from the “right” definition. For “nicely behaved”
functions, the total derivative and the directional derivatives agree. There’s even
a theorem that relates the 6 Note that if f were not defined on all of R n—such as
if it has a discontinuity or hole in its domain—this proof could be adapted by only
defining g on some small epsilon ball around c.
249
The second insight is that we can compute any directional derivative easily by
first computing a small number of directional derivatives—one for each basis vector
—and then simply projecting onto the direction of our choice. This projection is
precisely the inner product with the vector of directional derivatives, or, for
multiple output variables, the corresponding matrix multiplication. Projection
works because it coincides with the way to express a vector in terms of an
orthonormal basis (Proposition 12.15).
Speaking in terms of general bases is fine, and on occasion you’ll find derivatives
are easier to compute with a clever change of coordinates. However, it’s usually
easiest to use the same, simple basis: each basis vector is the standard basis
vector for R n, and is denoted dxi. This vector represents a change in a single
input variable while leaving all others constant. If you have names for your
variables, like f( x, y, z) = x 2 y + cos( z), then you would use dx, dy, and dz.
When we do examples, we’ll stick to using xi and dxi.
It’s one of those subtle, technical theorems that happens to show up as a core
technique for a lot of proofs.
An exercise will recommend you investigate, but we won’t explicitly use it.
250
of c), the example above can be written as ∂f/ ∂x 1 = 2 x 1 x 2. One refers to the
operation of taking a partial derivative with respect to x by the function named
∂ , with the jux-
∂x
When your chosen basis is the standard basis for each variable, the resulting total
derivative matrix Df is called the gradient of f, denoted ∇f. The symbol ∇ is often
spoken “grad,” and officially called a “nabla.” We’ll discuss the gradient in more
detail below, because the gradient has a useful geometric property.
Below I will write the matrix generically in the sense that it works for any choice
of c = ( x 1 , x 2 , x 3), in the same way that when writing a single-variable
derivative one uses the same variable before and after taking the derivative.
∇f = 2 x 1 x 2 x 2 − sin( x
3)
With this, we can compute the directional derivative in the direction of a vector v
=
= 2 x
1 x 2
x 2
− sin( x
· 1
− 1
3)
2
1
= √ (2 x 1 x 2 − x 2 −
2 sin( x 3))
251
v 1
∇f(1 , 2 , π/2) = 2 x
1 x 2
x 2
− sin( x
· v
3)
v 3
v 1
= 4
− 1 · v
v 3
= 4 v 1 + v 2 + −v 3
Any way you slice it, the value we want is just one inner product away!
Many authors don’t write the gradient as a vector in this way. Instead, they denote
the basis vectors as dxi, and the gradient is written as a single linear
combination of these basis vectors. For the example f we’ve been using, it would be
∇f = 2 x 1 x 2 dx 1 + x 21 dx 2 − sin( x 3) dx 3
This notation has the advantage that you can use it while still hating linear
algebra: this is just the inner product written out before choosing values for v
1 , v 2 , v 3, i.e., the coefficients of dx 1 , dx 2 , dx 3 in the vector v to
evaluate. It also helps you keep in mind that dxi are meant to represent deviations
of xi from the point being evaluated. Sometimes they’re written as a “delta”, ∆ xi
or δxi, since delta is commonly used to represent a change.9 On the other hand,
since it uses the symbols dxi, it’s easy to confuse the meaning with d/ dxi. We
learned to love linear algebra. We’ll stick to the vector notation.
Next we study the geometry of the gradient. Henceforth, when we say “differentiable
function” we mean a function with a total derivative, we’ll assume all functions
are differentiable, and we’ll seamlessly swap between total derivatives,
directional derivatives, linear maps, and matrices.
whose input is a “direction to look in” and whose output is how steep the
derivative is in that direction. Since ∇f is derived from f, it’s natural to ask
how the geometry of ∇f relates to the shape of f.
The answer reveals itself easily with a strong grasp of the projection function
from linear algebra. Recall the function proj ( w), which projects a vector w onto
a unit vector v
252
v. We studied this in Chapters 10 and 12, and there we noted some interesting
facts. Let’s recall them here. Let v be a unit vector and w an arbitrary vector of
the same dimension.
1. The standard inner product ⟨w, v⟩ is the signed length of proj ( w). The sign is
v
positive if the result of the projection points in the same direction as v and
negative if it points opposite to v.
2. If you project w onto v, and v is not on the same line as w, then ∥ proj ( w) ∥
< ∥w∥.
3. An alternate formula for ⟨v, w⟩ is ∥v∥∥w∥ cos( θ), where θ is the angle between v
and w. In the case that ∥v∥ = 1, the formula is ∥w∥ cos( θ).
All of these point to the same general insight, which is a theorem with a famous
name.
the standard inner product. Then |⟨v, w⟩| ≤ ∥v∥∥w∥, with equality holding if and
only if v and w are linearly dependent.
The Cauchy-Schwarz inequality has many proofs. I’ll just share one that uses the
cosine formula above to emphasize the geometry. You’ll do a different proof in the
exercises.
Proof. From ⟨v, w⟩ = ∥v∥∥w∥ cos( θ), and since − 1 ≤ cos( θ) ≤ 1, it follows that
The details of this proof show more than the statement. Since the directional
derivative is a projection of the gradient ∇f onto a unit vector v—i.e., ⟨∇f( x) ,
v⟩—if you want to maximize the directional derivative, v should point in the same
direction as ∇f( x). Said a different way, the gradient ∇f( x) points in the
steepest possible direction.
One is tempted to think this theorem is amazing (it is), but in light of our linear
algebraic preparation it is a trivial consequence of how linear projection works.
We can exploit this further. A level curve of f at c is the set of constant-height
inputs
253
Since many things in life and science can be modeled using functions R n → R, a
common desire is to find an input x ∈ R n which maximizes or minimizes such a
function. For the sake of discussion, let’s suppose we’re looking for a minimum.
Even when a mathematical model f exists for a phenomenon, minimizing it might be
algebraically intractable for a variety of reasons. For example, it might involve
functions that are difficult to separate, such as trigonometric functions and
threshold functions. Alternatively, it might simply be so large as to avoid any
human analysis whatsoever, as is often the case with a neural network that has
millions of parameters related to labeled data. The rest of this chapter is devoted
to understanding how to tackle such situations, and the core idea is to “follow”
the direction indicated by the gradient.
Now we’ll use the geometry of the gradient to derive a popular technique for
optimizing functions R n → R. First, we review the situation for single-variable
functions. In Chapter 9 we outlined the steps to solve a one-dimensional
minimization problem, which I’ll repeat here:
• Define your function f : R → R whose input x you control, and whose output you’d
like to minimize. Select a range of interest a ≤ x ≤ b.
• The optimal input x is the minimum value of f( x) where x is among the critical
points, or x = a or x = b.
2 x). Equating this to the zero vector results in an infinite family of solutions
given by x +
If the equations are simple enough, one can apply a classical technique called
Lagrange multipliers to compute optima. This was a central workhorse of a lot of
pre-computer-era optimization. In general, Lagrange multipliers fail to help in
almost every modern application, so we relegate it to the exercises. We’ll instead
focus on a more general algorithmic technique that works best when the function
you’re optimizing is intractable for pen-and-paper analysis. The technique is
called gradient descent, and in modern times it has grown into a huge field of
study.
Gradient descent (or gradient ascent, if you’re maximizing) works as follows. Given
f, start at a random point x 0. Iteratively evaluate the gradient ∇f( xi), which
points in the direction of steepest ascent of f, and set xi+1 = xi − ε∇f( xi),
where ε is some small scalar. The subtraction is the focus: you “take a small step”
in the opposite direction of the gradient to get closer to a minimum of f. So long
as the gradient is a reasonable enough approximator of f at each xi, each f( xi+1)
is smaller than the f( xi) before it. Repeat this over and over again, and you
should find a minimum of some sort. 10
Gradient descent intuitively makes sense, but there are a few confounding details
that trick this algorithm into stopping before it reaches a minimum. The devil lies
in the details of the stopping condition: if we’re at a minimum, the gradient
should definitely be the zero vector (there’s no direction of ascent at all, so
there’s no “steepest” direction), but does it work the other way as well?
Definitely not. However, to get a useful feel for why, we have to correct an
injustice from Chapter 8: we never discussed the geometry of the second derivative.
The derivative of a single variable function represents the slope of that function
at a given point. Higher derivatives ( f′′, f(3) , f(4), etc.) correspond to
certain sorts of curvature.
There are definitions of curvature that are much more precise and expressive than
the second derivative. In fact, the second derivative is quite bad at it. It only
captures “second-order” curvature of a function. So it sees no curvature in f( x) =
x 4 at x = 0, despite that this function is very obviously concave up. The reason
is that close to zero x 4 is also very close to zero, and so it makes the function
quite flat in that region. Higher derivatives make up for the second derivative’s
failure, but looking at a finite number of derivatives 10 Or decrease without
bound, but in our application zero will be an absolute lower bound by design.
255
10
y = x 2
y = 0.5 x 2
y = 5 x 2
20
15
10
2
3
10
15
256
will never provide the whole story. 11 In particular, Theorems 14.14 and 14.18 are
only sufficiency tests for a max/min. They cannot guarantee the detection of
optima.
proves that f has a local max at (1 , 0), and likewise a local minimum close to
(3 , − 10).
Now we can prove the theorem that concavity is sufficient to detect a local
min/max.
Proof. The Taylor series is our hammer. Since f′( c) = 0, near c we can expand
f( x) using a Taylor series that primarily depends on f′′( c).
f ′′( c)
f ( x) = f ( c) +
( x − c)2 + r( x)
Here r( x) is the remainder term of the Taylor Theorem (Theorem 8.14). It’s a
degree-3 polynomial in x − c whose coefficient depends on an evaluation of f(3)( z)
at some unknown point z ∈ ( c, x). The most important detail of this is that it’s a
degree-3 polynomial, but in complete detail, it’s
f (3)( z)
r( x) =
( x − c)3
We need to argue that because x − c is very small when x is close to c, the value
of ( x − c)3 is dwarfed by the value of ( x − c)2, so that the min/max behavior of
f is determined solely by the ( x − c)2 term.
Let’s suppose that f′′( c) > 0, so that we need to show f( c) is a local min. In
this case we want an interval ( a, b) on which f( c) ≤ f( x) for all x. Rearranging
the formula above,
11 As we saw in Chapter 8, there are nonzero functions so flat at a point that all
of their derivatives are zero!
257
f ( c) = f ( x) − f ′′( c) ( x − c)2 − r( x) .
If the term [ −f′′( c)( x − c)2 − r( x)] is not positive on ( a, b), then f( c) ≤
f( x). So 2
the theorem will be proved if we can find an interval on which that term is at most
zero.
6 f ′′( c)
Since the value of r( x) depends on z (which can be different for different values
of x), we can’t proceed unless we eliminate the dependence on z. We’ll do that by
estimating, i.e., replacing f(3)( z) with the max of f(3) over an interval. So
start with some fixed interval around c, say ( c − 0 . 01 , c + 0 . 01), 12 and let
M > 0 be the maximum value of
|f(3)( z)/(3 f′′( c)) | on that interval. I.e., M is the largest magnitude of the
coefficient of ( x − c)3 in the above inequality that can occur close to c. Then we
need to find an interval, perhaps smaller than ( c − 0 . 01 , c + 0 . 01), for
which the following (simplified) inequality is true for all x in that interval.
( x − c)2 ≥ M ( x − c)3
But this is easy! So long as x ̸= c we can simplify to see we just need a small
enough interval that ensures ( x − c) ≤ 1/ M. This will be true of either ( c − 1/
M, c + 1/ M) or ( c − 0 . 01 , c + 0 . 01), whichever is smaller.
That was a lot of work to achieve a proof. Recalling our discussion of waves in
Chapter 12, the reader might begin to understand why a working physicist would
rather erase terms with reckless abandon than wade through the strange existential
z’s that plague Taylor series. However, as was the case with matrix algebra
providing an elegant (though intentionally leaky) abstraction for linear maps,
mathematical analyses like these have their own abstractions to aid computation
while maintaining rigor. In this case, most programmers are aware of it: big-O
notation. We’ll display its use in Chapter 15.
258
mins and maxes of f—along with a slight amount of extra information—determines the
global min/max of f.
One can also think of a directional derivative as a sort of “local” property. It’s
the derivative when one “only looks” in a certain window, while the total
derivative is global.
If you can show that each directional derivative is continuous—or even just that
the partial derivatives are continuous—then you automatically get the global
(total) derivative.
You have built global structure out of local pieces. Of course, the total
derivative at a point is also a local construct from a different perspective. The
total derivative describes the approximate structure of f at a point, and with
enough information about the total derivative at every point of f (and a few bits
of extra information), you can completely reconstruct f. So there are multiple
scales of locality that allow one to discuss local and global properties, and how
they relate to each other.
The Hessian
While there are still local maxes and mins of the obvious sort, there are many ways
a local min/max can fail to exist. An important way is called a saddle point. The
shape of these is quite literal: the surface looks like the saddle of a horse, or
the shape of a potato chip, in which the curvature goes up along one direction and
down along another.
With many variables comes many different directions along which curvature can
differ. You might imagine a function with 5 variables, each axis giving two choices
of up-curvature or down-curvature, for a total of 25 = 32 different kinds of
saddles (including the normal max/min). The way to get a handle on these forms is
to look at the matrix of all ways to take second derivatives. First we define
notation for second derivatives.
∂ 2 f
∂f
∂xi∂xj
∂xi
∂xj
Personally I hate this notation, particularly how arbitrarily it’s defined so that
the “numerator” of the variable names are smushed together. My inner programmer
cries out in anguish, because it’s breaking algebra and functional notation at the
same time by pretending they’re the same. Are we taking the squared derivative with
respect to a squared
259
f( x 1, x 2) = x 21 x 22
), x 2
0
f( x 1
2.0
1.5
2.0 1.5
1.0
1.0 0.5
0.0 0.5
0.0
0.5
1.0
x1
x2
2.0
variable? Multiplying the top and bottom of a function name separately? Your
syntactic sugar is rotting my brain! Alas, the notation is widespread, and the only
alternative I know of, fx ( x) = ∂ 2 f , is not all that much better.
ixj
∂xj ∂xi
One might expect the mixed partials with respect to xi, xj and xj, xi to be
different due to the order of the computation. Under sufficiently strong
conditions, they turn out to be the same.
∂xi∂xj
∂xj ∂xi
We quote this theorem without proof, but notice that, in addition to reducing our
computation duties by a half, it gives a hindsight rationalization for the fraction
notation. If the order of partial derivatives doesn’t matter, then we need not
bother with the functional notation that emphasizes order precedence.
Next we define the Hessian, which is the matrix of mixed partial derivatives of a
function.
260
∂ 2 f
∂ 2 f
· · ·
∂ 2 f
∂x 2
∂x 1 ∂x 2
∂x 1 ∂xn
∂ 2 f
∂ 2 f
· · ·
∂ 2 f
∂x
∂x
H = 2 ∂x 1
∂x 2
2
2 ∂xn
..
..
..
..
∂ 2 f
∂ 2 f
· · ·
∂ 2 f
∂xn∂x 1
∂xn∂x 2
∂x 2
Just like the gradient, H( f) is really a function whose input is a point x in the
domain of f, and the output is the matrix H( f)( x). The notation gets even hairier
since H( f)( x) is itself a linear map R n → R n. In an exercise you’ll interpret
this linear map to make more sense of it.
We’ll skip the proof for brevity, but our understanding of eigenvalues and
eigenvectors provides a tidy interpretation. The eigenvectors of nonzero
eigenvalues correspond to the directions (when looking from x) in which the
curvature of f is purely upward or downward, and maximally so. In a sense that can
be made rigorous, because H has an orthonormal basis of eigenvectors, these
curvatures “don’t interfere” with each other. If the surface were an ellipsoidal
bowl, the eigenvectors would be the “axes” of the bowl.
For a saddle point, the eigenvectors are the directions of the saddle that are
parallel and perpendicular to the imagined horse’s body. This is shown in Figure
14.8.
Of course, all of this breaks down if the sort of curvature we’re looking at can’t
be captured by second derivatives. There might be an eigenvalue of zero, in which
case you can’t tell if the curvature is positive, negative, or even completely
flat.
But this raises a natural question: if the gradient gives you first derivative
information, and the Hessian gives you second derivative information, can we get
third derivative information and higher? Yes! And can we use these to form a sort
of “Taylor series”
for multivariable functions? More yes! One difficulty with this topic is the mess
of notation. A fourth-derivative-Hessian analogue is a four-dimensional array of
numbers. With more dimensions comes more difficulty of notation (or the need for a
better abstraction).
Nevertheless, we can at least provide the analogue of the Taylor series for the
first two terms:
261
f( x 1, x 2) = x 21 x 22
), x 2
f( x 1
2.0
1.5
1.0
2.0
0.5
1.5 1.0
0.0
0.5 0.0
0.5 x2
1.0
x1
1.5
2.0 2.0
Figure 14.8: A function with a saddle point. The eigenvectors of the Hessian at the
saddle point are shown as arrows, and represent the maximally positive and negative
curvatures at the saddle point.
f ( x + v) ≈ f ( x) + ⟨∇f, v⟩ + ⟨Hv, v⟩
There are two caveats to this. First, the Hessian is expensive to compute. It’s
size is the square of the size of the gradient. Second, a provable optimum is
something of a luxury. Most optimization problems benefit well enough from
progressively improving an approximate optimum. Gradient descent does precisely
this, and allows you to easily trade off solution quality for runtime.
Informally, gradient descent is the process: “go slowly in the opposite direction
of the gradient until the gradient is zero.” More formally, choose a stopping
threshold ε > 0
3. Output x.
This algorithm can be fast or slow depending on the choice of the starting point
and the smoothness of f. If x lands in a bowl, it will quickly find the bottom. If
x starts on a plateau of f, it will never improve. For this reason, one might run
multiple copies of this loop, and output the most optimal run. If the inputs are
chosen randomly, there’s a good chance one avoids the avoidable plateaus.
One might wonder, if the Hessian gives more information about the curvature of f,
why not use the Hessian in determining the next step to take? You can! But
unfortunately, since the Hessian is often an order of magnitude more difficult to
compute than the gradient—and the gradient already requires mountains of
engineering to get right—
it’s simply not feasible to do so. And, as you’ll get to explore in the exercises,
there are alternative techniques that allow one to “accelerate” gradient descent in
a principled fashion without the Hessian.
The primary practical use of the chain rule is to allow us to compute complicated
derivatives mechanically. In particular, one decomposes a function into a large
composition of simple pieces, where the derivative of each piece is known, and
applies the chain rule to build up the full derivative from the pieces.
(we can think of isolating one of the coordinates of g to inspect), the chain rule
becomes the following product, where we hide the evaluation point for brevity.
D( g ◦ f ) = ( ∇g) Df
∂f 1
∂f 1
· · · ∂f 1
∂x
∂x
∂x
) 1
∂f
∂f 2
· · · ∂f 2
∂x
∂x
∂x
∂g
∂g
· · ·
∂g
1
n
∂f
.
.
1
∂f 2
∂fm
..
..
..
..
∂fm
∂fm
· · · ∂fm
∂x 1
∂x 2
∂xn
263
notation suggests that the fi in the last sum “cancel,” which is not true but some
find it helpful.
∂f 1
∂x
) 1
∂f
∂g
2
∂x
∂g ∂fi
=
∂g
∂g
· · ·
∂g
1
. =
∂x
∂f
∂f 2
∂fm
.
∂fi ∂x 1
i=1
∂fm
∂x 1
∂f
∂f ∂g ∂h ∂i ∂j
∂x
∂g ∂h ∂i ∂j ∂x
Notice that the terms in this chain can be grouped and re-grouped arbitrarily. For
example, if you’ve already computed ∂g , then to get ∂f you need only compute the
missing
∂j
∂x
terms
∂f
∂f ∂g ∂j
∂x
∂g ∂j ∂x
This allows one to use caching to avoid recomputing derivatives over and over
again.
That’s especially useful when there are many dependency branches. In fact, as we’ll
realize concretely when we build a neural network, the concept of derivatives with
branching dependencies is core to training neural networks. To prepare for that,
we’ll describe the abstract idea of a computation graph and reiterate how the chain
rule is computed recursively through such a network.
264
+
x 2
log
For us, the operations fv at each vertex will always be differentiable (with one
caveat), and hence G will be differentiable, though the definition of a computation
graph doesn’t require differentiability.
Now we’ll reiterate the chain rule for an arbitrary computation graph. Say we have
a programmatic representation of a computation graph for G, and somewhere deep in
the graph is a vertex with operation f( a 1 , . . . , an). We want to compute a
partial derivative of G with respect to an input variable that may be even deeper
than f. Using the chain rule, we’ll describe the algorithm for computing the
derivative generically at any vertex and then apply induction/recursion. More
specifically, at vertex f we’ll compute ∂G/ ∂f and multiply it by ∂f/ ∂ai to get
∂G/ ∂ai.
We know ∂f/ ∂a 1 by assumption, having designed the graph so the gradient ∇fv of
each vertex v is easy to compute. By induction, for each output vertex hj we can
compute
∑ ∂G
∂G/ ∂f =
· ∂hi .
∂hi
∂f
i=1
265
...
...
Figure 14.10: A generic node of a computation graph. Node f has many inputs, its
output feeds into many nodes, and each of its inputs and outputs may also have many
inputs and outputs.
Once we have that, each ∂G/ ∂ai = ( ∂G/ ∂f) · ( ∂f/ ∂ai), as desired. Note that if
G
depends on ai via another path through the computation graph, then ∂G/ ∂ai sums
over all such paths.
Because we use the vertices that depend on f as the inductive step, the base case
is the output vertex, and there ∂G/ ∂G = 1. Likewise, the top of the recursive
stack are the input vertices, and at the end we’ll have ∂G/ ∂xi for all inputs xi.
As one can easily see, a network with heavily interdependent vertices requires one
to cache the intermediate values to avoid recomputing derivatives everywhere.
That’s exactly the strategy we’ll take with our neural network.
Neural networks are extremely popular right now. In the decade between 2010 and
Perhaps surprisingly, the mathematical techniques that are used to train these
networks are largely the same as they were decades ago. They are all variations on
gradient descent, and the specific instance of gradient descent applied to training
neural networks is called backpropagation.
In this section, we’ll implement a neural network from scratch and train it to
classify handwritten digits with relatively decent accuracy. Along the way, we’ll
get a taste for the theory and practice of machine learning.
266
It's a 7!
yes
It's a 4!
yes
yes
no
no
no
. . .
It's a 0!
Machine learning is the process of using data to design a program that performs
some task.
1. Collect a large sample of handwritten digits, and clean them up (as all
programmers know, we must sanitize our inputs!).
2. Get humans to provide labels for which pictures correspond to which digits.
3. Run a machine learning training algorithm on the labeled data, and get as output
a classifier that can be used to label new, unseen data.
A slow, brutish training algorithm might be: generate all possible decision trees
in increasing order of size, and select the first one that’s consistent with the
data.
To get a more pungent whiff, let’s jump right into the handwritten digit dataset
we’ll use in the remainder of this chapter. The dataset is a famous one that goes
by the irrelevant acronym MNIST (Modified National Institute of Standards and
Technology referring to the institution that created the original dataset). The
database consists of 70,000 data points, each of which is a 28-by-28 pixel black
and white image of a handwritten digit.
The digits have been preprocessed in various ways, including resizing, centering,
and anti-aliasing. The raw dataset was originally created around 1995, and since
1998 the machine learning researchers Yan LeCun, Corinna Cortes, and Christopher
Burges have provided
267
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
25 192 226 226 241 252 253 202 252 252 252 252 252 225
0
0
39
19
39
20
59
0
0
67
17
67
83 205 190
24
24
0
0
39
0
0
0
0
16
0
0
0
0
31
0
0
0
0
39
67 252
79
0
0
0
0
Figure 14.12: A training point for a digit 7 (aligned to make it easier to see).
the cleaned copy on LeCun’s website.14 We also include a copy in the code samples
for this book, since their version of the dataset has a non-standard encoding
scheme.
MNIST is the Petersen graph of machine learning: every technique should first be
The data is split into a training set and a test set, the former having 60,000
examples and the latter 10,000, which are stored in separate files. The separation
exists to give a simulation of how well a classifier trained on the training data
would perform on “new”
data. As such, to get a good quality estimate, it’s crucial that the training
algorithm uses no information in the test set. We load the data using a helper
function, which scales the pixel values from [0 , 255] to [0 , 1]. For our
application, we’ll simplify the problem a bit to distinguishing between two digits:
is it a 1 or a 7? The digit 1 corresponds to a label of 0, and a digit 7
corresponds to a label of 1.
14 http://yann.lecun.com/exdb/mnist/
268
def load_1s_and_7s(filename):
examples = []
with open(filename, 'r') as infile:
label = tokens[0]
if label == 1:
examples.append([example, 0])
elif label == 7:
examples.append([example, 1])
print('Data loaded.')
return examples
Before we go on, I must emphasize that the first two steps in the “machine learning
recipe,” collecting and cleaning data, are much harder than they appear. A misstep
in any part of these processes can cause wild swings in the quality of the output
classifier, and getting it right requires clear and strict procedures. See the
Chapter Notes for more on this.
{ 0 , 1 }. For the handwritten digits example, think of this as the classifier for
“is the digit a 7 or not?”
Next, one defines a so-called hypothesis class. This is the universe of all
possible output classifiers that a learning algorithm may consider. A useful
hypothesis class has natural parameters that vary the behavior of a hypothesis. The
learning algorithm learns by selecting parameters based on examples given to it.
One of the most common examples, and a building block of neural networks, is the
inner product.
Lw,b( x) =
otherwise.
269
Linear threshold functions have n + 1 parameters: the n weights w and the bias b.
The linear threshold function lives up to its name, thanks to the geometry of the
inner product.
One must also decide how to measure the quality of a proposed classifier. Measures
vary depending on the learning model, but in practice it usually boils down to:
does the classifier accurately classify the slice of data that has been cordoned
off solely for the purpose of evaluation? This special slice of data is the test
set. In the exercises, we’ll explore a handful of theoretical learning models that
give provable guarantees. Though these models are theoretical—for example, they
assume the true labels have a particular structure—they serve as the foundation for
all principled machine learning models. In these models, if a classifier is
accurate on a test set, it will provably generalize to accurately classify new
data.
A simple example learning model and problem, which is a building block for many
other learning problems,16 is the following. Given labeled data points chosen
randomly from a distribution over R n that can be classified perfectly by a linear
threshold function, design an algorithm that finds a “good” threshold function,
i.e., one that will generalize well to new examples drawn from the same
distribution. We’ll explore this more in the exercises.
The trained network will evaluate an input and produce a binary label saying
whether the input is a 1 (a label of zero) or a 7 (a label of one).18
270
1.0
y = ex/(1 + ex)
0.8
0.6
0.4
0.2
0.0
network.train(dataset)
network.evaluate(new_example)
The most important component operation that is used to build up a neural network is
the linear halfspace Lw,b from Definition 14.21. We’ll call a vertex of the
computation graph corresponding to a linear halfspace a linear node, and each
linear node will have its own independently tunable set of parameters, w and b.
In principle, there must be more to a neural network than linear nodes. As we know
well from linear algebra, a composition of linear functions is still linear. The
geometry of the space of handwritten digits is probably complicated enough to
warrant more help. We should include operations in our computation graph that
transform the input in nonlinear ways.
A historically prevalent operation is the sigmoid function, that is, the single-
variable function defined by σ( x) = ex/(1 + ex), with the graph depicted in Figure
14.13. The sigmoid is nonlinear, differentiable, and its output is confined to [0 ,
1]. You may hear this operation compared to the “impulse” of a neuron in a brain,
which is why the sigmoid is often called an activation function. Though neural
networks are called “neural,” the name is merely an inspiration. Simply put,
sigmoids and other activation functions introduce nonlinearity in a useful way.
Typically, one applies the single-input activation function to the output of every
linear node. The combined pair of a linear node and its activation function are
called a neuron.
271
layer 1 (10x)
layer 2 (10x)
linear → ReLU
linear → ReLU
x 1
. . .
linear → sigmoid
. . .
. . .
x 784
{ x if x ≥ 0
ReLU( x) = 0 otherwise
A ReLU needs no plot, as it’s simply the function: truncate negative values to
zero. The ReLU is particularly interesting because it is not differentiable!
However, it only fails to have a derivative at x = 0, and in practice one can
simply ignore the problem. The ReLU
implements the thresholding of the linear halfspace, but with the twist that
“activated”
neurons can express how activated they are. Another advantage, which is
particularly nice for hardware optimization, is that evaluating a ReLU and its
derivative requires only branching comparisons and constants. No exponential math
is required.
272
def build_network():
input_nodes = InputNode.make_input_nodes(28*28)
linear_output = LinearNode(second_layer_relu)
output = SigmoidNode(linear_output)
error_node = L2ErrorNode(output)
return network
The final output of the network is a real number in [0 , 1]. Labels are binary
{ 0 , 1 }, and so we interpret the output as a probability of the label being 1.
Then we can say that the label predicted by a network is 1 if the output is at
least 1/2, and 0 otherwise.
You might be wondering how someone comes up with the architecture of a neural
network. The answer is that there are some decent heuristics, but in the end its an
engineering problem with no clear answers. Our network is quite small, only about
7,500
tunable parameters in all (because it’s written in pure Python, training a large
network would be prohibitively slow). In real production systems, networks have
upwards of millions of parameters, and the process of determining an architecture
is more alchemy than science.
There is a now-famous 2017 talk by Ali Rahimi in which he criticized what he argued
was a loss of rigor in the field. He quoted, for example, how a change to the
default rounding mechanism in a popular deep learning library (from “truncate” to
“round”) caused many researcher’s models to break completely, and nobody knew why.
The networks
still trained, but suddenly failed to learn anything. Rahimi argues that brittle
optimization techniques (gradient descent) applied to massively complex and opaque
networks create a house of cards, and that theory and rigor can alleviate these
problems. I tend to agree. But brittle or not, gradient descent on neural networks
has proved to be remark-ably useful, making some learning problems tractable
despite the failure of decades of research into other techniques. So let’s
continue.
Once we’ve specified a neural network as a computation graph and obtained a dataset
S of labeled examples ( x, l), we need to choose a function to optimize. This is
often called a loss function. For a single labeled example ( x, l), it’s not so
hard to come up with a reasonable loss function. Let fw be the function computed by
the neural network and w the combined vector of all of its parameters. Then define
E( w) = ( fw( x) − l)2 as the “error” of a single example. This is just the squared
distance of the output of fw on an example from that example’s label. Note we’re
not doing any rounding here, so that fw( x) ∈ [0 , 1].
as E total( w) = 1
( f
S|
( x,l) ∈S
273
descent to minimize E total. However, this loss function requires us to loop over
the entire training dataset for each step of gradient descent. That is
prohibitively slow. Instead, one typically applies what’s called stochastic
gradient descent. In stochastic gradient descent, one chooses an example ( x, l) at
random, and applies a gradient descent step update to E( w) = ( fw( x) − l)2. Each
subsequent gradient step update uses a different, randomly chosen example. The fact
that this usually produces a good result is not obvious.19
There are many different loss functions, and the loss function we chose above is
called the L 2-loss. The name L 2 comes from mathematics, and the number 2
describes the 2’s
1/2
x∥ 2 = (
x 2)
i
i
results in an L 3 norm, and for a general p these are called Lp norms. You will
explore different loss functions in the exercises.
As we outlined in Section 14.8, each vertex of our computation graph needs to know
about various derivatives related to the operation computed at that node, and that
these values need to be cached to compute a gradient efficiently. Now we’ll see one
way to manifest that in code. Let’s start by defining a generic base node class,
representing a generic operation in a computation graph. We’ll call the operation
computed at that node f, which has arguments z 1 , . . . , zm, and possibly tunable
parameters w 1 , . . . , wk.
f = f ( w 1 , . . . , wk, z 1 , . . . , zm)
Call the function computed by the entire graph E. The inputs to E are both the
normal inputs and all of the tunable parameters at every node. For the sake of
having good names, we’ll define the global derivative of some quantity x to mean
∂E/ ∂x, while the local derivative is ∂f/ ∂x (it’s local to the node we’re
currently operating with). These are not standard terms.
Now we define a cache to attach to each node, whose lifetime will be a single step
of the gradient descent algorithm.
class CachedNodeData(object):
def __init__(self):
self.output = None
self.global_gradient = None
self.local_gradient = None
self.local_parameter_gradient = None
self.global_parameter_gradient = None
The attributes are as follows, with each expression evaluated at the current input
x and the current choice of tunable parameters.
19 There are also compromises: pick a random subset of 100 examples, and compute
the average error and gradient for that “mini batch.” Variations abound.
274
Now we define a base class Node for the vertices of the computation graph. Its
L2ErrorNode. Here’s an example of how the subclasses of Node are used to build a
computation graph:
linear_node_1 = LinearNode(input_nodes)
linear_node_2 = LinearNode(input_nodes)
linear_node_3 = LinearNode(input_nodes)
sigmoid_node_1 = SigmoidNode(linear_node_1)
sigmoid_node_2 = SigmoidNode(linear_node_2)
sigmoid_node_3 = SigmoidNode(linear_node_3)
output = SigmoidNode(linear_output)
error_node = L2ErrorNode(output)
network.evaluate(new_data_point)
class Node(object):
self.has_parameters = False
self.parameters = []
self.arguments = arguments
self.successors = []
self.cache = CachedNodeData()
'''Argument nodes z_i will query this node f(z_1, ..., z_k) for �f�/z_i,
The list of arguments is ordered, so that all inputs and gradients correspond
index-wise.
We’ll define the core methods in Node that perform gradient descent training
momentarily, but first we have to define what functions the subclasses need to
implement. They are:
275
the output of this node, by recursively calling output on the argument nodes and
The example of the linear node illustrates each of these pieces. Let
f ( w, b, x) = ⟨w, x⟩ + b
= b +
wixi
i=1
We model the bias term b by adding an extra input as a ConstantNode. We also have a
simple InputNode for the input to the whole graph.
class ConstantNode(Node):
return 1
class InputNode(Node):
super().__init__()
self.input_index = input_index
return inputs[self.input_index]
@staticmethod
def make_input_nodes(count):
Now we can define LinearNode. First, we initialize the weights and add a constant
node for the bias. In this way, the bias is treated the same as any other input,
which makes the formulas convenient.
276
class LinearNode(Node):
self.initialize_weights()
self.has_parameters = True
def initialize_weights(self):
arglen = len(self.arguments)
range(arglen)]
√
random numbers in [ − 1/ d, 1/ d], where d is the number of weights. This aligns
with gradient descent: start at a random initial configuration and try to optimize.
The rest of the class consists of the required implementations of the Node
interface.
i=0
ixi, we have
∂f
∂f
∂E
∂E ∂f
= wi,
= xi,
∂xi
∂wi
∂wi
∂f ∂wi
class LinearNode(Node):
[...]
return sum(
w * x.evaluate(inputs)
def compute_local_gradient(self):
return self.weights
def compute_local_parameter_gradient(self):
def compute_global_parameter_gradient(self):
return [
self.global_gradient * self.local_parameter_gradient_for_argument(argument)
argument_index = self.argument_to_index[argument]
return self.local_parameter_gradient[argument_index]
The other nodes are defined similarly, with the parameter functions returning empty
lists as the LinearNode is the only node with tunable parameters. For each of the
four
277
class Node:
[...]
@property
def local_gradient(self):
if self.cache.local_gradient is None:
self.cache.local_gradient = self.compute_local_gradient()
return self.cache.local_gradient
The methods in the child classes use these properties when referring to their
arguments, so the values will be lazily evaluated and then cached as needed.
Finally, the computation of the global gradient for a node doesn’t depend on the
formula for that node, so it can be defined in the parent class.
class Node:
[...]
def compute_global_gradient(self):
return sum(
successor.global_gradient * successor.local_gradient_for_argument(self)
argument_index = self.argument_to_index[argument]
return self.local_gradient[argument_index]
At this point we’ve enabled the computation of all the gradients we need to do a
step of gradient descent.
class Node:
[...]
'''The core gradient step subroutine: compute the gradient for each of
if self.has_parameters:
The very last node of the computation graph, which computes the error for a
training example, has some extra methods that depend on a training example’s label.
For the L 2
278
class L2ErrorNode(Node):
argument_value = self.arguments[0].evaluate(inputs)
def compute_local_gradient(self):
last_input = self.arguments[0].output
return [2 * (last_input - self.label)]
def compute_global_gradient(self):
return 1
Now we define a wrapper class NeuralNetwork that keeps track of the input and
terminal nodes of the computation graph, resets caches, and controls the training
of the network. We start with a self-explanatory constructor, and a helper function
for applying some function to each node of the computation graph exactly once.
class NeuralNetwork(object):
self.terminal_node = terminal_node
self.input_nodes = input_nodes
nodes_to_process = set([self.error_node])
processed = set()
while nodes_to_process:
node = nodes_to_process.pop()
func(node)
processed.add(node)
The for_each function performs a classic graph traversal. 20 We can use it to reset
the caches at every node. We can also trivially define the evaluate function and
compute_error functions as wrappers.
279
class NeuralNetwork(object):
[...]
def reset(self):
def reset_one(node):
node.cache = CachedNodeData()
self.for_each(reset_one)
self.reset()
return self.terminal_node.evaluate(inputs)
self.reset()
each node.
class NeuralNetwork(object):
[...]
self.compute_error(inputs, label)
for i in range(max_steps):
Now let’s apply this to the MNIST dataset. First we build our network, with two
fully connected layers of LinearNodes and ReluNodes, with a final LinearNode with a
SigmoidNode output.
def build_network():
input_nodes = InputNode.make_input_nodes(28*28)
linear_output = LinearNode(second_layer_relu)
output = SigmoidNode(linear_output)
error_node = L2ErrorNode(output)
Then we split the training set into batches, separating from each batch a so-called
validation set, which we use to measure the quality of the training as it
progresses. At
280
train = load_1s_and_7s('mnist/mnist_train.csv')
test = load_1s_and_7s('mnist/mnist_test.csv')
network = build_network()
for i in range(5):
shuffle(train)
len(train_piece), len(validation)))
network.train(train_piece, max_steps=len(train_piece))
network.error_on_dataset(validation)))
print("Test error={:.3f}".format(network.error_on_dataset(test)))
Test error=0.011
281
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000011110000000000
0000000000000000000000000000
0000000000111111110000000000
0000000000000111110000000000
0000000000000000000000000000
0000000000111111111000000000
0000000000000111110000000000
0000000000000000000000000000
0000000000111111111000000000
0000000000000111100000000000
0000000000000000000000000000
0000000000011000111100000000
0000000000000111100000000000
0011111111111111101100000000
0000000000000001111100000000
0000000000000111100000000000
0011111111111111111100000000
0000000000000001111100000000
0000000000000111100000000000
0011111111111111111100000000
0000000000000011111000000000
0000000000000111100000000000
0011100000000000011100000000
0000000000000011111000000000
0000000000000111100000000000
0000000000000000011100000000
0000000000000111111000000000
0000000000000111100000000000
0000000000000000011100000000
0000000000001111110000000000
0000000000000111100000000000
0000000000000000011100000000
0000000000001111110000000000
0000000000000111100000000000
0000000000000000011100000000
0000000000011111100000000000
0000000000000111100000000000
0000000000000000011110000000
0000000000111111000000000000
0000000000000111100000000000
0000000000000000011110000000
0000000000111111000000000000
0000000000000111100000000000
0000000000000000011110000000
0000000000111110000000000000
0000000000000111000000000000
0000000000000000001110000000
0000000011111100000000000000
0000000000001111000000000000
0000000000000000001110000000
0000000011111100000000000000
0000000000000111000000000000
0000000000000000000111000000
0000000011111000000000000000
0000000000000111000000000000
0000000000000000000111000000
0000000011111000000000000000
0000000000000111000000000000
0000000000000000000111000000
0000000011110000000000000000
0000000000000000000000000000
0000000000000000000011000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000011000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
0000000000000000000000000000
14.11 Exercises
there exists a δ > 0 such that whenever ∥x − c∥ < δ it holds that |f( x) − f( c) |
< ε.
xyz
(defining g(0 , 0 , 0) = 0) is not continuous at (0 , 0 , 0). Hint: x 3+ y 3+ z 3
14.2. Prove the first part of the Cauchy-Schwarz inequality for real vectors, that
|⟨v, w⟩| ≤ (
i)(
14.3. Prove the analogue of Theorem 14.10 for functions R n → R m. In that case, if
f = ( f 1 , . . . , fm), the total derivative matrix should be:
282
Dir( f 1 , c, v 1)
Dir( f
2 , c, v 1)
..
.
.
..
..
..
Dir( fm, c, v 1) Dir( fm, c, v 2) · · · Dir( fm, c, vn) Hint: the same proof works,
but the construction of the single-variable function to apply the chain rule to is
slightly different.
The Mean Value Theorem is one of the most powerful technical tools in the fields of
mathematics that deal with continuous functions.
14.5. In the chapter I provided a corkscrew surface for which, if the direction of
the directional derivative changed slightly, the value of the directional
derivative changed drastically (i.e., it was not continuous in the choice of
direction). On the other hand, Theorem 14.11 fixes the basis vector and requires
the directional derivative to be continuous with respect to c, the position.
Reconcile these two perspectives. Squint at the corkscrew surface and see why
continuity with respect to c covers the same edge case as rotating the direction.
14.6. Find and study a proof of Schwarz’s theorem, that mixed partial derivatives
of sufficiently nice functions don’t depend on the order you take them in.
14.7. Prove that the rule for computing partial derivatives by assuming other
variables are constant is valid.
14.11. Perhaps the most famous theoretical machine learning model is called the
Probably Approximately Correct model (abbreviated PAC). This model formalizes much
of modern machine learning. Given a finite set X (the universe of possible inputs),
the PAC
model involves a probability distribution D over X used both for generating data
and evaluating the quality of a hypothesis. A machine learning algorithm gets as
input the ability to sample as much data as it wants from D, and its output
hypothesis h must have
283
high accuracy on D (hence the name “approximately” in PAC). Since the sampled data
is random, the learning algorithm may fail to produce an accurate classifier with
small probability. However—and this is the most stringent qualification—in order
for a learning algorithm to be considered successful in the PAC model, it must
provably succeed for any distribution on the data. If the distribution is uniformly
random or focused on just a small set of screwy points, a valid “PAC learner” must
be able to adapt. Look up the formal definition of the PAC model, find a simple
example of a problem that can be PAC-learned, and read a proof that a successful
algorithm does the trick.
14.12. Another important learning model involves an algorithm that, rather than
pas-sively analyzing data that’s given to it (as in the PAC model of the previous
exercise), is allowed to formulate queries of a certain type, an “oracle” (a human)
answers those queries, and then eventually the algorithm produces a hypothesis.
Such a model is often called an “active learning” model. Perhaps the most famous
example is exact learning with membership and equivalence queries. Look up a formal
definition of this model, and learn about its main results and variations.
14.13. Write a program that uses gradient descent to learn linear threshold
functions. In particular: write a function that samples data uniformly from the set
[0 , 1]5 ⊂ R5, and labels them (unbeknownst to the learning algorithm) according to
their value under a fixed linear threshold function Lw,b. Design a learning
algorithm to learn w and b from the data. That is, determine what the appropriate
loss function should be, determine a formula for the gradient, and enshrine it in
code. How much data is needed to successfully and consistently learn? How does this
change as the exponent 5 grows?
14.14. In this chapter, our gradient descent used a fixed ε as the step size.
However, it can often make sense to adjust the rate of descent as the optimization
progresses. At the beginning of the descent, larger steps can provide quicker gains
toward an optimum.
14.15. Another popular technique for training neural networks is the so-called
minibatch, where instead of a stochastic update for each example, one groups the
examples into batches and computes the average loss for the batch. Research why
minibatch is considered a good idea, and augment the program in this chapter to
incorporate it. Does it improve the error rate of the learned hypothesis?
14.16. There are many different loss functions for a neural network. Look up a list
of the most widely used loss functions, and research their properties.
284
14.17. One particularly relevant loss function is called softmax, because it
applies to a vector-valued input. Softmax is typically used to represent the loss
of a categorical (1
out of N options) labeling, and it’s particularly useful to adapt MNIST from a
binary two-digit discriminator to a full ten-digit classifier. Augment the code in
this chapter to incorporate softmax, and use this to implement a classifier for the
full MNIST dataset.
Recall for single-variable functions f, g : R → R, the chain rule says that the
derivative of f ( g( x)) involves evaluating f ′ at g, and multiplying the result
by g′. I.e., d ( f ( g( x))) =
dx
Let’s first think about why this should be harder in principle than the single
variable case. Call x = ( x 1 , . . . , xn) the variables input to f = ( f
1 , . . . , fm), a function R n → R m.
The derivative of g ◦ f measures how much g depends on changes to each xi. But
while f depends on an input xi in a straightforward way, g depends on xi
transitively through the possibly many outputs of f. Computing ∂g/ ∂xi should
require one to combine the knowledge of ∂fj/ ∂xi for each j, and that combination
might be strange. The function g ◦ f has a dependency graph like in Figure 14.16,
where the arrows a → b indicate that b depends on a. A similar dependence describes
dependence among the partial derivatives.
Luckily the relationship is quite elegant: for one dependent variable you multiply
along each branch and sum the results. Doing this for every input variable produces
exactly the
285
. . . f
. . . x
Figure 14.16: The dependence of g ◦ f on each xi contains paths through each of the
fj.
matrix multiplication that makes up the chain rule. We’ll prove a slightly simpler
version of the chain rule where g has only one output, which has all the necessary
features of the more general proof where g = ( g 1 , . . . , gk) is vector-valued.
∂( g ◦ h)
∑ ∂g
( c) =
( h( c)) · ∂hi ( c)
∂x 1
∂hi
∂x 1
i=1
The other components of the gradient are defined by replacing x 1 with xj.
Proof. For clarity, in this proof the boldface v will denote a vector of numbers or
functions (a function with multiple outputs). Denote by h(x) = ( h 1(x) , . . . ,
hm(x)), so that we can conveniently abbreviate g( h 1(x) , . . . , hm(x)) as
g(h(x)). Let H be the matrix representation of the total derivative of h,
H 1
H =
..
Hm
Let G be the matrix representation of the total derivative of g (i.e., ∇g). The
claimed total derivative matrix for g(h(x)) is the matrix multiplication GH. This
results in the formula claimed by the theorem. We need to show that GH satisfies
the linear approximation condition for g(h(x)), i.e., that
286
x →c
∥x − c ∥
= 0
∥t ∥
Now we define two functions that track the error of the linear approximators. More
specifically, the first function represents the error of H as a linear approximator
of h at c, and the second is the error of G as a linear approximator of g at h(c).
err H(t) = h(c + t) − h(c) − H(t) err G(s) = g(h(c) + s) − g(h(c)) − G(s) Note that
in err G, the vector s is in the domain of g, while in err H the vector t is in the
domain of the hi. We can use these formulas to simplify the limit above. Substitute
for h(c + t) a rearrangement of the definition of err H, getting (
t →0
∥t ∥
Define s = H(t) + err H(t), so that we can substitute g(h(c) + s) using a rewriting
of the definition of err G.
lim
t →0
∥t ∥
Expand s, apply linearity of G, and cancel opposite terms, to reduce the limit to
G(err
lim
t →0
∥t ∥
To show this limit is zero, we split it into two pieces. The first is
G(err
lim
H (t))
t →0
∥t ∥
.
∥err H(t) ∥
C lim
t →0
∥t ∥
= 0 .
287
This goes to zero because (by the definition of err H) it’s the defining property
of the total derivative of H. It remains to show the second part is zero:
err
lim
G(s)
t →0
∥t ∥
We would like to bound this limit from above by a different limit we can more
easily prove goes to zero. Indeed, if there were a constant B for which
err G(s)
err
G(s)
∥t ∥
B∥s ∥
≥ (c + t) − h(c)
B
∥t ∥
The quantity on the right hand side is familiar: it’s the inside of the limit for
the directional derivative of h (rather, a vector of directional derivatives). As
∥t ∥ → 0 it gets close to the directional derivatives, so for a sufficiently small
t, the quantity is no larger than twice the largest possible directional
derivative, i.e., 2 ∥( ∇h 1(c) , . . . , ∇hm(c)) ∥. Choose B so that 1/ B is larger
than this quantity, and the proof is complete.
This was the most difficult proof in this book. And it’s easy to get lost in it. We
started from a relatable premise: find a formula for the chain rule for
multivariable functions.
To prove our formula worked, we reduced progressively trickier and more specialized
arguments, boiling down to an arbitrary-seeming upper bound of a haphazard limit of
an error term of a linear approximation.
To be sure, the steps in this proof were not obvious. One has to take a bit of a
leap of faith to guess that GH was the right formula (though it is the simplest and
most elegant option), and then jump from an obtuse limit to the realization that,
if one writes everything in terms of error terms, the hard parts ( g composed with
h) will cancel out. Suffice it to say that this proof was distilled from hard work
and many examples, and it leaves a taste of mystery in the mouth. Until, that is,
one dives deeper into the general subfield of mathematics known as “analysis,”
where arguments like this one are practiced until they become relatively routine.
One gains the nose for what sorts of quantities should yield their secrets to a
well-chosen upper bound. Contrast this to subjects like linear algebra and abstract
algebra (Chapter 16), in which pieces largely tend to fit together in a structured
manner that—in my opinion—tends to appeal to programmers in a way that analysis
doesn’t. Another demonstration of subcultures in mathematics.
288
We were fortunate enough to have LeCun and his colleagues vet MNIST for us. These
prepared datasets are like goods in supermarkets. A shopper doesn’t see,
appreciate, or viscerally comprehend the amount of work and resources required to
rear the cow and grow the almonds, nor even the general form of the pipeline. A
common refrain among data scientists and machine learning practitioners is that
machine learning is 10%
For example, deciding on the meaning of a label is no simple task. It seems easy
for problems like handwritten digits, because it’s mostly unambiguous what the true
label for a digit is. But for many interesting use cases—detecting fraud/spam,
predicting what video a user will enjoy, or determining whether a loan applicant
should receive a loan—
Another concern is bias in the training data. Not just statistical bias, which can
be a result of errors in data collection on the part of the process designer, but
human bias beyond one’s control. When you collect data on human preferences, it’s
easy for population majorities to overwhelm less prevalent signals. This happens
roughly because machine learning algorithms tend to look for the statistically
dominant trends first, and only capture disagreeing trends if the model is complex
enough to have both coexist. Think of Chapter 12 in which we studied a physical
model by throwing out small order terms. In this context, if those terms
corresponded to a coherent group of users, those users would be ignored or actively
harmed by the mathematical model.
Even worse, active discrimination can be encoded into training labels. If one
trains an algorithm to predict job fitness on a dataset of hiring information,
incorporating the reviews of human interviewers can muddy the dataset. You have to
be aware that humans, and especially humans in a position of power, can exhibit
bias for any number of superficial characteristics that are unrelated to job
fitness, most notably that an applicant looks and behaves like the people currently
employed. An algorithm trained on this data will learn to mimic the human
preferences, which may be unrelated to one’s goal.
289
for machine learning, which is why it’s sometimes called the “high interest credit
card of technical debt.” These sorts of problems, though interesting and important,
are beyond the scope of this book. Instead we’ll focus on the “easy” part, actually
training an algorithm and producing a classifier.
Our neural network and computation graph are almost laughably small. And, having
written our network in pure Python, training proceeds at a snail’s pace. It should
be obvious that our toy implementation falls far short of industry-strength deep
learning libraries, even though the underlying concepts of computation graphs are
the same. I’d like to lay out a few specific reasons.
Our network for learning (a subset of) MNIST has roughly 7 , 500 tunable
parameters.
Many additional mathematical and engineering tricks are required to achieve such
scale.
One aspect of this is hardware. Top-tier neural networks take advantage of the
structure of certain nodes (for example, many nodes are linear) and the typical
architecture of a network (nodes grouped in layers) to convert evaluation and
gradient computations to matrix multiplications. Once this is done, graphics cards
(GPUs) can drastically accelerate the training process. Even more, companies like
Google develop custom ASICs (application-specific integrated circuits) that are
particularly fast at doing the operations neural networks need for training. One
such chip is called a Tensor Processing Unit (TPU). The proliferation of graphics
cards and custom hardware has resulted in the ability to train more ambitious
models for applications like language translation and playing board games like Go.
However, fancy hardware won’t fix issues like overfitting, where a model with
billions of parameters essentially becomes a lookup table for the training data and
doesn’t generalize to new data. To avoid this, experts employ a handful of
engineering and architectural tricks. For example, between each layer of linear
nodes, one can employ a technique called dropout, in which the outputs of random
nodes are set to zero. This prevents nodes in subsequent layers from depending on
specific arguments in a fragile way. In other words, it promotes redundancy. Such
techniques fall under the umbrella of regularization methods.
Other techniques are specific to certain application domains. For example, the
concept of convolution is used widely in networks that process image data. While
convolution has a mathematically precise definition, we’ll suffice to describe it
as applying a “filter”
The individual computational nodes also get much consideration. Historically, the
original nonlinear activation node for a linear node was the sigmoid function.
However, because the function plateaus for large positive and negative values,
training a network that solely uses sigmoid activations can result in prohibitively
slow learning. The ReLU
function avoids this, but brings its own problems. In particular, when linear
weights
290
are randomly initialized as we did, ReLU nodes have an equal chance of being zero
or nonzero. When a ReLU activation is zero, that neuron (and all the input work to
get to that neuron) is essentially dead. Even if the neuron should contribute to
the output of an example, the gradient is zero and so gradient descent can’t update
it. Other activation functions have been defined and studied to try to get the best
of both worlds.
For the reader eager to dive deeper into production-quality neural networks, check
out the Keras library. Keras is a layer on top of Google’s TensorFlow library that
makes implementing neural networks in Python as straightforward as in this book.
The designer of Keras also wrote a book, “Deep Learning with Python,” which—beyond
including a multitude of examples—covers the nitty-gritty engineering details with
plenty of references.
Chapter 15
but in a satisfactorily controlled way. […] The extra time needed to introduce O
notation is amply repaid by the simplifications that occur later.
– Donald Knuth
As we’ve discussed, the bulk of software is bookkeeping, moving and reshaping data
to adhere to APIs of various specifications, and doing this in a way that’s easy to
extend and maintain. The ever-present specter of software is the fickle user who
thinks they know what they want, only to change their mind when you finish
implementing it. Big-O analysis doesn’t seem to play a part in that struggle.
One should try to see the other side of the coin as well. Often an interviewer
doesn’t particularly care about the exact big-O runtime of an algorithm. They
aren’t testing your aptitude to recall arbitrary facts and do algebra. They care
that you can reason about the behavior of the thing you just wrote on the
whiteboard. As we all know, beyond correctness, an important part of software is
anticipating how things will break in subtler ways. What kind of data will make the
system hog memory? For what sort of usage will a system thrash? Can you guarantee
there are no deadlocks? Most importantly, can you be concrete in your analysis?
Among the simplest things one could possibly ask is what part of the algorithm you
just wrote is the bottleneck at scale. To do that, you have to walk a fine line
between being precise and vague. Define the quantities of interest—whether they’re
joins in a database query or sending data across a network—and the simplifying
assumptions that make it possible to discuss in principle. You also have to sweep
an immense amount of complexity under the rug. Maybe you’ll ignore problems that
could occur due to multithreading, or the overhead of stack frame management
incurred by splitting code into functions in just such a way, or even ignore the
benefits of helpful compiler optimizations and memory locality, when the
application doesn’t depend on it.
291
292
And it’s not just about runtime. You can use big-O and its relatives to describe
the usage of any constrained resource, be it runtime, space, queries, collisions,
errors, or bits sent to a satellite.
Of course, like any tool big-O is not a panacea. Often one needs to peek behind the
curtain and optimize at a granular level. Customer attention is a matter of
milliseconds.
In time-critical engines like text editors and video games, frame rate and response
latency are the bottom line. But big-O has the advantage of being able to fit
entirely inside your head, unlike tables of measurements. As a language aid, a
first approximation, and a start to a conversation, big-O is hard to beat.
So in this short chapter I’ll introduce big-O notation, describe some of its
history, show how it simplifies some of the calculations in this book, and then
describe some of my favorite places where big-O takes center stage.
The original use of big-O notation was by Landau and Bachmann in the 1890’s for
293
popular notation in number theory. It was not until mid-century 1900’s that big-O
found its way to computer science, in part because computer science had to be
invented. Donald Knuth opens a 1976 essay with, “Most of us have gotten accustomed
to [big-O notation],”
To explain what this means, recall that the Taylor series for sin( x) at x = 0 is x
5
x 9
sin( x) = x − x 3 +
− x 7 +
− · · ·
3!
5!
7!
9!
Big-O says the x 3 terms and smaller are dominated by the x term. What’s unspoken
here is what “dominates” means. In the analysis of algorithms, “dominates” usually
means an upper bound as the size of the input grows larger. But here nothing is
growing! Instead, here the big-O notation implies a limit x → 0. I.e., when x
shrinks, x 3 vanishes much faster than x. The formal definition is as a limit.
f ( x)
lim
< ∞
x→a g( x)
The limit notation needs a disambiguation. We’re not saying that the limit has to
exist.
We simply need that the limit does not grow without bound. So when we say f =
O( g), we mean that g is a sort of upper bound on f under some limit. Usually the
limit point a is established once at the beginning of a discussion, or obvious from
context (e.g., you’re doing a Taylor series at a). In the rare cases one needs to
disambiguate, one can use Ox→a( g( x)).
Unpacking this definition a bit, consider the special case when the limit exists
and is finite. Then there is some constant C for which
f ( x)
lim
= C,
x→a g( x)
1. f = O( f) for any f.
294
Take care, because when we say f = O( g), the symbol = doesn’t mean equals in the
usual sense. For example, it’s not symmetric or transitive; x 3 = O( x) and x 2 =
O( x) as x → 0, but x 3 ̸= x 2. When someone uses big-O notation like f = O( g),
it’s best to read
= as “is,” and then the sentence makes sense: “f is (at most) order of g.”
Moreover, when we include O( g( x)) in the context of some larger expression, like
sin( x) = x + O( x 3), what we mean is that sin( x) = x + f( x) for some f( x) = O(
x 3). Fluent use of big-O
involves “native support” for this implicit association in your head, which can
take to get used to.
Continuing with the example of sin( x), say we wanted an estimate of sin( x) 1 + x
2.
x 2
x 6
1 + x 2 = 1 +
− x 4 +
− · · ·
16
Big-O can help. If we decide in advance how many terms we care about, then we can
truncate the two series with big-O and we’re left with a finite product. Note that
if these next computations look strange, it’s probably because you’re used to
seeing big-O as an infinite limit, whereas the big-O used here is a limit as x → 0.
In this context, x 5 = O( x 3).
sin( x) = x + O( x 3) ,
√1 + x 2 = 1 + O( x 2) ,
= x + O( x 3) + x · O( x 2) + O( x 3) O( x 2)
= x + O( x 3) + O( x 3) + O( x 3)
= x + O( x 3)
In particular, this makes rigorous the idea that “( x+ something small), multiplied
by (1+ something small), is still ( x+ something small).” It’s the kind of
reasoning that one sees in physics books all the time, but instead of using the
mathematically valid big-O,
295
they say “we’ll ignore this term” or “assume this term is zero.” Being sloppy in
this un-controlled way can result in unforeseeable errors. Missing error terms can
get combined in ways that the combination of the error is of the same order of
magnitude as the term you care about. With big-O, error terms are still present,
but they’re present in a way that doesn’t complicate calculations too much more.
When two terms get combined, you’re forced to ask if the combined error is too big.
The interface helps prevent careless mistakes. Following one of the major themes of
this book, it reduces both the cognitive load of doing algebra, and the cognitive
load of keeping track of error terms.
f ( x)
lim
< ∞
x→∞ g( x)
With the infinite limit, we’re saying |f( x) | ≤ D|g( x) | for all sufficiently
large x and some constant D. Here and elsewhere in math, “sufficiently large”
abbreviates the claim that some N exists, above which ( x > N) the property is
always true.
Definitions 15.1 and 15.2 have the same name because they satisfy the same
properties.
However, the hypotheses of these properties are different. For example, x 2 = Ox→
0( x) and x 3 = Ox→ 0( x), implying x 2+ x 3 = Ox→ 0( x). But for infinite limits,
x 2 ̸= Ox→∞( x) and x 3 ̸= Ox→∞( x). Instead, x 2 = Ox→∞( x 3) , x 3 = Ox→∞( x 3),
and so x 2 + x 3 =
Ox→∞( x 3).
f ( x)
lim
= 0
x→a g( x)
We allow a = ∞, in which case the nonzero condition is again “sufficiently large”
Shaving off any sufficiently large-growing function can also be the difference
between big-O and little-o. In particular, as x → ∞ it’s true that x = o( x log x)
and even x =
o( x log(log(log( x)))).
The rest of the asymptotic notation family is defined by relation to big-O and
little-o.
296
Part of what makes this version of the derivative definition so elegant is that it
puts the core idea of derivatives—that we care about a linear approximator—front
and center.
( x + ε)2 = x 2 + 2 xε + ε 2 = x 2 + (2 x) ε + o( ε) .
Recall the chain rule, Theorem 8.10, which you proved in an exercise and we
generalized in Chapter 14. We can prove this theorem using easy calculations.
Note that f′( g( x)) and g′( x) are constants relative to the little-o, so the
bracketed terms simplify to o( ε). What’s left is the coefficient of ε, which is f′
( g( x)) g′( x).
297
Algorithm Analysis
To say anything meaningful about which algorithm is better, we want big-O for two
reasons. First, just as the interface for a software system shouldn’t depend on the
implementation, our analysis of the quality of an algorithm shouldn’t depend on the
fine-grained details of the implementation. If one decides to structure the
algorithm as three functions instead of four, the raw runtime will change; extra
steps are taken to push stack frames and handle return values! Of course, many
engineers spend a lot of important and valuable time studying the fine-grained
runtime of time-critical algorithms. There are experts in loop-unrolling, after
all. But big-O isn’t meant for those situations; rather, it’s meant for the life of
the system that comes before fine-tuning. Big-O is a first responder to the scene.
By the time you’re fine-tuning, big-O’s job is done.
Second, and closely related, the analysis of the quality of the algorithm shouldn’t
depend on features of the system the code is being run on that are beyond the
programmer’s control. If you’re sensitive to whether your C compiler is run with
aggressive or extremely aggressive optimization flags, then big-O will not help.
But most systems don’t ever reach that level of care in their entire lifetime. Big-
O allows you to ignore it.
is our main focus. When we ask, “can this algorithm be solved any faster?” we don’t
mean can the constant be improved. Rather, we mean can it be solved an order of
magnitude faster, ignoring constants and runtime for small inputs.
I often hear the complaint, “But what if the constant factor is a billion! Then
it’s completely useless to use big-O!” Computer scientists are well aware of the
possibility that the hidden constant might be absurd. A witty meme, whose origin I
can’t recall and failed to hunt down, involves the Black Knight of Monty Python and
the Holy Grail. This character famously loses his limbs in a sword fight, but
refuses to surrender, exclaiming,
“It’s just a flesh wound!” On this image, the meme superimposes the quote, “It’s
just a constant factor!” Joking aside, more often than not the constant factors are
mere flesh wounds. Constants dominating runtime—i.e., when big-O misleads—is the
exception to the rule, and usually a sign of recent, or purely theoretical
research. A famous example is the linear-time algorithm for polygon triangulation.
This algorithm has a large constant factor, and is so tricky to implement that it
has been called “hopeless” by Steve Skiena, the author of “The Algorithm Design
Manual.”
298
We’ve established that big-O can be used to measure things beyond algorithm runtime
and space usage, like the quality of an approximation. Indeed, big-O can be used to
discuss the usage of any constrained resource. For Taylor series the resource is
“deviation from the truth,” but in computer science there are a whole host of other
things that big-O
is used to analyze.
• Collisions: Load balancers have to assign jobs to servers with an extremely high
rate of jobs assigned per second. In particular, they almost never have enough time
to ask a server how many jobs it’s processing. Instead, load balancing algorithms
use randomness and reason about the expected worst-case load of a server. One
• Errors: In systems where data integrity is important, expensive, and bits are
often lost or flipped (such as data being transmitted through space, or on a
scratched up disc), one often employs redundancy schemes called error-correcting
codes that allow one to recover from these errors. Such schemes require one to
store additional bits, and so there’s a tradeoff between how many additional bits
one needs to store and the error tolerance of the scheme.
• Labeled examples: Most machine learning systems require labeled training data to
produce a classifier. Since compute power is generally cheaper than getting
that helps the machine with difficult examples. A human doing work is clearly a
299
Each of these topics has a rich history of design and analysis, and for each the
principles of the discussion revolve around asymptotic analysis. An interactive
learning system that takes n pieces of input data but requires Ω( n) queries to a
human to learn can already be determined unscalable, but one that only needs O(log(
n)) might work. A load balancer
that spreads m jobs over n servers and causes the worst server to have Θ( m/ n + m)
jobs is almost certain to crash servers during peak hours compared to one that
guarantees O( m/ n + log n).
Big-O is a cognitive tool that allows a human to organize and make sense of a mess
of details in a rigorous fashion. It’s a tool for high level thinking. Software is
full of constrained resources, tradeoffs, and the desire for principled decision
making. Fluency in asymptotic language will help you navigate these decisions
efficiently and formulate hypotheses that can then be backed up by data.
Chapter 16
Groups
That, and that software matured as a discipline largely after these mathematical
revolu-tions took hold.
Embodying part of this novelty are ideas like programs that transform other
programs.
You write programs. Compilers are programs that turn your programs into other
programs. A program analyzes the quality of a compiler. Programs test the
correctness of the compiler analyzer. Software automates the running of the tests
of the correctness of the compiler analyzer. And, of course, you use a program to
help refactor the programs that automate the running of the tests of the
correctness of the compiler analyzer. It’s programs all the way down.
What’s less obvious to a programmer is that studying the class of transformations
of an object provides insight into that object. By analogy, if you study the way a
refactoring tool changes the behavior of a program, that can help you understand
how the program works. Even more, it can help you understand how to write clearer
and more refactorable programs. Building up a theory based on transformations is
like a slick development framework, which you later learn applies to programs you
never anticipated writing.
Group theory is the mathematical study of symmetry. As we’ll see in this chapter,
symmetry has algebraic structure. We can work with symmetry in much the same way
302
We have group theory in part to thank for not wasting our time on the analytical
approach. Using group theory one can prove that it’s not merely difficult to find
an algebraic formula for the roots of a generic degree-5 polynomial. It’s
impossible. We foreshadowed this in Chapter 2 when we discussed existence and
uniqueness. This theorem—known as the Abel-Ruffini theorem—is a crown jewel of
mathematics. And though this book is too short to do the theorem justice, the
modern proof relies heavily on the shift in thought from objects to
transformations.
Algebra is the offer made by the devil to the mathematician. The devil says: “I
will give you this powerful machine, it will answer any question you like. All you
need to do is give me your soul: give up geometry and you will have this marvellous
machine.”
Hermann Weyl echoed a similar idea seventy years earlier: “In these days the angel
of topology and the devil of abstract algebra fight for the soul of each individual
mathematical domain.” While these seem like superstitious warnings to the
unsuspecting apprentice of mathematics, the utility of algebra for computation is
undeniable. If there’s anything to read from these quotes, it’s that geometric
arguments are considered fashionable, pure, and beautiful by a certain group of
influential mathematicians. Subcultures abound.
But you, dear programmer, would never patronize computation as mere contentedness.
303
The most common example of a group—and its raison d’etre—is the set of symmetries
of some object. That is to say, a group is nothing if it does not “act” on some set
by transforming it in a composable, reversible way. You use groups to elucidate the
symmetry in objects of interest. In this final chapter we’ll see how the concept
manifests itself in Euclidean and hyperbolic geometry, and in the exercises we’ll
explore groups as they show up in number theory, cryptography, polynomials, graphs,
and others.
We’ll finish off the chapter, and the book, with a dive into hyperbolic geometry.
We’ll see how geometry can be studied via the groups that transform geometric
space. Finally, we’ll apply what we learned to draw hyperbolic tessellations, of
the same sort that M.C.
For example, you could rotate the square counterclockwise by a quarter turn, or
reflect it across the AC diagonal, or both. These are rigid motions of the square.
As functions, they are bijections from the square to itself. Moreover, they
preserve the distances between all pairs of points. In symbols, let’s give
coordinates ( x, y) to the square. Say the square is the product of two intervals
304
• d( x, y) = 0 if and only if x = y.
• d( x, y) = d( y, x) for all x, y ∈ X.
These properties make an arbitrary function sensible enough that one could
reasonably call it a “distance” function. Of particular interest is the triangle
inequality, which says that taking a direct path from x to y is never worse than
taking an indirect path through z.
In Chapters 10 and 12 we discussed how the Euclidean inner product gives rise to a
distance metric
d( x, y) = ∥x − y∥ =
⟨x − y, x − y⟩.
This metric is the same metric for Euclidean geometry. However, not all metrics
arise from an inner product. Our study of hyperbolic geometry will produce a highly
nonlinear metric, so it’s worth teasing apart the two concepts.
Back to our example of the square. Since we labeled the corners, we can track how
an isometry affects the corners. And in a sense that will become clear shortly, we
only care about how it affects the corners. If we denote a counterclockwise
quarter-turn by ρ
(the Greek lower-case rho) and a flip across the AC diagonal2 by σ (the Greek
lower-case sigma), we can write down a sequence of these operations like
ρρσρ,
where we apply the operations in order from right to left. That is, the above
operation is “rotate a quarter turn, then flip, then rotate twice more.” Figure
16.2 shows how the symmetries transform the square.
We often emphasize that we’re talking about isometries that preserve the square—map
points in the square to other points in the square—by calling these isometries
symmetries of the square. Such a provocative name encourages the natural question:
what are all 2 This flip is specific to the initial position of A and C. As A and C
move around, the flip operation is still top-left-corner to bottom-right-corner. Of
course, you want the definition of an operation to be independent of what
operations are applied before or after it, so this configuration-independent
definition is best.
305
ρρσρ
of the different symmetries of the square? There are infinitely many ways to
compose symmetries on paper, but two symmetries created via different methods can
result in the same operation.
bijections.
Two different ways to compose symmetries can result in the same symmetry. Flipping
across the same diagonal twice is the same thing as doing nothing, and rotating
four times in the same direction is also the same thing as doing nothing. Note we
only consider the relative change of the square compared to how it started. To
apply the next rigid motion in a sequence, you need not know how it was previously
transformed.
As an exercise, flesh out this proof sketch in more detail. However, be warned that
not all possible labelings of the corners arise from symmetries of the square.
Opposite corners of the square cannot be mapped by an isometry to neighboring
corners.
306
Figure 16.3: The position of a point is uniquely determined by its distance from
the three corners.
With a handful of symmetries, such as our ρ and σ from earlier, we can write down
compositions of those symmetries, and make equations of symmetries. The following
three are some particularly simple ones:
ρ 4 = 1
σ 2 = 1
ρσρ = σ
307
σρ 9 σρ− 3 σ = σ( ρ 9 σρ 9) ρ− 12 σ
= σ( ρ 8 σρ 8) ρ− 12 σ
...
= σ( ρσρ) ρ− 12 σ
= σ 2 ρ− 12 σ
− 3
= σ 2( ρ 4)
σ = 1 · (1) − 3 · σ = σ.
As you might have guessed, the properties we’ve identified are what define a group,
and the algebra above is characteristic of doing algebra with a group structure.
Before we see the formal definition, here’s a more complicated example of a group:
the symmetries of the Rubik’s cube.3
In the same way that we can enumerate all possible symmetries of the square, one
could enumerate all possible symmetries of the Rubik’s cube. One can rotate any one
of the six faces of the cube, but the relationships between operations are not at
all obvious. The colored stickers take place of A, B, C, D labels to distinguish
two configurations, but it’s not clear which (if any) stickers are superfluous.
Nevertheless, the same properties hold: there is a do-nothing operation, every
operation is reversible, and any two operations can be composed and the result is
still a viable operation. As we’ve suggested, if you want to understand the Rubik’s
cube, you should study its group of symmetries.
2. For every x ∈ G there is some element y ∈ G called an “inverse” for which x·y =
e and y · x = e. (A priori there may be more than one such inverse.) 3. The group
operation is associative.4 That is, x · ( y · z) = ( x · y) · z.
People often say that a set G is a group “under” an operation instead of “paired
with.”
There are a few issues we need to tackle regarding this definition and the notation
associated with it, but first let’s see some trivial examples.
4 Since most of our groups will be numbers, matrices, or functions, this axiom will
naturally hold. We will ignore it for brevity. For some groups, this is the hardest
axiom to establish.
308
The singleton set {e} with the binary operation · defined by asserting e · e = e is
a group. And there was much rejoicing. The set of integers Z forms a group under
the operation of addition. It is common knowledge that zero fits the definition of
the identity element, that the sum of two integers is an integer, that addition on
integers is associative, and that every integer x has an additive inverse −x.
Likewise, all of the number systems in this book except N are groups under
addition: rational numbers, real numbers, complex numbers, etc. If we want to work
with multiplication, it is not hard to see that R − { 0 } is a group, since every
nonzero real number has a multiplicative inverse, and 1 is the multiplicative
identity. Vector spaces are groups under vector addition; indeed, the group axioms
are a subset of the vector space axioms.
An important example comes from our discussion in Chapter 9, the set of integers
modulo n, denoted Z/ n Z, under the operation of addition modulo n. For example,
Z/4Z = { 0 , 1 , 2 , 3 }.
A few basic propositions clear up the ambiguities in Definition 16.3. For instance,
the uniqueness of the identity element follows from the other axioms of a group.
Here’s a proof: if there were two identity elements e, e′, then by the following
logic they must be equal:
e = e · e′ = e′
The first equality holds because e′ is an identity element, and the second because
e is.
A similar proof shows that the inverse of an element is unique. These facts justify
the following notation: we call the identity element 1, and use subscripts 1 G, 1 H
to distinguish between identity elements in different groups G, H. We also replace
the explicit · operation with an invisible operation (juxtaposition). So that xyz
replaces x · y · z. Moreover, we emulate repeated applications of the operation by
saying xn to mean x · x · · · · · x multiplying n copies of x.
One more caveat to support “legacy” math. If we’re talking about the integers Z
under addition, the juxtaposition operation (which implies multiplication) feels
unsanitary. It simply won’t do. In this case, and whenever we have a group of
numbers with a + symbol as the operation, we’ll use +. And instead of xn we’ll use
nx to mean x + x + · · · + x adding n copies. Here n is not considered an element
of Z as a group, but just the number of additions. Likewise, −x is the inverse of
x, while in a multiplicative group the inverse is x− 1. This is purely syntactic
sugar.
Now we demonstrate how two drastically different sets can have the same underlying
group structure, which will inform our dive into structure-preserving mappings
between groups. The first group we understand well: R under addition. For the
second, consider the set of 2 × 2 matrices of the following form, under the
operation of matrix multiplication.
{(
a
G =
: a ∈ R
The identity matrix is the identity element. Notice G has some familiar structure.
309
) (
a + b
)
1
x 7→
Any mathematical setting that expresses the abstract group R can be identified by
finding this sort of group-correspondence with (R , +). A mathematician sees this
wonderful example and dreams: can we classify all the different kinds of group
structures? Could we get a new perspective on the symmetry group of the square by
turning it into a suitable group of matrices?
In particular, they need not be bijections. But they do preserve the defining
features of the group structure. To build up intuition we can do some simple
proofs.
Proof. Let G be a group with identity 1 G, and H a group with identity 1 H. Let f :
G → H
f (1 G) = f (1 G 1 G) = f (1 G) f (1 G)
310
Since H is a group, all its elements have inverses, including f(1 G). So multiply
both ends by f(1 G) − 1 to get
f (1 G) f (1 G) − 1 = f (1 G) f (1 G) f (1 G) − 1
1 H = f (1 G)1 H = f (1 G)
The extent to which a homomorphism degrades the structure of the input group is
ker f = {x : f( x) = 1 H}
An example: G = Z under addition and H = Z/10Z under addition modulo 10. Let f : G
→ H mapping n 7→ 2 n mod 10 (Exercise: prove this is a homomorphism). The kernel of
f is { 0 , ± 5 , ± 10 , ± 15 , . . . }. Despite losing the multiples of 5, the
image f( G) still has a group structure inside H. Note f( G) = { 0 , 2 , 4 , 6 ,
8 }, and the group operation in H—applied only to elements of f( G)—maintains the
property of being in f( G). In other words, part of the structure of G is embedded
inside H using the operation of H, but not all of it.
A group that sits inside another group (and shares the containing group’s
operation) is called a subgroup.
Definition 16.8. Let H ⊂ G be two sets and let G be a group under the operation ·.
• 1 ∈ H.
• If x ∈ H, then so is x− 1.
311
Another term for the above conditions is that H is “closed” under · and the
inverse-taking operation ( −) − 1.
A homomorphism provides two useful subgroups: its image and its kernel.
Proof. First we prove that ker f is a subgroup of G. We’ll prove this directly, by
assuming x, y are arbitrary elements of ker f , and showing that xy ∈ ker f and x−
1 ∈ ker f . These are the second two conditions required of a subgroup by
Definition 16.8, and the first condition, 1 G ∈ ker f, is implied by Proposition
16.5.
implies 1 H ∈ im f.
Take Z with addition, and the map f : Z → Z/10Z defined by x 7→ 2 x modulo 10.
Lemma 16.10. For any homomorphism f, the quotient set G/ ker f forms a group under
the operation [ a][ b] = [ ab] .
312
If G and H are isomorphic, they have identical group structure, and H is simply a
relabeling of the elements of G. The boolean comparison (or assertion) that two
groups G, H are isomorphic is denoted G ∼
the same “up to isomorphism,” meaning only their representations are different. For
our
= Z/5Z.
= G/ ker f .
The most common, as we’ve seen multiple times in this chapter, are the integers
under addition, their subgroups, and their quotients, under addition and addition
modulo n.
These arise as the kernels and quotients of the maps Z → Z defined by x 7→ nx. The
kernels have the form n Z = {nx : x ∈ Z } for some fixed n, and also the trivial
subgroups
The groups Z and Z/ n Z both have the property that 1, when repeatedly added to
itself, produces the entire group. Because of this, 1 is called a generator of the
group. In general, an element x ∈ G is called a generator if the subgroup { 1 , x,
x 2 , x 3 , . . . } is equal to G.
Groups with such an element are called cyclic groups, and all cyclic groups are
isomorphic to Z or Z/ n Z under addition. In general, a set S ⊂ G is said to
generate G if every x ∈ G
313
Figure 16.4: Because 5 is odd, the lines of symmetry of the regular pentagon each
pass through a side and a vertex.
but a group G may have generating sets of different sizes. Hence, any concept of
“group dimension” must be more nuanced. 5
One of the simplest ways to build a larger group from smaller pieces is the direct
product.
This construction simply forms the product of two groups as sets, and defines the
group operation component-wise. E.g., Z × Z/2Z is the set of pairs {( n, b) | n ∈ Z
, b ∈ { 0 , 1 }}, where ( n, b) + ( n′, b′) = ( n + n′, b + b′). If a group
decomposes as a direct product of subgroups, the symmetry structure can be seen to
have independent components.
The set Z/ n Z forms a group under multiplication if we remove the numbers k such
that gcd( n, k) ̸= 1. This guarantees that inverses exist. In the special case that
n is prime, we need only remove zero. This group is denoted (Z/ n Z) ×, and it’s
substantially more interesting than integers under addition. Up to isomorphism it
is always possible to write (Z/ n Z) × as a direct product of cyclic groups.
However, there is no known generic method for finding generators of the cyclic
pieces. This computational difficulty is exploited by RSA public-key cryptography,
which we will explore in an exercise.
Next we have the symmetry groups of regular convex polygons6 in the plane, such as
the square we started this chapter with. The group corresponding to the polygon
with n ≥ 3 sides is called the dihedral group and is denoted D 2 n. It has 2 n
elements, corresponding to the n rotations by an angle of 2 π/ n and the n
reflections across lines passing through the vertices and sides. These lines of
symmetry depend on the parity of n, as is made clear by the lines of symmetry in
the pentagon and the hexagon in Figure 16.4.
Confusingly, the dihedral group for a polygon with n sides is sometimes denoted Dn
instead of D 2 n, which makes D 8 terribly ambiguous. We’ll use D 2 n.
5 A cyclic group has a generating set consisting of a single element, called “the”
generator, but “generator” can loosely refer to any element of a generating set,
even if that element alone does not generate the group.
6 Regular means all the angles have the same measure and all sides have the same
length, and convex means every line between points in the polygon is completely
contained in the polygon.
314
Next we have matrix groups. Given any reasonably well-behaved number system that
has addition and multiplication, say R for example, we can form a group of square
matrices under matrix multiplication, which is often called the general linear
group. Define by GLn(R) the set of invertible n × n matrices with real entries. As
we saw in Section 16.2, asserting some specific structure on the groups often leads
to an interesting subgroup.
One famous subgroup of the general linear group is called the orthogonal group,
denoted On(R), consisting of matrices whose columns form orthonormal bases.
This group is closely related to the symmetry group of Euclidean space we’ll study
in Section 16.5. Another interesting facet of groups of matrices is that they have
enough structure that one can do calculus on them. In the formal jargon, the
general linear group is a smooth manifold. This is far beyond the scope of this
book, but at least explains why the general linear group gets such a special name.
The last example is called the symmetric group. Really, it should be called the
permutation group, since it is the set of all bijections of a fixed set to itself.
Let A be a set, and define the symmetric group S( A) to be the set of all
bijections A → A. It is easy to see that if A, B are both finite sets of size n,
then S( A) ∼
by Sn. In the exercises you will study the structure of finite permutation groups,
and a useful data representation for computation.
= { 1 }, we have that G ∼
that do computations on finite groups, it’s enough to write programs that work with
finite permutation groups. Indeed, most useful group-theoretic algorithms are
algorithms on finite permutation groups. Entire books have been written about this.
For the rest of the chapter, we’re going to study geometry from the perspective of
groups. In fact, the modern mathematical attitude toward geometry is that it is the
study of groups. This view was espoused by Felix Klein in the late 1800’s. Around
this time,
315
special cases of projective geometry and hyperbolic geometry had been discovered,
but it was largely unclear how different geometries were related.
With these in hand, the symmetry group of the space is the set of bijections X → X
that preserve the quantity of interest. In Euclidean geometry, points and lines are
the usual points and lines in R n, and distance is the quantity of interest. Such
“quantities of interest” are called invariants. A different type of geometry might
only wish to preserve area of figures, or preserve the property of similarity
(invariance under scaling).
Klein’s view was that a geometry should be studied via its group of symmetries. The
classical concepts like angles, areas, and lengths are seen as measures that may or
may not be invariant under the application of a symmetry. Thus, geometry has two
approaches: given a group of symmetries, study the interesting quantities invariant
to those transformations; and given a quantity you think is important, find the
group of symmetries that preserves that quantity. Every geometry has a group. Every
group corresponds to some geometry.
Klein called his view the Erlangen Program. 7 One striking result8 was that all
geometries are a special case of projective geometry—a geometry that allows
projections to a possibly infinite horizon. In particular, even though different
geometries might have different axioms (regarding, say, configurations of parallel
lines), every geometry can be modeled inside of a projective geometry. For example,
hyperbolic geometry is projective geometry restricted to a particular surface
inside a larger projective space. Moreover, the group corresponding to this model
of hyperbolic geometry is a subgroup of the symmetry group of the projective
geometry. We get containments of the spaces as sets, and of the groups as
subgroups.
We won’t study this particular relationship in this book, but it shows how Klein’s
desire fits into the larger mathematical goal: to connect and unify disparate
geometries into a single theory. I encourage the reader interested in cryptography
to learn about projective geometry, in part because it’s the correct setting for
studying elliptic curves. It’s also a 7 In mathematics, a “program” is a sort of
long-term plan, usually one that is too large for a single mathematician to
complete alone. In the case of Klein, mathematicians and physicists found new
geometries and symmetry groups to study long after Klein died.
8 I’m not aware of this claim as a theorem, but rather a famous “attitude” voiced
by Arthur Cayley.
316
great way to exercise your linear algebra muscles, as projective geometry is simply
a quotient of the vector space R n by a suitable equivalence relation. 9
We now turn to Euclidean geometry, and study it through the lens of groups.
Euclidean Geometry
Euclidean geometry is the study of isometries of R n with the usual distance metric
d( x, y) = ∥x − y∥. Recalling Definition 16.2, f : R n → R n is an isometry if
d( x, y) = d( f ( x) , f ( y)) for all x, y ∈ R n. Because isometries preserve
distance, and angle measure is determined by the lengths of the sides of triangles,
isometries also preserve angle measure.
With a few moments of thought, it’s easy to come up with examples of Euclidean
Remember that rotations, projections, and reflections are examples of linear maps.
Ignoring translations for a moment, it’s natural to wonder which linear maps double
as isometries.
Theorem 16.13. The isometries of R n that fix the origin are exactly the linear
maps whose columns form an orthonormal basis.
Proof. In Chapter 12, we observed that matrices with orthonormal columns preserve
the inner product. Let A be such a matrix. In R n, squared distance is d( x, y)2 =
⟨x−y, x−y⟩.
As a consequence,
= ⟨A( x − y) , A( x − y) ⟩
= ⟨x − y, x − y⟩
= d( x, y)2 .
Since distances are non-negative, the square roots are also equal.
To show that any isometry fixing the origin is a linear map with orthonormal
columns, we first show it is linear. We will use slick geometric arguments, but one
can prove it just as well with formulas involving inner products (which the reader
is encouraged to try).
9 A while back I wrote a blog series on this topic, in which I build up elliptic
curve cryptography from scratch. The second post in the series defines projective
geometry as a quotient. You can find it here:
https://jeremykun.com/2014/02/08/introducing-elliptic-curves/.
317
First, f( ax) = af( x). To prove this we first prove that any Euclidean isometry
maps lines to lines. We will use the fact that in Euclidean geometry a straight
line is the shortest path between any two points. In particular, if x lies on the
shortest path from 0 to ax, then f( x) lies on the shortest path from 0 to f( ax):
letting c = d(0 , x), then x minimizes the following:
y∈ R n
d(0 ,y)= c
Using the fact that isometries map lines to lines, we continue. Since d(0 , ax) =
|a|∥x∥ = d(0 , f( ax)), the only way f( ax) can be on the same line through the
origin as f( x) is if f( ax) = ±af( x). We claim it must be f( ax) = af( x).
Suppose for contradiction that f( ax) = −af( x), then there are two cases. In the
first case, |a| ≥ 1.
Then,
= ∥x∥ + ∥ − af ( x) − f ( x) ∥
= ∥x∥ + | − a − 1 |∥f ( x) ∥
= (1 + | − a − 1 |) ∥x∥
This implies |a|− 1+ |a+1 |, which is only true for a = − 1. This provides a
contradiction for all a ̸= − 1. But if a = − 1, then f( −x) = f( x), which
contradicts f being injective.
We conclude that f( x) = af( x). The second case, |a| < 1 is similar, and starting
from
d( x, y). When L
1 ,y∈L 2
1 , L 2
are parallel, this distance is a positive constant, and otherwise it is zero. Since
the property is defined entirely in terms of distance, an isometry must preserve
it.
Now consider the parallelogram, with opposite sides being parallel line segments,
and having one vertex at the origin.
w
v + w
318
By our arguments above (isometries preserve length, angle measure, and parallelism
of lines), isometries map parallelograms to parallelograms. But a parallelogram is
precisely how we define addition of two vectors! The sum of the vectors
representing the sides is the diagonal vector drawn from the origin to the opposite
vertex.
Now that we’ve established isometries that fix the origin are linear maps, we
already know from linear algebra that a linear map preserves distance if and only
if it preserves the inner product ( d( x, y) = ∥x − y∥ is defined in terms of the
inner product), which happens if and only if its columns are orthonormal. (Cf.
Chapter 12, Exercise 12.3) This proof puts into practice Klein’s idea to study
invariants preserved by isometries.
The invariants that can be derived from distance preservation are highly
structured, allowing one to explicitly limit an isometry’s shenanigans. As an added
benefit, thinking in terms of invariants removes the need to rephrase geometric
concepts in symbolic language. If you found the epsilon-delta proofs of calculus
tedious, you might just be a geometer.
The group of n × n matrices with orthogonal unit vector columns is called the
orthogonal group O( n).11 Recall it has the following characterization.
O( n) = {A : AT A = In}.
We’ve already shown that this set forms a group under matrix multiplication. Still,
it’s worthwhile to check again in purely linear algebraic terms. Each matrix
represents a change of basis, and composing two basis-changes is again a change of
basis. The identity is a no-op basis change, and every basis change has an inverse.
Finally, orthogonality is preserved: if AT A = In, BT B = In, then ( AB) T ( AB) =
BT ( AT A) B = BT B = In.
Likewise for A− 1.
Because these isometries are linear maps, we can also infer that the complete
behavior of the isometry is determined by its behavior on n linearly independent
points. This is another example of local information being used to infer global
structure.
319
Ax + v, which has the form required to be a member of E( n). This maps an isometry
f to a member of E( n). This mapping is a homomorphism by repeating the argument
from the last paragraph: if f = x 7→ Ax + v with v = f(0) and g = y 7→ By + w with
w = g(0), then gf = x 7→ BAx + Bv + w, where Bv + w is precisely g( f(0)).
Coupling this with the one-sided inverse ( Ax + v is an isometry for any choice of
A, v), we get our bijection.
Hyperbolic Geometry
In antiquity, the Greek mathematician Euclid laid out a grand vision of geometry in
which every theorem can be proved from a core set of axioms. The axioms, one of
which was
“any two points can be connected by a straight line,” cannot be proved and must be
taken as a truism.
3. For any straight line segment, there is a circle with that line as its radius
and one endpoint as its center.
5. Given any straight line and a point not on that line, there is a unique
straight line that passes through the point and never intersects the first
line.
The fifth axiom, commonly called “the parallel postulate,” nagged mathematicians
for centuries. It always seemed possible that it could be converted from an axiom
to a theorem by deducing it from the other four axioms.
These efforts were sadly in vain. As is often the case, the more failed attempts at
proving a claim, the more it seems the claim might be false. Indeed, the parallel
postulate can be broken in a few ways. There are geometries that satisfy the first
four axioms, but no parallel lines exist (all possible lines intersect the first).
There are also geometries in which multiple parallel lines exist. Projective
geometry is an instance of the first breakage, and hyperbolic geometry the second.
Let’s now define a model of the hyperbolic plane and classify its symmetries. I say
“a”
because there are many models of the hyperbolic plane. The connections between them
are interesting and useful, but for this chapter we’ll work entirely in the model
called the Poincaré disk. The Chapter Notes and Exercises contain more historical
details.
320
The universe of points for the Poincaré disk is the interior of the unit disk, 12
D2 =
In the Poincaré disk, there are two kinds of lines. The first is one which includes
the origin, and these lines are simply diameters of the unit circle (not including
the endpoints).
Now we can immediately see why the parallel postulate fails: parallel lines are
just circles that don’t intersect! Given one such circle C and a point not on that
circle, we can find many circles passing through the point that don’t intersect C.
This is pictured in Figure 16.6, where C is the dotted line.
There is a bit of work to do to establish the axioms of geometry for this model. We
need to be able to draw a line between any two points, and to draw a circle with a
segment as its radius. A priori, it’s not clear what a circle would look like in
this model, since some lines are defined as parts of Euclidean circles. We will
have to define such a “Poincaré circle.” We also need to define the angle between
two hyperbolic lines, and verify that right angles are all congruent. For each of
these it helps to have our first hyperbolic symmetry in hand: inversion in a
Euclidean circle.
Definition 16.15. Let C be a Euclidean circle with center x and radius r. Let p be
a point different from x. Define the inverse of p with respect to C as the point p′
along the ray from x through p that satisfies:
d( p, x) d( p′, x) = r 2 .
The verb for computing the inverse with respect to C is “inverting in C.” For the
classical geometric construction of the inverse of p in C: suppose p is in the
interior of C. Draw a ray from x through p, as in Figure 16.7. Then draw a
perpendicular segment from p to C to get a point q. Then the inverse p′ is the
intersection of the tangent to C at q with the ray x → p.
321
Figure 16.5: Lines in the Poincaré disk. The solid black line is the boundary of
the disk.
The dashed diameters are one type of line. The arcs of the dashed circles are
another. The circles must intersect the boundary of the disk at perpendicular
angles.
Figure 16.6: Given the dotted Poincaré line and the indicated point, all three
dashed lines pass through the point without ever intersecting the dotted line. The
parallel postulate fails.
322
p'
If p is outside the circle, one can perform these steps backward: compute a tangent
to C through p to get q, then p′ is the intersection of the altitude of the
triangle ∆ xqp with the ray x → p. If p lies on the circle, then p is its own
inverse.
To see why this has the property required by Definition 16.15, look again at Figure
16.7.
Triangles ∆ xp′q and ∆ xqp are similar (a general truth about altitudes of right
triangles), meaning d( x, p′)/ r = r/ d( x, p).
Another way to construct the inverse is to “just do it.” You want a point along the
ray from the center x through p compatible with its defining property. Simply
compute p′ = x + r 2( p − x)/ ∥p − x∥ 2 .
d( p, x) d( p′, x) = ∥p − x∥∥p′ − x∥
r 2( p − x)
= ∥p − x∥
∥p − x∥ 2
= ∥p − x∥ 2 r 2/ ∥p − x∥ 2 = r 2 .
Proof. The proof is left as an exercise in geometry. As George Pólya said, geometry
is the science of correct reasoning on incorrect figures. Take this to heart and
make lots of bad drawings.
323
q''
q'
L 1
L 2
Figure 16.8: The construction of a Poincaré circle with center p and radius pq.
Next, we define a Poincaré line as the arc of a circle orthogonal to the boundary
of the unit disk. We ignore some special cases made precise in code in Section
16.7. Given two points p, q ∈ D2, pick one that’s not the center of D2 and invert
it in the unit circle to get a third point s outside the unit disk. By Proposition
16.16, the unique circle through these three points is orthogonal to the unit
circle, as desired. The arc of that segment that is between p and q and lies inside
the unit circle is defined to be the line segment between p and q, as well as the
shortest path between them. To extend this segment to a line, include the entire
arc within the interior of D2.
Second, we define a Poincaré circle. We take a cue from Euclidean geometry, where a
circle has the property that it is perpendicular to every line through its center,
and use this property to guide the construction. Again we use the inversion: fix a
line segment between p and q, and say we want to draw the circle with center p.
Pick any two hyperbolic lines L 1 , L 2 that pass through p but not q. Invert q in
both of these lines to get q′, q′′. Then the Euclidean circle passing through q, q
′, q′′ is the hyperbolic circle centered at p with radius pq. See Figure 16.8 for a
diagram. Curiously, a hyperbolic circle in the Poincaré disk is a Euclidean circle,
but its center is not the same point as the Euclidean center.
Finally, define the angle between hyperbolic lines as the usual Euclidean angle
between their tangents at their point of intersection. Since hyperbolic lines are
orthogonal to the unit circle, their centers necessarily lie outside of the
Poincaré disk. Hence, if two lines intersect they intersect at a single point.
Since the angles are defined in terms of Euclidean angles, all right angles are
congruent.
Together, these facts establish the axioms of a geometry for the Poincaré disk.
324
Taking a cue from Klein, let’s study the symmetries of the Poincaré disk. We
already have one symmetry: reflection across a hyperbolic line, which is inversion
with respect to the circle defining that line. In the case of a hyperbolic line
which is a diameter of D2, reflection is the same as Euclidean reflection in that
line. By Proposition 16.16, these operations preserve the boundary of the Poincaré
disk D2, and it’s not hard to prove that the interior of D2 is also mapped to
itself.
Definition 16.17. Let w, x, y, z be four distinct points (in a specific order). The
cross ratio of w, x, y, z, denoted [ wx; yz] is defined as
∥w − y∥ ∥x − y∥
∥w − y∥∥x − z∥
∥w − z∥ ∥x − z∥ = ∥w − z∥∥x − y∥.
The cross ratio holds the distinguished position of being the invariant quantity of
projective geometry. Since all geometries are special cases of projective geometry,
an appropriately contextualized version of the cross ratio should be invariant for
hyperbolic geometry as well.
Lemma 16.18. Two hyperbolic reflections agreeing on two distinct pairs of inversion
are equal. That is, the circle defining an inversion operation is uniquely
determined by how that operation behaves on two points with distinct images.
Proof. When reflecting across a diameter of D2, the lemma is true because
reflection in a Euclidean line is uniquely determined by its behavior on two points
(prove this as an exercise). The paragraphs to follow will heavily use Definition
16.15.
r =
If x, y, z lie on a common line, then we may assume without loss of generality that
x, y, z lie on the horizontal axis—otherwise we may make this true via a rotation
about the origin of D2, and the uniqueness will still be determined.13 With this,
we may set x = ( a, 0) , x′ = ( a′, 0) , y = ( b, 0) , y′ = ( b′, 0), and we need
to find z = ( c, 0) and r > 0
325
x'
y'
Figure 16.9: The image of two points uniquely determines the circle of inversion
(the easy case).
are variables. Subtracting the two equations gives aa′ − bb′ + c( b + b′ − a − a′)
= 0, which can be solved for c as long as a ̸= b and a′ ̸= b′. Note that if b − a = a
′ − b′, then the two points must be interchanged by the inversion, and hence they
are not two distinct pairs of inversions.
Lemma 16.18 fails in the case that the two points are exchanged by the inversion.
It simplifies the pair of equations used in the proof to ( a − c)( b − c) = r 2. If
you arbitrarily choose a position for c to the right of both a and b or to the left
of both a and b, then you
required for uniqueness, and the condition relevant to the upcoming Lemma 16.21 is
that the inverting circle is orthogonal to the unit disk.
Next, we show that the cross ratio is preserved by hyperbolic reflections. The
proof is trivial for reflection in a diameter of the Poincaré disk, so we focus on
the case of inversion in a circle.
Theorem 16.19. Let f( x) be inversion in a circle with center c and radius r. Let
w, x, y, z ∈
Proof. For ease of notation, let w′ = f( w) (similarly for x, y, z), and let ( ab)
denote
∥a − b∥, the length of the line segment between two vectors a and b. We’ll use ·
for multiplication to disambiguate. Then we must prove
( wy) · ( xz)
( w′y′) · ( x′z′)
( wz) · ( xy)
( w′z′) · ( x′y′)
It suffices to show that for any two of these points, say, w, y, that ( wy) = ( cw)
. If we ( w′y′)
( cy′)
can show this, then (note the second equality is where we apply the claim, and the
rest is grouping):
326
w'
y'
( w′y′)
( cy′)
( wy) · ( xz)
( w′y′) · ( x′z′)
( wz) · ( xy)
( w′z′) · ( x′y′)
( w′y′) · ( x′z′) · ( wz) · ( xy) ( cw) · ( cx) · ( cw′) · ( cx′)
= 1 ,
r 4
To prove that ( wy) = ( cw) , we split into two cases depending on whether c, w, y
are ( w′y′)
( cy′)
collinear. If they are not, then this follows from the similarity of the triangles
∆ cwy ∼
∆ cy′w′: they share the angle with c and the defining property of circle inversion
implies ( cw) = ( cy) . If they are collinear, consider the diagram in Figure
16.10. If w, y are on ( cy′)
( cw′)
( wy)
( cw) + ( cy)
( cy)
r 2/( cy′)
( cw′)
( cy′) + ( cw′)
( w′y′)
= 1+
= 1+
= 1+
( cw)
( cw)
( cw)
r 2/( cw′)
( cy′)
( cy′)
( cy′)
( cy′)
which was our goal. If w and y are on the same side of c, then replacing the sum
( wy) =
( cy) + ( cw) with ( wy) = ( cy) − ( cw), or ( cw) − ( cy), as the case may be,
yields the same result.
Definition 16.20. Let p, q ∈ D2 be two distinct points. Form the hyperbolic line
through those points, and let x, y be the intersection of the hyperbolic line with
the boundary of D2, so that x is closest to p and y to q. Define the distance
between p and q to be: 1
( x − q)( y − p)
d( p, q) =
2 log ( x − p)( y − q)
327
Admittedly vaguely, the choice of these two special points used to compute the
cross ratio results in a “canonical” choice that allows different distances to be
compared with respect to the same reference scale. As p and q near the boundary of
the circle, the denominators involved in the cross ratio tend to zero and the cross
ratio increases. See the exercises for more.
Lemma 16.21. The set of points equidistant from two distinct points x, y is a
hyperbolic line, and a hyperbolic reflection in this line exchanges x and y.
Proof. First, we establish that for any two points x, y, there is a unique
hyperbolic reflection f : D2 → D2 that exchanges x and y. Then we prove that a
point is fixed by f if and only if it is equidistant to x and y. Since we know that
a point is fixed by a circle inversion if and only if it lies on that circle,14
this completes the proof.
The existence of f: if x and y both have the same Euclidean distance from the
origin, then one can use the diameter of D2 that bisects the angle between x, y,
and the center of D2. Otherwise, as per the postscript of Lemma 16.18 we follow the
steps of Lemma 16.18
with the added condition that the inverting circle is orthogonal to the unit
circle.
Rotate the center of the (unknown) circle of inversion so it, x, and y all lie on
the same horizontal line, which we may suppose without loss of generality is the
horizontal axis.
ab − ∥d∥ 2 + 1 + ( −a − b + 2 d 1) c = 0
328
f ( z)
x
Figure 16.11: The line between x and f( z) is mapped to the line between y and z by
reflection, and the intersection of these points is w.
This has a unique solution for c if and only if d 1 ̸= ( a + b)/2, i.e., if d does
not lie on the (Euclidean) perpendicular bisector of the line segment between x, y.
This exceptional case is exactly when we use a reflection in a diameter of D2,
i.e., the first case above.
d( y, z), and suppose to the contrary that z ̸= f ( z). Let L be the hyperbolic line
defined by f; by swapping z and f( z) we may assume z is on the same side of L as
x. In this case note that z, f( z) are exchanged by f, since f is a reflection.
This implies that any point w fixed by f is also equidistant to z and f ( z). We
have the picture in Figure 16.11.
Now d( x, z)
d( x, f ( z)). Now consider the hyperbolic line segment between x, f ( z), which
intersects L (the hyperbolic line defining f ) at a point w. This w is on the
shortest path between x and f ( z), meaning d( x, f ( z)) = d( x, w) + d( w, f
( z)), and note that w is fixed by f .
Finally,
d( x, z) = d( x, f ( z))
= d( x, w) + d( w, f ( z))
= d( x, w) + d( w, z)
This implies w is on the shortest path between x and z. This contradicts the
equality part of the triangle inequality: x and z are on the same side of L while w
is on L.
And now the finale: all isometries of the Poincaré disk are a composition of
reflections. This proof relies on a fact whose proof I have omitted for brevity:
isometries of the Hyperbolic plane map lines to lines, just like in the Euclidean
setting.
Proof. First, we claim that any isometry is determined by its effect on three non-
collinear points x, y, z (not on any Poincaré line). Suppose to the contrary there
were two isometries f, g with f( x) = g( x) , f( y) = g( y) , f( z) = g( z), but
for which some p ̸∈ {x, y, z}
satisfies f( p) ̸= g( p). Since f and g are isometries, each of the points {f( x) ,
f( y) , f( z) }
To show three reflections are enough to express any isometry f : D2 → D2, choose
any x, y, z not on a line. In the special case that x = f( x) and y = f( y), then
reflection in the hyperbolic line through x, y must map z to f( z). Indeed, z has
the same distance to x = f ( x) and y = f ( y) as f ( z), so Lemma 16.21 applies.
In this case f is just a reflection.
In the slightly less special case that only one of the three points equals its
image under f , say x = f ( x), then map y to f ( y) via reflection in the unique
hyperbolic line consisting of equidistant points to y and f( y) provided by Lemma
16.21. Again, since y and f( y) are equidistant from x = f( x), the line being
reflected must pass through x, meaning x is fixed by this reflection. With one
reflection we’ve reduced to the case x = f( x) , y = f( y); the first case adds one
more reflection to get f.
Finally, in the least special case that all three points are different from their
images, we can apply any reflection mapping x 7→ f( x), reducing to the second
case. This results in a simple algorithm:
3. Do the same for g 2( g 1( z)) and f( z), provided they are not equal, and call
the resulting reflection g 3. This reflection fixes both g 2( g 1( x)) = g 1( x)
and g 2( g 1( y)).
Even today, we tessellate our footballs with black-and-white pentagons and our
tweed coats with herringbone. Look around you—tessellations!
Tessellations and groups are natural bedfellows. A fixed isometry of the ambient
space containing a starting pattern will move the pattern to one of its
repetitions, and the (usu-
330
Fine Arts.
331
Figure 16.16: Cloth, Hawaii. From The Grammar of the Ornament. A pattern which has
two linearly independent directions of translational symmetry.
ally infinitely large) set of all such transformations forms a group. This group
uniquely describes the geometry of the tessellation.
The Euclidean plane provides a notable example before we return to hyperbolic ge-
ometry. Let’s consider the set of all patterns that have discrete repetition in two
linearly independent directions (as opposed to a pattern that only repeats when
shifted, say, right), such as in Figure 16.16. The groups that describe such
patterns—which include the tessellations used in many historical decorations—have a
complete known classification.
They are called wallpaper groups, and there are exactly 17 of them, up to
isomorphism.
Simpler than classifying all wallpaper patterns, we can ask what are the possible
tessellations of the Euclidean plane by a convex polygon? For example, regular
squares (each interior angle having the same measure, and each side being the same
length) tile a plane via a group of translations isomorphic to Z × Z, a fact
familiar to anyone who has seen a chess or checkers board. And while regular
pentagons don’t tile the plane, irregular pentagons do, as depicted in Figure
16.18.
The shapes we’re narrowing down to study are convex, possibly irregular polygons.
Out of curiosity, if you try to tessellate the plane using an 8-sided convex
polygon, you will struggle. Your struggle is true: it’s impossible. The proof we’ll
see is quite interesting—it
332
Figure 16.18:
Irregular pentagonal
by David Eppstein.
Proof. Suppose for contradiction that there is an n-sided convex polygon P , scaled
to area 1, that tessellates the plane, and fix the set T of all polygons in such a
tessellation.
Our proof will have two steps: first, we will fix a bounded piece of the
tessellation of area A. Then we’ll count the number of angles of polygons contained
in that piece in two different ways, and arrive at an inequality of A in terms of
A. This inequality will be a contradiction for a sufficiently large A.
Fix a circle C of area A, and let S ⊂ T be the polygons in T that contain at least
one point within C. This finite set of polygons forms a graph G = ( V, E), where V
is the set of vertices of polygons in S, and E is the (possibly subdivided16) set
of polygon edges.
Moreover, this graph is planar since the tessellation S provides a literal drawing
in the plane. Call F the set of faces of G (i.e., the polygons plus the outside
face, as we did in Chapter 6). We summarize in Figure 16.19.
First, split each of V, E, F into “interior” and “exterior” subsets. The exterior
subsets correspond to those vertices, edges, and faces that are adjacent to the
outside of the graph.
I.e., these came from the polygons that are only partially in the circle C. The
interior vertices, edges, and faces are those that come from polygons entirely
inside C. Subscript V, E, F with “int” for interior and “ext” for exterior, like V
ext.
16 Two polygons in the tessellation can touch so that the vertex of one lies
partway along the edge of another.
333
exterior vertex
interior vertex
exterior edge
...
interior edge
...
interior angles
of polygons
...
Figure 16.19: The setup for a hypothetical tilling of the Euclidean plane by a
convex 7-gon.
The bold circle has area A, and we include any polygon having at least one point
inside the disk with boundary C.
We will use the Euler characteristic formula from our chapter on graphs, Theorem
6.5, which says that for a planar graph |V | − |E| + |F | = 2. We first claim two
facts which imply the formula |V | = ( n/2 − 1) A + O( A 1/2), which is attained by
substituting these two facts into Euler’s formula and combining.
Now we will count the number of interior angles of polygons in S in two different
ways. What I mean by “interior angle” is an angle at a vertex inside a face. The
first way is obvious, n( |F | − 1) = n|S| ≤ n|F |, because each polygon has n
interior angles by
334
definition (ignoring the exterior face). Second, we count by vertex, splitting into
interior and exterior cases. Call av the number of interior angles meeting at a
vertex v ∈ V .
# interior angles =
av +
av.
v∈V int
v∈V ext
For V int, there must be at least three interior angles at each vertex (one of
these angles may be part of an edge of some polygon, thus having measure π). This
bounds the first sum from below by 3 |V int |. The second sum is O( A 1/2) because
every exterior vertex touches an exterior edge, and fact (2) above shows the number
of exterior edges is O( A 1/2). This gives (# interior angles) ≥ 3 |V int | + O( A
1/2). Since |V int | = |V | − |V ext |, we have |V int | = ( n/2 − 1) A + O( A 1/2)
as well.17
The right hand side is approximately 3 nA, and the left hand side is nA, hinting at
the 2
3( n/2 − 1)
While this may disappoint hopeful weavers of the next great tapestry, one can
tessellate the hyperbolic plane with a 7-gon. Not only that, but there are
infinitely many ways to do it! Figure 16.20 shows two ways produced by the program
in this section. 18
In the figure, a regular 7-gon tessellates the Poincaré disk, with 3 polygons
meeting at each vertex. The two parameters implied by (7 , 3) provide an infinite
family of tessellations by regular, convex p-gons. 19 Given a convex, regular,
hyperbolic p-gon, let [ p, q]
18 The intrepid reader will revisit the proof of Theorem 16.23 and determine where
it fails for hyperbolic geometry.
335
Figure 16.20: Left: a tiling of the hyperbolic plane by 7-gons with 3 meeting per
vertex.
The artist M.C. Escher used a [6 , 4] tessellation to construct his Circle Limit
IV, displayed in Figure 16.21 with additional lines showing the hyperbolic lines
used in its design. The remainder of this chapter is devoted to drawing the
outlines of hyperbolic tessellations.
In an exercise you’ll extend the program to input a pattern (like the angel/devil
motif in Figure 16.21) and output an Escher-style drawing.
If such a triangle has its π/ p vertex centered at the origin, then Figure 16.23
shows why it produces a hyperbolic p-gon that tessellates the plane. In Figure
16.23, the fundamental triangle is the thick solid shape, and it’s been repeatedly
reflected along the edges incident to the origin. Recall from Theorem 16.22 that
all isometries are products of reflections, and here we’re expressing rotations of
2 π/ p by two reflections. The result is that the triangle and its mirror are
rotated to produce a hyperbolic p-gon centered at the origin.
Likewise, the vertex with an angle of π/ q allows one to rotate around an exterior
vertex by an angle of 2 π/ q, forming a piece of each of the q distinct polygons at
each vertex.
Thus, if we can draw a fundamental triangle and reflect a set of points across a
hyperbolic line, we’ll be able to draw regular convex tessellations.
336
Figure 16.21: Left: Circle Limit IV, M.C. Escher, 1960. Right: annotated showing
the center 6-gon that is tessellated.
ted).
337
Recall that a hyperbolic line between two points in the Poincaré disk is
represented by the circle passing through those two points orthogonal the unit
circle (or a diameter).
Moreover, reflection in that line is inversion in the circle (or reflection across
the diameter).
With these basic objects and operations, we can compute the hyperbolic line pass-
ing through two points. The inputs are two points which the hyperbolic line must
pass through, along with a circle it must be orthogonal to. The orthogonal circle
argument happens to be the boundary of D2, but the implementation does not depend
on this.
There is one simple case to start: when both points are already on the orthogonal
circle.
In this case, the hyperbolic line is the Euclidean circle whose center is the
intersection of the two tangent lines at the points, depicted in Figure 16.25. This
results in the following edge case in code.
line1 = circle.tangent_at(point1)
line2 = circle.tangent_at(point2)
return line1.intersect_with(line2)
"""Return a Circle that passes through the two given points and
"""
If at least one point is not on the circle, then the output is computed as follows.
Invert the non-circle point in the circle (Proposition 16.16 guarantees
orthogonality), and the result is a set of three points, which uniquely determine
the equation of a circle.
338
def project(self, w): """Project self onto the input vector w."""
class Line:
@staticmethod
def through(p1, p2): """Return a Line through the two given points."""
class VerticalLine(Line):
The equation for the center of the circle passing through three given points can be
computed by setting up three equations and solving. The equations being solved are
built by substituting our known points into the equation of a circle. Here the
unknowns are cx, cy, and r.
( x 1 − cx)2 + ( y 1 − cy)2 = r 2
( x 2 − cx)2 + ( y 2 − cy)2 = r 2
( x 3 − cx)2 + ( y 3 − cy)2 = r 2
A succinct way to express the solution to these equations is in terms of the ratios
of determinants of a cleverly chosen matrix. We haven’t talked about the
determinant in this book, but in addition to being a deeply meaningful quantity in
its own right, it shows up frequently in computational geometry. More about the
determinant in the Chapter Notes. In this case, the solution is summarized by
ratios of determinants of sub-matrices of the following matrix:
x 2 + y 2
x 2 + y 2 x
y 1 1
1
x 2 + y 2
2
y 2 1
x 2 + y 2
y 3 1
computing the determinant of the smaller matrix, called a minor. Once the recursion
reduces to determinants of 3-dimensional matrices, we can easily hard-code a
formula.
[...edge case...]
point3 = (circle.invert_point(point2)
def row(point):
(x, y) = point
return [x ** 2 + y ** 2, x, y, 1]
circle_radius = math.sqrt(
circle_center_x ** 2 + circle_center_y ** 2
+ detminor_1_4 / detminor_1_1)
return Circle(Point(circle_center_x, circle_center_y), circle_radius)
340
This allows us to define relevant abstractions for a hyperbolic line and the
hyperbolic plane. An instance of the Poincaré disk is a circle, with methods to
compute a line through two given points. A hyperbolic line is a circle, which
happens to be orthogonal to the unit circle forming the boundary of the Poincaré
disk.
class PoincareDiskModel(Circle):
else:
class PoincareDiskLine(Circle):
return self.invert_point(point)
To determine if three points are collinear, we again employ the determinant. More
generally, if you provide three points A = ( ax, ay) , B = ( bx, by) , C = ( cx,
cy) in sequence, one can determine via the sign of a determinant whether visiting
the points in order results in a clockwise turn, a counterclockwise turn, or a
straight line. The relevant matrix is
ax ay
1 b
by
cx
cy
a_x, a_y = a
b_x, b_y = b
c_x, c_y = c
return 'counterclockwise'
return 'clockwise'
else:
return 'collinear'
341
The requirement of the three angle measures, paired with the side AD lying on the
horizontal axis, uniquely determines the positions of B and D. Let’s derive this
now.
( )
Lemma 16.26. Define the constant Z = tan π + π tan π . The coordinates of the p
q
p
bx =
1 + 2 Z − (tan( π/ p))2
by = bx tan( π/ p) ,
r 2 = b 2 y + ( bx − gx)2
dx = gx − r,
Proof. The point B = ( bx, by) is defined to be on the line which makes an angle of
π/ p with the horizontal, i.e., y = tan( π/ p) x. Since A is the origin, hyperbolic
lines through A are the same as Euclidean lines. This gives the formula for by. B
also lies on a circle orthogonal to the unit circle that passes through D. Call
this unknown circle C, and suppose it has center G = ( gx, 0). Note that the y-
coordinate of G must be zero in order for C to make a right angle with D = ( dx,
0). Refer to Figure 16.26.
We’re asking for an angle of π/ q between the line y = tan( π/ p) x and the tangent
to this unknown circle C at B. Stare at the diagram in Figure 16.27 to convince
yourself that the desired tangent line must have an angle of π + π with the
horizontal, implying p
The equation of the unknown circle (in terms of our unknown quantities) is ( x −
gx)2+
by C′( bx, by) = −( bx − gx)/ by, and setting C′ = tan( π + π ), we solve for g p
x in terms
of bx as
(
( )
bx( Z + 1) = gx,
where Z = tan
tan
342
π/q
r = ( x x
b – g ) + y
π/p
Figure 16.26: The unknown points computed in Lemma 16.26 are B, D, and G, which is
the center of the orthogonal circle C passing through B, D, that makes the desired
angle of π/ q with the top edge of the fundamental triangle.
π/q
B
π/p
π/q
π/p
π/p) x
y = tan(
Figure 16.27: By symmetry, the angle of the tangent line to C at B with the
horizontal is π/ p + π/ q.
343
If we can get another independent equation relating bx and gx, we can eliminate one
variable and solve the entire system. The fact we have yet to use is that C and the
unit circle are orthogonal. This gives a relationship between their radii, which
form the legs of a right triangle: 12 + r 2 = g 2, where
. Solving this
equation for bx gives the formula stated in the theorem, and substitution provides
the rest.
This results in the following code, whose documentation is far more tedious than
its implementation:
def compute_fundamental_triangle(tessellation_configuration):
p = tessellation_configuration.num_polygon_sides
q = tessellation_configuration.num_polygons_per_vertex
tan_p = math.tan(math.pi / p)
g_x = b_x * (Z + 1)
A = Point(0, 0)
B = Point(b_x, b_y)
D = Point(d_x, 0)
return [A, B, D]
Finally, we have all the pieces we need to draw a tessellation. The majority of the
code is helpers. We output the drawing as an SVG file, and so in addition to using
a library to draw SVGs, we need to keep track of the differences in coordinate
systems. Beyond that, the core routine is quite simple.
First we define a configuration class for a tessellation (used above to draw the
fundamental triangle). Followed by a class representing a tessellation. In the
latter, the compute_center_polygon method computes the center polygon by computing
the fundamental triangle, and then iteratively reflecting it across the appropriate
edges.
344
The remainder of the code21 involves rendering the edges of the polygons as SVG
arcs.
We also created a simple data structure that allows one to compare polygons for
equality in a principled way (since the process of reflecting them changes the
order of their vertices).
class TessellationConfiguration(
namedtuple('TessellationConfiguration',
['num_polygon_sides', 'num_polygons_per_vertex'])):
if not self.is_hyperbolic():
(self.num_polygon_sides, self.num_polygons_per_vertex)))
def is_hyperbolic(self):
class HyperbolicTessellation(object):
def __init__(self, configuration):
self.configuration = configuration
self.center_polygon = self.compute_center_polygon()
self.tessellated_polygons = self.tessellate()
def compute_center_polygon(self):
self.configuration)
p = self.configuration.num_polygon_sides
"""The center polygon's first vertex is the top vertex (the one that makes an angle
of pi / q), because the x_axis_vertex is the center of
an edge.
"""
polygon = [top_vertex]
p2 = self.disk_model.line_through(center, p1).reflect(p2)
p1 = self.disk_model.line_through(center, p2).reflect(p1)
polygon.append(p1)
return polygon
21 See pimbook.org
345
"""Return the set of polygons that make up a tessellation of the center polygon.
Keep reflecting polygons until drawing a total of
max_polygon_count."""
queue = deque()
queue.append(self.center_polygon)
tessellated_polygons = []
processed = PolygonSet()
while queue:
polygon = queue.popleft()
if processed.contains_polygon(polygon):
continue
for i in range(len(polygon))]
for u, v in edges:
line = self.disk_model.line_through(u, v)
queue.append(reflected_polygon)
tessellated_polygons.append(polygon)
processed.add_polygon(polygon)
processed.add_polygon(polygon)
break
return tessellated_polygons
We close with some outputs for different configurations, shown in Figure 16.32.
1. Groups are the primary tool mathematics has for studying symmetry, and symme-
those symmetries.
16.9 Exercises
16.1. Recall the symmetric group Sn is the set of all bijections of a set of n
elements.
Call the set being permuted { 1 , 2 , 3 , . . . , n}, and consider the following
helpful notation for a permutation: define a cycle notation whereby the tuple (1 3
4 2) represents the permutation σ mapping 1 7→ 3, 3 7→ 4, 4 7→ 2, and 2 7→ 1. All
other values are fixed by σ. Define a product of cycles, such as (going right to
left) (2 4)(1 2) = (1 4 2) as the composition of the corresponding maps. A cycle of
length 2 is called a transposition.
346
A [3 , 7] tessellation
A [5 , 5] tessellation
A [6 , 6] tessellation
A [7 , 7] tessellation
347
Prove that every permutation can be written as a product of disjoint cycles. Prove
that the n-cycle (1 2 3 · · · n) and a single transposition (1 2) are a generating
set for Sn.
1. aH = bH if and only if b− 1 a ∈ H.
3. Let G be a group. Given a subgroup H ⊂ G, show that the set of all cosets of H
partition G into disjoint subsets. Conclude that “being in the same coset of H”
16.7. Let G be a finite group and H a subgroup. Prove |H| evenly divides |G|. Use
this to prove that for any a ∈ G, a|G| is the identity.
16.9. Prove Theorem 16.12, assembling the pieces laid out in the chapter.
348
16.11. Research and implement the ElGamal digital signature scheme using (Z/ n Z)
×.
16.12. Look up the definition of a semi-direct product of groups, and use this to
understand the characterization of the dihedral group D 2 n as a semi-direct
product of Z/2Z
16.13. If you’re comfortable with complex numbers, find a source online that
discusses the symmetry groups of the roots of polynomials with coefficients in Q.
At the risk of referring to an interactive essay that has disappeared from the
internet after this book is published, see Fred Akalin’s essay, “Why is the Quintic
Unsolvable?” 22
16.15. Two graphs are called isomorphic if there is a bijection between their
vertex sets having the same property as a symmetry: all adjacencies and non-
adjacencies are preserved. The problem of efficiently computing whether two graphs
are isomorphic is one of the most famous open problems in computer science, called
the graph isomorphism problem. Prove that the graph isomorphism problem reduces to
the problem of computing a generating set of the symmetry group of a single graph.
16.16. Prove that any Euclidean isometry in E( n) can be written as the product of
at most n + 1 reflections.
16.17. Read about determinants and understand why the formula we presented in
Section 16.7 for the circle passing through three given points is correct.
16.18. Research the cross ratio in the context of projective geometry. How is it
defined there? What are the projective transformations, and why is do they preserve
the cross ratio?
22 https://www.akalin.com/quintic-unsolvability
349
16.21. We neglected to give a good intuition for why the hyperbolic distance
function is intuitively a good choice. The reason is that the morally acceptable
way to think about this function involves integral calculus, which we avoided in
this book. To do this formally, one defines a metric tensor or line element that
describes the length of a curve via an integral. Research these topics to
understand how the hyperbolic metric is defined. Be warned that many sources jump
straight into advanced terminology and concepts. You’re looking for an
“introduction to tensor calculus” or an “introduction to Riemannian geometry.”
Because of the close relation to physics and general relativity, there are also
many sources explaining these concepts for physicists. Apply the usual caveats that
come with physicists explaining mathematics.
16.22. Extend the hyperbolic tessellation program in this chapter to one which,
when given an input motif (an image that replaces the fundamental triangle) draws a
hyperbolic polygon using that image and then tessellates the Poincaré disk.
16.23. A different model of hyperbolic geometry is the upper half-plane model. This
model has as points the complex numbers {a + bi : b > 0 }, and as lines the half
circles orthogonal to the horizontal axis b = 0, along with vertical rays. The line
b = 0 forms the
“boundary” analogous to the unit circle bounding the Poincaré disk. The isometries
of this model are the so-called Möbius transformations. For these exercises it may
help to read the section in the chapter notes about the complex matrix
representation of hyperbolic isometries. Prove the following.
1. The set of Möbius transformations, those mappings of the complex line defined by
z 7→ az+ b with ad − bc ̸= 0, form a group under function composition. This cz+ d
5. Find a bijection between the upper half plane and the Poincaré disk that
preserves hyperbolic lines.
350
16.24. Yet another model of hyperbolic geometry is the Minkowski hyperboloid model.
3. How hyperbolic lines in the Minkowski model correspond to lines in the Poincaré
disk model.
4. Using the above, write a program that draws a hyperbolic tessellation in the
Minkowsi model, and then projects it to the Poincaré disk. What are the advan-
tages and disadvantages to doing it this way, instead of directly in the Poincaré
model?
16.25. In this exercise we’ll explore the symmetry group of the hyperbolic
tessellation of a regular convex p-gon with configuration [ p, q]. Fix the
fundamental triangle of the configuration, and consider the reflections α, β, γ
across each edge. What are the algebraic relations between these symmetries? Can
you identify the resulting (infinite) group of symmetries with a subgroup of a
familiar group?
Like many topics in mathematics, the discovery of the hyperbolic plane was far more
roundabout than its final form. The first hyperbolic geometry was discovered on the
surface of revolution of the so-called tractrix, which is itself derived indirectly
from the cate-nary curve—the name for the not-quite-parabolic shape formed by an
ideal rope hanging from its ends under its own weight.
and irrelevant for our purposes—in terms of symmetry groups. Specifically, let Sn
be the permutation group on the set of rows of a matrix A. See Exercise 16.1 and
its sequel for more details on what permutation groups look like. For σ ∈ Sn define
( − 1) σ to be the parity of σ (1 if σ is an even permutation and − 1 if it is
odd).
351
det A =
( − 1) σ
ai,σ( i)
σ∈Sn
i=1
That is, for each permutation you take the products of the entries of A whose rows
and columns are input-output pairs of σ, scale by the parity of σ, and sum.
For example, for the (signed) area of a triangle T with vertices ( ax, ay) , ( bx,
by) , ( cx, cy), we used
ax ay
det 1 b
by
cx
cy
Multiple times throughout this book, we’ve avoided using complex numbers, resulting
in some slightly nonstandard work. This was essentially a cop out.23 Be that as it
may, the group structure of hyperbolic isometries is best studied with complex
numbers.
The briefest review: the set C = {a+ ib : a, b ∈ R } is called the set of complex
numbers, where i is the “complex unit,” i.e., it’s a unit vector defined to be
linearly independent from 1. There is a bijection C → R2 via a + ib 7→ ( a, b), so
that complex numbers can be viewed as a plane. Using this view, denote by arg( a +
bi) the angle between ( a, b) and (1 , 0) (chosen to be in the interval [0 , 2 π)),
denote by |a + bi| the length of ( a, b), and define multiplication of a + ib =
( a, b) by i as the rotation of ( a, b) by 90
degrees counterclockwise. Extrapolate from this that i 2 = − 1, and assert that the
usual arithmetic rule that ( a + ib)( c + id) = ac − bd + i( ad + bc).
23 I like complex numbers, but I thought the book was getting too long to fit a
full chapter. The topic deserves nothing less, and I’m aware of the irony of this
section.
352
az + b
f + ( z) =
a,b
bz + a
az + b
f − ( z) =
a,b
bz + a
Also force a, b to satisfy |a| 2 − |b| 2 = 1. These are the isometries of the
Poincaré disk.
a,b
a,b
Proof. The proof is left in the exercises for those who feel comfortable with
complex numbers.
The functions f+ are “orientation preserving” isometries of D2, meaning they are a
a,b
f + 7→
a,b
And if you multiply the matrices, you get the composition of the two maps.
c,d
a,b
ac+ db,bc+ ad
G = {
: a, b ∈ C } be the set
1 , 0
G ∪ f − G. This is the isometry group of the Poincaré disk. Another way to describe
it is 1 , 0
that G, the orientation preserving isometries, is the quotient of the full isometry
group by the subgroup consisting of the identity and a single reflection.
24 Orientation has a technical definition that encodes the intuitive idea that
“reversing orientation” turns “hello”
into “
hello
” and vice versa—though for hyperbolic isometries it will have the expected
additional warping.
Chapter 17
A New Interface
We are no longer constrained by pencil and paper. The symbolic shuffle should no
longer be taken for granted as the fundamental mechanism for understanding quantity
and change. Math needs a new interface.
This book has been quite a journey. We laughed. We cried. We computed with matrices
like fury.
Math is a human activity. It’s messy and beautiful, complicated and elegant, useful
and bull-headedly frustrating. But in reading this book, dear reader, my dream is
that you have found the attitude, confidence, and enough prerequisite knowledge to
continue to engage with mathematics beyond these pages. I hope that you will find
the same joy that I have in the combination of math and programming.
You may be wondering what’s next. Each topic in this book was only covered lightly.
There’s a vast world of math out there, in the form of books, blog posts, video
lectures, and the questions from your own curiosity. So much to explore! I included
an annotated list of resources in Appendix C to whet your appetite.
In these closing words, I’d like to explore a vision for how mathematics and
software can grow together. Much of our effort in this book involved understanding
notation, and using our imagination to picture arguments written on paper. In
contrast, there’s a growing movement that challenges mathematics to grow beyond its
life on a chalkboard.
One of the most visible proponents of this view is Bret Victor. If you haven’t
heard of him or seen his fantastic talks, please stop reading now and go watch his
talk, “Inventing on Principle.” It’s worth every minute.1 Victor’s central thesis
is that creators must have an immediate connection to their work. As such, Victor
finds it preposterous that programmers often have to write code, compile, run,
debug, and repeat every time they make a change. Programmers shouldn’t need to
simulate a machine inside their head when designing a program—there’s a machine
sitting right there that can perform the logic perfectly!
1 https://vimeo.com/36579366
353
354
Victor reinforces his grand, yet soft-spoken ideas with astounding prototypes. But
his ideas are deeper than a flashy user interface. Victor holds a deep reverence
for ideas and enabling creativity. He doesn’t want to fundamentally change the way
people interact with their music library. He wants to fundamentally change the way
people create new ideas. He wants to enable humans to think thoughts that could not
previously have been thought at all. You might wonder what one could possibly mean
by “think new
thoughts,” but fifteen minutes of Victor’s talk will show you and make disbelieve
how we could have possibly made do without the typical software write-compile-run
loop. His demonstrations rival the elegance of the finest mathematical proofs.
Just as Lamport’s structured proof hierarchies and automated assistants are his key
to navigating complex proofs, and similarly to how Atiyah’s most effective tool is
a tour of ideas that pique his interest, Victor feels productive when he has an
immediate connection with his work. A large part of it is having the thing you’re
creating react to modifications in real time. Another aspect is simultaneously
seeing all facets relevant to your inquiry. Rather than watch a programmed car move
over time, show the entire trajectory for a given control sequence, the view
updating as the control sequence updates.
It should not surprise you, then, that Victor despises mathematical notation. In
his essay “Kill Math,” Victor argues that a pencil and paper is the most antiquated
and unhelpful medium for using mathematics. Victor opines on what a shame it is
that so much knowledge is only accessible to those who have the unnatural ability
to manipulate symbols on paper. How many good ideas were never thought because of
that high bar?
Overall, I agree with Victor’s underlying sentiment. Lots of people struggle with
math, and a better user interface for mathematics would immediately usher in a new
age of enlightenment. This isn’t an idle speculation. It has happened time and time
again throughout history. The Persian mathematician Muhammad ibn Musa al-Khwarizmi
invented
algebra (though without the symbols for it) which revolutionized mathematics,
elevating it above arithmetic and classical geometry, quickly scaling the globe.
Make no mistake, the invention of algebra literally enabled average people to do
contemporarily advanced 2 It’s amusing to see an audience’s wild applause for this,
when the same people might easily have groaned as students being asked to sketch
(or parse a plot of) the trajectories of a differential equation, despite the two
concepts being identical. No doubt it is related to the use of a video game.
355
mathematics. 3 I’m surprised Victor does not reference algebra as a perfect example
of a tool for thinking new thoughts, even if before arguing its time has passed.
And it only gets better, deeper, and more nuanced. Shortly after the printing press
was invented French mathematicians invented modern symbolic notation for algebra,
allowing mathematics to scale up in complexity. Symbolic algebra was a new user
interface that birthed countless new thoughts. Without this, for example,
mathematicians would never have discovered the connections between algebra and
geometry that are so prevalent in modern mathematics and which lay the foundation
of modern physics. Later came the invention of set theory, and shortly after
category theory, which were each new and improved user interfaces that allowed
mathematicians to express deeper, more unified, and more nuanced ideas than was
previously possible.
Meanwhile, many of Victor’s examples of good use of his prototypes are “happy ac-
Immediacy makes it dreadfully easy to explore examples, which is one of the most
important techniques I hope you take away from this book! But what algebraic
notation and its successors bring to the table beyond happenstance is to scale in
complexity beyond the problem at hand. While algebra limits you in some ways—you
can’t see the solutions to the equations as you write them—it frees you in other
ways. You need not know
how to find the roots of a polynomial before you can study them. You need not have
a complete description of a group before you start finding useful homomorphisms. As
Sir Arthur Eddington said, group theory studies operations that are as unknown as
the quantities that they operate on. We didn’t need to understand precisely how
matrices correspond to linear maps before studying them, as might be required to
provide a useful interface meeting Victor’s standards. Indeed, it was algebraic
grouping and rearranging (with cognitive load reduced by passing it off to paper)
that provided the derivation of matrices in the first place.
Then there are the many “interfaces” that we’ve even seen in this book: geometry
and the Cartesian plane, graphs with vertices and edges, pyramids of balls with
arrows, drawings of arcs that we assert are hyperbolic curves, etc. Mathematical
notation goes beyond
“symbol manipulation,” because any picture you draw to reason about a mathematical
object is literally mathematical notation.
I see a few ways Victor’s work falls short of enabling new modes of thought,
particularly insofar as it aims to replace mathematical notation. I’ll outline the
desiderata I think a new interface for mathematics must support if it hopes to
replace notation.
356
The last two properties are of particular importance for any interface. Important
interfaces throughout history satisfy the last two, including spoken language,
writing, most tools for making art and music, spreadsheets, touchscreens and
computer mice, key-boards,4 and even the classic text editors vim and emacs—anyone
can use them in a basic fashion, while experts dazzle us with them.
Lumped in with this is population reasoning. I need to be able to reason about the
entire class of all possible objects satisfying some properties. The set of all
algorithms that compute a function (even if no such algorithm exists), or the set
of all distance-preserving functions of an arbitrary space. These kinds of
deductions are necessary to organize and synthesize ideas from disparate areas of
math together (connecting us to
A different view is that a useful interface for mathematics must necessarily allow
the mathematician to make mistakes. But part of the point of a new interface was to
avoid the mistakes and uncertainty that pencil and paper make frequent! It’s not
entirely clear to me whether counterfactual reasoning necessarily enables mistakes.
It may benefit from a tradeoff between the two extremes.
Meaning Assignment
357
can write f( ab) = f( a) f( b) and overload which multiplication means what. I can
define a new type of arrow ,→ on the fly and say “this means injective map.”
Ideally the interface also makes the assignment and management of meaning easy.
That is, if I’ve built up an exploration of a problem involving pennies on a table,
I should easily be able to change those pennies to be coins of arbitrary unknown
denomination. And then allow them to be negative-valued coins. And then give them a
color as an additional property. And it should be easy to recall what semantics are
applied to which objects later. If each change requires me to redo large swaths of
work (as many programs built specifically to explore such a problem would), the
interface will limit me. With algebraic notation, I could simply add another index,
or pull out a colored pencil (or pretend it’s a color with shading), and continue
as before. In real life I just say the word, even if doing so makes the problem
drastically more difficult.
Flexible Complexity
Music is something that exhibits flexible complexity. A child raps the keys of a
piano and makes sounds. So too does Ray Charles, though his technique is
multifaceted and deliberate.
Mathematics has similar dynamic range that can accommodate the novice and the ex-
pert alike. Anyone can make basic sense of numbers and unknowns. Young children can
understand and generate simple proofs. With a decent grasp of algebra, one can
compute difficult sums. Experts use algebra to develop theories of physics, write
computer programs with provable guarantees, and reallocate their investment
portfolios for maximum profit.
The closest example of an interface I’ve seen that meets the kind of flexible
complexity I ask of a replacement for mathematics is Ken Perlin’s Chalktalk.5
Pegged as a “digital presentation and communication language,” the user may draw
anything they wish. If the drawing is recognized by the system, it becomes
interactive according to some prespecified rules. For example, draw a circle at the
end of a line, and it turns into a pendulum you can draw to swing around. Different
pieces are coupled together by drawing arrows; one can plot the displacement of the
pendulum by connecting it via an arrow to a plotting widget. Perlin displays
similar interactions between matrices, logical circuits, and various sliders and
dials.
Chalktalk falls short in that your ability to use it is limited by what has been
explicitly 5 https://github.com/kenperlin/chalktalk
358
programmed into it as a behavior. If you don’t draw the pendulum just right, or you
try to connect a pendulum via an arrow to a component that doesn’t understand its
output, you hit a wall. To explain to the interface what you mean, you write a
significant amount of code. This isn’t a deal breaker, but rather where I
personally found the interface struggling to keep up with my desires and
imagination. What’s so promising about Chalktalk is that it allows one to offset
the mental task of keeping track of interactions that algebraic notation leaves to
manual bookkeeping.
Incrementalism
Incrementalism means that if I want to repurpose a tool for a new task, I don’t
already need to be an expert in the target task to use the tool on it. If I’ve
learned to use a paint-brush to paint a flower on a canvas, I need no woodworking
expertise to paint a fence.
study classical geometry, and many such systems exist (Geogebra is a popular one,
and quite useful in its own right!). You could enable this system to draw and
transform various shapes on demand. You can phrase theorems from Euclidean geometry
in it, and explore examples with an immediate observation of the effect of any
operation.
Now suppose we want to study parallel lines; it may be as clear as the day from
simulations that two parallel lines never intersect, but does this fact follow from
the inherent properties of a line? Or is it an artifact of the implementation of
the simulation? As we remember, efficient geometry algorithms can suffer from
numerical instability or fail to behave properly on certain edge cases. Perhaps
parallel lines intersect, but simply very far away and the interface doesn’t
display it well? Or maybe an interface that does display far away things happens to
make non-intersecting lines appear to intersect due to the limitations of our human
eyes and the resolution of the screen.
In this system, could one study the possibility of a geometry in which parallel
lines always intersect? With the hindsight of Chapter 16 we know such geometries
exist (projective geometry has this property), but suppose this was an unknown
conjecture. To repurpose our conventional interface for studying geometry would
seem to require defining a correct model for the alternative geometry in advance.
Worse, it might require us to spend weeks or months fretting over the computational
details of that model. We might hard-code an intersection point, effectively
asserting that intersections exist. But then we need to specify how two such hard-
coded points interact in a compatible fashion, and decide how to render them in a
useful way. If it doesn’t work as expected, did we mess up the implementation, or
is it an interesting feature of the model? All this fuss before we even know
whether this model is worth studying!
This is mildly unfair, as the origins of hyperbolic geometry did, in fact, come
from concrete models. The point is that the inventors of this model were able to
use the sorts of indirect tools that precede computer-friendly representations.
They didn’t need a whole
359
class of new insights to begin. If the model fails to meet expectations early on,
they can throw it out without expending the effort that would have gone into
representing it within our hypothetical interface.
Most of my objections boil down to the need to create abstractions not explicitly
programmed into the interface. Mathematics is a language, and it’s expressiveness
is a core feature. Like language, humans use it primarily to communicate to one
another. Like writing, humans use it to record thoughts in times of inspiration, so
that memory can be offset to paper and insights can be reproduced faithfully later.
Paraphrasing Thurston, mathematics only exists in the social fabric of the people
who do it. An interface purporting to replace mathematical notation must build on
the shoulders of the existing mathematics community. As Isaac Newton said, “If I
have seen further it is by standing on the shoulders of giants.”
The value of Victor’s vision lies in showing us what we struggle to see in our
minds.
Now let’s imagine an interface that satisfies our desiderata, but also achieves
immediacy with one’s work. I can do little more than sketch a dream, but here it
is.
Let’s explore a puzzle played on an infinite chessboard, which I first learned from
mathematician Zvezdelina Stankova via the YouTube channel Numberphile.6 You start
with an integer grid N × N, and in each grid cell ( i, j) you can have a person or
no person.
The people are called “clones” because they are allowed to take the following
action: if cells ( i + 1 , j) and ( i, j + 1) are both empty, then the clone in
cell ( i, j) can split into two clones, which now occupy spaces ( i + 1 , j) , ( i,
j + 1), leaving space ( i, j) vacant. You start with three clones in “prison” cells
(1 , 1) , (1 , 2) , (2 , 1), and the goal is to determine if there is a finite
sequence of moves, after which all clones are outside the prison. For this reason,
Stankova calls the puzzle “Escape of the Clones.”
Left: An example move in “Escape of the Clones” whereby the solid-bordered clone
transforms into the two dotted-border clones. Right: the starting configuration for
the puzzle.
6 http://youtu.be/lFQGSGsXbXE
360
Suppose that our dream interface is sufficiently expressive that it can encode the
rules of this puzzle, and even simulate attempts to solve it. If the interface is
not explicitly programmed to do this, it would already be a heroic accomplishment
of meaning assignment and flexible complexity.
Now after playing with it for a long time, you start to get a feeling that it is
impossible to free the clones. We want to use the interface to prove this, and we
can’t already know the solution to do so. This is incrementalism.
Then we can, with the aid of the interface, compute the weight-sum of any given
configuration. The starting region’s weight is 2, and it remains 2 after any
sequence of operations. It dawns on us to try filling the entire visible region
outside the prison with clones. We have assumed to the contrary that an escape
sequence exists, in which the worst case is that it fills up vast regions of the
plane. The interface informs us that our egregiously crowded region has weight 1 .
998283. We then ask the interface to fill the entire complement of the prison with
clones (even though that is illegal; the rules imply you must have a finite
sequence of moves!). It informs us that weight is also 2. We realize that if any
cell is cloneless, as must be true after a finite number of moves, we will have
violated the invariant. This is counterfactual reasoning.
While we may never understand such deep questions, it’s clear that abstract logic
puzzles and their proofs provide an excellent test bed for proposals. Mathematical
puzzles are limited, but rich enough to guide the design of a proposed interface.
Games involve simple explanations for humans with complex analyses (flexible
complexity), drastically different semantics for abstract objects like chessboards
and clones (meaning assignment), there are many games which to this day still have
limited understanding by experts (incrementalism), and the insights in many games
involve reasoning about hypothetical
361
In his book “The Art of Doing Science and Engineering,” the mathematician and
computer scientist Richard Hamming put this difficulty into words quite nicely,
It has rarely proved practical to produce exactly the same product by machines as
we produced by hand. Indeed, one of the major items in the conversion from hand to
machine production is the imaginative redesign of an equivalent product . Thus in
thinking of mechanizing a large organization, it won’t work if you try to keep
things in detail exactly the same, rather there must be a larger give-and-take if
there is to be a significant success.
You must get the essentials of the job in mind and then design the mechanization to
do that job rather than trying to mechanize the current version—if you want a
significant success in the long run.
ing manual human processes requires arduously encoding the loose judgments made by
humans—often inconsistent and based on folk lore and experience. Software almost
always falls short of really solving your problem. Accommodating the shortcomings
requires a whole extra layer of process.
We write programs to manage our files, and in doing so we lose much of the spatial
reasoning that helps people remember where things are. The equivalent product is
that the files are stored and retrievable. On the other hand, for mathematics the
equivalent product is human understanding. This should be no surprise by now,
provided you’ve come to understand the point of view espoused throughout this book.
In this it deviates from software. We don’t want to retrieve the files, we want to
understand the meaning behind their contents.
My imagination may thus defeat itself by failing to give any ground. If a new
interface is to replace pencil and paper mathematics, must I give up the ease of
some routine mathematical tasks? Or remove them from my thinking style entirely?
Presuming I can achieve the same sorts of understanding—though I couldn’t say how—
the method of arrival shouldn’t matter. And yet, this attitude ignores my
experience entirely. The manner of insight you gain when doing mathematics is
deeply intertwined with the method of inquiry. That’s precisely why Victor’s
prototypes allow him to think new thoughts!
paper may be the wrong tool for the next generation of great thinkers. But if we
hope to enable future insights, we must understand how and why the existing tools
facilitated the great ideas of the past. We must imbue the best features of history
into whatever we build. If you, dear programmer, want to build those tools, I hope
you will incorporate the lessons and insights of mathematics.
Until then!
Appendix A
Notation
A lookup table for the notation used in this book, roughly ordered by chapter.
Refer to the ‘notation’ entry of this book’s index to find the page where the
notation is introduced.
1 https://en.wikipedia.org/wiki/List_of_mathematical_symbols
363
364
Symbol
Meaning
Related
Notes
Natural numbers
Integers
Real numbers
To
∑ n
Sum
i=1
∏ n
Product
i=1
“is in”
Set membership
̸∈
f − 1
Preimage, inverse
f − 1( x) , f − 1( A)
im
Image
f ( A)
Is contained in
⊆, ⊊
Set subset
End of proof
QED
( )
“n choose k”
Function composition
7→
“Maps to”
Approximately
⌈−⌉
Ceiling
⌊−⌋
Floor
lim
Limit
f ′
f “prime”
lated” signal
Derivative
df , f′
dx
dx
[ a, b]
Closed interval
The set {x ∈ R , a ≤ x ≤ b}
( a, b)
Open interval
Exists
For all
[ −]
Equivalence class
X/ ∼
Set quotient
∼ is an equivalence relation.
Equivalent to
a ≡ b mod n
⟨−, −⟩
Inner product
“perp”
complement
arg
Argument
Gradient
Partial derivative
f
∂x
x, ∂f
∂x
∂ 2
∂x∂y
xy , ∂ 2 f
∂x∂y
tive
Total derivative
dx
Differential
Evaluated at
Order of magnitude
o, Ω , ω, Θ
Cf. Chapter 15
ker
Kernel
Complex numbers
Complex unit
a + bi
Complex conjugation
Appendix B
A Summary of Proofs
Most mathematical proofs share a common structure. Each has a theorem they’d like
to prove, starts from some true statement, and applies simple logical deductions to
eventually arrive at the desired claim. Mathematical logic is the mathematical
study of frameworks for proving theorems in such a way that could be parsed and
verified by a computer.
Most of the logic we need in this book is covered by propositional logic—the same
sorts of rules that govern the evaluation of a conditional test in a programming
language—along with quantifiers for reasoning about classes of objects. Together
this is called first-order logic. 1
Instead, I will approach propositional logic more casually. I will describe the
syntax and semantics of first-order logic in plain language, while a more typical
reference would appeal heavily to formulas and symbols. I think many programmers
would benefit from a syntactic approach, but then there is a process of returning
to plain-English proofs, because very few proofs are written in a style that
emphasizes the syntax of first-order logic. It simply makes proofs harder to read.
As I hope I’ve stressed enough in this book: mathematical proofs are intended to be
written in prose optimized for human readers, and to further human understanding in
a way that uses syntax and notation as one of many tools. Nevertheless, the formal
foundations for the correctness of mathematical proofs has occupied mathematicians
for centuries, and it is worthwhile to see how it is done, even if most proofs need
nowhere near as much formality.
365
366
P → Q
Q → P
( P → Q)
and ( Q → P ) P ↔ Q
as a human you may not know how to tell if a statement is true or false (such as
“there are infinitely many prime integers”), but its truth value doesn’t depend on
information not specified in the proposition itself. Of course, if you might use
the name P to refer to a generic proposition, but that is a “variable” of our
analysis, not part of the logic itself. Confusingly, some call generic propositions
“propositional variables.” This will be contrasted with first-order logic
momentarily, where variables are first-class citizens.
The core operations performed on propositions are logical connectives, like “and,”
“or,”
“if and only if,” which connects two propositions P and Q by asserting that the
truth of P is identical to the truth of Q. That is, if P is true then Q must be
true, and if P is false Q must be false.
“If-then” statements in propositional logic are often written using an arrow, which
denotes “logical implication.” You might see P → Q, which is the same as “if P then
Q.”
For any generic compound proposition, one can write down a truth table that
describes the full range of possible truth values the syntactic statement can
assume. For example, Figure B.1 shows the truth table that proves P ↔ Q is an
equivalent statement to “P → Q
First order logic adds variables to propositional logic, meaning statements can
have unknown truth values. A claim in first-order logic is called a formula. For
example if x is stated to be a variable ranging over the integers, then “x is even”
is a formula, but its truth value is undetermined absent more knowledge about x.
However, if you interpret x as 8, then “x is even” is a true formula; “for every x,
x is even,” is a false formula; and “there is an x such that x is even” is a true
formula. These are the three ways that a variable can become “bound” in first-order
logic. A variable can be assigned a concrete value. A variable can be universally
quantified, meaning we claim the formula is true for all possible assignments. Or,
finally, a variable can be existentially quantified, meaning we claim the formula
is true for at least one possible assignment. If all variables in a formula are
bound, then the formula has a truth value. Often the symbol ∀ is used for the
universal quantifier,
367
The domain of values variables can assume is specified by the logical framework
itself.
For example, one may describe a first-order logic for integers whose values are the
symbols {. . . , − 3 , − 2 , − 1 , 0 , 1 , 2 , 3 , . . . }—syntactically to the
logic they are mere symbols, arbitrary as any other, but in our hearts they are the
esteemed integers—and which has the additional symbols <, =. This would allow you
to syntactically phrase mathematical statements pertaining to the ordering of
integers.
Once you have a set of rules for constructing statements and interpreting their
truth values, you need a set of rules for inferring truth values of statements from
known truth values of other statements. There is a long list of inference rules,
most of which are common sense. For example, there is a rule (often called modus
ponens) that says that if you know P is true, and if you know P → Q is true, then
you may conclude that Q
is true. Similarly, if “P and Q” is true, then you can conclude that P is true. One
more: from “not not P ” being true, you may conclude that P is true. For a complete
list, refer to a book or website on first-order logic. None of the rules are
surprising.
Putting all of these together we may start to construct proofs. The statement you’d
like to prove is a formula, and there are a set of hypothesis formulas that are
assumed to be true. Using the hypotheses, along with any tautologies you wish, a
proof is simply a list of logical inference rules applied to any previously proven
true formulas to arrive at the theorem.
While a formal proof can have any form legal according to first-order logic (or
second-order logic, as the case may be), it is helpful to identify and give names
to particular patterns of proof to help with human digestion.
368
There is another important technique, called proof by induction, that does not fit
neatly in every first-order logical framework (though it does in some, see below).
In second-order logic, induction is actually an axiomatic inference rule of the
form: for all boolean-valued functions F : N → { True, False }, if ( P (1) and for
all k ∈ N , P ( k) → P ( k + 1)), then for all n ∈ N , P ( n). As we have seen many
times in the book, to prove by induction you prove the base case ( P (1)) and the
recursive/inductive step (for all k ∈ N , P ( k) →
P ( k + 1)) separately, and you can infer the theorem is true for any natural
number.
For a first-order logic where the universe is the universe of sets, the concept of
natural numbers is usually baked into other axioms, and so the induction inference
rule can be proved as a theorem. In a logic whose universe of elements are
integers, it is baked into axioms about well-ordering. In the end, it is usually
singled out as a particularly handy proof technique for the times when you have no
other ideas on how to prove a theorem.
Once you have an intuition for how proofs may be made formal, and you have a grasp
on the basic tools for proving simple statements, you will find yourself in a
position where you want to prove something and you have no idea how.
This is a frightening stage, but there are simple techniques that can help. In
fact, there is an entire book by George Pólya called “How to Solve It” devoted to
explaining these techniques. I will describe some techniques I use here, which
provide merely a subset of 3 This last bit is not entirely obvious, and a truth
table for negating P → Q helps.
369
Pólya’s advice. Before we get to that, there are a number of ways that you can
stumble upon something you want to prove.
One common way is when working on another problem and you notice a pattern. For
example, you may be working on a number theory problem about square numbers and
notice that the difference between successive square numbers is always odd. For
example, 25 − 16 = 9 and 81 − 64 = 17. You have already noticed a pattern, you know
what it is you want to prove, and you can set out trying to prove it.
Another more tenuous situation is when you have some known inputs, and a known
state you’d like to get to, but otherwise no clue on how to get there. For example
you may have some quantity you believe is bounded from above by 2, but you don’t
know how to prove it. An example I was working on at the time of this writing
consisted of a sum like
In this sum the numbers A, B, C are fixed positive reals and n ̸= m ̸= k fixed,
distinct positive integers, but the numbers p, q, r ∈ [0 , 2 π] are variable. My
goal was to find the choice of p, q, r that made the maximum magnitude of the
resulting function as small as possible.
In these more ambiguous cases, first and foremost, you should write down what you
want to prove as precisely as you can. Identify the known and unknown parts of the
problem. For the cosine problem I’ve dressed it up a bit, to give the most general
version of the problem I care about, of which the three-term sum above was a
motivating example.
If you’re reading this after dipping your toes into the early chapters of this
book, pardon the notation salad, and just think of the simpler example above.
min
max
Ai cos( mi 2 πt + pi)
p 1 ,...,pn t∈[0 , 2 π]
i=1
With a clear problem in hand, the simplest next step is to write down many
examples, and draw pictures, and try to gain an understanding of why the problem
resists a proof.
Often, simple examples show that my belief about the problem was completely wrong,
and it’s actually false for trivial reasons. For the cosine problem above, I
plotted a number of different values of the various parameters, and tried to
understand a rough idea about what shifts would make the peaks line up, and what
shifts would make the troughs line up (both bad situations). I determined that in
this case this problem did actually have some meat to it, so I proceed.
Now there are a few techniques I can try. The simplest and most reliable technique,
in my opinion, is to make the problem progressively simpler and simpler until you
can solve it, and then slowly add back in complexity until you can’t solve it
anymore. For the cosine problem above, we can start by fixing all the Ai = 1, and
the mi to sequential
370
integers mi = i. After thinking about that version of the problem for a while, it’s
still too hard, so I simplify it further by fixing n to small values. Since n = 1
defeats the problem instantly (the maximum is unchanged no matter what you do), the
simplest nontrivial choice is n = 2.
cos(4 πt + r).
Now what further resists a proof? We could try to simplify further by letting the
two periods 2 π and 4 π be the same value (say, both 2 π). We can ignore for the
moment that this violates one of the constraints of the problem, in order to
determine if that constraint is important. Indeed, such a simplification makes the
problem too trivial, because an easily chosen shift of π cancels both curves out
completely to the zero function. The differing periods (and, it appears, the fact
that their ratio is rational) are core ingredients in the fact that a nontrivial
minimum can be achieved. At this point, one can try to manually optimize the
function to find the right value of r, using techniques from calculus.
In so accomplishing this task, one reflects on the results. Will the techniques
applied generalize to the more complex case of 3 or more curves? If not, at what
step does it break down? What, precisely, is the core reason that technique fails?
Does that say anything about whether related techniques would also fail? Does that
provide any insight into what properties are required of a technique if it is to
succeed?
There are a number of other questions naturally raised when doing this simplify-
solve-generalize loop. What known problems seem related to this one? For example,
the problem above looks like a decomposition called the Fourier series, so one
could look for information pertaining to how to tell where the maximum of a finite
Fourier series lies.
Another question: can we restate the problem differently to suggest different ap-
proaches? For one, I notice that I will fail at my minimization goal if I unluckily
cause many peaks of different curves to line up, or many troughs. So somehow I want
to mis-align all the peaks relative to all the other peaks, and all the troughs
relative to the other troughs. But I can easily compute the peaks and troughs of
each curve, they form a discrete set, so maybe it is enough to find an alignment
that keeps the peaks and troughs as far away from each other as possible. This idea
of mis-aligning peaks and troughs is also a sort of heuristic reasoning that may
guide me to a more precise proof.
Another question: can I make the problem more general in a way that helps? Knowing
a bit about complex analysis and the famous formula eit = cos( t) + i sin( t)
suggests to write cos( k 2 πt + p) = Re( ei( k 2 πt+ p)) and work there. Indeed,
from that perspective the cosine is the projection of a vector onto the x-axis, and
the function is a sum of continuously rotating vectors. I want to keep the
projection of those vectors from sticking out too far in the horizontal direction
left or right (but they may stretch as high or as low as they want). It is worth
noting that complex numbers have a rich history of making
371
some complicated calculus problems much simpler, so it’s reasonable to expect they
might help in this situation.
Another question: can I find a good approximation to the thing I want to prove?
Perhaps instead of finding the exact minimum of the exact function, I can find the
exact minimum of an approximation to the function, or an approximate minimum to the
exact function, or an approximate minimum to an approximate function. In each of
these I could apply various types of approximations, such as Taylor series (Chapter
8), and try to measure the quality of the approximations.
There are many other questions I could ask, but in puzzling over each I form a
rough plan of attack for the problem. I get leads for topics to read about that may
help, I find a new way to picture the problem, and I can apply each to my list of
examples to evaluate whether it is worth pursuing. As I learn more mathematics in
general, I find more and newer ways to approach problems.
What’s most important about having all of these leads is that one feels like one is
making progress. It is completely useless to write down a problem with nowhere to
go from there. The process of simplifying and generalizing in different ways, in
addition to the loop of conjecture and proof or refutation, preserves momentum,
upholds determination, and preserves sanity through trying times.
Beyond searching for leads, there is the practical matter of prioritization. How do
you choose which lead to follow, and for how long should you keep at it before
switching tacks? How do you keep track of your progress so that you can easily
resume where you left off, or revisit an approach later? In my view the answers to
this are deeply personal.
Everyone has different styles of managing their “To Do” list and project
management.
Many mathematicians I know, including myself, keep notebooks of various sorts for
the off-hand thoughts and tinkering that is too embarrassing to show to the world.
Some mathematicians take this a bit further. They rely on the idea expounded in
this book that, once you have a nugget of insight, you can hide the details to
reduce overhead and re-derive them as needed. As such, the way to “keep track” of a
lead may be as simple as a line “Try Fourier series.” One might spend a working
session working that angle on a blackboard or scratch paper, or do an extensive
literature search, and at the end derive one clear limitation or bit of progress,
such as “odd/even makes a big difference.”
The intermediate scratch work is often discarded, and can be recreated (often
clearer and more concisely) later. As my advisor’s advisor would say (told to me by
my advisor), “If you can’t recreate it later, then it was probably wrong anyway.”
Luckily, the nugget of insight is often much easier to write down or remember.
Many others use the process of typesetting their notes (and cleaning/pruning them)
to preserve the important bits. This has often helped me find mistakes in my
scratch work early—or at least, earlier than when my colleagues and I have declared
victory and
372
are ready to typeset it in a paper submission. Typesetting forces you to slow down
and reexamine your work, similar to how writing something manually with a paper and
pen makes it easier to remember and makes you more deliberate about what you write.
Appendix C
Annotated Resources
One of the primary outcomes I hope readers get from this book is the ability and
confidence to engage with mathematics outside these pages. Mathematics is full of
excellent books, lecture notes, blogs, videos, and talks. Most importantly,
mathematics is a broad community of people. With some of the ideas in this book, a
jump start on notation and proofs, and hopefully an idea of what you want to learn
next, you can take full advantage of the broad literature and wonderful people.
In putting so much content into this book, I necessarily had to compress and omit.
I encourage readers to read other books alongside this one. In particular, readers
have written to me that they found it helpful to read a supplementary introduction
to proofs, to reinforce and practice mechanics in tandem.
ematician’s Lament.”
2. Naive Set Theory (UTM), Paul Halmos. A short derivation of set theory,
axiomatically from the ground up. The first 50 pages contain everything you would
need
374
basic methods of proofs, with many exercises. Covers topics like sequences, com-
pleteness, inequalities, and basic number theory. Notable for how cheap the phys-
4. Logic as Algebra, Paul Halmos, Steven Givant. A book that covers propositional
logic, but primarily works to show how logic exhibits algebraic structure (an aside
from Chapter 2). This is also the book where I first saw the tournament problem
proof described in Chapter 4.
5. Reading, Writing, and Proving: A Closer Look at Mathematics (UTM), Ulrich Daepp,
Pamela Gorkin. A more expository-focused book introducing propositional logic,
Andre. A standard college textbook introducing sets and proofs. This was the text I
learned from in school. While complete and full of exercises, not particularly
engaging.
expositions about every topic under the sun. Foundational not for its ease of
reading, but for its breadth of coverage. Includes some essays and cultural
guidance as well.
C.2 Polynomials
2. Ideals, Varieties, and Algorithms, David Cox, John Little, Danal O’Shea. Covers
a large range of computationally relevant aspects of polynomials, and graduates the
reader toward a mature view of polynomials in terms of rings. An overview of
375
2. A Walk Through Combinatorics, Miklós Bóna. An extensive text covering the basics
of combinatorics and its particular methods of proof. After going through basic
counting problems and tools, it moves on to advanced tools like generating func-
tions, and then proceeds to cover a large subset of important combinatorics and
graph theory topics from matchings and colorings to error-correcting codes and
block designs (only in more recent editions). I studied the first half of this book
in detail as an undergraduate.
3. Graph Theory and Its Applications, Jonathan L. Gross, Jay Yellen, Mark Anderson.
A comprehensive undergraduate level introduction to graph theory, with particular
attention paid to algorithms that are relevant in computer science such as graph
4. Networks, Crowds, and Markets, David Easley, Jon Kleinberg. Covers a wide
breadth of applications of graph theory, specifically oriented around processes and
dynamic systems that occur on a graph. A natural next step for the reader who
enjoyed the discussion of stable matchings in this book. Leans heavily toward
modeling and is relatively light on mathematical technicalities.
boundless enthusiasm.
376
3. Calculus, Michael Spivak. Revered as a classic, but dense and focused primarily
on proving single-variable calculus from the ground up with rigor. I have found it
useful as a reference and a third pass over calculus.
doing math.
6. The Fourier Transform and its Applications, Brad Osgood. This is the text I
learned Fourier analysis from, along with Osgood’s excellent online lecture videos
when
1. Linear Algebra Done Right, Sheldon Axler. The text I originally learned linear
algebra from. Focuses heavily on an axiomatic view focused on linear maps, bases,
and their correspondence with matrices. Provides a nice side-introduction to
complex
numbers.
2. Linear Algebra and Its Applications, Gilbert Strang. Strang is the author of a
number of classic and authoritative texts on linear algebra. Opposite to Axler’s
text, this one focuses heavily on practical aspects of matrix computations,
including matrix decompositions and determinants, at the expense of theoretical
foundations.
4. Quantum Algorithms via Linear Algebra, Richard Lipton and Kenneth Regan. A self-
contained approach to quantum computing algorithms, relying only on linear
algebra background knowledge. Includes proofs of all the amazing results on fac-
377
C.6 Optimization
important problems.
1. A First Course in Abstract Algebra, John Fraleigh. The book I originally learned
group theory from. Comprehensive and well paced, but lacking on motivation and
practical applications.
2. Abstract Algebra, David Dummitt, Richard Foote. A revered classic, but very
terse and also very oriented toward the applications of abstract algebra in pure
mathematics. Legend has it one of the two authors (I forget which) sprinkled the
text
with puns and jokes, and the other author insisted on removing them. The only
jokes that remain were the ones that were too subtle to be detected.
3. Algebra, Michael Artin. A denser book, but one which approaches abstract algebra
using linear algebra as the unifying representation. As an undergraduate I spent a
lot of time working through parts of this book as self-study, and found it got me
into the mindset of filling in the gaps left by authors.
378
advanced topics. This was my first-year graduate Algebra book, and it demystified
category theory for me while also acting as a synthesizing text for my disparate
6. Permutation Groups, John Dixon, Brian Mortimer. A graduate level text discussing
permutation groups as a proxy for all groups. Focuses on concrete representations
and touches on algorithmic aspects.
C.8 Topology
Topology is a very heavy subject, but it is fun to think about. I’d recommend not
diving into a standard reference of point-set topology (Munkres) until you feel
comfortable with set theory, calculus (Chapter 8), and standard proof techniques.
2. Introduction to Topology, Theodore Gamelin, Robert Greene. The (very cheap) text
I originally learned point-set topology from. Concise and with lots of exercises.
3. Topology, James Munkres. The gold standard topology book aimed at math
undergraduates. I have used it as a reference.
4. Algebraic Topology, Allen Hatcher. The gold standard for an advanced subfield of
topology whose task is to use group and ring structures to compute interesting
2 That being said, this StackExchange question and its top voted answer give a
compelling reason why many of these topics are important and interesting:
https://cstheory.stackexchange.com/q/14811
379
on the various theoretical models and mathematical guarantees that practical ma-
chine learning algorithms can achieve. Short and well-paced. I have studied the
book in detail.
the concepts in the book. I use this book as a reference for blog posts about
cryptography.
hardness, and quantum computing. A hard text, but much of it is assumed knowl-
edge for those interested in reading about the cutting edge results. I read most of
this book as a graduate student.
6. Computational Geometry, Mark de Berg, Otfried Cheong, Marc van Kreveld, Mark
Overmars. A text that covers the algorithmic questions around geometry, such as
at CMU, Sanjeev Arora’s at Princeton, and Tim Roughgarden and Greg Valiant’s
course at Stanford. All are wonderful springboards to learning about new exciting
topics.
variety of yearly conferences for each subfield. I will list a few below, but note
that it is a large volume of papers and it is hard to sort through them for
accessible and interesting ones. An easy place to start is to look at best paper
awards and “test of time” awards.
380
of Computer Science) are the two “top tier” conferences in the field. Both for
h) ICML, NeurIPS, IJCAI, KDD all focus on aspects of practical machine learning
methods.
i) ITCS (Innovations in Theoretical Computer Science) focuses on new ideas
that don’t fit into mainstream conferences, or may not achieve the technical
2. My Best Mathematical and Logic Puzzles, Martin Gardner. Another book of puzzles
from the most prominent luminary of recreational mathematics.
A volume full of aesthetic proofs from all over mathematics, with each proof being
roughly 2 pages long.
topics that are often not sufficiently complex for standard mathematical journals,
but are interesting nonetheless.
5. The Harmony of the World, Gerald Alexanderson (Ed). A selection of the editor’s
favorite articles from the 75 year history of Mathematics Magazine.
6. The Best Writing on Mathematics Ed. Mircea Pitici. An annual anthology of the
year’s best mathematics writing.
Jeremy Kun is a software engineer at Google, as part of a team that plans and
optimizes Google’s “fleet” of datacenter machines. Born in 1989 in San Francisco,
California, he earned his undergraduate degree in mathematics from California
Polytechnic State University at San Luis Obispo, and his doctorate in mathematics
from the University of Illi-nois at Chicago where he was advised by Lev Reyzin.
Jeremy writes the blog Math ∩
The cover art is Tableau no. 2 / Composition no. V. by Piet Mondrian (1914),
currently at the New York Museum of Modern Art.
Piet Mondrian is renowned for his embrace of abstract geometric art. His late works
are instantly recognizable, characterized by thick black lines outlining rectangles
and squares of white or primary colors. Mondrian painted Composition no. V in 1914,
at a time in his life (between 1910 and the end of World War I in 1918) when he was
both inspired by the cubist works of Picasso and Braque and reconciling his
spirituality with his art. In this period he discovered and found meaning in
abstraction, which shaped his work for the rest of his life.
I hope that you, dear reader, will discover and find meaning in mathematics. I
believe that the harmony and rhythm in these basic forms of beauty, supplemented if
necessary by programs, can become a work of art, even stronger than it is true.
381
Index
directional, 241
notation, 14
302
linear
approximator,
preimage, 44
112
range, 13
Appel, Kenneth, 78
of polynomials, 109
partial, 249
354
350
backpropagation, 265
203
Ben-David, Shai, 32
boolean logic, 9, 59
multiplicity, 207
graph, 348
Cantor, Georg, 59
category theory
Erdős number, 90
universality, 92
227
284, 296
complete, 76
connectivity, 71
impossibility, 14, 32
32, 47
263
definition, 70
concavity, 254
field, 182
degree, 71
conjugation, 157
embedding, 87
contrapositive, 202
function
Euler characteristic, 76
analytic, 116
incidence, 71
neighborhood, 71
path, 71
definition, 324
codomain, 13
concavity, 126
De Morgan’s law, 59
divergence, 126
subgraph, 71
deferred acceptance, 55
domain, 13, 78
definition, 6
image, 44
approximation, 75, 83
derivative, 138
inverse, 46
Grothendieck, Alexander, 92
383
384
group
bias, 288
∈, 14, 40
cyclic, 312
manifold, 132
definition, 307
∂ 2 f / ∂x∂y, 258
dihedral, 313
mod, 133
examples, 312
diagonalizable, 207
N, 14
negation, 41
product, 313
inverse, 156
∂, 249
quotient, 311
symmetric, 196
⊥, 171, 269
f − 1( x), 44
subgroup, 310
229
, 18
i=1
R, 14
Haken, Wolfgang, 78
set-builder, 40
halfspace, 268
⊂, 41
Halmos, Paul, 37
133, 308
, 18
Hamming, Richard, 361
i=1
monotonic, 57
□, 20
∀, 46, 104
NP-hard, 75
homomorphism, 309
optimization
notation
norm, 202
≈, 99
orthogonal, 202
arg, 351
→, 13, 105
perfect secrecy, 27
347, 348
( )
n , 49
Picard, Émile, 66
picture proofs, 50
C, 351
i, 351
◦, 247
definition, 6
a + bi, 351
degree, 6, 8
Kleinberg, Jon, 32
D, 246
interpolation, 15, 22
, 110
product
dx
dx, 249
of groups, 313
of sets, 42
132
≡, 133
limit
∃, 44, 104
proof techniques
definition, 105
gradient ∇, 250
of a function, 105
im( f), 44
sequence, 102
Z, 14
332, 367
interval, 112
( a, b), 112
diagonalization, 59
definition, 141
ker f, 310
direct, 367
L 2, 273
lim, 105
little-o, 295
7→, 110
201, 366
385
countable, 59
total derivative
208, 368
definition, 40
as a matrix, 248
monotonicity, 57, 148
membership, 14, 40
power set, 59
Turán, György, 90
product, 42
set-builder notation, 40
quantifier
size, 41
vector space
Shamir, Adi, 32
Shapley, Lloyd, 61
quotient, 131
181, 192
definition, 131
of a group, 311
spectrum, 231
dimension, 148
for R, 42
Squeeze theorem, 127
dual, 229
independence, 146
statistics, 60
ReLU, 270
Steiner system, 60
Rényi, Alfréd, 89
norm, 160
Szemerédi, Endre, 90
span, 145
Riemann, Bernhard, 93
subspace, 148
Roth, Alvin, 61
Tate, John, 92
Vieta’s formulas, 29
semidefinite
programming,
121
84
tessellation, 329
well-definition, 102
sequence
Cauchy, 126
233, 359
convergence, 102
Tilly, Ben, 89
Wiles, Andrew, 35
divergence, 125
tombstone, 20
set
cardinality, 41, 59
Zhang, Yitang, 31
Document Outline
Our Goal
Realizing it in Code
Cultural Review
Exercises
Chapter Notes
Cultural Review
Exercises
Chapter Notes
Graph Coloring
Approximate Coloring
Cultural Review
Exercises
Chapter Notes
Limits
The Derivative
Taylor Series
Remainders
Cultural Review
Exercises
Dimension
Matrices
Cultural Review
Exercises
Chapter Notes
Inner Products
Orthonormal Bases
Computing Eigenvalues
Application: Waves
Cultural Review
Exercises
Chapter Notes
Linear Approximations
Cultural Review
Exercises
Chapter Notes
The Argument for Big-O Notation
Cultural Review
Exercises
Chapter Notes
A New Interface
Notation
Methods of proof
Polynomials
Linear Algebra
Optimization
Topology
Computer Science, Theory, and Algorithms
Index