SlideShare a Scribd company logo
ENGINEERING FAST INDEXES (DEEP DIVE)
Daniel Lemire
https://lemire.me
Joint work with lots of super smart people
Roaring : Hybrid Model
A collection of containers...
array: sorted arrays ({1,20,144}) of packed 16‑bit integers
bitset: bitsets spanning 65536 bits or 1024 64‑bit words
run: sequences of runs ([0,10],[15,20])
2
Keeping track
E.g., a bitset with few 1s need to be converted back to array.
→ we need to keep track of the cardinality!
In Roaring, we do it automagically
3
Setting/Flipping/Clearing bits while keeping track
Important : avoid mispredicted branches
Pure C/Java:
q = p / 64
ow = w[ q ];
nw = ow | (1 << (p % 64) );
cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA
w[ q ] = nw;
4
In x64 assembly with BMI instructions:
shrx %[6], %[p], %[q] // q = p / 64
mov (%[w],%[q],8), %[ow] // ow = w [q]
bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag
sbb $-1, %[cardinality] // update card based on flag
mov %[load], (%[w],%[q],8) // w[q] = ow
 sbb is the extra work
5
For each operation
union
intersection
difference
...
Must specialize by container type:
array bitset run
array ? ? ?
bitset ? ? ?
run ? ? ?
6
High‑level API or Sipping Straw?
7
Bitset vs. Bitset...
Intersection:
First compute the cardinality of the result.
If low, use an array for the result (slow), otherwise generate
a bitset (fast).
Union: Always generate a bitset (fast).
(Unless cardinality is high then maybe create a run!)
We generally keep track of the cardinality of the result.
8
Cardinality of the result
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
We have 1024 calls to  Long.bitCount .
This counts the number of 1s in a 64‑bit word.
9
Population count in Java
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
Sounds expensive?
10
Population count in C
How do you think that the C compiler  clang compiles this code?
#include <stdint.h>
int count(uint64_t x) {
int v = 0;
while(x != 0) {
x &= x - 1;
v++;
}
return v;
}
11
Compile with  -O1 -march=native on a recent x64 machine:
popcnt rax, rdi
12
Why care for  popcnt ?
 popcnt : throughput of 1 instruction per cycle (recent Intel CPUs)
Really fast.
13
Population count in Java?
// Hacker`s Delight
int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L)
+ ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
14
Population count in Java!
Also compiles to  popcnt if hardware supports it
$ java -XX:+PrintFlagsFinal
| grep UsePopCountInstruction
bool UsePopCountInstruction = true
But only if you call it from  Long.bitCount 
15
Java intrinsics
 Long.bitCount ,  Integer.bitCount 
 Integer.reverseBytes ,  Long.reverseBytes 
 Integer.numberOfLeadingZeros ,
 Long.numberOfLeadingZeros 
 Integer.numberOfTrailingZeros ,
 Long.numberOfTrailingZeros 
 System.arraycopy 
...
16
Cardinality of the intersection
How fast does this code run?
int c = 0;
for (int k = 0; k < 1024; ++k) {
c += Long.bitCount(A[k] & B[k]);
}
A bit over ≈ 2 cycles per pair of 64‑bit words.
load A, load B
bitwise AND
 popcnt 
17
Take away
Bitset vs. Bitset operations are fast
even if you need to track the cardinality.
even in Java
e.g.,  popcnt overhead might be negligible compared to other costs
like cache misses.
18
Array vs. Array intersection
Always output an array. Use galloping O(m log n) if the sizes
differs a lot.
int intersect(A, B) {
if (A.length * 25 < B.length) {
return galloping(A,B);
} else if (B.length * 25 < A.length) {
return galloping(B,A);
} else {
return boring_intersection(A,B);
}
}
19
Galloping intersection
You have two arrays a small and a large one...
while (true) {
if (largeSet[k1] < smallSet[k2]) {
find k1 by binary search such that
largeSet[k1] >= smallSet[k2]
}
if (smallSet[k2] < largeSet[k1]) {
++k2;
} else {
// got a match! (smallSet[k2] == largeSet[k1])
}
}
If the small set is tiny, runs in O(log(size of big set))
20
Array vs. Array union
Union: If sum of cardinalities is large, go for a bitset. Revert to an
array if we got it wrong.
union (A,B) {
total = A.length + B.length;
if (total > DEFAULT_MAX_SIZE) {// bitmap?
create empty bitmap C and add both A and B to it
if (C.cardinality <= DEFAULT_MAX_SIZE) {
convert C to array
} else if (C is full) {
convert C to run
} else {
C is fine as a bitmap
}
}
otherwise merge two arrays and output array
}
21
Array vs. Bitmap (Intersection)...
Intersection: Always an array.
Branchy (3 to 16 cycles per array value):
answer = new array
for value in array {
if value in bitset {
append value to answer
}
}
22
Branchless (3 cycles per array value):
answer = new array
pos = 0
for value in array {
answer[pos] = value
pos += bit_value(bitset, value)
}
23
Array vs. Bitmap (Union)...
Always a bitset. Very fast. Few cycles per value in array.
answer = clone the bitset
for value in array { // branchless
set bit in answer at index value
}
Without tracking the cardinality ≈ 1.65 cycles per value
Tracking the cardinality ≈ 2.2 cycles per value
24
Parallelization is not just multicore + distributed
In practice, all commodity processors support Single instruction,
multiple data (SIMD) instructions.
Raspberry Pi
Your phone
Your PC
Working with words x × larger has the potential of multiplying the
performance by x.
No lock needed.
Purely deterministic/testable.
25
SIMD is not too hard conceptually
Instead of working with x + y you do
(x , x , x , x ) + (y , y , y , y ).
Alas: it is messy in actual code.
1 2 3 4 1 2 3 4
26
With SIMD small words help!
With scalar code, working on 16‑bit integers is not 2 × faster than
32‑bit integers.
But with SIMD instructions, going from 64‑bit integers to 16‑bit
integers can mean 4 × gain.
Roaring uses arrays of 16‑bit integers.
27
Bitsets are vectorizable
Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with
Single instruction, multiple data (SIMD) instructions.
Intel Cannonlake (late 2017), AVX‑512
Operate on 64 bytes with ONE instruction
→ Several 512‑bit ops/cycle
Java 9's Hotspot can use AVX 512
ARM v8‑A to get Scalable Vector Extension...
up to 2048 bits!!!
28
Java supports advanced SIMD instructions
$ java -XX:+PrintFlagsFinal -version |grep "AVX"
intx UseAVX = 2
29
Vectorization matters!
for(size_t i = 0; i < len; i++) {
a[i] |= b[i];
}
using scalar : 1.5 cycles per byte
with AVX2 : 0.43 cycles per byte (3.5 × better)
With AVX‑512, the performance gap exceeds 5 ×
Can also vectorize OR, AND, ANDNOT, XOR + population count
(AVX2‑Harley‑Seal)
30
Vectorization beats  popcnt 
int count = 0;
for(size_t i = 0; i < len; i++) {
count += popcount(a[i]);
}
using fast scalar (popcnt): 1 cycle per input byte
using AVX2 Harley‑Seal: 0.5 cycles per input byte
even greater gain with AVX‑512
31
Sorted arrays
sorted arrays are vectorizable:
array union
array difference
array symmetric difference
array intersection
sorted arrays can be compressed with SIMD
32
Bitsets are vectorizable... sadly...
Java's hotspot is limited in what it can autovectorize:
1. Copying arrays
2. String.indexOf
3. ...
And it seems that  Unsafe effectively disables autovectorization!
33
There is hope yet for Java
One big reason, today, for binding closely to hardware is to
process wider data flows in SIMD modes. (And IMO this is a
long‑term trend towards right‑sizing data channel widths, as
hardware grows wider in various ways.) AVX bindings are where
we are experimenting, today
(John Rose, Oracle)
34
Fun things you can do with SIMD: Masked VByte
Consider the ubiquitous VByte format:
Use 1 byte to store all integers in [0, 2 )
Use 2 bytes to store all integers in [2 , 2 )
...
Decoding can become a bottleneck. Google developed Varint‑GB.
What if you are stuck with the conventional format? (E.g., Lucene,
LEB128, Protocol Buffers...)
7
7 14
35
Masked VByte
Joint work with J. Plaisance (Indeed.com) and N. Kurz.
http://maskedvbyte.org/
36
Go try it out!
Fully vectorized Roaring implementation (C/C++):
https://github.com/RoaringBitmap/CRoaring
Wrappers in Python, Go, Rust...
37

More Related Content

What's hot (20)

PDF
Vasia Kalavri – Training: Gelly School
Flink Forward
 
PDF
Real Time Big Data Management
Albert Bifet
 
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
PPTX
Time Series Analysis for Network Secruity
mrphilroth
 
PDF
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
PPTX
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
ODP
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
PDF
Unsupervised Learning with Apache Spark
DB Tsai
 
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
PDF
Data correlation using PySpark and HDFS
John Conley
 
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
PDF
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
PPTX
Java 8 monads
Asela Illayapparachchi
 
PPTX
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PDF
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Spark Summit
 
PDF
Mapreduce Algorithms
Amund Tveit
 
PDF
On Beyond (PostgreSQL) Data Types
Jonathan Katz
 
Vasia Kalavri – Training: Gelly School
Flink Forward
 
Real Time Big Data Management
Albert Bifet
 
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Martin Goodson
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
Time Series Analysis for Network Secruity
mrphilroth
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
Databricks
 
Anomaly Detection with Apache Spark
Cloudera, Inc.
 
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
Unsupervised Learning with Apache Spark
DB Tsai
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Data correlation using PySpark and HDFS
John Conley
 
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
Photon Technical Deep Dive: How to Think Vectorized
Databricks
 
Java 8 monads
Asela Illayapparachchi
 
Distributed GLM with H2O - Atlanta Meetup
Sri Ambati
 
Distributed computing with spark
Javier Santos Paniego
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Spark Summit
 
Mapreduce Algorithms
Amund Tveit
 
On Beyond (PostgreSQL) Data Types
Jonathan Katz
 

Viewers also liked (20)

PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
PDF
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Spark Summit
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PDF
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Spark Summit
 
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
PDF
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
PDF
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
PDF
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
PPTX
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
PDF
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Spark and Online Analytics: Spark Summit East talky by Shubham Chopra
Spark Summit
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Spark Summit
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Spark Summit
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Spark Summit
 
Predictive Analytics for IoT Network Capacity Planning: Spark Summit East tal...
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...
Spark Summit
 
No More “Sbt Assembly”: Rethinking Spark-Submit Using CueSheet: Spark Summit ...
Spark Summit
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Spark Summit
 
Ad

Similar to Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by Daniel Lemire (20)

PDF
Engineering fast indexes
Daniel Lemire
 
PDF
[Slides] A simple (leveled) fully homomorphic encryption scheme and thoughts ...
tranminhkhoait
 
PDF
Xgboost
Vivian S. Zhang
 
PDF
Scala to assembly
Jarek Ratajski
 
PDF
Ijmsr 2016-05
ijmsr
 
PDF
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
Fwdays
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PDF
A nice 64-bit error in C
PVS-Studio
 
PDF
Options and trade offs for parallelism and concurrency in Modern C++
Satalia
 
PPT
cipherrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr.ppt
SnehaPavithran6
 
PDF
Vectorization in ATLAS
Roberto Agostino Vitillo
 
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
 
PDF
PVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio
 
PDF
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Daniel Lemire
 
PDF
Lesson 24. Phantom errors
PVS-Studio
 
PDF
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
PDF
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
PPTX
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
PDF
Java Keeps Throttling Up!
José Paumard
 
PDF
Vectorized VByte Decoding
indeedeng
 
Engineering fast indexes
Daniel Lemire
 
[Slides] A simple (leveled) fully homomorphic encryption scheme and thoughts ...
tranminhkhoait
 
Scala to assembly
Jarek Ratajski
 
Ijmsr 2016-05
ijmsr
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
Fwdays
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
A nice 64-bit error in C
PVS-Studio
 
Options and trade offs for parallelism and concurrency in Modern C++
Satalia
 
cipherrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr.ppt
SnehaPavithran6
 
Vectorization in ATLAS
Roberto Agostino Vitillo
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Graham Wihlidal
 
PVS-Studio for Linux Went on a Tour Around Disney
PVS-Studio
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Daniel Lemire
 
Lesson 24. Phantom errors
PVS-Studio
 
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
Design of QSD Number System Addition using Delayed Addition Technique
Kumar Goud
 
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
Java Keeps Throttling Up!
José Paumard
 
Vectorized VByte Decoding
indeedeng
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 

Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by Daniel Lemire

  • 1. ENGINEERING FAST INDEXES (DEEP DIVE) Daniel Lemire https://lemire.me Joint work with lots of super smart people
  • 2. Roaring : Hybrid Model A collection of containers... array: sorted arrays ({1,20,144}) of packed 16‑bit integers bitset: bitsets spanning 65536 bits or 1024 64‑bit words run: sequences of runs ([0,10],[15,20]) 2
  • 3. Keeping track E.g., a bitset with few 1s need to be converted back to array. → we need to keep track of the cardinality! In Roaring, we do it automagically 3
  • 4. Setting/Flipping/Clearing bits while keeping track Important : avoid mispredicted branches Pure C/Java: q = p / 64 ow = w[ q ]; nw = ow | (1 << (p % 64) ); cardinality += (ow ^ nw) >> (p % 64) ; // EXTRA w[ q ] = nw; 4
  • 5. In x64 assembly with BMI instructions: shrx %[6], %[p], %[q] // q = p / 64 mov (%[w],%[q],8), %[ow] // ow = w [q] bts %[p], %[ow] // ow |= ( 1<< (p % 64)) + flag sbb $-1, %[cardinality] // update card based on flag mov %[load], (%[w],%[q],8) // w[q] = ow  sbb is the extra work 5
  • 6. For each operation union intersection difference ... Must specialize by container type: array bitset run array ? ? ? bitset ? ? ? run ? ? ? 6
  • 7. High‑level API or Sipping Straw? 7
  • 8. Bitset vs. Bitset... Intersection: First compute the cardinality of the result. If low, use an array for the result (slow), otherwise generate a bitset (fast). Union: Always generate a bitset (fast). (Unless cardinality is high then maybe create a run!) We generally keep track of the cardinality of the result. 8
  • 9. Cardinality of the result How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } We have 1024 calls to  Long.bitCount . This counts the number of 1s in a 64‑bit word. 9
  • 10. Population count in Java // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } Sounds expensive? 10
  • 11. Population count in C How do you think that the C compiler  clang compiles this code? #include <stdint.h> int count(uint64_t x) { int v = 0; while(x != 0) { x &= x - 1; v++; } return v; } 11
  • 12. Compile with  -O1 -march=native on a recent x64 machine: popcnt rax, rdi 12
  • 13. Why care for  popcnt ?  popcnt : throughput of 1 instruction per cycle (recent Intel CPUs) Really fast. 13
  • 14. Population count in Java? // Hacker`s Delight int bitCount(long i) { // HD, Figure 5-14 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f; } 14
  • 15. Population count in Java! Also compiles to  popcnt if hardware supports it $ java -XX:+PrintFlagsFinal | grep UsePopCountInstruction bool UsePopCountInstruction = true But only if you call it from  Long.bitCount  15
  • 16. Java intrinsics  Long.bitCount ,  Integer.bitCount   Integer.reverseBytes ,  Long.reverseBytes   Integer.numberOfLeadingZeros ,  Long.numberOfLeadingZeros   Integer.numberOfTrailingZeros ,  Long.numberOfTrailingZeros   System.arraycopy  ... 16
  • 17. Cardinality of the intersection How fast does this code run? int c = 0; for (int k = 0; k < 1024; ++k) { c += Long.bitCount(A[k] & B[k]); } A bit over ≈ 2 cycles per pair of 64‑bit words. load A, load B bitwise AND  popcnt  17
  • 18. Take away Bitset vs. Bitset operations are fast even if you need to track the cardinality. even in Java e.g.,  popcnt overhead might be negligible compared to other costs like cache misses. 18
  • 19. Array vs. Array intersection Always output an array. Use galloping O(m log n) if the sizes differs a lot. int intersect(A, B) { if (A.length * 25 < B.length) { return galloping(A,B); } else if (B.length * 25 < A.length) { return galloping(B,A); } else { return boring_intersection(A,B); } } 19
  • 20. Galloping intersection You have two arrays a small and a large one... while (true) { if (largeSet[k1] < smallSet[k2]) { find k1 by binary search such that largeSet[k1] >= smallSet[k2] } if (smallSet[k2] < largeSet[k1]) { ++k2; } else { // got a match! (smallSet[k2] == largeSet[k1]) } } If the small set is tiny, runs in O(log(size of big set)) 20
  • 21. Array vs. Array union Union: If sum of cardinalities is large, go for a bitset. Revert to an array if we got it wrong. union (A,B) { total = A.length + B.length; if (total > DEFAULT_MAX_SIZE) {// bitmap? create empty bitmap C and add both A and B to it if (C.cardinality <= DEFAULT_MAX_SIZE) { convert C to array } else if (C is full) { convert C to run } else { C is fine as a bitmap } } otherwise merge two arrays and output array } 21
  • 22. Array vs. Bitmap (Intersection)... Intersection: Always an array. Branchy (3 to 16 cycles per array value): answer = new array for value in array { if value in bitset { append value to answer } } 22
  • 23. Branchless (3 cycles per array value): answer = new array pos = 0 for value in array { answer[pos] = value pos += bit_value(bitset, value) } 23
  • 24. Array vs. Bitmap (Union)... Always a bitset. Very fast. Few cycles per value in array. answer = clone the bitset for value in array { // branchless set bit in answer at index value } Without tracking the cardinality ≈ 1.65 cycles per value Tracking the cardinality ≈ 2.2 cycles per value 24
  • 25. Parallelization is not just multicore + distributed In practice, all commodity processors support Single instruction, multiple data (SIMD) instructions. Raspberry Pi Your phone Your PC Working with words x × larger has the potential of multiplying the performance by x. No lock needed. Purely deterministic/testable. 25
  • 26. SIMD is not too hard conceptually Instead of working with x + y you do (x , x , x , x ) + (y , y , y , y ). Alas: it is messy in actual code. 1 2 3 4 1 2 3 4 26
  • 27. With SIMD small words help! With scalar code, working on 16‑bit integers is not 2 × faster than 32‑bit integers. But with SIMD instructions, going from 64‑bit integers to 16‑bit integers can mean 4 × gain. Roaring uses arrays of 16‑bit integers. 27
  • 28. Bitsets are vectorizable Logical ORs, ANDs, ANDNOTs, XORs can be computed fast with Single instruction, multiple data (SIMD) instructions. Intel Cannonlake (late 2017), AVX‑512 Operate on 64 bytes with ONE instruction → Several 512‑bit ops/cycle Java 9's Hotspot can use AVX 512 ARM v8‑A to get Scalable Vector Extension... up to 2048 bits!!! 28
  • 29. Java supports advanced SIMD instructions $ java -XX:+PrintFlagsFinal -version |grep "AVX" intx UseAVX = 2 29
  • 30. Vectorization matters! for(size_t i = 0; i < len; i++) { a[i] |= b[i]; } using scalar : 1.5 cycles per byte with AVX2 : 0.43 cycles per byte (3.5 × better) With AVX‑512, the performance gap exceeds 5 × Can also vectorize OR, AND, ANDNOT, XOR + population count (AVX2‑Harley‑Seal) 30
  • 31. Vectorization beats  popcnt  int count = 0; for(size_t i = 0; i < len; i++) { count += popcount(a[i]); } using fast scalar (popcnt): 1 cycle per input byte using AVX2 Harley‑Seal: 0.5 cycles per input byte even greater gain with AVX‑512 31
  • 32. Sorted arrays sorted arrays are vectorizable: array union array difference array symmetric difference array intersection sorted arrays can be compressed with SIMD 32
  • 33. Bitsets are vectorizable... sadly... Java's hotspot is limited in what it can autovectorize: 1. Copying arrays 2. String.indexOf 3. ... And it seems that  Unsafe effectively disables autovectorization! 33
  • 34. There is hope yet for Java One big reason, today, for binding closely to hardware is to process wider data flows in SIMD modes. (And IMO this is a long‑term trend towards right‑sizing data channel widths, as hardware grows wider in various ways.) AVX bindings are where we are experimenting, today (John Rose, Oracle) 34
  • 35. Fun things you can do with SIMD: Masked VByte Consider the ubiquitous VByte format: Use 1 byte to store all integers in [0, 2 ) Use 2 bytes to store all integers in [2 , 2 ) ... Decoding can become a bottleneck. Google developed Varint‑GB. What if you are stuck with the conventional format? (E.g., Lucene, LEB128, Protocol Buffers...) 7 7 14 35
  • 36. Masked VByte Joint work with J. Plaisance (Indeed.com) and N. Kurz. http://maskedvbyte.org/ 36
  • 37. Go try it out! Fully vectorized Roaring implementation (C/C++): https://github.com/RoaringBitmap/CRoaring Wrappers in Python, Go, Rust... 37