SlideShare a Scribd company logo
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Structures and Performance for Scienti
c 
Computing with Hadoop and Dumbo 
Austin R. Benson 
Computer Sciences Division, UC-Berkeley 
ICME, Stanford University 
May 15, 2012
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
1 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Dense matrix storage 
A = 
0 
11 12 13 14 
21 22 23 24 
31 32 33 34 
41 42 42 44 
BB@ 
1 
CCA 
How do we store the matrix in HDFS?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Dense matrix storage 
A = 
0 
11 12 13 14 
21 22 23 24 
31 32 33 34 
41 42 42 44 
BB@ 
1 
CCA 
In HDFS: 
h1; [11; 12; 13; 14]i 
h2; [21; 22; 23; 24]i 
h3; [31; 32; 33; 34]i 
h4; [41; 42; 43; 44]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Two rows per record 
or we might use: 
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i 
h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Flattened list 
or maybe 
h1; [11; 12; 13; 14; 21; 22; 23; 24]i 
h3; [31; 32; 33; 34; 41; 42; 43; 44]i 
... but we do lose information here (maybe it's not important)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Full matrix 
or maybe 
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
What is the "best" way?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
What is the "best" way? 
Depends on the application... we will look at an example later.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
2 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Serialization 
Small optimizations ! 2.5x speedup! 
*all data from the NERSC Magellan cluster
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Serialization 
Same experiment but dierent matrix size (200 columns): 
Again, 2.5x speedup!
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Languages 
Switching from Python to C++... 
same general trend
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
More speedups 
Algorithm performance isn't the only place where we see speedups
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Why can we expect these speedups? 
These are not high-performance implementations. We care about 
I/O performance.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
3 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Suppose we need to write many small matrices to disk.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Code 
Code: 
git clone git://github.com/icme/mapreduce-workshop.git 
cd mapreduce-workshop/arbenson 
Files: 
speed test.py (tester) 
small matrix test.py (driver)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
4 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Algorithm 
Cholesky QR: R = chol(ATA, 'upper')
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Implementation for MapReduce
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Which of these implementations is better?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Which of these implementations is better? 
Answer: the one on the left (usually)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Why? 
1 Shue time 
2 Reduce bottleneck 
However, the left implementation could run out of memory.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Can we do better? Yes

More Related Content

What's hot (19)

PDF
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Thomas Gottron
 
PDF
Handling 20 billion requests a month
Dmitriy Dumanskiy
 
PDF
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
Rob Skillington
 
PDF
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
PDF
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
PDF
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
PPTX
Query Rewriting in RDF Stream Processing
Jean-Paul Calbimonte
 
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
PDF
A Fast and Efficient Time Series Storage Based on Apache Solr
QAware GmbH
 
PPTX
Performance .NET Core - M. Terech, P. Janowski
Aspire Systems Poland Sp. z o.o.
 
PPTX
Pycon 2016-open-space
Chetan Khatri
 
PDF
Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...
Boaz Leskes
 
PDF
Golang in TiDB (GopherChina 2017)
PingCAP
 
PPTX
Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
PPTX
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
PPTX
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
Muhammad Saleem
 
PDF
Performance evaluation of apache tajo
Jihoon Son
 
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Thomas Gottron
 
Handling 20 billion requests a month
Dmitriy Dumanskiy
 
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
Rob Skillington
 
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
Query Rewriting in RDF Stream Processing
Jean-Paul Calbimonte
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
A Fast and Efficient Time Series Storage Based on Apache Solr
QAware GmbH
 
Performance .NET Core - M. Terech, P. Janowski
Aspire Systems Poland Sp. z o.o.
 
Pycon 2016-open-space
Chetan Khatri
 
Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...
Boaz Leskes
 
Golang in TiDB (GopherChina 2017)
PingCAP
 
Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Rick Viscomi
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
Muhammad Saleem
 
Performance evaluation of apache tajo
Jihoon Son
 

Viewers also liked (20)

PPT
Hoja de ruta
mistresmina
 
PPS
Jetabook - El Facebook argentino
gimenez
 
PDF
A l'abordatge presentació setmana
Biblioteca d'Alcarràs
 
PDF
Swr
Edris Fedlu
 
DOC
How not to be a dick Как Не быть хуем и засранцем
Maik' Ckneteli
 
PPT
E-learning a jeho možnosti
Martin Slavík
 
DOCX
Actividades del proyecto de aula enriquecida con tpack
JOSE RAMIRO HOYOS
 
PDF
Sony VGP-BPS8 Akku
laptopakkude
 
PDF
Naturaleza y turismo
lizzethvv
 
PDF
Curso Superior de Dirección Estrategica marketing 11 / 12
Centro Desarrollo Directivo
 
PDF
1310 manual de conservacion de suelos
Impacto Ambiental Morelos
 
PPT
How to set up PPPoE on your Fonera - FON
fongermany
 
PPTX
El bulldog francés
CEIP San Félix
 
PPTX
Global organic textile standard
Myriam Giraldo
 
PDF
Mapa conceptual gestion de calidad en los servicios3
Jhoel Dgez Garcia
 
PDF
Cleo Studio Wedding Package Promotion
Hamdi Mokhtar
 
PDF
08 abril-2014
civil1980
 
PDF
Manual gesuser
Encarni Requena
 
PDF
Customer-centric IT - Enterprise IT trends and investment 2013
EY
 
Hoja de ruta
mistresmina
 
Jetabook - El Facebook argentino
gimenez
 
A l'abordatge presentació setmana
Biblioteca d'Alcarràs
 
How not to be a dick Как Не быть хуем и засранцем
Maik' Ckneteli
 
E-learning a jeho možnosti
Martin Slavík
 
Actividades del proyecto de aula enriquecida con tpack
JOSE RAMIRO HOYOS
 
Sony VGP-BPS8 Akku
laptopakkude
 
Naturaleza y turismo
lizzethvv
 
Curso Superior de Dirección Estrategica marketing 11 / 12
Centro Desarrollo Directivo
 
1310 manual de conservacion de suelos
Impacto Ambiental Morelos
 
How to set up PPPoE on your Fonera - FON
fongermany
 
El bulldog francés
CEIP San Félix
 
Global organic textile standard
Myriam Giraldo
 
Mapa conceptual gestion de calidad en los servicios3
Jhoel Dgez Garcia
 
Cleo Studio Wedding Package Promotion
Hamdi Mokhtar
 
08 abril-2014
civil1980
 
Manual gesuser
Encarni Requena
 
Customer-centric IT - Enterprise IT trends and investment 2013
EY
 
Ad

Similar to Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012) (6)

PDF
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
Austin Benson
 
PDF
Sparse matrix computations in MapReduce
David Gleich
 
PDF
Tall and Skinny QRs in MapReduce
David Gleich
 
PDF
HPEC 2021 sparse binary format
ErikWelch2
 
PDF
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Austin Benson
 
PDF
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Austin Benson
 
QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...
Austin Benson
 
Sparse matrix computations in MapReduce
David Gleich
 
Tall and Skinny QRs in MapReduce
David Gleich
 
HPEC 2021 sparse binary format
ErikWelch2
 
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Austin Benson
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Austin Benson
 
Ad

More from Austin Benson (20)

PDF
Hypergraph Cuts with General Splitting Functions (JMM)
Austin Benson
 
PDF
Spectral embeddings and evolving networks
Austin Benson
 
PDF
Computational Frameworks for Higher-order Network Data Analysis
Austin Benson
 
PDF
Higher-order link prediction and other hypergraph modeling
Austin Benson
 
PDF
Hypergraph Cuts with General Splitting Functions
Austin Benson
 
PDF
Hypergraph Cuts with General Splitting Functions
Austin Benson
 
PDF
Higher-order link prediction
Austin Benson
 
PDF
Simplicial closure & higher-order link prediction
Austin Benson
 
PDF
Three hypergraph eigenvector centralities
Austin Benson
 
PDF
Semi-supervised learning of edge flows
Austin Benson
 
PDF
Choosing to grow a graph
Austin Benson
 
PDF
Link prediction in networks with core-fringe structure
Austin Benson
 
PDF
Higher-order Link Prediction GraphEx
Austin Benson
 
PDF
Higher-order Link Prediction Syracuse
Austin Benson
 
PDF
Random spatial network models for core-periphery structure
Austin Benson
 
PDF
Random spatial network models for core-periphery structure.
Austin Benson
 
PDF
Simplicial closure & higher-order link prediction
Austin Benson
 
PDF
Simplicial closure and simplicial diffusions
Austin Benson
 
PDF
Sampling methods for counting temporal motifs
Austin Benson
 
PDF
Set prediction three ways
Austin Benson
 
Hypergraph Cuts with General Splitting Functions (JMM)
Austin Benson
 
Spectral embeddings and evolving networks
Austin Benson
 
Computational Frameworks for Higher-order Network Data Analysis
Austin Benson
 
Higher-order link prediction and other hypergraph modeling
Austin Benson
 
Hypergraph Cuts with General Splitting Functions
Austin Benson
 
Hypergraph Cuts with General Splitting Functions
Austin Benson
 
Higher-order link prediction
Austin Benson
 
Simplicial closure & higher-order link prediction
Austin Benson
 
Three hypergraph eigenvector centralities
Austin Benson
 
Semi-supervised learning of edge flows
Austin Benson
 
Choosing to grow a graph
Austin Benson
 
Link prediction in networks with core-fringe structure
Austin Benson
 
Higher-order Link Prediction GraphEx
Austin Benson
 
Higher-order Link Prediction Syracuse
Austin Benson
 
Random spatial network models for core-periphery structure
Austin Benson
 
Random spatial network models for core-periphery structure.
Austin Benson
 
Simplicial closure & higher-order link prediction
Austin Benson
 
Simplicial closure and simplicial diffusions
Austin Benson
 
Sampling methods for counting temporal motifs
Austin Benson
 
Set prediction three ways
Austin Benson
 

Recently uploaded (20)

PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012)

  • 1. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Structures and Performance for Scienti
  • 2. c Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012
  • 3. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 1 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 4. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA How do we store the matrix in HDFS?
  • 5. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA In HDFS: h1; [11; 12; 13; 14]i h2; [21; 22; 23; 24]i h3; [31; 32; 33; 34]i h4; [41; 42; 43; 44]i
  • 6. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Two rows per record or we might use: h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i
  • 7. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Flattened list or maybe h1; [11; 12; 13; 14; 21; 22; 23; 24]i h3; [31; 32; 33; 34; 41; 42; 43; 44]i ... but we do lose information here (maybe it's not important)
  • 8. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Full matrix or maybe h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i
  • 9. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way?
  • 10. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way? Depends on the application... we will look at an example later.
  • 11. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 2 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 12. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Small optimizations ! 2.5x speedup! *all data from the NERSC Magellan cluster
  • 13. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Same experiment but dierent matrix size (200 columns): Again, 2.5x speedup!
  • 14. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Languages Switching from Python to C++... same general trend
  • 15. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR More speedups Algorithm performance isn't the only place where we see speedups
  • 16. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why can we expect these speedups? These are not high-performance implementations. We care about I/O performance.
  • 17. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 3 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 18. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Suppose we need to write many small matrices to disk.
  • 19. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Code Code: git clone git://github.com/icme/mapreduce-workshop.git cd mapreduce-workshop/arbenson Files: speed test.py (tester) small matrix test.py (driver)
  • 20. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 21. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 22. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 23. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 24. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 4 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 25. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Algorithm Cholesky QR: R = chol(ATA, 'upper')
  • 26. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Implementation for MapReduce
  • 27. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better?
  • 28. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better? Answer: the one on the left (usually)
  • 29. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why? 1 Shue time 2 Reduce bottleneck However, the left implementation could run out of memory.
  • 30. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Can we do better? Yes
  • 31. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Questions? Austin R. Benson [email protected] https://github.com/arbenson/mrtsqr