Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012)

0 likes294 views

This document discusses matrix storage and data serialization techniques for scientific computing with Hadoop and Dumbo. It provides examples of storing matrices in HDFS using different approaches like storing each row separately, storing two rows per record, or flattening the matrix into a single list. It also discusses optimizing data serialization and switching programming languages. The document then presents an example of outputting many small matrices to disk and compares two MapReduce implementations for computing the Cholesky QR decomposition, identifying which approach is usually better and why.

Data & Analytics

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Data Structures and Performance for Scienti

c
Computing with Hadoop and Dumbo
Austin R. Benson
Computer Sciences Division, UC-Berkeley
ICME, Stanford University
May 15, 2012

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
1
1 Matrix storage
2 Data
3 Example: outputting many small matrices
4 Example: Cholesky QR

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Dense matrix storage
A =
0
11 12 13 14
21 22 23 24
31 32 33 34
41 42 42 44
BB@
1
CCA
How do we store the matrix in HDFS?

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Two rows per record
or we might use:
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i
h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Flattened list
or maybe
h1; [11; 12; 13; 14; 21; 22; 23; 24]i
h3; [31; 32; 33; 34; 41; 42; 43; 44]i
... but we do lose information here (maybe it's not important)

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Full matrix
or maybe
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
What is the "best" way?

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
What is the "best" way?
Depends on the application... we will look at an example later.

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
2
1 Matrix storage
2 Data
3 Example: outputting many small matrices
4 Example: Cholesky QR

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Data Serialization
Small optimizations ! 2.5x speedup!
*all data from the NERSC Magellan cluster

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Data Serialization
Same experiment but dierent matrix size (200 columns):
Again, 2.5x speedup!

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Languages
Switching from Python to C++...
same general trend

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
More speedups
Algorithm performance isn't the only place where we see speedups

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Why can we expect these speedups?
These are not high-performance implementations. We care about
I/O performance.

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
3
1 Matrix storage
2 Data
3 Example: outputting many small matrices
4 Example: Cholesky QR

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Suppose we need to write many small matrices to disk.

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Code
Code:
git clone git://github.com/icme/mapreduce-workshop.git
cd mapreduce-workshop/arbenson
Files:
speed test.py (tester)
small matrix test.py (driver)

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
4
1 Matrix storage
2 Data
3 Example: outputting many small matrices
4 Example: Cholesky QR

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Algorithm
Cholesky QR: R = chol(ATA, 'upper')

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Implementation for MapReduce

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Mapper implementation
Which of these implementations is better?

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Why?
1 Shue time
2 Reduce bottleneck
However, the left implementation could run out of memory.

Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Mapper implementation
Can we do better? Yes

More Related Content

What's hot (19)

PDF

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Thomas Gottron

PDF

Handling 20 billion requests a monthDmitriy Dumanskiy

PDF

FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexRob Skillington

PDF

Tweaking performance on high-load projectsDmitriy Dumanskiy

PDF

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

PDF

Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak

PDF

21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data

PPTX

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

PPTX

Query Rewriting in RDF Stream ProcessingJean-Paul Calbimonte

PDF

Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son

PDF

A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH

PPTX

Performance .NET Core - M. Terech, P. JanowskiAspire Systems Poland Sp. z o.o.

PPTX

Pycon 2016-open-spaceChetan Khatri

PDF

Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...Boaz Leskes

PDF

Golang in TiDB (GopherChina 2017)PingCAP

PPTX

Tracking the Performance of the Web with HTTP ArchiveRick Viscomi

PPTX

Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi

PPTX

CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationMuhammad Saleem

PDF

Performance evaluation of apache tajoJihoon Son

Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...Thomas Gottron

Handling 20 billion requests a monthDmitriy Dumanskiy

FOSDEM 2020: Querying over millions and billions of metrics with M3DB's indexRob Skillington

Tweaking performance on high-load projectsDmitriy Dumanskiy

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak

21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

Query Rewriting in RDF Stream ProcessingJean-Paul Calbimonte

Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son

A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH

Performance .NET Core - M. Terech, P. JanowskiAspire Systems Poland Sp. z o.o.

Pycon 2016-open-spaceChetan Khatri

Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...Boaz Leskes

Golang in TiDB (GopherChina 2017)PingCAP

Tracking the Performance of the Web with HTTP ArchiveRick Viscomi

Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi

CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationMuhammad Saleem

Performance evaluation of apache tajoJihoon Son

Viewers also liked (20)

PPT

Hoja de rutamistresmina

PPS

Jetabook - El Facebook argentinogimenez

PDF

A l'abordatge presentació setmanaBiblioteca d'Alcarràs

PPTX

Perdidos o CondenadosCentro Cristiano Internacional Aposento Alto

PDF

SwrEdris Fedlu

DOC

How not to be a dick Как Не быть хуем и засранцемMaik' Ckneteli

PPT

E-learning a jeho možnostiMartin Slavík

DOCX

Actividades del proyecto de aula enriquecida con tpackJOSE RAMIRO HOYOS

PDF

Sony VGP-BPS8 Akkulaptopakkude

PDF

Naturaleza y turismolizzethvv

PDF

Curso Superior de Dirección Estrategica marketing 11 / 12Centro Desarrollo Directivo

PDF

1310 manual de conservacion de suelos Impacto Ambiental Morelos

PPT

How to set up PPPoE on your Fonera - FONfongermany

PPTX

El bulldog francésCEIP San Félix

PPTX

Global organic textile standardMyriam Giraldo

PDF

Mapa conceptual gestion de calidad en los servicios3Jhoel Dgez Garcia

PDF

Cleo Studio Wedding Package PromotionHamdi Mokhtar

PDF

08 abril-2014civil1980

PDF

Manual gesuserEncarni Requena

PDF

Customer-centric IT - Enterprise IT trends and investment 2013EY

Hoja de rutamistresmina

Jetabook - El Facebook argentinogimenez

A l'abordatge presentació setmanaBiblioteca d'Alcarràs

Perdidos o CondenadosCentro Cristiano Internacional Aposento Alto

SwrEdris Fedlu

How not to be a dick Как Не быть хуем и засранцемMaik' Ckneteli

E-learning a jeho možnostiMartin Slavík

Actividades del proyecto de aula enriquecida con tpackJOSE RAMIRO HOYOS

Sony VGP-BPS8 Akkulaptopakkude

Naturaleza y turismolizzethvv

Curso Superior de Dirección Estrategica marketing 11 / 12Centro Desarrollo Directivo

1310 manual de conservacion de suelos Impacto Ambiental Morelos

How to set up PPPoE on your Fonera - FONfongermany

El bulldog francésCEIP San Félix

Global organic textile standardMyriam Giraldo

Mapa conceptual gestion de calidad en los servicios3Jhoel Dgez Garcia

Cleo Studio Wedding Package PromotionHamdi Mokhtar

08 abril-2014civil1980

Manual gesuserEncarni Requena

Customer-centric IT - Enterprise IT trends and investment 2013EY

Similar to Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012) (6)

PDF

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson

PDF

Sparse matrix computations in MapReduceDavid Gleich

PDF

Tall and Skinny QRs in MapReduceDavid Gleich

PDF

HPEC 2021 sparse binary formatErikWelch2

PDF

Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson

PDF

Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson

QR Factorizations and SVDs for Tall-and-skinny Matrices in MapReduce Architec...Austin Benson

Sparse matrix computations in MapReduceDavid Gleich

Tall and Skinny QRs in MapReduceDavid Gleich

HPEC 2021 sparse binary formatErikWelch2

Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson

Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Austin Benson

More from Austin Benson (20)

PDF

Hypergraph Cuts with General Splitting Functions (JMM)Austin Benson

PDF

Spectral embeddings and evolving networksAustin Benson

PDF

Computational Frameworks for Higher-order Network Data AnalysisAustin Benson

PDF

Higher-order link prediction and other hypergraph modelingAustin Benson

PDF

Hypergraph Cuts with General Splitting FunctionsAustin Benson

PDF

Hypergraph Cuts with General Splitting FunctionsAustin Benson

PDF

Higher-order link predictionAustin Benson

PDF

Simplicial closure & higher-order link predictionAustin Benson

PDF

Three hypergraph eigenvector centralitiesAustin Benson

PDF

Semi-supervised learning of edge flowsAustin Benson

PDF

Choosing to grow a graphAustin Benson

PDF

Link prediction in networks with core-fringe structureAustin Benson

PDF

Higher-order Link Prediction GraphExAustin Benson

PDF

Higher-order Link Prediction SyracuseAustin Benson

PDF

Random spatial network models for core-periphery structureAustin Benson

PDF

Random spatial network models for core-periphery structure.Austin Benson

PDF

Simplicial closure & higher-order link predictionAustin Benson

PDF

Simplicial closure and simplicial diffusionsAustin Benson

PDF

Sampling methods for counting temporal motifsAustin Benson

PDF

Set prediction three waysAustin Benson

Hypergraph Cuts with General Splitting Functions (JMM)Austin Benson

Spectral embeddings and evolving networksAustin Benson

Computational Frameworks for Higher-order Network Data AnalysisAustin Benson

Higher-order link prediction and other hypergraph modelingAustin Benson

Hypergraph Cuts with General Splitting FunctionsAustin Benson

Higher-order link predictionAustin Benson

Simplicial closure & higher-order link predictionAustin Benson

Three hypergraph eigenvector centralitiesAustin Benson

Semi-supervised learning of edge flowsAustin Benson

Choosing to grow a graphAustin Benson

Link prediction in networks with core-fringe structureAustin Benson

Higher-order Link Prediction GraphExAustin Benson

Higher-order Link Prediction SyracuseAustin Benson

Random spatial network models for core-periphery structureAustin Benson

Random spatial network models for core-periphery structure.Austin Benson

Simplicial closure & higher-order link predictionAustin Benson

Simplicial closure and simplicial diffusionsAustin Benson

Sampling methods for counting temporal motifsAustin Benson

Set prediction three waysAustin Benson

Recently uploaded (20)

PPTX

apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...apidays

PDF

R Cookbook - Processing and Manipulating Geological spatial data with R.pdfOtnielSimopiaref2

PDF

Context Engineering for AI Agents, approaches, memories.pdfTamanna

PPTX

SlideEgg_501298-Agentic AI.pptx agentic ai530BYManoj

PPTX

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

PPTX

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

PPT

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

PPT

Growth of Public Expendituuure_55423.pptNavyaDeora

PDF

Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMSMD RIZWAN MOLLA

PDF

Web Scraping with Google Gemini 2.0 .pdfTamanna

PPT

AI Future trends and opportunities_oct7v1.pptSHIKHAKMEHTA

PPTX

ER_Model_with_Diagrams_Presentation.pptxdharaadhvaryu1992

PDF

Development and validation of the Japanese version of the Organizational Matt...Yoga Tokuyoshi

PPTX

apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...apidays

PPTX

apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...apidays

PDF

Data Chunking Strategies for RAG in 2025.pdfTamanna

PPTX

AI Presentation Tool Pitch Deck Presentation.pptxShyamPanthavoor1

PDF

apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...apidays

PPTX

ER_Model_Relationship_in_DBMS_Presentation.pptxdharaadhvaryu1992

PPTX

apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...apidays

apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...apidays

R Cookbook - Processing and Manipulating Geological spatial data with R.pdfOtnielSimopiaref2

Context Engineering for AI Agents, approaches, memories.pdfTamanna

SlideEgg_501298-Agentic AI.pptx agentic ai530BYManoj

Listify-Intelligent-Voice-to-Catalog-Agent.pptxnareshkottees

b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptxAnees487379

tuberculosiship-2106031cyyfuftufufufivifvivivAkshaiRam

Growth of Public Expendituuure_55423.pptNavyaDeora

Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMSMD RIZWAN MOLLA

Web Scraping with Google Gemini 2.0 .pdfTamanna

AI Future trends and opportunities_oct7v1.pptSHIKHAKMEHTA

ER_Model_with_Diagrams_Presentation.pptxdharaadhvaryu1992

Development and validation of the Japanese version of the Organizational Matt...Yoga Tokuyoshi

apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...apidays

apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...apidays

Data Chunking Strategies for RAG in 2025.pdfTamanna

AI Presentation Tool Pitch Deck Presentation.pptxShyamPanthavoor1

apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...apidays

ER_Model_Relationship_in_DBMS_Presentation.pptxdharaadhvaryu1992

apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...apidays

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012)

1. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Structures and Performance for Scienti

2. c Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012

3. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 1 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

4. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA How do we store the matrix in HDFS?

5. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA In HDFS: h1; [11; 12; 13; 14]i h2; [21; 22; 23; 24]i h3; [31; 32; 33; 34]i h4; [41; 42; 43; 44]i

6. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Two rows per record or we might use: h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i

7. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Flattened list or maybe h1; [11; 12; 13; 14; 21; 22; 23; 24]i h3; [31; 32; 33; 34; 41; 42; 43; 44]i ... but we do lose information here (maybe it's not important)

8. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Full matrix or maybe h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i

9. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way?

10. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way? Depends on the application... we will look at an example later.

11. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 2 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

12. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Small optimizations ! 2.5x speedup! *all data from the NERSC Magellan cluster

13. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Same experiment but dierent matrix size (200 columns): Again, 2.5x speedup!

14. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Languages Switching from Python to C++... same general trend

15. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR More speedups Algorithm performance isn't the only place where we see speedups

16. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why can we expect these speedups? These are not high-performance implementations. We care about I/O performance.

17. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 3 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

18. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Suppose we need to write many small matrices to disk.

19. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Code Code: git clone git://github.com/icme/mapreduce-workshop.git cd mapreduce-workshop/arbenson Files: speed test.py (tester) small matrix test.py (driver)

20. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR

21. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR

22. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR

23. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR

24. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 4 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

25. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Algorithm Cholesky QR: R = chol(ATA, 'upper')

26. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Implementation for MapReduce

27. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better?

28. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better? Answer: the one on the left (usually)

29. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why? 1 Shue time 2 Reduce bottleneck However, the left implementation could run out of memory.

30. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Can we do better? Yes

31. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Questions? Austin R. Benson [email protected] https://github.com/arbenson/mrtsqr