1
Binary Storage Formats for Sparse
Erik Welch
ewelch@anaconda.com
September 21, 2021
2
● To try to improve the status quo
○ Matrix Market, FROSTT, and other text-based formats
○ COO: coordinate triples of rows, columns, and values
■ Columnar storage technology is really good!
○ Bespoke, library-specific binary formats
○ TileDB
○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.)
● To discuss and create a vision for new binary storage formats for sparse
○ “Open standard” formats that are broadly useful and widely accessible
● To determine what is most important to the people here
○ Which languages, libraries, formats, features, technologies, etc.
○ And what do you wish for? Dream big!
● To find partners and team up for the next steps
○ Prototype, analyze, communicate results, and iterate
○ Goals for 3-6 months and beyond
Why are we here?
3
● Library developers and users
● Anybody with large sparse data or graphs
● Hardware creators and vendors
○ For example, to better optimize running workloads on GPUs or other accelerators
● Researchers
● Companies with graph workloads
● Benchmark organizers
● Maintainers of repositories of sparse data
● Virtually anybody who needs to work with sparse data or graphs
Who is this for?
4
● Data written to long term storage
○ Any technically competent person who discovers the file 10 years later should be
able to figure out how to load and use it (even if they don’t recognize the format)
● The logical model of a file format
○ Metadata about metadata
■ File extension (such as .zip or .pdf)
■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY)
■ Version number of format
■ Any additional extensions needed to understand the metadata or data
○ Metadata
■ Data type, endianness of data, shape of sparse array, etc.
■ Info about compression
○ Data
■ Contiguous arrays of raw bytes
■ May be chunked or compressed
What is a file format?
5
An “open” file format is also about community
● More likely to succeed with more partners
● Right now, a small group (or just me) can
make a lot of progress with prototyping
and documenting to bootstrap the effort
○ Along with feedback from LAGraph group
and others
● We’ll want more partners eventually
● Please think about how/when you can help
○ Student projects
○ Volunteer to be beta users, give feedback
○ Participate in this session
● We’ll need a governance model eventually
○ Hopefully! This is a good “problem” to have
○ We’re happy to have this discussion
“If you want to go quickly,
go alone. If you want to
go far, go together.”
African Proverb
6
● Fast enough
● Small enough
● Flexible enough
● Portable enough
● Future-proof enough
● Standardized enough
● “Cloud native” enough
● Simple enough to implement
● Easy enough to use
● Scalable enough
● Able to evolve
● And, of course, able to support
whatever optimized format
Tim Davis can dream up!
What are our high level goals?
“Pinky, are you pondering
what I’m pondering?”
Brain
7
Essential sparse formats to support: rank 1 and 2
Sparse vector indices values
COO (ijv triples) rows columns values
CSR indptr col_indices values
CSC indptr row_indices values
DCSR (HyperCSR) rows indptr col_indices values
DCSC (HyperCSC) columns indptr row_indices values
Let’s start designing!
Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
8
Essential formats for rank N: COO and CSF
http://shaden.io/pub-files/smith2017knl.pdf
CSF is like DCSR/HyperCSR generalized to higher dimensions
9
● Metadata
○ Let’s use JSON
○ Versatile
○ Human-readable
○ No need to think about endianness
of metadata values
Logical model of a file
● Data
○ Like a key-value store with binary data
○ This is just a logical model
○ We are free to pack the data into bytes
however we wish
10
Unified Sparse Storage
Rank 1 Rank 2
key CSR CSC DCSR DCSC COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0]
compressed_axes [0] [0]
coord_axes [0] [1] [1] [1] [1] [0, 1]
axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2
optional has_duplicates
Minimal, cohesive set of metadata and storage keys
11
Unified Sparse Storage: Rank 3
key CSF CSF (alt) CSR+extra COO DCSR+extra
double- fiber_ind0
compressed fiber_ptr0
fiber_ind1
fiber_ptr1
compressed indptr
sparse coord0
coord1
coord2
metadata fiber_axes [0, 1] [1] [0]
compressed_axes [0] [0]
coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2]
axis_order
optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2
optional has_duplicates
12
Unified Sparse Storage: Rank 2 combinations
Rank 2 Combos
key
CSR
/COO
DCSR
/CSR
DCSR
/COO
DCSR/CSR
/COO
CSC
/COO
DCSC
/CSC
DCSC
/COO
DCSC/CSC
/COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0] [0] [0] [0] [0]
compressed_axes [0] [0] [0] [0] [0] [0]
coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1]
axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1
optional has_duplicates
13
● No values?
○ Index-only
● Single values array
○ Typical usage
● Multiple value arrays!
○ Different data types
○ Give them different names?
○ Allows bitmaps
○ Want ability to read only one values array
● Dense array (may be rank N) for each element
○ For example, from embeddings
● Can a fill value be given for the missing elements?
Unified Sparse Storage: Values
14
● Straw man proposal: use asar!
○ “asar is a simple extensive archive format, it works
like tar that concatenates all files together without
compression, while having random access support”
● Support random access
● Use JSON to store files' information
● Very easy to write a parser
How should we pack bytes into a file?
Other options
● ZIP file
● LMDB
● HDF4, HDF5, CDF5, NetCDF
● Directory of files on file system
● Directory of files in cloud
○ S3, GCS, Azure, Ceph, etc.
● Zarr
● n5
● ASDF
| UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
15
● Arbitrary metadata via JSON
○ Able to evolve and support extensions
● Metadata scheme to specify CSR, DCSR, COO, CSF, etc.
○ Indicates what data is available and how to read it
● Single file archive that supports random access
○ Logically behaves like a key-value store
● Great, but what about extensions?
○ Support other sparse formats
○ Specify compression
○ Split single array into chunks
○ Additional data types (such as datetime)
○ Indicate variant of or option for existing format
Quick recap: what do we have so far?
Design goal: easily show via feature matrix which formats
and extensions are supported by each language and library.
16
● CSR
○ “Padded” CSR: add indptr_end so column indices and values can have gaps
■ Cheaper to add or remove values
○ Column indices in a given row can be:
■ unsorted
■ sorted
■ binary tree as a binary heap
○ CSR5
● COO
○ One array per coordinate
■ May have different data types (uint8, uint16, etc.)
○ One array for all coordinates (N x D)
○ Has duplicates or not
○ Lexicographic sort depth
File format variations
17
File format variations
https://arxiv.org/pdf/1904.03329.pdf
B-CSF
18
Graph properties
● directed / undirected
● weighted / unweighted
● is adjacency matrix
● is incidence matrix
● is bipartite graph
● is multigraph
● is hypergraph
● has self-edges
● has node attributes
Matrix properties
● is upper triangular
● is lower triangular
● is diagonal
● is symmetric
● is symmetric (only upper triangle)
● is symmetric (only lower triangle)
● is iso-valued
● ndiag
What about other properties?
A graph is more than just a sparse matrix
Thank You!
Erik Welch
ewelch@anaconda.com
About Anaconda
With more than 20 million users, Anaconda is the world’s most
popular data science platform and the foundation of modern
machine learning. We pioneered the use of Python for data
science, champion its vibrant community, and continue to
steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable
corporate, research, and academic institutions around the world
to harness the power of open-source for competitive advantage,
groundbreaking research, and a better world.
Visit https://www.anaconda.com to learn more.

More Related Content

PPTX
Open Data Mashups: linking fragments into mosaics
PDF
Demystifying datastores
PDF
Linked Open Data: A simple how-to
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
PDF
Indexing, searching, and aggregation with redi search and .net
PDF
Using NLP to Explore Entity Relationships in COVID-19 Literature
PDF
Brett Ragozzine - Graph Databases and Neo4j
PDF
Python Blaze Overview
Open Data Mashups: linking fragments into mosaics
Demystifying datastores
Linked Open Data: A simple how-to
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
Indexing, searching, and aggregation with redi search and .net
Using NLP to Explore Entity Relationships in COVID-19 Literature
Brett Ragozzine - Graph Databases and Neo4j
Python Blaze Overview

What's hot (17)

PPTX
Flow control, variable types
PDF
Europe PubMed Central and Linked Data
PPTX
Structures
PDF
Linking knowledge spaces
PPTX
Dsa unit 1
PPTX
Postgre sql data types
PDF
Crawling the Web for Structured Documents
PPTX
Learning R - Handling NetCDF files
PPTX
Introduction to Data Structures
PDF
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
PDF
Core Data
PPT
Introduction to trees
PPTX
Day 4( magic camp)
PPTX
Bdam presentation on parquet
PDF
Shilpa shukla processing_text
PPTX
Over view of data structures
Flow control, variable types
Europe PubMed Central and Linked Data
Structures
Linking knowledge spaces
Dsa unit 1
Postgre sql data types
Crawling the Web for Structured Documents
Learning R - Handling NetCDF files
Introduction to Data Structures
Big Data Day LA 2015 - How to model anything in Redis by Josiah Carlson of Ze...
Core Data
Introduction to trees
Day 4( magic camp)
Bdam presentation on parquet
Shilpa shukla processing_text
Over view of data structures
Ad

Similar to HPEC 2021 sparse binary format (20)

KEY
Numpy Talk at SIAM
PDF
The Joy of SciPy
PPTX
Format Wars: from VHS and Beta to Avro and Parquet
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
FP Days: Down the Clojure Rabbit Hole
PPTX
PDF
Migrating from matlab to python
PDF
PyData Paris 2015 - Closing keynote Francesc Alted
PPTX
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
PDF
Column and hadoop
PPTX
Data management principles
PDF
The TileDB Embedded Storage Engine
PPTX
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
PPTX
Bsc cs ii dfs u-1 introduction to data structure
PPSX
Anton Dorfman - Reversing data formats what data can reveal
PDF
Hdf5 is for Lovers (PyData SV 2013)
PPT
Primitive data types
PDF
Portable Lucene Index Format & Applications - Andrzej Bialecki
PPTX
Bca ii dfs u-1 introduction to data structure
Numpy Talk at SIAM
The Joy of SciPy
Format Wars: from VHS and Beta to Avro and Parquet
Apache Arrow Workshop at VLDB 2019 / BOSS Session
FP Days: Down the Clojure Rabbit Hole
Migrating from matlab to python
PyData Paris 2015 - Closing keynote Francesc Alted
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
Column and hadoop
Data management principles
The TileDB Embedded Storage Engine
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Bsc cs ii dfs u-1 introduction to data structure
Anton Dorfman - Reversing data formats what data can reveal
Hdf5 is for Lovers (PyData SV 2013)
Primitive data types
Portable Lucene Index Format & Applications - Andrzej Bialecki
Bca ii dfs u-1 introduction to data structure
Ad

Recently uploaded (20)

PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Advancing precision in air quality forecasting through machine learning integ...
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Human Computer Interaction Miterm Lesson
PDF
Decision Optimization - From Theory to Practice
PDF
The AI Revolution in Customer Service - 2025
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
SaaS reusability assessment using machine learning techniques
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
Early detection and classification of bone marrow changes in lumbar vertebrae...
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
substrate PowerPoint Presentation basic one
PDF
Examining Bias in AI Generated News Content.pdf
PDF
NewMind AI Journal Monthly Chronicles - August 2025
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Presentation - Principles of Instructional Design.pptx
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Advancing precision in air quality forecasting through machine learning integ...
Data Virtualization in Action: Scaling APIs and Apps with FME
Human Computer Interaction Miterm Lesson
Decision Optimization - From Theory to Practice
The AI Revolution in Customer Service - 2025
NewMind AI Weekly Chronicles – August ’25 Week IV
SaaS reusability assessment using machine learning techniques
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
Early detection and classification of bone marrow changes in lumbar vertebrae...
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
substrate PowerPoint Presentation basic one
Examining Bias in AI Generated News Content.pdf
NewMind AI Journal Monthly Chronicles - August 2025

HPEC 2021 sparse binary format

  • 1. 1 Binary Storage Formats for Sparse Erik Welch [email protected] September 21, 2021
  • 2. 2 ● To try to improve the status quo ○ Matrix Market, FROSTT, and other text-based formats ○ COO: coordinate triples of rows, columns, and values ■ Columnar storage technology is really good! ○ Bespoke, library-specific binary formats ○ TileDB ○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.) ● To discuss and create a vision for new binary storage formats for sparse ○ “Open standard” formats that are broadly useful and widely accessible ● To determine what is most important to the people here ○ Which languages, libraries, formats, features, technologies, etc. ○ And what do you wish for? Dream big! ● To find partners and team up for the next steps ○ Prototype, analyze, communicate results, and iterate ○ Goals for 3-6 months and beyond Why are we here?
  • 3. 3 ● Library developers and users ● Anybody with large sparse data or graphs ● Hardware creators and vendors ○ For example, to better optimize running workloads on GPUs or other accelerators ● Researchers ● Companies with graph workloads ● Benchmark organizers ● Maintainers of repositories of sparse data ● Virtually anybody who needs to work with sparse data or graphs Who is this for?
  • 4. 4 ● Data written to long term storage ○ Any technically competent person who discovers the file 10 years later should be able to figure out how to load and use it (even if they don’t recognize the format) ● The logical model of a file format ○ Metadata about metadata ■ File extension (such as .zip or .pdf) ■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY) ■ Version number of format ■ Any additional extensions needed to understand the metadata or data ○ Metadata ■ Data type, endianness of data, shape of sparse array, etc. ■ Info about compression ○ Data ■ Contiguous arrays of raw bytes ■ May be chunked or compressed What is a file format?
  • 5. 5 An “open” file format is also about community ● More likely to succeed with more partners ● Right now, a small group (or just me) can make a lot of progress with prototyping and documenting to bootstrap the effort ○ Along with feedback from LAGraph group and others ● We’ll want more partners eventually ● Please think about how/when you can help ○ Student projects ○ Volunteer to be beta users, give feedback ○ Participate in this session ● We’ll need a governance model eventually ○ Hopefully! This is a good “problem” to have ○ We’re happy to have this discussion “If you want to go quickly, go alone. If you want to go far, go together.” African Proverb
  • 6. 6 ● Fast enough ● Small enough ● Flexible enough ● Portable enough ● Future-proof enough ● Standardized enough ● “Cloud native” enough ● Simple enough to implement ● Easy enough to use ● Scalable enough ● Able to evolve ● And, of course, able to support whatever optimized format Tim Davis can dream up! What are our high level goals? “Pinky, are you pondering what I’m pondering?” Brain
  • 7. 7 Essential sparse formats to support: rank 1 and 2 Sparse vector indices values COO (ijv triples) rows columns values CSR indptr col_indices values CSC indptr row_indices values DCSR (HyperCSR) rows indptr col_indices values DCSC (HyperCSC) columns indptr row_indices values Let’s start designing! Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.
  • 8. 8 Essential formats for rank N: COO and CSF http://shaden.io/pub-files/smith2017knl.pdf CSF is like DCSR/HyperCSR generalized to higher dimensions
  • 9. 9 ● Metadata ○ Let’s use JSON ○ Versatile ○ Human-readable ○ No need to think about endianness of metadata values Logical model of a file ● Data ○ Like a key-value store with binary data ○ This is just a logical model ○ We are free to pack the data into bytes however we wish
  • 10. 10 Unified Sparse Storage Rank 1 Rank 2 key CSR CSC DCSR DCSC COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] compressed_axes [0] [0] coord_axes [0] [1] [1] [1] [1] [0, 1] axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2 optional has_duplicates Minimal, cohesive set of metadata and storage keys
  • 11. 11 Unified Sparse Storage: Rank 3 key CSF CSF (alt) CSR+extra COO DCSR+extra double- fiber_ind0 compressed fiber_ptr0 fiber_ind1 fiber_ptr1 compressed indptr sparse coord0 coord1 coord2 metadata fiber_axes [0, 1] [1] [0] compressed_axes [0] [0] coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2] axis_order optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2 optional has_duplicates
  • 12. 12 Unified Sparse Storage: Rank 2 combinations Rank 2 Combos key CSR /COO DCSR /CSR DCSR /COO DCSR/CSR /COO CSC /COO DCSC /CSC DCSC /COO DCSC/CSC /COO double- fiber_ind0 compressed fiber_ptr0 compressed indptr sparse coord0 coord1 metadata fiber_axes [0] [0] [0] [0] [0] [0] compressed_axes [0] [0] [0] [0] [0] [0] coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1] axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0] optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 optional has_duplicates
  • 13. 13 ● No values? ○ Index-only ● Single values array ○ Typical usage ● Multiple value arrays! ○ Different data types ○ Give them different names? ○ Allows bitmaps ○ Want ability to read only one values array ● Dense array (may be rank N) for each element ○ For example, from embeddings ● Can a fill value be given for the missing elements? Unified Sparse Storage: Values
  • 14. 14 ● Straw man proposal: use asar! ○ “asar is a simple extensive archive format, it works like tar that concatenates all files together without compression, while having random access support” ● Support random access ● Use JSON to store files' information ● Very easy to write a parser How should we pack bytes into a file? Other options ● ZIP file ● LMDB ● HDF4, HDF5, CDF5, NetCDF ● Directory of files on file system ● Directory of files in cloud ○ S3, GCS, Azure, Ceph, etc. ● Zarr ● n5 ● ASDF | UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |
  • 15. 15 ● Arbitrary metadata via JSON ○ Able to evolve and support extensions ● Metadata scheme to specify CSR, DCSR, COO, CSF, etc. ○ Indicates what data is available and how to read it ● Single file archive that supports random access ○ Logically behaves like a key-value store ● Great, but what about extensions? ○ Support other sparse formats ○ Specify compression ○ Split single array into chunks ○ Additional data types (such as datetime) ○ Indicate variant of or option for existing format Quick recap: what do we have so far? Design goal: easily show via feature matrix which formats and extensions are supported by each language and library.
  • 16. 16 ● CSR ○ “Padded” CSR: add indptr_end so column indices and values can have gaps ■ Cheaper to add or remove values ○ Column indices in a given row can be: ■ unsorted ■ sorted ■ binary tree as a binary heap ○ CSR5 ● COO ○ One array per coordinate ■ May have different data types (uint8, uint16, etc.) ○ One array for all coordinates (N x D) ○ Has duplicates or not ○ Lexicographic sort depth File format variations
  • 18. 18 Graph properties ● directed / undirected ● weighted / unweighted ● is adjacency matrix ● is incidence matrix ● is bipartite graph ● is multigraph ● is hypergraph ● has self-edges ● has node attributes Matrix properties ● is upper triangular ● is lower triangular ● is diagonal ● is symmetric ● is symmetric (only upper triangle) ● is symmetric (only lower triangle) ● is iso-valued ● ndiag What about other properties? A graph is more than just a sparse matrix
  • 20. About Anaconda With more than 20 million users, Anaconda is the world’s most popular data science platform and the foundation of modern machine learning. We pioneered the use of Python for data science, champion its vibrant community, and continue to steward open-source projects that make tomorrow’s innovations possible. Our enterprise-grade solutions enable corporate, research, and academic institutions around the world to harness the power of open-source for competitive advantage, groundbreaking research, and a better world. Visit https://www.anaconda.com to learn more.