HPEC 2021 sparse binary format

1
Binary Storage Formats for Sparse
Erik Welch
ewelch@anaconda.com
September 21, 2021

2
● To try to improve the status quo
○ Matrix Market, FROSTT, and other text-based formats
○ COO: coordinate triples of rows, columns, and values
■ Columnar storage technology is really good!
○ Bespoke, library-specific binary formats
○ TileDB
○ Domain-specific solutions (genomics, bioimaging, neuroscience, etc.)
● To discuss and create a vision for new binary storage formats for sparse
○ “Open standard” formats that are broadly useful and widely accessible
● To determine what is most important to the people here
○ Which languages, libraries, formats, features, technologies, etc.
○ And what do you wish for? Dream big!
● To find partners and team up for the next steps
○ Prototype, analyze, communicate results, and iterate
○ Goals for 3-6 months and beyond
Why are we here?

3
● Library developers and users
● Anybody with large sparse data or graphs
● Hardware creators and vendors
○ For example, to better optimize running workloads on GPUs or other accelerators
● Researchers
● Companies with graph workloads
● Benchmark organizers
● Maintainers of repositories of sparse data
● Virtually anybody who needs to work with sparse data or graphs
Who is this for?

4
● Data written to long term storage
○ Any technically competent person who discovers the file 10 years later should be
able to figure out how to load and use it (even if they don’t recognize the format)
● The logical model of a file format
○ Metadata about metadata
■ File extension (such as .zip or .pdf)
■ “Magic number” at beginning of file (NumPy’s .npy is x93NUMPY)
■ Version number of format
■ Any additional extensions needed to understand the metadata or data
○ Metadata
■ Data type, endianness of data, shape of sparse array, etc.
■ Info about compression
○ Data
■ Contiguous arrays of raw bytes
■ May be chunked or compressed
What is a file format?

5
An “open” ﬁle format is also about community
● More likely to succeed with more partners
● Right now, a small group (or just me) can
make a lot of progress with prototyping
and documenting to bootstrap the effort
○ Along with feedback from LAGraph group
and others
● We’ll want more partners eventually
● Please think about how/when you can help
○ Student projects
○ Volunteer to be beta users, give feedback
○ Participate in this session
● We’ll need a governance model eventually
○ Hopefully! This is a good “problem” to have
○ We’re happy to have this discussion
“If you want to go quickly,
go alone. If you want to
go far, go together.”
African Proverb

6
● Fast enough
● Small enough
● Flexible enough
● Portable enough
● Future-proof enough
● Standardized enough
● “Cloud native” enough
● Simple enough to implement
● Easy enough to use
● Scalable enough
● Able to evolve
● And, of course, able to support
whatever optimized format
Tim Davis can dream up!
What are our high level goals?
“Pinky, are you pondering
what I’m pondering?”
Brain

7
Essential sparse formats to support: rank 1 and 2
Sparse vector indices values
COO (ijv triples) rows columns values
CSR indptr col_indices values
CSC indptr row_indices values
DCSR (HyperCSR) rows indptr col_indices values
DCSC (HyperCSC) columns indptr row_indices values
Let’s start designing!
Many others: BSR, BSRX, DIA, SCS, Skyline, ELL (ITPACK/ELLPACK), hash map, CSB, CSR5, extended row scheme, etc.

8
Essential formats for rank N: COO and CSF
http://shaden.io/pub-ﬁles/smith2017knl.pdf
CSF is like DCSR/HyperCSR generalized to higher dimensions

9
● Metadata
○ Let’s use JSON
○ Versatile
○ Human-readable
○ No need to think about endianness
of metadata values
Logical model of a ﬁle
● Data
○ Like a key-value store with binary data
○ This is just a logical model
○ We are free to pack the data into bytes
however we wish

10
Uniﬁed Sparse Storage
Rank 1 Rank 2
key CSR CSC DCSR DCSC COO
double- fiber_ind0
compressed fiber_ptr0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0]
compressed_axes [0] [0]
coord_axes [0] [1] [1] [1] [1] [0, 1]
axis_order [0] [0, 1] [1, 0] [0, 1] [1, 0] [0,1], [1,0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0, 1, or 2
optional has_duplicates
Minimal, cohesive set of metadata and storage keys

11
Uniﬁed Sparse Storage: Rank 3
key CSF CSF (alt) CSR+extra COO DCSR+extra
double- fiber_ind0
fiber_ind1
fiber_ptr1
compressed indptr
sparse coord0
coord1
coord2
metadata fiber_axes [0, 1] [1] [0]
compressed_axes [0] [0]
coord_axes [2] [2] [1, 2] [0, 1, 2] [1, 2]
axis_order
optional num_sorted_coords 0 or 1 0 or 1 0 - 2 0 - 3 0 - 2

12
Uniﬁed Sparse Storage: Rank 2 combinations
Rank 2 Combos
key
CSR
/COO
DCSR
/CSR
DCSR
/COO
DCSR/CSR
/COO
CSC
/COO
DCSC
/CSC
DCSC
/COO
DCSC/CSC
/COO
double- fiber_ind0
compressed indptr
sparse coord0
coord1
metadata fiber_axes [0] [0] [0] [0] [0] [0]
compressed_axes [0] [0] [0] [0] [0] [0]
coord_axes [0, 1] [1] [0, 1] [0, 1] [0, 1] [1] [0, 1] [0, 1]
axis_order [0, 1] [0, 1] [0, 1] [0, 1] [1, 0] [1, 0] [1, 0] [1, 0]
optional num_sorted_coords 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1

13
● No values?
○ Index-only
● Single values array
○ Typical usage
● Multiple value arrays!
○ Different data types
○ Give them different names?
○ Allows bitmaps
○ Want ability to read only one values array
● Dense array (may be rank N) for each element
○ For example, from embeddings
● Can a ﬁll value be given for the missing elements?
Uniﬁed Sparse Storage: Values

14
● Straw man proposal: use asar!
○ “asar is a simple extensive archive format, it works
like tar that concatenates all files together without
compression, while having random access support”
● Support random access
● Use JSON to store files' information
● Very easy to write a parser
How should we pack bytes into a file?
Other options
● ZIP file
● LMDB
● HDF4, HDF5, CDF5, NetCDF
● Directory of files on file system
● Directory of files in cloud
○ S3, GCS, Azure, Ceph, etc.
● Zarr
● n5
● ASDF
| UInt32: header_size | String: header | Bytes: file1 | ... | Bytes: file42 |

15
● Arbitrary metadata via JSON
○ Able to evolve and support extensions
● Metadata scheme to specify CSR, DCSR, COO, CSF, etc.
○ Indicates what data is available and how to read it
● Single ﬁle archive that supports random access
○ Logically behaves like a key-value store
● Great, but what about extensions?
○ Support other sparse formats
○ Specify compression
○ Split single array into chunks
○ Additional data types (such as datetime)
○ Indicate variant of or option for existing format
Quick recap: what do we have so far?
Design goal: easily show via feature matrix which formats
and extensions are supported by each language and library.

16
● CSR
○ “Padded” CSR: add indptr_end so column indices and values can have gaps
■ Cheaper to add or remove values
○ Column indices in a given row can be:
■ unsorted
■ sorted
■ binary tree as a binary heap
○ CSR5
● COO
○ One array per coordinate
■ May have different data types (uint8, uint16, etc.)
○ One array for all coordinates (N x D)
○ Has duplicates or not
○ Lexicographic sort depth
File format variations

17
File format variations
https://arxiv.org/pdf/1904.03329.pdf
B-CSF

18
Graph properties
● directed / undirected
● weighted / unweighted
● is adjacency matrix
● is incidence matrix
● is bipartite graph
● is multigraph
● is hypergraph
● has self-edges
● has node attributes
Matrix properties
● is upper triangular
● is lower triangular
● is diagonal
● is symmetric
● is symmetric (only upper triangle)
● is symmetric (only lower triangle)
● is iso-valued
● ndiag
What about other properties?
A graph is more than just a sparse matrix

Thank You!
Erik Welch
ewelch@anaconda.com

About Anaconda
With more than 20 million users, Anaconda is the world’s most
popular data science platform and the foundation of modern
machine learning. We pioneered the use of Python for data
science, champion its vibrant community, and continue to
steward open-source projects that make tomorrow’s
innovations possible. Our enterprise-grade solutions enable
corporate, research, and academic institutions around the world
to harness the power of open-source for competitive advantage,
groundbreaking research, and a better world.
Visit https://www.anaconda.com to learn more.

HPEC 2021 sparse binary format

More Related Content

What's hot (17)

Similar to HPEC 2021 sparse binary format (20)

Recently uploaded (20)

HPEC 2021 sparse binary format