SlideShare a Scribd company logo
Wrapping C++ Arrow - Why
and How?
2 Sep 2019
Yoni Davidson
TG-17
About me!
Generalist working in TG-17, stealth mode startup.
Currently I am working in Kotlin,Javascript and Python.
Before:
Sears Israel - Mobile/Backend team.
Eyesight mobile - Mobile,Platform and IOT teams.
Alvarion - Wimax and Wifi teams.
Motivation
Data is getting bigger (Hadoop, S3) - Parquet for efficient storage.
Data scientists need a way to work without running out of memory.
Big data Infra is based on JVM, other languages would like to work on the data
and serialization is expensive - Python is the best example.
Moving data around is expensive (Serialization and Deserialization) - IO between
Services / GPU->CPU
Building all this for each framework and each language (Java + Python) is a lot of
work and blocks innovation.
What is Apache Arrow?
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-
independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
https://arrow.apache.org/
Moving data in memory between languages and
between services
What is Apache Arrow?
Performance Advantage of Columnar In-Memory
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs (Multi core) and GPUs.
What is Apache Arrow?
Advantages of a Common Data Layer
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs and GPUs.
● Each system has its own internal
memory format
● 70-80% computation wasted on
serialization and deserialization
● Similar functionality implemented
in multiple projects
● All systems utilize the same
memory format
● No overhead for cross-system
communication
● Projects can share functionality
(eg, Parquet-to-Arrow reader)
Who is leading the work on Apache Arrow?
How fast is it?
● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame
How fast is it? https://databricks.com/blog/2017/10/30/introducing-
vectorized-udfs-for-pyspark.html
Where is Apache Arrow going?
Using arrow to allow TF to natively work with local and remote datasets
https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:
Language bindings status:
Apache Arrow Python bindings
Based on the CPP project.
Built with Cython.
Allows integration with the massive Python ecosystem - Pandas.
What is Language Binding?
In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one
language can be used in another language (Wiki).
Where do we find language bindings ?
What do we want to do with our Go implementation?
Sharing table with Python in same memory space. 0 serialization
Pros:
1. It’s a very closed problem - read the spec, write tests and implement.
2. It gives you all the advantages of Arrow (up to the implementation date).
3. Go allows us to improve the implementation by providing better tools for
concurrent work (easier than C++).
First approach - implement spec in pure Go
First approach - implement spec in pure Go
Cons:
1. Every improvement that the main branch has needs to be implemented in Go,
especially if it’s not an “API” change, you’ll need to understand the C++ code
and then write it in Go.
2. 1 Makes it harder to maintain the project.
3. In case the Go version adds improvements it will be harder to export them
back to C++ project (and python who is binded to it) since the core project it
not the native one.
First approach - implement spec in pure Go
Python project also enjoys C++ improvements.
carrow - Go bindings to Apache Arrow via C++-API
https://github.com/353solutions/carrow
carrow - Go bindings to Apache Arrow via C++-API
Pros:
1. This project enjoys all the CPP main branch improvements.
2. Any add that we create using the Go project we can export back to
Python/CPP project (Did an experiment of reading pandas from our Go
project).
carrow - Go bindings to Apache Arrow via C++-API
Cons:
1. It's much harder to build ( compared to a pure native Go implementation).
Challenge 1 - Go and CPP - don’t link
CPP compilers do symbols mangling (for supporting CPP features ),
CGo doesn’t support it and a C wrapper is needed.
Challenge 1 - Go and CPP - don’t link - example
void *table_new(void *sp, void *cp) {
auto schema = (Schema *)sp;
auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp;
auto table = arrow::Table::Make(schema->ptr, *columns);
if (table == nullptr) {
return nullptr;
}
auto wrapper = new Table;
wrapper->table = table;
return wrapper;
}
Challenge 1 - Go and CPP - don’t link - example
#ifndef _CARROW_H_
#define _CARROW_H_
#ifdef __cplusplus
extern "C" {
#endif
void *table_new(void *sp, void *cp);
#ifdef __cplusplus
}
#endif // extern "C"
#endif // #ifdef _CARROW_H_
Challenge 2 - Building a CPP/Go project
CPP libs and headers are required, this means that the dev env’ is more complex
than a Go project.
Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.
Challenge 2 - Building a CPP/Go project - Dockerfile
FROM ubuntu:18.04
# Tools
RUN apt-get update && apt-get install -y 
gdb 
git 
make 
vim 
wget 
&& rm -rf /var/lib/apt/lists/*
# Go installation
RUN cd /tmp && 
wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz && 
tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz && 
rm go1.12.9.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
Challenge 2 - Building a CPP/Go project - Dockerfile
# Python bindings
RUN cd /tmp && 
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && 
bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda && 
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH="/miniconda/bin:${PATH}"
RUN conda install -y 
Cython 
conda-forge::compilers 
conda-forge::pyarrow=0.14 
ipython 
numpy 
pkg-config
ENV LD_LIBRARY_PATH=/miniconda/lib
WORKDIR /src/carrow
Challenge 3 - Wrapper for each type
Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap
each type.
Solution was to use go template and generate some of the code.
Challenge 3 - Wrapper for each type - example
func main() {
arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"}
.
.
.
// Supported data types
var(
{{- range $val := .ArrowTypes}}
{{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE)
{{- end}}
)
Challenge 4 - Logger
Do we send all our errors up the stream to the Go package for logging ?
We can also create a Go logger and throw it down to the CPP code for logging.
Challenge 5 - Error handling
Where are errors handled ?
Where is the best place to log and handle them?
For now - every call returns this result_t
typedef struct {
const char *err;
void *ptr;
int64_t i;
} result_t;
Challenge 666 - Memory management
2 memory managers.
1. Go runtime - Automatic memory management.
2. CPP runtime - Apache arrow uses std::shared_ptr extensively:
std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The
object is destroyed and its memory deallocated when either of the following happens:
■ the last remaining shared_ptr owning the object is destroyed;
■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().
Challenge 666 - Memory management - solution
Wrap std::shared_ptr with a struct - so we know who owns the memory.
struct Table {
std::shared_ptr<arrow::Table> table;
};
Challenge 666 - Memory management - solution
Use finalizer to free memory.
// NewSchema creates a new schema
func NewSchema(fields []*Field) (*Schema, error) {
fieldsList, err := NewFieldList()
if err != nil {
return nil, fmt.Errorf("can't create schema,failed creating fields list")
}
.
.
.
schema := &Schema{ptr}
runtime.SetFinalizer(schema, func(s *Schema) {
C.schema_free(s.ptr)
})
return schema, nil
}
Challenge 7 - cgo is FFI
FFI - Foreign function interface
https://github.com/dyu/ffi-overhead Results (500M calls)
c:
1182
1182
cpp:
1182
1183
Go: X 32
37975
Challenge 7 - cgo is FFI
Try and reduce unneeded cgo calls:
Using Builder pattern for appending data in array.
func TestAppendInt64(t *testing.T) {
bld := NewInteger64ArrayBuilder()
const size = 20913
for i := int64(0); i < size; i++ {
err := bld.Append(i)
require.NoErrorf(err, "append %d", i)
}
arr, err := bld.Finish()
}
Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.
Challenge 8 - Making package Go getable
This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for
example).
Do we precompile for each OS?
Add to Readme what packages need to be installed alongside?
carrow status
Adding more features (More data types).
Building good use-cases, Where and how should we use this?
Adding our project to main Apache Arrow Repo.
Questions?
Thank you
https://github.com/yonidavidson
https://www.linkedin.com/in/yoni-davidson-35b53222/
https://twitter.com/yonidavidson

More Related Content

What's hot (20)

PDF
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
PDF
Getting started with AMD GPUs
George Markomanolis
 
PDF
IRIS-HEP Retreat: Boost-Histogram Roadmap
Henry Schreiner
 
PDF
Take advantage of C++ from Python
Yung-Yu Chen
 
PDF
Reversing the dropbox client on windows
extremecoders
 
PDF
PyHEP 2019: Python 3.8
Henry Schreiner
 
PDF
Notes about moving from python to c++ py contw 2020
Yung-Yu Chen
 
PDF
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
Henry Schreiner
 
PDF
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis
 
PDF
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
PDF
IRIS-HEP: Boost-histogram and Hist
Henry Schreiner
 
PDF
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
Dong-hee Na
 
PDF
ROOT 2018: iminuit and MINUIT2 Standalone
Henry Schreiner
 
PDF
Debugging node in prod
Yunong Xiao
 
PPTX
Streams for the Web
Domenic Denicola
 
PDF
2019 IRIS-HEP AS workshop: Boost-histogram and hist
Henry Schreiner
 
PDF
Tensorflow in Docker
Eric Ahn
 
PDF
Why is Python slow? Python Nordeste 2013
Daker Fernandes
 
PDF
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
Jeongkyu Shin
 
PPT
Lua vs python
HoChul Shin
 
Exploring the Programming Models for the LUMI Supercomputer
George Markomanolis
 
Getting started with AMD GPUs
George Markomanolis
 
IRIS-HEP Retreat: Boost-Histogram Roadmap
Henry Schreiner
 
Take advantage of C++ from Python
Yung-Yu Chen
 
Reversing the dropbox client on windows
extremecoders
 
PyHEP 2019: Python 3.8
Henry Schreiner
 
Notes about moving from python to c++ py contw 2020
Yung-Yu Chen
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
Henry Schreiner
 
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
IRIS-HEP: Boost-histogram and Hist
Henry Schreiner
 
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
Dong-hee Na
 
ROOT 2018: iminuit and MINUIT2 Standalone
Henry Schreiner
 
Debugging node in prod
Yunong Xiao
 
Streams for the Web
Domenic Denicola
 
2019 IRIS-HEP AS workshop: Boost-histogram and hist
Henry Schreiner
 
Tensorflow in Docker
Eric Ahn
 
Why is Python slow? Python Nordeste 2013
Daker Fernandes
 
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
Jeongkyu Shin
 
Lua vs python
HoChul Shin
 

Similar to carrow - Go bindings to Apache Arrow via C++-API (20)

PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
Go vs C++ - CppRussia 2019 Piter BoF
Timur Safin
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
PPTX
Go from a PHP Perspective
Barry Jones
 
PDF
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io
 
PPTX
Python Bindings Overview
Sébastien Tandel
 
PDF
Tour of language landscape (katsconf)
Yan Cui
 
PDF
Tour of language landscape (BuildStuff)
Yan Cui
 
PDF
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PDF
Fletcher Framework for Programming FPGA
Ganesan Narayanasamy
 
PDF
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PDF
PyCon2022 - Building Python Extensions
Henry Schreiner
 
PPTX
Python with a SWIG of c++
bobmcn
 
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PPTX
Kostiantyn Grygoriev "Wrapping C++ for Python"
LogeekNightUkraine
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
Go vs C++ - CppRussia 2019 Piter BoF
Timur Safin
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Start Wrap Episode 11: A New Rope
Yung-Yu Chen
 
Go from a PHP Perspective
Barry Jones
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io
 
Python Bindings Overview
Sébastien Tandel
 
Tour of language landscape (katsconf)
Yan Cui
 
Tour of language landscape (BuildStuff)
Yan Cui
 
Cluj.py Meetup: Extending Python in C
Steffen Wenz
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Fletcher Framework for Programming FPGA
Ganesan Narayanasamy
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
PyCon2022 - Building Python Extensions
Henry Schreiner
 
Python with a SWIG of c++
bobmcn
 
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Kostiantyn Grygoriev "Wrapping C++ for Python"
LogeekNightUkraine
 
Ad

Recently uploaded (20)

PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Ad

carrow - Go bindings to Apache Arrow via C++-API

  • 1. Wrapping C++ Arrow - Why and How? 2 Sep 2019 Yoni Davidson TG-17
  • 2. About me! Generalist working in TG-17, stealth mode startup. Currently I am working in Kotlin,Javascript and Python. Before: Sears Israel - Mobile/Backend team. Eyesight mobile - Mobile,Platform and IOT teams. Alvarion - Wimax and Wifi teams.
  • 3. Motivation Data is getting bigger (Hadoop, S3) - Parquet for efficient storage. Data scientists need a way to work without running out of memory. Big data Infra is based on JVM, other languages would like to work on the data and serialization is expensive - Python is the best example. Moving data around is expensive (Serialization and Deserialization) - IO between Services / GPU->CPU Building all this for each framework and each language (Java + Python) is a lot of work and blocks innovation.
  • 4. What is Apache Arrow? Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language- independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. https://arrow.apache.org/
  • 5. Moving data in memory between languages and between services
  • 6. What is Apache Arrow? Performance Advantage of Columnar In-Memory Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs (Multi core) and GPUs.
  • 7. What is Apache Arrow? Advantages of a Common Data Layer Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs. ● Each system has its own internal memory format ● 70-80% computation wasted on serialization and deserialization ● Similar functionality implemented in multiple projects ● All systems utilize the same memory format ● No overhead for cross-system communication ● Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8. Who is leading the work on Apache Arrow?
  • 9. How fast is it? ● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame
  • 10. How fast is it? https://databricks.com/blog/2017/10/30/introducing- vectorized-udfs-for-pyspark.html
  • 11. Where is Apache Arrow going? Using arrow to allow TF to natively work with local and remote datasets https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:
  • 13. Apache Arrow Python bindings Based on the CPP project. Built with Cython. Allows integration with the massive Python ecosystem - Pandas.
  • 14. What is Language Binding? In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one language can be used in another language (Wiki).
  • 15. Where do we find language bindings ?
  • 16. What do we want to do with our Go implementation? Sharing table with Python in same memory space. 0 serialization
  • 17. Pros: 1. It’s a very closed problem - read the spec, write tests and implement. 2. It gives you all the advantages of Arrow (up to the implementation date). 3. Go allows us to improve the implementation by providing better tools for concurrent work (easier than C++). First approach - implement spec in pure Go
  • 18. First approach - implement spec in pure Go Cons: 1. Every improvement that the main branch has needs to be implemented in Go, especially if it’s not an “API” change, you’ll need to understand the C++ code and then write it in Go. 2. 1 Makes it harder to maintain the project. 3. In case the Go version adds improvements it will be harder to export them back to C++ project (and python who is binded to it) since the core project it not the native one.
  • 19. First approach - implement spec in pure Go Python project also enjoys C++ improvements.
  • 20. carrow - Go bindings to Apache Arrow via C++-API https://github.com/353solutions/carrow
  • 21. carrow - Go bindings to Apache Arrow via C++-API Pros: 1. This project enjoys all the CPP main branch improvements. 2. Any add that we create using the Go project we can export back to Python/CPP project (Did an experiment of reading pandas from our Go project).
  • 22. carrow - Go bindings to Apache Arrow via C++-API Cons: 1. It's much harder to build ( compared to a pure native Go implementation).
  • 23. Challenge 1 - Go and CPP - don’t link CPP compilers do symbols mangling (for supporting CPP features ), CGo doesn’t support it and a C wrapper is needed.
  • 24. Challenge 1 - Go and CPP - don’t link - example void *table_new(void *sp, void *cp) { auto schema = (Schema *)sp; auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp; auto table = arrow::Table::Make(schema->ptr, *columns); if (table == nullptr) { return nullptr; } auto wrapper = new Table; wrapper->table = table; return wrapper; }
  • 25. Challenge 1 - Go and CPP - don’t link - example #ifndef _CARROW_H_ #define _CARROW_H_ #ifdef __cplusplus extern "C" { #endif void *table_new(void *sp, void *cp); #ifdef __cplusplus } #endif // extern "C" #endif // #ifdef _CARROW_H_
  • 26. Challenge 2 - Building a CPP/Go project CPP libs and headers are required, this means that the dev env’ is more complex than a Go project. Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.
  • 27. Challenge 2 - Building a CPP/Go project - Dockerfile FROM ubuntu:18.04 # Tools RUN apt-get update && apt-get install -y gdb git make vim wget && rm -rf /var/lib/apt/lists/* # Go installation RUN cd /tmp && wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz && tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz && rm go1.12.9.linux-amd64.tar.gz ENV PATH="/usr/local/go/bin:${PATH}"
  • 28. Challenge 2 - Building a CPP/Go project - Dockerfile # Python bindings RUN cd /tmp && wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda && rm Miniconda3-latest-Linux-x86_64.sh ENV PATH="/miniconda/bin:${PATH}" RUN conda install -y Cython conda-forge::compilers conda-forge::pyarrow=0.14 ipython numpy pkg-config ENV LD_LIBRARY_PATH=/miniconda/lib WORKDIR /src/carrow
  • 29. Challenge 3 - Wrapper for each type Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap each type. Solution was to use go template and generate some of the code.
  • 30. Challenge 3 - Wrapper for each type - example func main() { arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"} . . . // Supported data types var( {{- range $val := .ArrowTypes}} {{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE) {{- end}} )
  • 31. Challenge 4 - Logger Do we send all our errors up the stream to the Go package for logging ? We can also create a Go logger and throw it down to the CPP code for logging.
  • 32. Challenge 5 - Error handling Where are errors handled ? Where is the best place to log and handle them? For now - every call returns this result_t typedef struct { const char *err; void *ptr; int64_t i; } result_t;
  • 33. Challenge 666 - Memory management 2 memory managers. 1. Go runtime - Automatic memory management. 2. CPP runtime - Apache arrow uses std::shared_ptr extensively: std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The object is destroyed and its memory deallocated when either of the following happens: ■ the last remaining shared_ptr owning the object is destroyed; ■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().
  • 34. Challenge 666 - Memory management - solution Wrap std::shared_ptr with a struct - so we know who owns the memory. struct Table { std::shared_ptr<arrow::Table> table; };
  • 35. Challenge 666 - Memory management - solution Use finalizer to free memory. // NewSchema creates a new schema func NewSchema(fields []*Field) (*Schema, error) { fieldsList, err := NewFieldList() if err != nil { return nil, fmt.Errorf("can't create schema,failed creating fields list") } . . . schema := &Schema{ptr} runtime.SetFinalizer(schema, func(s *Schema) { C.schema_free(s.ptr) }) return schema, nil }
  • 36. Challenge 7 - cgo is FFI FFI - Foreign function interface https://github.com/dyu/ffi-overhead Results (500M calls) c: 1182 1182 cpp: 1182 1183 Go: X 32 37975
  • 37. Challenge 7 - cgo is FFI Try and reduce unneeded cgo calls: Using Builder pattern for appending data in array. func TestAppendInt64(t *testing.T) { bld := NewInteger64ArrayBuilder() const size = 20913 for i := int64(0); i < size; i++ { err := bld.Append(i) require.NoErrorf(err, "append %d", i) } arr, err := bld.Finish() } Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.
  • 38. Challenge 8 - Making package Go getable This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for example). Do we precompile for each OS? Add to Readme what packages need to be installed alongside?
  • 39. carrow status Adding more features (More data types). Building good use-cases, Where and how should we use this? Adding our project to main Apache Arrow Repo.

Editor's Notes

  • #11: This blog post introduces the Pandas UDFs (a.k.a. Vectorized UDFs) feature in the upcoming Apache Spark 2.3 release that substantially improves the performance and usability of user-defined functions (UDFs) in Python. Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala and then invoke them from Python. Pandas UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high-performance UDFs entirely in Python. In Spark 2.3, there will be two types of Pandas UDFs: scalar and grouped map. Next, we illustrate their usage using four example programs: Plus One, Cumulative Probability, Subtract Mean, Ordinary Least Squares Linear Regression.