SlideShare a Scribd company logo
Cobrix – A COBOL data
source for Spark
Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
• Who we are?
• Motivation
• Mainframe files and copybooks
• Loading simple files
• Loading hierarchical databases
• Performance and results
• Cobrix in ABSA Big Data space
Outline
2/40
About us
. ABSA is a Pan-African financial services provider
- With Apache Spark at the core of its data engineering
. We fill gaps in the Hadoop ecosystem, when we find them
. Contributions to Apache Spark
. Spark-related open-source projects (https://github.com/AbsaOSS)
- Spline - a data lineage tracking and visualization tool
- ABRiS - Avro SerDe for structured APIs
- Atum - Data quality library for Spark
- Enceladus - A dynamic data conformance engine
- Cobrix - A Cobol library for Spark (focus of this presentation)
3/40
The market for Mainframes is strong, with no signs of cooling down.
Mainframes
Are used by 71% of Fortune 500 companies
Are responsible for 87% of all credit card transactions in the world
Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world
Handle 68% of the world’s production IT workloads, while accounting for only 6%
of IT costs.
For companies relying on Mainframes, becoming data-centric can be
prohibitively expensive
High cost of hardware
Expensive business model for data science related activities
Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/
Business Motivation
4/40
Technical Motivation
• The process above takes 11 days for a 600GB file
• Legacy data models (hierarchical)
• Need for performance, scalability, flexibility, etc
• SPOILER alert: we brought it to 1.1 hours 5/40
Mainframes PC
Fixed-length
Text Files
CSVHDFS
1. Extract 2. Transform
4. Load 3. Transform
Proprietary
Tools
• Run analytics / Spark on mainframes
• Message Brokers (e.g. MQ)
• Sqoop
• Proprietary solutions
• But ...
• Pricey
• Slow
• Complex (specially for legacy systems)
• Require human involvement
What can you do?
6/40
How Cobrix can help
• Decreasing human involvement
• Simplifying the manipulation of hierarchical structures
• Providing scalability
• Open-source
7/40
Mainframe file
(EBCDIC)
Schema
(copybook)
Apache
Spark Application
Cobrix Output
(Parquet, JSON, CSV…)df df df
transformations
Writer
...
Cobrix – a custom Spark data source
8/40
A copybook is a schema definition
A data file is a collection of binary records
Name: █ █ █ █ Age: █ █
Company: █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █ █
Zip: █ █ █ █ █
Name: J O H N Age: 3 2
Company: F O O . C O M
Phone #: + 2 3 1 1 - 3 2 7
Zip: 1 2 0 0 0
A * N J O H N G A 3 2 S H
K D K S I A S S A S K A S
A L , S D F O O . C O M X
L Q O K ( G A } S N B W E
S < N J X I C W L D H J P
A S B C + 2 3 1 1 - 3 2 7
C = D 1 2 0 0 0 F H 0 D .
9/40
Similar to IDLs of Avro, Thrift, Protocol Buffers, etc.
struct Company {
1: required i64 id,
2: required string name,
3: optional list<string> contactPeople
}
Thrift
message Company {
required int64 id = 1;
required string name = 2;
repeated string contact_people = 3;
}
10 COMPANY.
15 ID PIC 9(12) COMP.
15 NAME PIC X(40).
15 CONTACT-PEOPLE PIC X(20)
OCCURS 10.
COBOL
record Company {
int64 id;
string name;
array<string> contactPeople;
}
10/40
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
01 RECORD.
05 COMPANY-ID PIC 9(10).
05 COMPANY-NAME PIC X(40).
05 ADDRESS PIC X(60).
05 REG-NUM PIC X(8).
05 ZIP PIC X(6).
A * N J O H N G A 3 2 S H K D K S I
A S S A S K A S A L , S D F O O . C
O M X L Q O K ( G A } S N B W E S <
N J X I C W L D H J P A S B C + 2 3
COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP
100 ABCD Ltd. 10 Garden st. 8791237 03120
101 ZjkLPj 11 Park ave. 1233971 23111
102 Robotrd Inc. 12 Forest st. 0382979 12000
103 Xingzhoug 8 Mountst. 2389012 31222
104 Example.co 123 Tech str. 3129001 19000
Loading Mainframe Data
11/40
val spark = SparkSession.builder() .appName("Example").getOrCreate()
val df = spark
.read
.format("cobol")
.option("copybook", "data/example.cob")
.load("data/example")
// ...Business logic goes here...
df.write.parquet("data/output")
This App is
● Distributed
● Scalable
● Resilient
EBCDIC to Parquet examples
12/40
A * N J O H N G A 3 2 S H K D K S I A
S S A S K A S A L , S D F O O . C O M
X L Q O K ( G A } S N B W E S < N J X
I C W L D H J P A S B C + 2 3 1 1 - 3
2 7 C = D 1 2 0 0 0 F H 0 D . A * N J
O H N G A 3 2 S H K D K S I A S S A S
K A S A L , S D F O O . C O M X L Q O
K ( G A } S N B W E S < N J X I C W L
D H J P A S B C + 2 B W E S < N J X P
FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █
• Redefined fields AKA
• Unchecked unions
• Untagged unions
• Variant type fields
• Several fields occupy the same space
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
Redefined Fields
13/40
01 RECORD.
05 IS-COMPANY PIC 9(1).
05 COMPANY.
10 COMPANY-NAME PIC X(40).
05 PERSON REDEFINES COMPANY.
10 FIRST-NAME PIC X(20).
10 LAST-NAME PIC X(20).
05 ADDRESS PIC X(50).
05 ZIP PIC X(6).
• Cobrix applies all redefines for each
record
• Some fields can clash
• It’s up to the user to apply business logic
to separate correct and wrong data
IS_COMPANY COMPANY PERSON ADDRESS ZIP
1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”,
“LAST_NAME”: “ber Ltd.”}
74 Lawn ave., Denver 39023
0 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Beatrice”,
“LAST_NAME”: “Gagliano”}
10 Garden str. 33113
1 {“COMPANY_NAME”: “Beatrice
Gagliano”}
{“FIRST_NAME”: “Januar”,
“LAST_NAME”: “y Inc.”}
122/1 Park ave. 31234
Redefined Fields
14/40
df.select($"IS_COMPANY",
when($"IS_COMPANY" === true, "COMPANY_NAME")
.otherwise(null).as("COMPANY_NAME"),
when($"IS_COMPANY" === false, "CONTACTS")
.otherwise(null).as("FIRST_NAME")),
...
IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP
1 September Ltd. 74 Lawn ave., Denver 39023
0 Beatrice Gagliano 10 Garden str. 33113
1 January Inc. 122/1 Park ave. 31234
Clean Up Redefined Fields + flatten structs
15/40
Hierarchical DBs
• Several record types
• AKA segments
• Each segment type has its
own schema
• Parent-child relationships
between segments
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
16/40
• The combined copybook has to contain all the segments as redefined
fields:
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY-ID PIC X(10).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
common data
segment 1
segment 2
COMPANY
Name: █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
Reg-Num: █ █ █ █ █ █
CONTACT
Phone #: █ █ █ █ █ █ █
Contact Person: █ █ █ █
█ █ █ █ █ █ █ █ █ █ █
Defining a Copybook
17/40
• The code snippet for reading the data:
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.load("examples/multisegment_data")
Reading all the segments
18/40
• The dataset for the whole copybook:
• Invalid redefines are highlighted
SEGMENT_ID COMPANY_ID COMPANY CONTACT
C 1005918818 [ ABCD Ltd. ] [ invalid ]
P 1005918818 [ invalid ] [ Cliff Wallingford ]
C 1036146222 [ DEFG Ltd. ] [ invalid ]
P 1036146222 [ invalid ] [ Beatrice Gagliano ]
C 1045855294 [ Robotrd Inc. ] [ invalid ]
P 1045855294 [ invalid ] [ Doretha Wallingford ]
P 1045855294 [ invalid ] [ Deshawn Benally ]
P 1045855294 [ invalid ] [ Willis Tumlin ]
C 1057751949 [ Xingzhoug ] [ invalid ]
P 1057751949 [ invalid ] [ Mindy Boettcher ]
Reading all the segments
19/40
A * N J O
H N G A 3
2 S H K D
K S I A S
S A S K A
S A L , S
D F O O .
C O M X L
Q O K ( G
A } S N B
W E S < N
J X I C W
L D H J P
A S B C +
2 3 1 1 -
3 2 7 C =
D 1 2 0 0
0 F H 0 D
. K A I O
D A P D F
C J S C D
A D F R J
F D F C L
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Id Name Address Reg_Num
100 Example.com 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 ABCD Ltd. 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Jane +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Separate segments by dataframes
20/40
• Filter segment #1 (companies)
val dfCompanies =
df.filter($"SEGMENT_ID"==="C")
.select($"COMPANY_ID",
$"COMPANY.NAME".as($"COMPANY_NAME"),
$"COMPANY.ADDRESS",
$"COMPANY.REG_NUM")
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Reading root segments
21/40
• Filter segment #2 (people)
val dfContacts = df
.filter($"SEGMENT_ID"==="P")
.select($"COMPANY_ID",
$"CONTACT.CONTACT_PERSON",
$"CONTACT.PHONE_NUMBER")
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Reading child segments
22/40
Company_Id Company_Name Address Reg_Num
100 ABCD Ltd. 10 Garden st. 8791237
101 ZjkLPj 11 Park ave. 1233971
102 Robotrd Inc. 12 Forest st. 0382979
103 Xingzhoug 8 Mountst. 2389012
104 Example.co 123 Tech str. 3129001
Company_Id Contact_Person Phone_Number
100 Marry +32186331
100 Colyn +23769123
102 Robert +12389679
102 Teresa +32187912
102 Laura +42198723
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
The two segments can now be joined by Company_Id
23/40
• Joining segments 1 and 2
val dfJoined =
dfCompanies.join_outer(dfContacts, "COMPANY_ID")
Results:
Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number
100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331
100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123
102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679
102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912
102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723
Joining in Spark is easy
24/40
• The joined table can also be denormalized for document storage
val dfCombined =
dfJoined
.groupBy($"COMPANY_ID",
$"COMPANY_NAME",
$"ADDRESS",
$"REG_NUM")
.agg(
collect_list(
struct($"CONTACT_PERSON",
$"PHONE_NUMBER"))
.as("CONTACTS"))
{
"COMPANY_ID": "8216281722",
"COMPANY_NAME": "ABCD Ltd.",
"ADDRESS": "74 Lawn ave., New York",
”REG_NUM": "33718594",
"CONTACTS": [
{
"CONTACT_PERSON": "Cassey Norgard",
"PHONE_NUMBER": "+(595) 641 62 32"
},
{
"CONTACT_PERSON": "Verdie Deveau",
"PHONE_NUMBER": "+(721) 636 72 35"
},
{
"CONTACT_PERSON": "Otelia Batman",
"PHONE_NUMBER": "+(813) 342 66 28"
}
]
}
Denormalize data
25/40
Restore parent-child relationships
• In our example we had
COMPANY_ID field that is
present in all segments
• In real copybooks this is not
the case
• What can we do?
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
Root segment
Child segment
CONTACT-PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
…
Child segment
26/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• If COMPANY_ID is not part
of all segments
Cobrix can generate it for you
val df = spark
.read
.format("cobol")
.option("copybook", "/path/to/copybook.cpy")
.option("is_record_sequence", "true")
.option("segment_field", "SEGMENT-ID")
.option("segment_id_level0", "C")
.option("segment_id_prefix", "ID")
.load("examples/multisegment_data")
No COMPANY-ID
Id Generation
27/40
01 COMPANY-DETAILS.
05 SEGMENT-ID PIC X(5).
05 COMPANY.
10 NAME PIC X(15).
10 ADDRESS PIC X(25).
10 REG-NUM PIC 9(8) COMP.
05 CONTACT REDEFINES COMPANY.
10 PHONE-NUMBER PIC X(17).
10 CONTACT-PERSON PIC X(28).
• Seg0_Id can be used to restore
parent-child relationship
between segments
No COMPANY-ID
SEGMENT_ID Seg0_Id COMPANY CONTACT
C ID_0_0 [ ABCD Ltd. ] [ invalid ]
P ID_0_0 [ invalid ] [ Cliff Wallingford ]
C ID_0_2 [ DEFG Ltd. ] [ invalid ]
P ID_0_2 [ invalid ] [ Beatrice Gagliano ]
C ID_0_4 [ Robotrd Inc. ] [ invalid ]
P ID_0_4 [ invalid ] [ Doretha Wallingford ]
P ID_0_4 [ invalid ] [ Deshawn Benally ]
Id Generation
28/40
• When transferred from a mainframe a hierarchical database becomes
• A sequence of records
• To read next record a previous record should be read first
• A sequential format by it's nature
• How to make it scalable?
A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J
X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1
9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
COMPANY
ID: █ █ █ █ █ █ █ █ █
Name: █ █ █ █ █ █ █ █ █ █ █ █
Address: █ █ █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
A data file
PERSON
Name: J A N E █ █ █ █
R O B E R T S █ █
Phone #: + 9 3 5 2 8 0
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
PERSON
Name: █ █ █ █ █ █ █ █
█ █ █ █ █ █ █ █ █
Phone #: █ █ █ █ █ █ █
COMPANY
ID: 2 8 0 0 0 3 9 4 1
Name: E x a m p l e . c o m █
Address: 1 0 █ G a r d e n
Variable Length Records (VLR)
29/40
Performance challenge of VLRs
• Naturally sequential files
• To read next record the prior
record need to be read first
• Each record had a length
field
• Acts as a pointer to the next
record
• No record delimiter when
reading a file from the middle
VLR structure
30/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s Number of Spark cores
Throughput, variable record length
Sequential processing
10 MB/s
Spark on HDFS
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
Blocks
Partitions
Data Node
Spark Executor
. . .
HDFS Namenode
Spark Driver
31/40
Data node 1 Data node 2 Data node 3
HDFS
Cobrix
1- Read headers
List(offsets,lengths)
Spark
3 - Parse records
In parallel from
parallelized offsets
and lengths
Spark cluster
2 - Parallelize
Offsets and lengths
Parallelizing Sequential Reads
32/40
Data node 1 Data node 2 Data node 3
HDFS
Namenode Cobrix
1- Read VLR 1 header
offset = 3000
length = 1002 - where is
offset 3000
until 3100?
3 - Check nodes
3, 18 and 41.
Spark
4 - Load VLR 1
Preferred location
is node 3
5 - Launch task
On executor
hosted on node 3
Spark cluster
Enabling Data Locality for VLRs
33/40
Throughput when sparse indexes are used
• Experiments were ran
on our lab cluster
• 4 nodes
• 380 cores
• 10 Gbit network
• Scalable – for bigger
files when using more
executors the
throughput is bigger
34/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
10 GB file
20 GB file
40 GB file
Sequential
Comparison versus fixed length record performance
● Distribution and locality is
handled completely by Spark
● Parallelism is achieved using
sparse indexes
35/40
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, fixed length records
40 GB file 150 MB/s
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
MB/s
Number of Spark cores
Throughput, variable record length
40 GB file
Sequential
145 MB/s
10 MB/s
Cobrix in ABSA Data Infrastructure - Batch
Mainframes
180+
sources
HDFS
Enceladus
Cobrix
Spark
Spline
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
36/40
Cobrix in ABSA Data Infrastructure - Stream
Mainframes
180+
sources
Cobrix
Parser ABRiS
Enceladus
Kafka
Spline
1.Ingest 2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
37/40
Cobrix in ABSA Data Infrastructure
Mainframes
180+
sources
HDFS
Cobrix
Parser ABRiS
Enceladus
Kafka
Cobrix
Spark
Spline
1.Ingest
2. Parse
3. Conform
4. Track Lineage5. Re-ingest
6. Consume
2.Parse 3.Avro
4.Conform
5. Track Lineage
6. Consume
Batch
Stream
38/40
● Thanks to the following people the project was made possible and for all
the help along the way:
○ Andrew Baker, Francois Cillers, Adam Smyczek,
Jan Scherbaum, Peter Moon, Clifford Lategan,
Rekha Gorantla, Mohit Suryavanshi, Niel Steyn
• Thanks to the authors of the original COBOL parser:
○ Ian De Beer, Rikus de Milander
(https://github.com/zenaptix-lab/copybookStreams)
Acknowledgment
39/40
• Combine expertise to make access mainframe data in Hadoop seamless
• Our goal is to support the widest range of use cases possible
• Report a bug !
• Request new feature !
• Create a pull request ! " # $
Our home: https://github.com/AbsaOSS/cobrix
Your Contribution is Welcome
40/40

More Related Content

What's hot (20)

PDF
Creating Beautiful Dashboards with Grafana and ClickHouse
Altinity Ltd
 
PDF
ALL ABOUT DB2 DSNZPARM
IBM
 
PPTX
Presentation upgrade, migrate &amp; consolidate to oracle database 12c &amp...
solarisyougood
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
PDF
Migration to Oracle Multitenant
Jitendra Singh
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
What to Expect From Oracle database 19c
Maria Colgan
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PPTX
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Dive into PySpark
Mateusz Buśkiewicz
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PPT
DB2UDB_the_Basics
Pranav Prakash
 
PDF
Ibm db2 interview questions and answers
Sweta Singh
 
PPT
Db2
yboren
 
PPTX
DB2 Basic Commands - UDB
Srinimf-Slides
 
PDF
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
PPTX
Vsam presentation PPT
Anil Polsani
 
PDF
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 
Creating Beautiful Dashboards with Grafana and ClickHouse
Altinity Ltd
 
ALL ABOUT DB2 DSNZPARM
IBM
 
Presentation upgrade, migrate &amp; consolidate to oracle database 12c &amp...
solarisyougood
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
Migration to Oracle Multitenant
Jitendra Singh
 
Apache Spark Architecture
Alexey Grishchenko
 
What to Expect From Oracle database 19c
Maria Colgan
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Introduction to PySpark
Russell Jurney
 
Dive into PySpark
Mateusz Buśkiewicz
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Demystifying data engineering
Thang Bui (Bob)
 
DB2UDB_the_Basics
Pranav Prakash
 
Ibm db2 interview questions and answers
Sweta Singh
 
Db2
yboren
 
DB2 Basic Commands - UDB
Srinimf-Slides
 
PostgreSQL Streaming Replication Cheatsheet
Alexey Lesovsky
 
Vsam presentation PPT
Anil Polsani
 
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 

Similar to Cobrix – a COBOL Data Source for Spark (20)

PPTX
Sharing names and address cleaning patterns for Patstat
Gianluca Tarasconi
 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
PDF
Adaptive Query Processing on RAW Data
Manos Karpathiotakis
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
DOCX
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Subhasish Guha
 
PDF
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
PDF
Schema management with Scalameta
Lars Albertsson
 
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
PPTX
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
PDF
Composable Data Processing with Apache Spark
Databricks
 
PPT
Toronto jaspersoft meetup
Patrick McFadin
 
PDF
COBOL to Apache Spark
Rakuten Group, Inc.
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPT
Slinging Data: Data Loading and Cleanup in Evergreen
Galen Charlton
 
ODP
Introduciton to Apache Cassandra for Java Developers (JavaOne)
zznate
 
Sharing names and address cleaning patterns for Patstat
Gianluca Tarasconi
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Adaptive Query Processing on RAW Data
Manos Karpathiotakis
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Subhasish Guha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Schema management with Scalameta
Lars Albertsson
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
DataStax
 
Using Spark to Load Oracle Data into Cassandra
Jim Hatcher
 
Composable Data Processing with Apache Spark
Databricks
 
Toronto jaspersoft meetup
Patrick McFadin
 
COBOL to Apache Spark
Rakuten Group, Inc.
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Slinging Data: Data Loading and Cleanup in Evergreen
Galen Charlton
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
zznate
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 

Cobrix – a COBOL Data Source for Spark

  • 1. Cobrix – A COBOL data source for Spark Ruslan Iushchenko (ABSA), Felipe Melo (ABSA)
  • 2. • Who we are? • Motivation • Mainframe files and copybooks • Loading simple files • Loading hierarchical databases • Performance and results • Cobrix in ABSA Big Data space Outline 2/40
  • 3. About us . ABSA is a Pan-African financial services provider - With Apache Spark at the core of its data engineering . We fill gaps in the Hadoop ecosystem, when we find them . Contributions to Apache Spark . Spark-related open-source projects (https://github.com/AbsaOSS) - Spline - a data lineage tracking and visualization tool - ABRiS - Avro SerDe for structured APIs - Atum - Data quality library for Spark - Enceladus - A dynamic data conformance engine - Cobrix - A Cobol library for Spark (focus of this presentation) 3/40
  • 4. The market for Mainframes is strong, with no signs of cooling down. Mainframes Are used by 71% of Fortune 500 companies Are responsible for 87% of all credit card transactions in the world Are part of the IT infrastructure of 92 out of the 100 biggest banks in the world Handle 68% of the world’s production IT workloads, while accounting for only 6% of IT costs. For companies relying on Mainframes, becoming data-centric can be prohibitively expensive High cost of hardware Expensive business model for data science related activities Source: http://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/ Business Motivation 4/40
  • 5. Technical Motivation • The process above takes 11 days for a 600GB file • Legacy data models (hierarchical) • Need for performance, scalability, flexibility, etc • SPOILER alert: we brought it to 1.1 hours 5/40 Mainframes PC Fixed-length Text Files CSVHDFS 1. Extract 2. Transform 4. Load 3. Transform Proprietary Tools
  • 6. • Run analytics / Spark on mainframes • Message Brokers (e.g. MQ) • Sqoop • Proprietary solutions • But ... • Pricey • Slow • Complex (specially for legacy systems) • Require human involvement What can you do? 6/40
  • 7. How Cobrix can help • Decreasing human involvement • Simplifying the manipulation of hierarchical structures • Providing scalability • Open-source 7/40
  • 8. Mainframe file (EBCDIC) Schema (copybook) Apache Spark Application Cobrix Output (Parquet, JSON, CSV…)df df df transformations Writer ... Cobrix – a custom Spark data source 8/40
  • 9. A copybook is a schema definition A data file is a collection of binary records Name: █ █ █ █ Age: █ █ Company: █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ █ Zip: █ █ █ █ █ Name: J O H N Age: 3 2 Company: F O O . C O M Phone #: + 2 3 1 1 - 3 2 7 Zip: 1 2 0 0 0 A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . 9/40
  • 10. Similar to IDLs of Avro, Thrift, Protocol Buffers, etc. struct Company { 1: required i64 id, 2: required string name, 3: optional list<string> contactPeople } Thrift message Company { required int64 id = 1; required string name = 2; repeated string contact_people = 3; } 10 COMPANY. 15 ID PIC 9(12) COMP. 15 NAME PIC X(40). 15 CONTACT-PEOPLE PIC X(20) OCCURS 10. COBOL record Company { int64 id; string name; array<string> contactPeople; } 10/40
  • 11. val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") 01 RECORD. 05 COMPANY-ID PIC 9(10). 05 COMPANY-NAME PIC X(40). 05 ADDRESS PIC X(60). 05 REG-NUM PIC X(8). 05 ZIP PIC X(6). A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 COMPANY_ID COMPANY_NAME ADDRESS REG_NUM ZIP 100 ABCD Ltd. 10 Garden st. 8791237 03120 101 ZjkLPj 11 Park ave. 1233971 23111 102 Robotrd Inc. 12 Forest st. 0382979 12000 103 Xingzhoug 8 Mountst. 2389012 31222 104 Example.co 123 Tech str. 3129001 19000 Loading Mainframe Data 11/40
  • 12. val spark = SparkSession.builder() .appName("Example").getOrCreate() val df = spark .read .format("cobol") .option("copybook", "data/example.cob") .load("data/example") // ...Business logic goes here... df.write.parquet("data/output") This App is ● Distributed ● Scalable ● Resilient EBCDIC to Parquet examples 12/40
  • 13. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 B W E S < N J X P FIRST-NAME: █ █ █ █ █ █ LAST-NAME: █ █ █ █ █COMPANY-NAME: █ █ █ █ █ █ █ █ █ █ █ █ █ █ • Redefined fields AKA • Unchecked unions • Untagged unions • Variant type fields • Several fields occupy the same space 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). Redefined Fields 13/40
  • 14. 01 RECORD. 05 IS-COMPANY PIC 9(1). 05 COMPANY. 10 COMPANY-NAME PIC X(40). 05 PERSON REDEFINES COMPANY. 10 FIRST-NAME PIC X(20). 10 LAST-NAME PIC X(20). 05 ADDRESS PIC X(50). 05 ZIP PIC X(6). • Cobrix applies all redefines for each record • Some fields can clash • It’s up to the user to apply business logic to separate correct and wrong data IS_COMPANY COMPANY PERSON ADDRESS ZIP 1 {“COMPANY_NAME”: “September Ltd.”} {“FIRST_NAME”: “Septem”, “LAST_NAME”: “ber Ltd.”} 74 Lawn ave., Denver 39023 0 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Beatrice”, “LAST_NAME”: “Gagliano”} 10 Garden str. 33113 1 {“COMPANY_NAME”: “Beatrice Gagliano”} {“FIRST_NAME”: “Januar”, “LAST_NAME”: “y Inc.”} 122/1 Park ave. 31234 Redefined Fields 14/40
  • 15. df.select($"IS_COMPANY", when($"IS_COMPANY" === true, "COMPANY_NAME") .otherwise(null).as("COMPANY_NAME"), when($"IS_COMPANY" === false, "CONTACTS") .otherwise(null).as("FIRST_NAME")), ... IS_COMPANY COMPANY_NAME FIRST_NAME LAST_NAME ADDRESS ZIP 1 September Ltd. 74 Lawn ave., Denver 39023 0 Beatrice Gagliano 10 Garden str. 33113 1 January Inc. 122/1 Park ave. 31234 Clean Up Redefined Fields + flatten structs 15/40
  • 16. Hierarchical DBs • Several record types • AKA segments • Each segment type has its own schema • Parent-child relationships between segments COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 16/40
  • 17. • The combined copybook has to contain all the segments as redefined fields: 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY-ID PIC X(10). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). common data segment 1 segment 2 COMPANY Name: █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ Reg-Num: █ █ █ █ █ █ CONTACT Phone #: █ █ █ █ █ █ █ Contact Person: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Defining a Copybook 17/40
  • 18. • The code snippet for reading the data: val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .load("examples/multisegment_data") Reading all the segments 18/40
  • 19. • The dataset for the whole copybook: • Invalid redefines are highlighted SEGMENT_ID COMPANY_ID COMPANY CONTACT C 1005918818 [ ABCD Ltd. ] [ invalid ] P 1005918818 [ invalid ] [ Cliff Wallingford ] C 1036146222 [ DEFG Ltd. ] [ invalid ] P 1036146222 [ invalid ] [ Beatrice Gagliano ] C 1045855294 [ Robotrd Inc. ] [ invalid ] P 1045855294 [ invalid ] [ Doretha Wallingford ] P 1045855294 [ invalid ] [ Deshawn Benally ] P 1045855294 [ invalid ] [ Willis Tumlin ] C 1057751949 [ Xingzhoug ] [ invalid ] P 1057751949 [ invalid ] [ Mindy Boettcher ] Reading all the segments 19/40
  • 20. A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D A D F R J F D F C L COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Id Name Address Reg_Num 100 Example.com 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 ABCD Ltd. 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Jane +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Separate segments by dataframes 20/40
  • 21. • Filter segment #1 (companies) val dfCompanies = df.filter($"SEGMENT_ID"==="C") .select($"COMPANY_ID", $"COMPANY.NAME".as($"COMPANY_NAME"), $"COMPANY.ADDRESS", $"COMPANY.REG_NUM") Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Reading root segments 21/40
  • 22. • Filter segment #2 (people) val dfContacts = df .filter($"SEGMENT_ID"==="P") .select($"COMPANY_ID", $"CONTACT.CONTACT_PERSON", $"CONTACT.PHONE_NUMBER") Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Reading child segments 22/40
  • 23. Company_Id Company_Name Address Reg_Num 100 ABCD Ltd. 10 Garden st. 8791237 101 ZjkLPj 11 Park ave. 1233971 102 Robotrd Inc. 12 Forest st. 0382979 103 Xingzhoug 8 Mountst. 2389012 104 Example.co 123 Tech str. 3129001 Company_Id Contact_Person Phone_Number 100 Marry +32186331 100 Colyn +23769123 102 Robert +12389679 102 Teresa +32187912 102 Laura +42198723 Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 The two segments can now be joined by Company_Id 23/40
  • 24. • Joining segments 1 and 2 val dfJoined = dfCompanies.join_outer(dfContacts, "COMPANY_ID") Results: Company_Id Company_Name Address Reg_Num Contact_Person Phone_Number 100 ABCD Ltd. 10 Garden st. 8791237 Marry +32186331 100 ABCD Ltd. 10 Garden st. 8791237 Colyn +23769123 102 Robotrd Inc. 12 Forest st. 0382979 Robert +12389679 102 Robotrd Inc. 12 Forest st. 0382979 Teresa +32187912 102 Robotrd Inc. 12 Forest st. 0382979 Laura +42198723 Joining in Spark is easy 24/40
  • 25. • The joined table can also be denormalized for document storage val dfCombined = dfJoined .groupBy($"COMPANY_ID", $"COMPANY_NAME", $"ADDRESS", $"REG_NUM") .agg( collect_list( struct($"CONTACT_PERSON", $"PHONE_NUMBER")) .as("CONTACTS")) { "COMPANY_ID": "8216281722", "COMPANY_NAME": "ABCD Ltd.", "ADDRESS": "74 Lawn ave., New York", ”REG_NUM": "33718594", "CONTACTS": [ { "CONTACT_PERSON": "Cassey Norgard", "PHONE_NUMBER": "+(595) 641 62 32" }, { "CONTACT_PERSON": "Verdie Deveau", "PHONE_NUMBER": "+(721) 636 72 35" }, { "CONTACT_PERSON": "Otelia Batman", "PHONE_NUMBER": "+(813) 342 66 28" } ] } Denormalize data 25/40
  • 26. Restore parent-child relationships • In our example we had COMPANY_ID field that is present in all segments • In real copybooks this is not the case • What can we do? COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ Root segment Child segment CONTACT-PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ … Child segment 26/40
  • 27. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • If COMPANY_ID is not part of all segments Cobrix can generate it for you val df = spark .read .format("cobol") .option("copybook", "/path/to/copybook.cpy") .option("is_record_sequence", "true") .option("segment_field", "SEGMENT-ID") .option("segment_id_level0", "C") .option("segment_id_prefix", "ID") .load("examples/multisegment_data") No COMPANY-ID Id Generation 27/40
  • 28. 01 COMPANY-DETAILS. 05 SEGMENT-ID PIC X(5). 05 COMPANY. 10 NAME PIC X(15). 10 ADDRESS PIC X(25). 10 REG-NUM PIC 9(8) COMP. 05 CONTACT REDEFINES COMPANY. 10 PHONE-NUMBER PIC X(17). 10 CONTACT-PERSON PIC X(28). • Seg0_Id can be used to restore parent-child relationship between segments No COMPANY-ID SEGMENT_ID Seg0_Id COMPANY CONTACT C ID_0_0 [ ABCD Ltd. ] [ invalid ] P ID_0_0 [ invalid ] [ Cliff Wallingford ] C ID_0_2 [ DEFG Ltd. ] [ invalid ] P ID_0_2 [ invalid ] [ Beatrice Gagliano ] C ID_0_4 [ Robotrd Inc. ] [ invalid ] P ID_0_4 [ invalid ] [ Doretha Wallingford ] P ID_0_4 [ invalid ] [ Deshawn Benally ] Id Generation 28/40
  • 29. • When transferred from a mainframe a hierarchical database becomes • A sequence of records • To read next record a previous record should be read first • A sequential format by it's nature • How to make it scalable? A * N J O H N G A 3 2 S H K D K S I A S S A S K A S A L , S D F O O . C O M X L Q O K ( G A } S N B W E S < N J X I C W L D H J P A S B C + 2 3 1 1 - 3 2 7 C = D 1 2 0 0 0 F H 0 D . K A I O D A P D F C J S C D C D C W E P 1 9 1 2 3 – 3 1 2 2 1 . 3 1 F A D F L 1 7 COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ COMPANY ID: █ █ █ █ █ █ █ █ █ Name: █ █ █ █ █ █ █ █ █ █ █ █ Address: █ █ █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ A data file PERSON Name: J A N E █ █ █ █ R O B E R T S █ █ Phone #: + 9 3 5 2 8 0 PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ PERSON Name: █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ Phone #: █ █ █ █ █ █ █ COMPANY ID: 2 8 0 0 0 3 9 4 1 Name: E x a m p l e . c o m █ Address: 1 0 █ G a r d e n Variable Length Records (VLR) 29/40
  • 30. Performance challenge of VLRs • Naturally sequential files • To read next record the prior record need to be read first • Each record had a length field • Acts as a pointer to the next record • No record delimiter when reading a file from the middle VLR structure 30/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length Sequential processing 10 MB/s
  • 31. Spark on HDFS Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor Blocks Partitions Data Node Spark Executor . . . HDFS Namenode Spark Driver 31/40
  • 32. Data node 1 Data node 2 Data node 3 HDFS Cobrix 1- Read headers List(offsets,lengths) Spark 3 - Parse records In parallel from parallelized offsets and lengths Spark cluster 2 - Parallelize Offsets and lengths Parallelizing Sequential Reads 32/40
  • 33. Data node 1 Data node 2 Data node 3 HDFS Namenode Cobrix 1- Read VLR 1 header offset = 3000 length = 1002 - where is offset 3000 until 3100? 3 - Check nodes 3, 18 and 41. Spark 4 - Load VLR 1 Preferred location is node 3 5 - Launch task On executor hosted on node 3 Spark cluster Enabling Data Locality for VLRs 33/40
  • 34. Throughput when sparse indexes are used • Experiments were ran on our lab cluster • 4 nodes • 380 cores • 10 Gbit network • Scalable – for bigger files when using more executors the throughput is bigger 34/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 10 GB file 20 GB file 40 GB file Sequential
  • 35. Comparison versus fixed length record performance ● Distribution and locality is handled completely by Spark ● Parallelism is achieved using sparse indexes 35/40 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, fixed length records 40 GB file 150 MB/s 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 MB/s Number of Spark cores Throughput, variable record length 40 GB file Sequential 145 MB/s 10 MB/s
  • 36. Cobrix in ABSA Data Infrastructure - Batch Mainframes 180+ sources HDFS Enceladus Cobrix Spark Spline 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 36/40
  • 37. Cobrix in ABSA Data Infrastructure - Stream Mainframes 180+ sources Cobrix Parser ABRiS Enceladus Kafka Spline 1.Ingest 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume 37/40
  • 38. Cobrix in ABSA Data Infrastructure Mainframes 180+ sources HDFS Cobrix Parser ABRiS Enceladus Kafka Cobrix Spark Spline 1.Ingest 2. Parse 3. Conform 4. Track Lineage5. Re-ingest 6. Consume 2.Parse 3.Avro 4.Conform 5. Track Lineage 6. Consume Batch Stream 38/40
  • 39. ● Thanks to the following people the project was made possible and for all the help along the way: ○ Andrew Baker, Francois Cillers, Adam Smyczek, Jan Scherbaum, Peter Moon, Clifford Lategan, Rekha Gorantla, Mohit Suryavanshi, Niel Steyn • Thanks to the authors of the original COBOL parser: ○ Ian De Beer, Rikus de Milander (https://github.com/zenaptix-lab/copybookStreams) Acknowledgment 39/40
  • 40. • Combine expertise to make access mainframe data in Hadoop seamless • Our goal is to support the widest range of use cases possible • Report a bug ! • Request new feature ! • Create a pull request ! " # $ Our home: https://github.com/AbsaOSS/cobrix Your Contribution is Welcome 40/40