Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

Trillium Software System:
New Features and
Big Data Matching
Paige Roberts, Product Marketing Manager
Steve Shissler, Director, Sales Engineering

Agenda
1 Syncsort
2 New Features in TSS
3 Big Data Matching Principles
4 Big Data Matching Case Study
5 Demo
6 Questions

Who is
Syncsort?
>7,000 customers
84 of the Fortune 100
Customers in >100 countries
Headquarters: Pearl River, NY
U . S . L O C AT I O N S
• Burlington, MA; Irvine, CA;
Oakbrook Terrace, IL; Rochester, MN
G L O B A L P R E S E N C E
• U.K., France, Germany, Netherlands,
Israel, Hong Kong & Japan
Big Iron to Big Data is a fast-growing
market segment composed of solutions
that optimize traditional data systems
and deliver mission-critical data from
these systems to next-generation
analytic environments.
Global leader in
Big Iron to Big Data

Syncsort’s Trillium Software System:
New Features

Collibra Integration
Collibra can define and manage data quality
rules, but cannot enforce the rules on the
data or measure compliance to them.
Goal:
• Make data accessible, traceable and
meaningful to business users.
• Automatically, pass Collibra rules into Trillium
Discovery and get rule compliance data passed
back to Collibra
Requirements:
• Bi-directional near real-time integration
between Trillium Discovery and Collibra DGC
for quality measurement and monitoring
• Trillium business rule analysis results / data
quality metrics shown in Collibra dashboards.
• Data Stewards can quickly identify issues and
take corrective action when data quality
standards are not met.

Closing the Loop
Collibra Data Governance Center
• Enables non-technical users to define
business policies and data quality rules
in plain language
• Makes data quality performance
available to all users
Trillium Discovery
• Imports DBC business rules so technical user
can convert to executable data quality rules
• Constantly runs data quality metrics on near
real-time basis, passes results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger case in
Collibra Issue
Management

Trillium Quality for Big Data
Trillium Quality =
Best-of-breed data quality
solution.
Leader in Gartner Data
Quality Tools MQ 12 years
running.
Intelligent Execution =
Artificially intelligent
dynamic performance
optimizer for cluster
execution in MapReduce,
Amazon EMR, or Spark.
Trillium Quality +
Intelligent Execution =
High performance
industry-leading data
quality on Big Data and
Cloud platforms.

• Build data quality processes that
ensure high-quality data that
meets such key business needs as:
o Single customer view (SCV)
o Standardized product data
o Standardization for fraud detection
Trillium Quality – Powerful Data Cleansing
• Consolidate data sources on input
• Match on party, household, business, etc.
• Develop workflows to transform, parse,
standardize, match and survive best record
• Manage “householding” issues associated with
multiple physical addresses under a single account
KEY FUNCTIONALITY:
• Global address validation with individual country postal rules
• Enrich missing postal information, latitude/longitude and other reference data

Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster

Trillium Quality for Big Data
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.

Two Ways to Get Postal Updates
Trillium Postal Download Web Service
Trillium Postal Download Web Service is an
automated download service introduced in
TSS v15.7. The download service allows you
to check the status of your postal license and
download the postal directories from a
browser-based application.
TSS Download Center (File Portal) FTP website
TSS Download Center allows you to manually download
postal directories through Trillium Software’s secure
website. See the Trillium Software System Installation
Guide for procedures on downloading postal directories
through this website.

And more …
• Trillium Discovery REST APIs installed with TSS
server, documentation in Help file for easy
integration with other applications like ASG Data
Intelligence
• Unique ID (UUID) Function
• Trillium Language Pack Locale Setting
• Apache Tomcat Upgrade to v8.5.32
• Australian (AU) Postal Directories and AU Postal
Matcher changes in accordance with Australia Post
licensing terms
• And more …
Example:
German locale setting in config.txt
key rest_api {
value locale "de"
}

Big Data Matching
Finding Similar Needles in a Really Big Haystack

Nobody wants a data swamp instead of a data lake!
“This sure looked a lot nicer on the
whiteboard…”

Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
Cost of poor data quality
rose by 50% in 2017
(Gartner)
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
The importance of data
quality in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a
single, complete and accurate
view of your customer for better
sales, marketing and customer
service
• Compliance – Know your data,
and ensure its accuracy to meet
industry and government
regulations
• Machine learning & AI – Train
your models on accurate data
The Data Lake
Needs Data
Quality

“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018

Common Machine Learning Applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer

De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Correcting data problems vastly increases a data set’s usefulness for machine learning.

Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.

difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.

difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
However, traditional data quality software
is designed to work on smaller data sets.
Traditional data quality processes are
an effective method to remove defects.

Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train
machine learning models, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.

Harvard Business Review - 2018
“If your data is bad, your machine
learning tools are useless.”
Anonymous Computer Scientist - 1957
“Garbage in, garbage out.”

Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties does
not contain all the necessary fields
• Inconsistent data formats
(measurements, languages, postal
conventions and dates)
• Names spelled differently
• Different number formatting

“But I have a lot of data ….” Is not an excuse for non-compliance.
To comply with GDPR, companies must know the
answers to the following questions:
• What do we know about a given customer?
• Where is our customer data?
• Is our customer contact information current?
• How are we processing customer data?
And supply those answers in the form of business
processes that provide evidence of compliance.
Data Quality is Critical for GDPR Compliance

Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine
learning, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
2. Entity Resolution
• Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated
multi-field matching algorithms
• Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be
compared to everything else, multiple times in multiple ways.
• Other data quality tools cannot find and combine records of the same entity at that scale.

ROB SMITH
3 DAVY DRIVE
bob.smith@hotmail.com
01189407600
Name
Address1
City
Postal Code
Phone
Email
Customer
Service
S66 7EN
Address2
• Exact match + 36 different fuzzy matching
comparison algorithms
• Weighted decision trees
• Match scoring for confidence thresholds
• Multi-field matching, multi-pass and array
matching
• Transitive matching with multiple
different match criteria
A=B, B=C therefore A=B=C
• High performance everything-to-everything
comparison across any cluster in MapReduce
or Spark
Entity Resolution at Scale
Dr Bob Smith
bob.smith@hotmail.com
Name
Address1
City
Postal Code
Phone
Email
Web Login
Address2
Is that
you,
Bob?
Is that
you,
Bob?I have billions of records. How do I identify the same entity?
Are these two businesses owned by the same person?
Are these two accounts in the same building?
Mr Robert Smith
3 Davey Drive
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Transfer
# 16
Address2
Bob Smith DR
3 Davy Dr #16
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Purchase
Address2
Dr. B. Smith
3 Davy Dryve 16
S66 7EN
bsmith@gmail.com
01189 407 600
MALtby
Name
Address1
City
Postal Code
Phone
Email
ATM
Transaction
Address2

Anti-Money Laundering on Hadoop at Global Bank
Challenge: Meet AML transaction monitoring and Financial Conduct
Authority (FCA) compliance demands
• Data too large, diversely scattered to analyze
• Disparate data sources – Mainframe, RDBMS, Cloud, etc.
Requirements:
• Consolidate, clean, and verify data for all analytics and
reporting.
• MUST be secure: Kerberos and LDAP integration
required
• Need unmodified copy of
mainframe data stored on
Hadoop for backup, and
compliance archive
• MUST have complete, detailed data
lineage from origin to end point

Anti-Money
Laundering on
Hadoop at
Global Bank
Solution:
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and
ASG Data Intelligence
• Cluster-native data
verification, enrichment, and
demanding multi-field entity
resolution on Spark
• Unmodified mainframe
“Golden Records” stored on
Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Syncsort DMX-h
• Syncsort’s Trillium Quality for Big Data
• Syncsort DMX Change Data Capture
• Hortonworks HDP

“
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018

Demo: Big Data Matching
With Trillium Quality for Big Data

Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud

Identity management
Name Address City State Zip DOB
Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971
N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971
Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971
Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a
Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971
Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971
CUSTOMERS VENDORS ACCOUNTS
360º View

Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

More Related Content

What's hot (20)

Similar to Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack (20)

More from Precisely (20)

Recently uploaded (20)

Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

Editor's Notes