SlideShare a Scribd company logo
Trillium Software System:
New Features and
Big Data Matching
Paige Roberts, Product Marketing Manager
Steve Shissler, Director, Sales Engineering
Agenda
1 Syncsort
2 New Features in TSS
3 Big Data Matching Principles
4 Big Data Matching Case Study
5 Demo
6 Questions
Who is
Syncsort?
>7,000 customers
84 of the Fortune 100
Customers in >100 countries
Headquarters: Pearl River, NY
U . S . L O C AT I O N S
• Burlington, MA; Irvine, CA;
Oakbrook Terrace, IL; Rochester, MN
G L O B A L P R E S E N C E
• U.K., France, Germany, Netherlands,
Israel, Hong Kong & Japan
Big Iron to Big Data is a fast-growing
market segment composed of solutions
that optimize traditional data systems
and deliver mission-critical data from
these systems to next-generation
analytic environments.
Global leader in
Big Iron to Big Data
Syncsort’s Trillium Software System:
New Features
Collibra Integration
Collibra can define and manage data quality
rules, but cannot enforce the rules on the
data or measure compliance to them.
Goal:
• Make data accessible, traceable and
meaningful to business users.
• Automatically, pass Collibra rules into Trillium
Discovery and get rule compliance data passed
back to Collibra
Requirements:
• Bi-directional near real-time integration
between Trillium Discovery and Collibra DGC
for quality measurement and monitoring
• Trillium business rule analysis results / data
quality metrics shown in Collibra dashboards.
• Data Stewards can quickly identify issues and
take corrective action when data quality
standards are not met.
Closing the Loop
Collibra Data Governance Center
• Enables non-technical users to define
business policies and data quality rules
in plain language
• Makes data quality performance
available to all users
Trillium Discovery
• Imports DBC business rules so technical user
can convert to executable data quality rules
• Constantly runs data quality metrics on near
real-time basis, passes results back to
Collibra dashboards
Rulebooks to Rules
Quality test Results
Bi-directional connectivity Constant sync
Metric falling below
thresholds can
trigger case in
Collibra Issue
Management
Trillium Quality for Big Data
Trillium Quality =
Best-of-breed data quality
solution.
Leader in Gartner Data
Quality Tools MQ 12 years
running.
Intelligent Execution =
Artificially intelligent
dynamic performance
optimizer for cluster
execution in MapReduce,
Amazon EMR, or Spark.
Trillium Quality +
Intelligent Execution =
High performance
industry-leading data
quality on Big Data and
Cloud platforms.
• Build data quality processes that
ensure high-quality data that
meets such key business needs as:
o Single customer view (SCV)
o Standardized product data
o Standardization for fraud detection
Trillium Quality – Powerful Data Cleansing
• Consolidate data sources on input
• Match on party, household, business, etc.
• Develop workflows to transform, parse,
standardize, match and survive best record
• Manage “householding” issues associated with
multiple physical addresses under a single account
KEY FUNCTIONALITY:
• Global address validation with individual country postal rules
• Enrich missing postal information, latitude/longitude and other reference data
Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster
Trillium Quality for Big Data
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
Two Ways to Get Postal Updates
Trillium Postal Download Web Service
Trillium Postal Download Web Service is an
automated download service introduced in
TSS v15.7. The download service allows you
to check the status of your postal license and
download the postal directories from a
browser-based application.
TSS Download Center (File Portal) FTP website
TSS Download Center allows you to manually download
postal directories through Trillium Software’s secure
website. See the Trillium Software System Installation
Guide for procedures on downloading postal directories
through this website.
And more …
• Trillium Discovery REST APIs installed with TSS
server, documentation in Help file for easy
integration with other applications like ASG Data
Intelligence
• Unique ID (UUID) Function
• Trillium Language Pack Locale Setting
• Apache Tomcat Upgrade to v8.5.32
• Australian (AU) Postal Directories and AU Postal
Matcher changes in accordance with Australia Post
licensing terms
• And more …
Example:
German locale setting in config.txt
key rest_api {
value locale "de"
}
Big Data Matching
Finding Similar Needles in a Really Big Haystack
Nobody wants a data swamp instead of a data lake!
“This sure looked a lot nicer on the
whiteboard…”
Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
Cost of poor data quality
rose by 50% in 2017
(Gartner)
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
The importance of data
quality in the enterprise:
• Decision making – Trust the data
that drives your business
• Customer centricity – Get a
single, complete and accurate
view of your customer for better
sales, marketing and customer
service
• Compliance – Know your data,
and ensure its accuracy to meet
industry and government
regulations
• Machine learning & AI – Train
your models on accurate data
The Data Lake
Needs Data
Quality
“
”
The magic of machine learning is that you build a
statistical model based on the most valid dataset for
the domain of interest.
If the data is junk, then you’ll be building a junk
model that will not be able to do its job.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Common Machine Learning Applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
De-Bugging Your Data
Incorrect, Incomplete, Mis-Formatted “Dirty Data” –
Mistakes and errors are almost never the patterns you’re
looking for in a data set. Correcting and standardizing will
tend to boost the signal.
Multiple copies – If you data comes from many sources, as it
often does, it may contain multiple records of information
about the same person, company, product or other entity.
Removing duplicates and enhancing the overall depth and
accuracy of knowledge about a single entity can make a huge
difference.
Enrichment – Enriching data with other data sets, such as
geospatial, demographics, or firmographics data can provide
new depths of analysis. For example, adding latitude and
longitude may enable identification of geospatial patterns.
Correcting data problems vastly increases a data set’s usefulness for machine learning.
However, traditional data quality software
is designed to work on smaller data sets.
Traditional data quality processes are
an effective method to remove defects.
Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train
machine learning models, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
Harvard Business Review - 2018
“If your data is bad, your machine
learning tools are useless.”
Anonymous Computer Scientist - 1957
“Garbage in, garbage out.”
Common Data Quality Problems
• Many data records with different
layouts
• Lack of standardization of the
different fields
• Misspellings
• Data sourced from third parties does
not contain all the necessary fields
• Inconsistent data formats
(measurements, languages, postal
conventions and dates)
• Names spelled differently
• Different number formatting
“But I have a lot of data ….” Is not an excuse for non-compliance.
To comply with GDPR, companies must know the
answers to the following questions:
• What do we know about a given customer?
• Where is our customer data?
• Is our customer contact information current?
• How are we processing customer data?
And supply those answers in the form of business
processes that provide evidence of compliance.
Data Quality is Critical for GDPR Compliance
Data Quality Challenges of Enabling Machine Learning
1. Data Cleansing at Scale
• Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine
learning, and to comply with business regulations.
• Other data quality tools are not designed to work on that scale of data.
• Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills
and takes at least twice as long as designing the same workflows in graphical point and click tools.
• Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will
have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
2. Entity Resolution
• Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated
multi-field matching algorithms
• Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be
compared to everything else, multiple times in multiple ways.
• Other data quality tools cannot find and combine records of the same entity at that scale.
ROB SMITH
3 DAVY DRIVE
bob.smith@hotmail.com
01189407600
Name
Address1
City
Postal Code
Phone
Email
Customer
Service
S66 7EN
Address2
• Exact match + 36 different fuzzy matching
comparison algorithms
• Weighted decision trees
• Match scoring for confidence thresholds
• Multi-field matching, multi-pass and array
matching
• Transitive matching with multiple
different match criteria
A=B, B=C therefore A=B=C
• High performance everything-to-everything
comparison across any cluster in MapReduce
or Spark
Entity Resolution at Scale
Dr Bob Smith
bob.smith@hotmail.com
Name
Address1
City
Postal Code
Phone
Email
Web Login
Address2
Is that
you,
Bob?
Is that
you,
Bob?I have billions of records. How do I identify the same entity?
Are these two businesses owned by the same person?
Are these two accounts in the same building?
Mr Robert Smith
3 Davey Drive
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Transfer
# 16
Address2
Bob Smith DR
3 Davy Dr #16
S667EN
01189 407 600
Rotherham
Name
Address1
City
Postal Code
Phone
Email
Purchase
Address2
Dr. B. Smith
3 Davy Dryve 16
S66 7EN
bsmith@gmail.com
01189 407 600
MALtby
Name
Address1
City
Postal Code
Phone
Email
ATM
Transaction
Address2
Anti-Money Laundering on Hadoop at Global Bank
Challenge: Meet AML transaction monitoring and Financial Conduct
Authority (FCA) compliance demands
• Data too large, diversely scattered to analyze
• Disparate data sources – Mainframe, RDBMS, Cloud, etc.
Requirements:
• Consolidate, clean, and verify data for all analytics and
reporting.
• MUST be secure: Kerberos and LDAP integration
required
• Need unmodified copy of
mainframe data stored on
Hadoop for backup, and
compliance archive
• MUST have complete, detailed data
lineage from origin to end point
Impact of Entity Resolution
Anti-Money
Laundering on
Hadoop at
Global Bank
Solution:
• Must be secure – Kerberos,
LDAP
• Must have lineage – data
origin to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance
results at massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas and
ASG Data Intelligence
• Cluster-native data
verification, enrichment, and
demanding multi-field entity
resolution on Spark
• Unmodified mainframe
“Golden Records” stored on
Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Syncsort DMX-h
• Syncsort’s Trillium Quality for Big Data
• Syncsort DMX Change Data Capture
• Hortonworks HDP
“
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Demo: Big Data Matching
With Trillium Quality for Big Data
Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
Identity management
Name Address City State Zip DOB
Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971
N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971
Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971
Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a
Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971
Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971
CUSTOMERS VENDORS ACCOUNTS
360º View
Questions?
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

More Related Content

What's hot (20)

PPTX
Inside open metadata—the deep dive
DataWorks Summit
 
PPTX
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
DataWorks Summit
 
PPTX
Skillwise Big Data part 2
Skillwise Group
 
PPTX
Skilwise Big data
Skillwise Group
 
PDF
Hadoop Trends
Hortonworks
 
PDF
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
 
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
ODP
Big Data Testing Strategies
Knoldus Inc.
 
PDF
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
PDF
Stream Scaling in Pravega
DataWorks Summit
 
PPTX
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Zaloni
 
PDF
Modern Data Architecture
Mark Hewitt
 
PPTX
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
PPTX
StreamCentral Technical Overview
Raheel Retiwalla
 
PPTX
Next Generation Enterprise Architecture
MapR Technologies
 
PPTX
Multi-tenant Hadoop - the challenge of maintaining high SLAS
DataWorks Summit
 
PDF
Architecture of Big Data Solutions
Guido Schmutz
 
PDF
Creating a Next-Generation Big Data Architecture
Perficient, Inc.
 
PPTX
Big Data: Setting Up the Big Data Lake
Caserta
 
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Inside open metadata—the deep dive
DataWorks Summit
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
DataWorks Summit
 
Skillwise Big Data part 2
Skillwise Group
 
Skilwise Big data
Skillwise Group
 
Hadoop Trends
Hortonworks
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Seeling Cheung
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Big Data Testing Strategies
Knoldus Inc.
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
Stream Scaling in Pravega
DataWorks Summit
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Zaloni
 
Modern Data Architecture
Mark Hewitt
 
Verizon: Finance Data Lake implementation as a Self Service Discovery Big Dat...
DataWorks Summit
 
StreamCentral Technical Overview
Raheel Retiwalla
 
Next Generation Enterprise Architecture
MapR Technologies
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
DataWorks Summit
 
Architecture of Big Data Solutions
Guido Schmutz
 
Creating a Next-Generation Big Data Architecture
Perficient, Inc.
 
Big Data: Setting Up the Big Data Lake
Caserta
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 

Similar to Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack (20)

PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
PDF
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Precisely
 
PDF
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
 
PDF
Data Profiling: The First Step to Big Data Quality
Precisely
 
PPTX
Deliveinrg explainable AI
Gary Allemann
 
PDF
Big Data Analytics for connected home
Héloïse Nonne
 
PPTX
Kickstart a Data Quality Strategy to Build Trust in Data
Precisely
 
PDF
Drive ROI from Your Business Applications with Embedded Real-Time Data Quality
Precisely
 
PDF
The New Trillium DQ: Big Data Insights When and Where You Need Them
Precisely
 
PDF
Pitney-Bowes-Spectrum-Brochure1
Ty Faulkner
 
PDF
Big data and the data quality imperative
Trillium Software
 
PPTX
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
BigDataExpo
 
PPTX
Kickstart a Data Quality Strategy to Build Trust in Your Data
Precisely
 
PDF
Applying Data Quality Best Practices at Big Data Scale
Precisely
 
PPTX
Kickstart a Data Quality Strategy to Build Trust in Data
Precisely
 
PPTX
Do You Trust Your Machine Learning Outcomes?
Precisely
 
PDF
Continuous Improvement through Data Science From Products to Systems Beyond C...
ijtsrd
 
PDF
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Precisely
 
PDF
Denver User Group Q3 2014 Meeting Slides
Salesforce Denver User Group
 
PDF
How to Strengthen Enterprise Data Governance with Data Quality
DATAVERSITY
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Precisely
 
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...
Precisely
 
Data Profiling: The First Step to Big Data Quality
Precisely
 
Deliveinrg explainable AI
Gary Allemann
 
Big Data Analytics for connected home
Héloïse Nonne
 
Kickstart a Data Quality Strategy to Build Trust in Data
Precisely
 
Drive ROI from Your Business Applications with Embedded Real-Time Data Quality
Precisely
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
Precisely
 
Pitney-Bowes-Spectrum-Brochure1
Ty Faulkner
 
Big data and the data quality imperative
Trillium Software
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
BigDataExpo
 
Kickstart a Data Quality Strategy to Build Trust in Your Data
Precisely
 
Applying Data Quality Best Practices at Big Data Scale
Precisely
 
Kickstart a Data Quality Strategy to Build Trust in Data
Precisely
 
Do You Trust Your Machine Learning Outcomes?
Precisely
 
Continuous Improvement through Data Science From Products to Systems Beyond C...
ijtsrd
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Precisely
 
Denver User Group Q3 2014 Meeting Slides
Salesforce Denver User Group
 
How to Strengthen Enterprise Data Governance with Data Quality
DATAVERSITY
 
Ad

More from Precisely (20)

PDF
Introducing Syncsort™ Storage Management.pdf
Precisely
 
PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Precisely
 
PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Precisely
 
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Precisely
 
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
Precisely
 
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
PDF
The 2025 Guide on What's Next for Automation.pdf
Precisely
 
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Precisely
 
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Precisely
 
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Precisely
 
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
Precisely
 
PDF
The Changing Compliance Landscape in 2025.pdf
Precisely
 
PDF
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
PDF
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
PDF
Unlocking the Power of Trusted Data for AI, Analytics, and Business Growth.pdf
Precisely
 
Introducing Syncsort™ Storage Management.pdf
Precisely
 
Enable Enterprise-Ready Security on IBM i Systems.pdf
Precisely
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Solving the CIO’s Dilemma: Speed, Scale, and Smarter SAP Modernization.pdf
Precisely
 
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Precisely
 
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Precisely
 
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
Precisely
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
The 2025 Guide on What's Next for Automation.pdf
Precisely
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Precisely
 
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Precisely
 
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Precisely
 
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
Precisely
 
The Changing Compliance Landscape in 2025.pdf
Precisely
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
Unlocking the Power of Trusted Data for AI, Analytics, and Business Growth.pdf
Precisely
 
Ad

Recently uploaded (20)

PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
The Future of Artificial Intelligence (AI)
Mukul
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 

Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack

  • 1. Trillium Software System: New Features and Big Data Matching Paige Roberts, Product Marketing Manager Steve Shissler, Director, Sales Engineering
  • 2. Agenda 1 Syncsort 2 New Features in TSS 3 Big Data Matching Principles 4 Big Data Matching Case Study 5 Demo 6 Questions
  • 3. Who is Syncsort? >7,000 customers 84 of the Fortune 100 Customers in >100 countries Headquarters: Pearl River, NY U . S . L O C AT I O N S • Burlington, MA; Irvine, CA; Oakbrook Terrace, IL; Rochester, MN G L O B A L P R E S E N C E • U.K., France, Germany, Netherlands, Israel, Hong Kong & Japan Big Iron to Big Data is a fast-growing market segment composed of solutions that optimize traditional data systems and deliver mission-critical data from these systems to next-generation analytic environments. Global leader in Big Iron to Big Data
  • 4. Syncsort’s Trillium Software System: New Features
  • 5. Collibra Integration Collibra can define and manage data quality rules, but cannot enforce the rules on the data or measure compliance to them. Goal: • Make data accessible, traceable and meaningful to business users. • Automatically, pass Collibra rules into Trillium Discovery and get rule compliance data passed back to Collibra Requirements: • Bi-directional near real-time integration between Trillium Discovery and Collibra DGC for quality measurement and monitoring • Trillium business rule analysis results / data quality metrics shown in Collibra dashboards. • Data Stewards can quickly identify issues and take corrective action when data quality standards are not met.
  • 6. Closing the Loop Collibra Data Governance Center • Enables non-technical users to define business policies and data quality rules in plain language • Makes data quality performance available to all users Trillium Discovery • Imports DBC business rules so technical user can convert to executable data quality rules • Constantly runs data quality metrics on near real-time basis, passes results back to Collibra dashboards Rulebooks to Rules Quality test Results Bi-directional connectivity Constant sync Metric falling below thresholds can trigger case in Collibra Issue Management
  • 7. Trillium Quality for Big Data Trillium Quality = Best-of-breed data quality solution. Leader in Gartner Data Quality Tools MQ 12 years running. Intelligent Execution = Artificially intelligent dynamic performance optimizer for cluster execution in MapReduce, Amazon EMR, or Spark. Trillium Quality + Intelligent Execution = High performance industry-leading data quality on Big Data and Cloud platforms.
  • 8. • Build data quality processes that ensure high-quality data that meets such key business needs as: o Single customer view (SCV) o Standardized product data o Standardization for fraud detection Trillium Quality – Powerful Data Cleansing • Consolidate data sources on input • Match on party, household, business, etc. • Develop workflows to transform, parse, standardize, match and survive best record • Manage “householding” issues associated with multiple physical addresses under a single account KEY FUNCTIONALITY: • Global address validation with individual country postal rules • Enrich missing postal information, latitude/longitude and other reference data
  • 9. Design Once, Deploy Anywhere Intelligent Execution - Insulate your organization from underlying complexities of Hadoop. Get excellent performance every time without tuning, load balancing, etc. No re-design, re-compile, no re-work ever • Future-proof job designs for emerging compute frameworks, e.g. Spark 2.x • Move from dev to test to production • Move from on-premise to Cloud • Move from one Cloud to another Use existing ETL skills No parallel programming – Java, MapReduce, Spark … No worries about: • Mappers, Reducers • Big side or small side of joins … Design Once in visual GUI Deploy Anywhere! On-Premise, Cloud Mapreduce, Spark, Future Platforms Windows, Unix, Linux Batch, Streaming Single Node, Cluster
  • 10. Trillium Quality for Big Data • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency. • Process hundreds of millions of records of data. • Standardize, enhance, and match international data sets with postal and country-code validation. • Integrate, parse, standardize, and match new and legacy customer data from multiple disparate sources. • Increase processing efficiency. • Support failover through Hadoop’s fault-tolerant design; during a node failure, processing is redirected to another node.
  • 11. Two Ways to Get Postal Updates Trillium Postal Download Web Service Trillium Postal Download Web Service is an automated download service introduced in TSS v15.7. The download service allows you to check the status of your postal license and download the postal directories from a browser-based application. TSS Download Center (File Portal) FTP website TSS Download Center allows you to manually download postal directories through Trillium Software’s secure website. See the Trillium Software System Installation Guide for procedures on downloading postal directories through this website.
  • 12. And more … • Trillium Discovery REST APIs installed with TSS server, documentation in Help file for easy integration with other applications like ASG Data Intelligence • Unique ID (UUID) Function • Trillium Language Pack Locale Setting • Apache Tomcat Upgrade to v8.5.32 • Australian (AU) Postal Directories and AU Postal Matcher changes in accordance with Australia Post licensing terms • And more … Example: German locale setting in config.txt key rest_api { value locale "de" }
  • 13. Big Data Matching Finding Similar Needles in a Really Big Haystack
  • 14. Nobody wants a data swamp instead of a data lake! “This sure looked a lot nicer on the whiteboard…”
  • 15. Only 35% of senior executives have a high level of trust in the accuracy of their Big Data Analytics 92% of executives are concerned about the negative impact of data and analytics on corporate reputation Cost of poor data quality rose by 50% in 2017 (Gartner) 84% of CEOs are concerned about the quality of the data they’re basing decisions on The importance of data quality in the enterprise: • Decision making – Trust the data that drives your business • Customer centricity – Get a single, complete and accurate view of your customer for better sales, marketing and customer service • Compliance – Know your data, and ensure its accuracy to meet industry and government regulations • Machine learning & AI – Train your models on accurate data The Data Lake Needs Data Quality
  • 16. “ ” The magic of machine learning is that you build a statistical model based on the most valid dataset for the domain of interest. If the data is junk, then you’ll be building a junk model that will not be able to do its job. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 17. Common Machine Learning Applications • Anti-money laundering • Fraud detection • Cybersecurity • Targeted marketing • Recommendation engine • Next best action • Customer churn prevention • Know your customer
  • 18. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 19. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 20. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Enrichment – Enriching data with other data sets, such as geospatial, demographics, or firmographics data can provide new depths of analysis. For example, adding latitude and longitude may enable identification of geospatial patterns. Correcting data problems vastly increases a data set’s usefulness for machine learning.
  • 21. De-Bugging Your Data Incorrect, Incomplete, Mis-Formatted “Dirty Data” – Mistakes and errors are almost never the patterns you’re looking for in a data set. Correcting and standardizing will tend to boost the signal. Multiple copies – If you data comes from many sources, as it often does, it may contain multiple records of information about the same person, company, product or other entity. Removing duplicates and enhancing the overall depth and accuracy of knowledge about a single entity can make a huge difference. Enrichment – Enriching data with other data sets, such as geospatial, demographics, or firmographics data can provide new depths of analysis. For example, adding latitude and longitude may enable identification of geospatial patterns. Correcting data problems vastly increases a data set’s usefulness for machine learning. However, traditional data quality software is designed to work on smaller data sets. Traditional data quality processes are an effective method to remove defects.
  • 22. Data Quality Challenges of Enabling Machine Learning 1. Data Cleansing at Scale • Data quality cleansing and preparation routines have to be reproduced at scale, both to get the data ready to train machine learning models, and to comply with business regulations. • Other data quality tools are not designed to work on that scale of data. • Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills and takes at least twice as long as designing the same workflows in graphical point and click tools. • Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud.
  • 23. Harvard Business Review - 2018 “If your data is bad, your machine learning tools are useless.” Anonymous Computer Scientist - 1957 “Garbage in, garbage out.”
  • 24. Common Data Quality Problems • Many data records with different layouts • Lack of standardization of the different fields • Misspellings • Data sourced from third parties does not contain all the necessary fields • Inconsistent data formats (measurements, languages, postal conventions and dates) • Names spelled differently • Different number formatting
  • 25. “But I have a lot of data ….” Is not an excuse for non-compliance. To comply with GDPR, companies must know the answers to the following questions: • What do we know about a given customer? • Where is our customer data? • Is our customer contact information current? • How are we processing customer data? And supply those answers in the form of business processes that provide evidence of compliance. Data Quality is Critical for GDPR Compliance
  • 26. Data Quality Challenges of Enabling Machine Learning 1. Data Cleansing at Scale • Data quality cleansing and preparation routines have to be reproduced at scale, both to ready the data for machine learning, and to comply with business regulations. • Other data quality tools are not designed to work on that scale of data. • Programming data cleansing workflows from scratch in Java MapReduce or Scala for Spark requires specialized skills and takes at least twice as long as designing the same workflows in graphical point and click tools. • Tuning those MapReduce or Spark workflows to get decent performance on a cluster takes even longer, and will have to be re-done if the job is moved to a bigger or smaller cluster, or from an on-premise data center to the Cloud. 2. Entity Resolution • Distinguishing matches that relate to a single specific entity (a person, a company, a part, etc.) requires sophisticated multi-field matching algorithms • Distinguishing matches across massive datasets requires a lot of compute power. Essentially everything has to be compared to everything else, multiple times in multiple ways. • Other data quality tools cannot find and combine records of the same entity at that scale.
  • 27. ROB SMITH 3 DAVY DRIVE [email protected] 01189407600 Name Address1 City Postal Code Phone Email Customer Service S66 7EN Address2 • Exact match + 36 different fuzzy matching comparison algorithms • Weighted decision trees • Match scoring for confidence thresholds • Multi-field matching, multi-pass and array matching • Transitive matching with multiple different match criteria A=B, B=C therefore A=B=C • High performance everything-to-everything comparison across any cluster in MapReduce or Spark Entity Resolution at Scale Dr Bob Smith [email protected] Name Address1 City Postal Code Phone Email Web Login Address2 Is that you, Bob? Is that you, Bob?I have billions of records. How do I identify the same entity? Are these two businesses owned by the same person? Are these two accounts in the same building? Mr Robert Smith 3 Davey Drive S667EN 01189 407 600 Rotherham Name Address1 City Postal Code Phone Email Transfer # 16 Address2 Bob Smith DR 3 Davy Dr #16 S667EN 01189 407 600 Rotherham Name Address1 City Postal Code Phone Email Purchase Address2 Dr. B. Smith 3 Davy Dryve 16 S66 7EN [email protected] 01189 407 600 MALtby Name Address1 City Postal Code Phone Email ATM Transaction Address2
  • 28. Anti-Money Laundering on Hadoop at Global Bank Challenge: Meet AML transaction monitoring and Financial Conduct Authority (FCA) compliance demands • Data too large, diversely scattered to analyze • Disparate data sources – Mainframe, RDBMS, Cloud, etc. Requirements: • Consolidate, clean, and verify data for all analytics and reporting. • MUST be secure: Kerberos and LDAP integration required • Need unmodified copy of mainframe data stored on Hadoop for backup, and compliance archive • MUST have complete, detailed data lineage from origin to end point
  • 29. Impact of Entity Resolution
  • 30. Anti-Money Laundering on Hadoop at Global Bank Solution: • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Machine learning can detect patterns, but … Requires large amounts of current, clean data. • Syncsort DMX-h • Syncsort’s Trillium Quality for Big Data • Syncsort DMX Change Data Capture • Hortonworks HDP
  • 31. “ ” For want of a nail, the kingdom was lost. For want of a data cleansing and integration tool, the whole AI superstructure can fall down. James Kobeilus SiliconANGLE Wikibon Lead Analyst for Data Science, Deep Learning, App Development 2018
  • 32. Demo: Big Data Matching With Trillium Quality for Big Data
  • 33. Trillium Quality for Big Data – Data Cleansing at Scale Boost effectiveness of machine learning, AI with complete, standardized data. 1. Visually create and test data quality processes locally 2. Execute in MapReduce or Spark On premise or in the Cloud
  • 34. Identity management Name Address City State Zip DOB Nicholas Saunders 22 Shady Lane Mystic CT 06355 04/12/1971 N.M Saunders Jnr Crooked Trail Trenton NJ 08604 12/04/1971 Nick Saunders 22 Shady Street Mystic CT 06355 12/04/1971 Saunders, Nicholas M. 22 Shady Lane Mystic CT 06355 n/a Nicholas Sanders Crooked Road Trenton NJ 08604 04/12/1971 Nicholas Saunders 22 Shady Street Mystic NJ 08604 12/04/1971 CUSTOMERS VENDORS ACCOUNTS 360º View

Editor's Notes

  • #7: For Collibra users: We are the only data quality solution with out-of-the-box bi-directional integration with Collibra Governance Center to give you “closed loop” data governance If Trillium Discovery metrics fall below thresholds, customer can implement so case can be triggered in Collibra Issue Management Data stewards alerted, enabling them to take corrective actions
  • #10: Intelligent execution – artificially intelligent dynamic performance optimizer: Visually design your jobs once, and deploy them anywhere – MapReduce, Spark, Linux, Unix, Windows – on premise or in the cloud. No changes or tuning required. Easily move applications from standalone server environments, from MapRedue to Spark, from on premise to cloud – as easy as clicking on a drop-down menu Future-proof job designs for emerging compute frameworks Avoid tuning -- Intelligent Execution dynamically plans for applications at run-time based on the chosen compute framework Insulate your users from the underlying complexities of Hadoop and use existing data quality skills Cut development time in half
  • #15: Traditional data quality software is not designed to work at Hadoop scale.
  • #16: https://www.zdnet.com/article/most-executives-dont-trust-their-organizations-data-analytics-and-ai/
  • #17: Data is the new source code for AI.
  • #28: Match scoring for confidence thresholds – in a user-friendly scoring map that you can easily tune Multi-pass matching for different combinations of fields Array matching – cross check multi-word or multi-field information - example 3 Davy Dr #16 all in address 1 compared to 3 Davey Drive in Add1 and #16 in Add2 Even without intentionally trying to conceal identity, it can be difficult to resolve a single person or business from multiple touches across multiple data systems, each with it’s own data quality issues. Without good entity resolution, money laundering is much easier to get away with. You could hide who you are from a computer as easily as calling yourself Dr. Robert Smith in one place and Bob Smith in another. Data cleansing and standardization at scale, the previous step, will increase the number of matches found significantly, but doing an everything to everything comparison across a cluster is still a big challenge. Data scientists should be focused on perfecting anti money laundering models, not the perfect windowing functions in Spark for doing Levenstein distance matching on a cluster. Examples of multi-field matching: Name + email Name + phone Name + physical address Email + phone Multi-pass matching means you go over the data multiple times comparing different combinations of fields. Fuzzy matching algorithm examples: keystroke distance, Levenstein distance, etc., distance comparison of geo-location Specialized date, name, street, etc comparison algorithms
  • #29: he Financial Conduct Authority (FCA) is a financial regulatory body in the United Kingdom, but operates independently of the UK Government, and is financed by charging fees to members of the financial services industry.[3] The FCA regulates financial firms providing services to consumers and maintains the integrity of the financial markets in the United Kingdom.[4]
  • #30: Overall, a good entity resolution solution makes AML teams 81% more productive