SlideShare a Scribd company logo
Introducing:
Trillium DQ for Big Data
Harald Smith, Director Product Marketing
Housekeeping
Webcast Audio
• Today’s webcast audio is streamed through your computer speakers.
• If you need technical assistance with the web interface or audio,
please reach out to us using the chat window.
Questions Welcome
• Submit your questions at any time during the presentation
using the chat window.
• We will answer them during our Q&A session following the
presentation.
Recording and slides
• This webcast is being recorded. You will receive an
email following the webcast with a link to download
both the recording and the slides.
Speaker
Harald Smith
• Director of Product Marketing, Syncsort
• 20+ years in Information Management with a focus
on data quality, integration, and governance
• Co-author of Patterns of Information Management
• Author of two Redbooks on Information Governance
and Data Integration
• Blog on InfoWorld: “Data Democratized”
3
Data challenges across the business
Business Leaders
Lack trust in data needed to
make rapid, accurate
decisions that grow business
Business Analysts
Can’t access or understand
data and spend excessive
time on investigating
Information Leaders
Must facilitate business
collaboration and data
transparency and governance
Chief Data Officers
Make data a strategic
business asset utilizing
scientific skills from basic
spreadsheet knowledge
4
Only 35% of senior
executives have a high
level of trust in the
accuracy of their Big
Data Analytics
92% of executives are
concerned about the
negative impact of data
and analytics on
corporate reputation
New survey indicates
nearly 80% of AI/ML
projects stalling due to
poor data quality
84% of CEOs
are concerned about
the quality of the data
they’re basing
decisions on
Big Data Needs
Data Quality
6
Data Quality Challenges of Big Data
Profiling Data
• Organizations are storing vast amounts of data in data lakes and the Cloud –
from many different sources – but that data isn’t usable unless it is understood
and to understand it, the business users who work with the data must be able
to access and profile it without constant IT help
Matching Entities Accurately
• Distinguishing matches that indicate a single specific entity across so much data
requires sophisticated multi-field matching algorithms – that need to be
understandable by business users to be meaningful
Scalability
• Distinguishing matches across massive datasets requires a lot of compute
power - compare everything has to be compared to everything else, multiple
times in multiple ways
• Taking advantage of Big Data processing for scalability requires specialized skills
and takes a long time – and requires tuning, re-writing as technology changes
• Traditional data quality tools are not designed to work on that scale of data
Trillium DQ for Big Data
Understand, Evaluate, and Resolve Big Data Quality Problems
Trillium Discovery for Big Data
Data Profiling
Gain a complete picture of your data before
use
• Understand the data
• Analyze the data
• Find data quality problems
• Build and evaluate data quality rules
7
Trillium DQ for Big Data
On Premises or via Trillium Cloud
Deploy any or all products to the cloud - Completely managed SaaS in AWS or Azure
Trillium Quality for Big Data
Data Cleansing and Matching
Cleanse, standardize, and connect
data in accordance with your predefined
standards
• Entity matching and resolution
• Data cleansing and correction
• Data record enrichment
Feature-rich data profiling and data quality processing engines
• Leveraging over two decades of data quality expertise
An efficient orchestration of this engine in Big Data distributed
frameworks
• Powered by an architecture that has been in production with very large
(2000+ node) environments running natively across the cluster
• Partnered with Cloudera and Hortonworks closely, native integration with the stack
• Syncsort has been a major contributor to Apache Hadoop open source project
• With efficient orchestration, we can process any number of attributes with a handful
of MapReduce jobs
• Same architecture is used for Apache Spark
“Design once, deploy anywhere” architecture
• Native connectivity providing breadth and performance
• “Intelligent Execution” to optimize process execution at run-time
(MapReduce, Spark 1.x, Spark 2.x)
• On-premise and in the cloud (e.g. Amazon EMR)
8
Data Quality for Your Big Data Needs
Key Outcomes
• Reduce the time for business analysts to discover and understand
data on Big Data platforms
• Allow business analysts who understand the data but have little
technical expertise to quickly find data and run data profiling in
three steps
• Let analysts explore results and drilldown to details within 2-5
seconds per view to review and then report on data issues to
business leaders
• Scale to large volumes of data sources & attributes so that business
analysts can understand the contents of any data source needed for
business decisions
• Data is always secured in process and at rest and only available to
authorized users to comply with regulations and avoid fines
9
Trillium Discovery for Big Data
10
Trillium Discovery for Big Data
• Delivers enterprise trusted Trillium Discovery on distributed big data
platforms (e.g. Hadoop, Spark) for high-volume, scalable data profiling
• Provides complete Trillium Discovery data profiling for analysis & review
• Attribute metadata, value & pattern frequencies, key & dependency analysis,
cross-source join analysis, drill down to any outlier or issue, and more…
• Provides easily configured native connectivity for Big Data sources
• Provides managing and monitoring for task execution
• Integrates with the security frameworks (Kerberos, AD, LDAP) of
Big Data platforms
Run Profiling
1
n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Trillium Discovery for Big Data – Data Profiling at Scale
Select Source Explore ProfilesRun Profiling
Stored Profiling Results
▪ Metadata & Statistics
▪ Frequency Distributions
▪ Drilldown Indices
Share &
Govern
Results
Integration
(APIs)
Notification
Collaboration
Native Connectors
▪ HDFS source directories
▪ …
Drilldown to IssuesEvaluate Business Rules
Key Outcomes
• Match and link any data entity – customers, suppliers, products, etc. –
into a trusted single view to support a broad array of business-critical
use cases (e.g. Customer 360, fraud, AML)
• Parse and standardize complex multi-domain data, extended with
enrichment and verification of critical address and geolocation data –
all leveraging out-of-the-box templates
• Utilize “design once, deploy anywhere” approach to speed time-to-
value and focus on building data quality business logic while letting the
product handle the technical aspects of framework execution with no
coding or tuning required
• Leverage the high-performance compute power of distributed Big Data
frameworks including Hadoop MapReduce and Spark to process high
volumes within targeted time windows to meet critical Service Level
Agreements (SLA’s)
12
Trillium Quality for Big Data
13
Trillium Quality for Big Data
• Integrate, parse, standardize, and match new and legacy customer data
from multiple disparate sources.
• Provide high-quality entity resolution through multi-domain deduplication
and matching with the most comprehensive set of match comparisons
available, including fuzzy matching, distance comparisons, and more.
• Standardize, enhance, and match international data sets with postal and
country-code validation.
• Deploy data quality workflows as native, parallel MapReduce or Spark
processes for optimal efficiency.
• Process hundreds of millions of records of data.
• Increase processing efficiency.
• Support failover through Hadoop’s fault-tolerant design; during a node
failure, processing is redirected to another node.
Trillium Quality for Big Data – Data Cleansing at Scale
Boost effectiveness of machine learning, AI with complete, standardized, matched data.
1. Visually create and test data
quality processes locally
2. Execute in MapReduce or Spark
On premise or in the Cloud
Big Data Platform
14
Syncsort Trillium Delivers Data You can Trust
Data Profiling Business Rules &
Data Quality
Assessment
Data Validation,
Standardization,
Enrichment & more
Matching, Entity
Resolution &
Verification
•Customer 360
•AI/ML
Operational Integrations
•Analytics &
Reporting
Data Governance
Trillium Discovery for Big Data
Trillium Quality for Big Data
+ Global Address Verification
Trillium DQ for Big Data
15
Trillium DQ for Big Data
Use Cases
16
Turn your Big Data
into a trusted view
of your customers,
products and more
Power machine
learning and
advanced analytics
with reliable, fit-for-
purpose data
Gain actionable
business insights
from high-volume
disparate data sets
from across the
enterprise
Deploy industry-
leading data quality
processes at massive
scale, with no coding
or Big Data skills
required
Trillium DQ for Big
Data evaluates &
transforms your Big
Data for trusted
business insights
Anti-Money
Laundering on
Hadoop at
Global Bank
S O LU T I O N
CHAL L ENGE
• Must provide highly accurate
entity resolution
• Must be secure – Kerberos, LDAP
• Must have lineage – data origin
to end point
• Massive data volumes
• Scattered data – Mainframe,
RDBMS, Cloud, …
• Must archive unaltered
mainframe data
Full Anti-Money Laundering
regulatory compliance with
financial crimes data lake –
high performance results at
massive scale.
• Full end-to-end data lineage
supplied to Apache Atlas
and ASG Data Intelligence
• Cluster-native data
verification, enrichment,
and demanding multi-field
entity resolution on Spark
• Unmodified mainframe
“Golden Records” stored
on Hadoop
Bank must monitor transactions
to detect Money Laundering for
FCA compliance.
Machine learning can detect
patterns, but …
Requires large amounts of
current, clean data.
• Trillium DQ for Big Data
• Connect CDC
• Connect for Big Data
18
Trillium DQ for
Big Data Cleanses
Credit Data for
Creditsafe
C H A L L E N G E
Ensure ALL DATA on each company is
analyzed – and NO DATA from another
company is accidentally included –
to get accurate corporate credit ratings.
• Need to profile, cleanse and enhance
data to evaluate credit ratings for
80 million companies in U.S. alone
• Existing solution lacked flexible
de-dupe matching rules, scalability
• Millions of records to analyze per
company, in multiple inconsistent
data sources, about 800 million/day
total and growing
• Solution must scale!
S O LU T I O N
• Amazon EMR Cloud
• Trillium DQ for Big Data cleansed,
standardized and matched over
130 million recs/hour on basic
10-node test cluster– met the
business SLA with room to grow
96% Address Matching Accuracy
after Trillium cleansing,
standardization
Saved software costs – Replaced
multiple solutions and tools
Saved Amazon cluster costs and
left room for company growth
“We can’t afford to miss
information, or mix up information
about businesses with similar
names. Companies count on our
highly accurate predictive scoring
to provide fast, accurate ratings
for their potential customers
and vendors.”
19
Next Steps
For more information on Trillium DQ for Big Data and our other
Syncsort Trillium data quality solutions, please visit:
https://www.syncsort.com/en/products/trillium-dq-for-big-data
And:
https://www.syncsort.com/en/integrate
Q & A
21
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

More Related Content

What's hot (20)

PPTX
Azure Data Engineering.pptx
priyadharshini626440
 
PPTX
DAMA International DMBOK V2 - Comparison with V1
Howard Diesel (CDMP BI, DW, DBA, Msc Elec Eng)
 
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
PDF
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
DATAVERSITY
 
PPTX
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Simplilearn
 
PDF
Data Virtualization: An Introduction
Denodo
 
PPT
Data mining
Samir Sabry
 
PDF
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
PDF
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
PPTX
Data Governance Intro.pptx
BHARATH KUNAMNENI
 
PPTX
‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development
Ahmed Alorage
 
PDF
What is Data Science
Ioannis Kourouklides
 
PDF
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo
 
PDF
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Edureka!
 
PDF
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
PDF
Tableau Dashboard Tutorial | Tableau Training For Beginners | Tableau Tutoria...
Edureka!
 
PDF
CDMP Overview Professional Information Management Certification
Christopher Bradley
 
PDF
Why Data Virtualization? An Introduction
Denodo
 
PDF
JSON Data Modeling in Document Database
DATAVERSITY
 
PDF
Data Management Maturity Assessment
Firas Hamdan
 
Azure Data Engineering.pptx
priyadharshini626440
 
DAMA International DMBOK V2 - Comparison with V1
Howard Diesel (CDMP BI, DW, DBA, Msc Elec Eng)
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
DATAVERSITY
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Simplilearn
 
Data Virtualization: An Introduction
Denodo
 
Data mining
Samir Sabry
 
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Abzetdin Adamov
 
Data Governance Intro.pptx
BHARATH KUNAMNENI
 
‏‏‏‏‏‏‏‏‏‏‏‏Chapter 13: Professional Development
Ahmed Alorage
 
What is Data Science
Ioannis Kourouklides
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Edureka!
 
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
Tableau Dashboard Tutorial | Tableau Training For Beginners | Tableau Tutoria...
Edureka!
 
CDMP Overview Professional Information Management Certification
Christopher Bradley
 
Why Data Virtualization? An Introduction
Denodo
 
JSON Data Modeling in Document Database
DATAVERSITY
 
Data Management Maturity Assessment
Firas Hamdan
 

Similar to Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake (20)

PDF
The New Trillium DQ: Big Data Insights When and Where You Need Them
Precisely
 
PDF
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Precisely
 
PPTX
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
BigDataExpo
 
PPTX
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA
 
PDF
The Changing Data Quality & Data Governance Landscape
Trillium Software
 
PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Precisely
 
PDF
Building Rules for Data Governance
Precisely
 
PDF
Big data and the data quality imperative
Trillium Software
 
PDF
Into dq ed wrazen
BigDataExpo
 
PDF
The Bigger They Are The Harder They Fall
Trillium Software
 
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
PDF
What's New in Syncsort's Trillium Line of Data Quality Software - TSS Enterpr...
Precisely
 
PPTX
Deliveinrg explainable AI
Gary Allemann
 
PDF
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
PDF
A Tighter Weave – How YARN Changes the Data Quality Game
Inside Analysis
 
PPTX
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
PPTX
Data Quality from Precisely: Trillium Quality & Discovery
Precisely
 
PDF
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Precisely
 
PDF
Big data beyond the hype may 2014
bigdatagurus_meetup
 
PDF
Data Profiling: The First Step to Big Data Quality
Precisely
 
The New Trillium DQ: Big Data Insights When and Where You Need Them
Precisely
 
What’s New in Syncsort’s Trillium Software System (TSS) 15.7
Precisely
 
Big Data Expo 2015 - Trillium software Big Data and the Data Quality
BigDataExpo
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA
 
The Changing Data Quality & Data Governance Landscape
Trillium Software
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Precisely
 
Building Rules for Data Governance
Precisely
 
Big data and the data quality imperative
Trillium Software
 
Into dq ed wrazen
BigDataExpo
 
The Bigger They Are The Harder They Fall
Trillium Software
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
What's New in Syncsort's Trillium Line of Data Quality Software - TSS Enterpr...
Precisely
 
Deliveinrg explainable AI
Gary Allemann
 
How to teach your data scientist to leverage an analytics cluster with Presto...
Alluxio, Inc.
 
A Tighter Weave – How YARN Changes the Data Quality Game
Inside Analysis
 
Transform Your Downstream Cloud Analytics with Data Quality 
Precisely
 
Data Quality from Precisely: Trillium Quality & Discovery
Precisely
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Precisely
 
Big data beyond the hype may 2014
bigdatagurus_meetup
 
Data Profiling: The First Step to Big Data Quality
Precisely
 
Ad

More from Precisely (20)

PDF
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Precisely
 
PDF
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Precisely
 
PDF
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
Precisely
 
PDF
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
PDF
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
PDF
The 2025 Guide on What's Next for Automation.pdf
Precisely
 
PDF
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
PDF
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Precisely
 
PDF
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Precisely
 
PDF
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Precisely
 
PDF
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
Precisely
 
PDF
The Changing Compliance Landscape in 2025.pdf
Precisely
 
PDF
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
PDF
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
PDF
Unlocking the Power of Trusted Data for AI, Analytics, and Business Growth.pdf
Precisely
 
PDF
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
PDF
End-to-end process automation: Simplifying SAP master data with low-code/no-c...
Precisely
 
PDF
Optimizing Your IBM i Availability: Storage vs. Software Replication.pdf
Precisely
 
PDF
AI You Can Trust - The Role of Data Integrity in AI-Readiness.pdf
Precisely
 
PDF
Top Tips to Get Your Data AI-Ready‎ ‎ ‎‎ ‎
Precisely
 
Solving the Data Disconnect: Why Success Hinges on Pre-Linked Data.pdf
Precisely
 
Cooking Up Clean Addresses - 3 Ways to Whip Messy Data into Shape.pdf
Precisely
 
Building Confidence in AI & Analytics with High-Integrity Location Data.pdf
Precisely
 
SAP Modernization Strategies for a Successful S/4HANA Journey.pdf
Precisely
 
Precisely Demo Showcase: Powering ServiceNow Discovery with Precisely Ironstr...
Precisely
 
The 2025 Guide on What's Next for Automation.pdf
Precisely
 
Outdated Tech, Invisible Expenses – How Data Silos Undermine Operational Effi...
Precisely
 
Modernización de SAP: Maximizando el Valor de su Migración a SAP S/4HANA.pdf
Precisely
 
Outdated Tech, Invisible Expenses – The Hidden Cost of Disconnected Data Syst...
Precisely
 
Migration vers SAP S/4HANA: Un levier stratégique pour votre transformation d...
Precisely
 
Outdated Tech, Invisible Expenses: The Hidden Cost of Poor Data Integration o...
Precisely
 
The Changing Compliance Landscape in 2025.pdf
Precisely
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Automate Studio Training: Building Scripts for SAP Fiori and GUI for HTML.pdf
Precisely
 
Unlocking the Power of Trusted Data for AI, Analytics, and Business Growth.pdf
Precisely
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
End-to-end process automation: Simplifying SAP master data with low-code/no-c...
Precisely
 
Optimizing Your IBM i Availability: Storage vs. Software Replication.pdf
Precisely
 
AI You Can Trust - The Role of Data Integrity in AI-Readiness.pdf
Precisely
 
Top Tips to Get Your Data AI-Ready‎ ‎ ‎‎ ‎
Precisely
 
Ad

Recently uploaded (20)

PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 

Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for the Data Lake

  • 1. Introducing: Trillium DQ for Big Data Harald Smith, Director Product Marketing
  • 2. Housekeeping Webcast Audio • Today’s webcast audio is streamed through your computer speakers. • If you need technical assistance with the web interface or audio, please reach out to us using the chat window. Questions Welcome • Submit your questions at any time during the presentation using the chat window. • We will answer them during our Q&A session following the presentation. Recording and slides • This webcast is being recorded. You will receive an email following the webcast with a link to download both the recording and the slides.
  • 3. Speaker Harald Smith • Director of Product Marketing, Syncsort • 20+ years in Information Management with a focus on data quality, integration, and governance • Co-author of Patterns of Information Management • Author of two Redbooks on Information Governance and Data Integration • Blog on InfoWorld: “Data Democratized” 3
  • 4. Data challenges across the business Business Leaders Lack trust in data needed to make rapid, accurate decisions that grow business Business Analysts Can’t access or understand data and spend excessive time on investigating Information Leaders Must facilitate business collaboration and data transparency and governance Chief Data Officers Make data a strategic business asset utilizing scientific skills from basic spreadsheet knowledge 4
  • 5. Only 35% of senior executives have a high level of trust in the accuracy of their Big Data Analytics 92% of executives are concerned about the negative impact of data and analytics on corporate reputation New survey indicates nearly 80% of AI/ML projects stalling due to poor data quality 84% of CEOs are concerned about the quality of the data they’re basing decisions on Big Data Needs Data Quality
  • 6. 6 Data Quality Challenges of Big Data Profiling Data • Organizations are storing vast amounts of data in data lakes and the Cloud – from many different sources – but that data isn’t usable unless it is understood and to understand it, the business users who work with the data must be able to access and profile it without constant IT help Matching Entities Accurately • Distinguishing matches that indicate a single specific entity across so much data requires sophisticated multi-field matching algorithms – that need to be understandable by business users to be meaningful Scalability • Distinguishing matches across massive datasets requires a lot of compute power - compare everything has to be compared to everything else, multiple times in multiple ways • Taking advantage of Big Data processing for scalability requires specialized skills and takes a long time – and requires tuning, re-writing as technology changes • Traditional data quality tools are not designed to work on that scale of data
  • 7. Trillium DQ for Big Data Understand, Evaluate, and Resolve Big Data Quality Problems Trillium Discovery for Big Data Data Profiling Gain a complete picture of your data before use • Understand the data • Analyze the data • Find data quality problems • Build and evaluate data quality rules 7 Trillium DQ for Big Data On Premises or via Trillium Cloud Deploy any or all products to the cloud - Completely managed SaaS in AWS or Azure Trillium Quality for Big Data Data Cleansing and Matching Cleanse, standardize, and connect data in accordance with your predefined standards • Entity matching and resolution • Data cleansing and correction • Data record enrichment
  • 8. Feature-rich data profiling and data quality processing engines • Leveraging over two decades of data quality expertise An efficient orchestration of this engine in Big Data distributed frameworks • Powered by an architecture that has been in production with very large (2000+ node) environments running natively across the cluster • Partnered with Cloudera and Hortonworks closely, native integration with the stack • Syncsort has been a major contributor to Apache Hadoop open source project • With efficient orchestration, we can process any number of attributes with a handful of MapReduce jobs • Same architecture is used for Apache Spark “Design once, deploy anywhere” architecture • Native connectivity providing breadth and performance • “Intelligent Execution” to optimize process execution at run-time (MapReduce, Spark 1.x, Spark 2.x) • On-premise and in the cloud (e.g. Amazon EMR) 8 Data Quality for Your Big Data Needs
  • 9. Key Outcomes • Reduce the time for business analysts to discover and understand data on Big Data platforms • Allow business analysts who understand the data but have little technical expertise to quickly find data and run data profiling in three steps • Let analysts explore results and drilldown to details within 2-5 seconds per view to review and then report on data issues to business leaders • Scale to large volumes of data sources & attributes so that business analysts can understand the contents of any data source needed for business decisions • Data is always secured in process and at rest and only available to authorized users to comply with regulations and avoid fines 9 Trillium Discovery for Big Data
  • 10. 10 Trillium Discovery for Big Data • Delivers enterprise trusted Trillium Discovery on distributed big data platforms (e.g. Hadoop, Spark) for high-volume, scalable data profiling • Provides complete Trillium Discovery data profiling for analysis & review • Attribute metadata, value & pattern frequencies, key & dependency analysis, cross-source join analysis, drill down to any outlier or issue, and more… • Provides easily configured native connectivity for Big Data sources • Provides managing and monitoring for task execution • Integrates with the security frameworks (Kerberos, AD, LDAP) of Big Data platforms
  • 11. Run Profiling 1 n . . . . . . . . . . . . . . . . . . . . . . 11 Trillium Discovery for Big Data – Data Profiling at Scale Select Source Explore ProfilesRun Profiling Stored Profiling Results ▪ Metadata & Statistics ▪ Frequency Distributions ▪ Drilldown Indices Share & Govern Results Integration (APIs) Notification Collaboration Native Connectors ▪ HDFS source directories ▪ … Drilldown to IssuesEvaluate Business Rules
  • 12. Key Outcomes • Match and link any data entity – customers, suppliers, products, etc. – into a trusted single view to support a broad array of business-critical use cases (e.g. Customer 360, fraud, AML) • Parse and standardize complex multi-domain data, extended with enrichment and verification of critical address and geolocation data – all leveraging out-of-the-box templates • Utilize “design once, deploy anywhere” approach to speed time-to- value and focus on building data quality business logic while letting the product handle the technical aspects of framework execution with no coding or tuning required • Leverage the high-performance compute power of distributed Big Data frameworks including Hadoop MapReduce and Spark to process high volumes within targeted time windows to meet critical Service Level Agreements (SLA’s) 12 Trillium Quality for Big Data
  • 13. 13 Trillium Quality for Big Data • Integrate, parse, standardize, and match new and legacy customer data from multiple disparate sources. • Provide high-quality entity resolution through multi-domain deduplication and matching with the most comprehensive set of match comparisons available, including fuzzy matching, distance comparisons, and more. • Standardize, enhance, and match international data sets with postal and country-code validation. • Deploy data quality workflows as native, parallel MapReduce or Spark processes for optimal efficiency. • Process hundreds of millions of records of data. • Increase processing efficiency. • Support failover through Hadoop’s fault-tolerant design; during a node failure, processing is redirected to another node.
  • 14. Trillium Quality for Big Data – Data Cleansing at Scale Boost effectiveness of machine learning, AI with complete, standardized, matched data. 1. Visually create and test data quality processes locally 2. Execute in MapReduce or Spark On premise or in the Cloud Big Data Platform 14
  • 15. Syncsort Trillium Delivers Data You can Trust Data Profiling Business Rules & Data Quality Assessment Data Validation, Standardization, Enrichment & more Matching, Entity Resolution & Verification •Customer 360 •AI/ML Operational Integrations •Analytics & Reporting Data Governance Trillium Discovery for Big Data Trillium Quality for Big Data + Global Address Verification Trillium DQ for Big Data 15
  • 16. Trillium DQ for Big Data Use Cases 16
  • 17. Turn your Big Data into a trusted view of your customers, products and more Power machine learning and advanced analytics with reliable, fit-for- purpose data Gain actionable business insights from high-volume disparate data sets from across the enterprise Deploy industry- leading data quality processes at massive scale, with no coding or Big Data skills required Trillium DQ for Big Data evaluates & transforms your Big Data for trusted business insights
  • 18. Anti-Money Laundering on Hadoop at Global Bank S O LU T I O N CHAL L ENGE • Must provide highly accurate entity resolution • Must be secure – Kerberos, LDAP • Must have lineage – data origin to end point • Massive data volumes • Scattered data – Mainframe, RDBMS, Cloud, … • Must archive unaltered mainframe data Full Anti-Money Laundering regulatory compliance with financial crimes data lake – high performance results at massive scale. • Full end-to-end data lineage supplied to Apache Atlas and ASG Data Intelligence • Cluster-native data verification, enrichment, and demanding multi-field entity resolution on Spark • Unmodified mainframe “Golden Records” stored on Hadoop Bank must monitor transactions to detect Money Laundering for FCA compliance. Machine learning can detect patterns, but … Requires large amounts of current, clean data. • Trillium DQ for Big Data • Connect CDC • Connect for Big Data 18
  • 19. Trillium DQ for Big Data Cleanses Credit Data for Creditsafe C H A L L E N G E Ensure ALL DATA on each company is analyzed – and NO DATA from another company is accidentally included – to get accurate corporate credit ratings. • Need to profile, cleanse and enhance data to evaluate credit ratings for 80 million companies in U.S. alone • Existing solution lacked flexible de-dupe matching rules, scalability • Millions of records to analyze per company, in multiple inconsistent data sources, about 800 million/day total and growing • Solution must scale! S O LU T I O N • Amazon EMR Cloud • Trillium DQ for Big Data cleansed, standardized and matched over 130 million recs/hour on basic 10-node test cluster– met the business SLA with room to grow 96% Address Matching Accuracy after Trillium cleansing, standardization Saved software costs – Replaced multiple solutions and tools Saved Amazon cluster costs and left room for company growth “We can’t afford to miss information, or mix up information about businesses with similar names. Companies count on our highly accurate predictive scoring to provide fast, accurate ratings for their potential customers and vendors.” 19
  • 20. Next Steps For more information on Trillium DQ for Big Data and our other Syncsort Trillium data quality solutions, please visit: https://www.syncsort.com/en/products/trillium-dq-for-big-data And: https://www.syncsort.com/en/integrate