Big Data
Pitfalls
April 8, 2015
2
Big Data Introduction
3
So What is it?
●
Misnomer and marketing speak
●
“Unstructured” data
– Text heavy
– Without obvious/clear structure
●
Comes from many places, in many styles
4
5
Where It Comes From
6
Building Your Data Lake
7
A Common Evolution
8
A Common Evolution
9
Hadoop to the Rescue!
10
You Have a Data Lake!
11
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
12
Hadoop to the Rescue
● Cross system analytics?
● Data quality confidence?
● Source of truth?
● Tool chain support?
● Giant yellow elephants?
If any are ignored...
13
You have a Data Swamp!
14
Don't worry, even the Jedi had a Data Swamp...
15
Goal is to build a Data Reservoir
16
Reservoirs...
● Contain data that is...
– Managed
– Transformed
– Filtered
– Secured
– Portable
– Fit for purpose
Source: Gartner
17
Pitfalls
18
Data Warehouse Models
● Traditional models don't cover semi-
structured data
● Modern models are hybrids that cross the
structured semi-structured boundary
19
Data Vault
20
Data Vault
● Developed by Dan Linstedt
● Tie technical keys across structured and semi-structured data sources
● Semi-structured data can me made more structured and loaded into relational data
vault
● Tools have to support crossing sources
● More details: http://www.tdan.com/view-articles/5054/
21
Anchor
22
Anchor
● Developed by Lars Rönnbäck
● 6th normal form data warehouse
● Have to transform semi-structured data to match the anchor model
● Provides flexible model that should be able to have marts built upon it
● More details: http://www.anchormodeling.com/
23
Textual Disambiguation
● Developed by Bill Inmon
● Breaking semi-structured data down by context
● Converts the data into structured format, consumable by tools
● Store data within the data warehouse – 8th/9th normal form
● White papers and more details are on Bill's website:
http://www.forestrimtech.com/
24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128
25
Working With “Unstructured” Data
● Most data tools require structure (Database schema, clear-cut data formatting)
● Business and technical knowledge required
– Business to provide the pattern “the grammar or syntax”
– Technical to provide the “how”
26
Working With “Unstructured” Data
“The car is hot.”
27
Identifying Context
● It's a really nice car.
● It's internal temperature requires adjustment
● It's hot to the touch
● It's on fire
28
29
How to Implement
● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)
● Have to create the grammar/syntax rules for particular business
● MDM is _not_ the solution
● Best to have a data warehouse based on subject/relationships
– Data Vault
– Anchor
– Textual Disambiguation
30
Data Symbiosis
● Data in data lake can't stand on it's own
– Ties back to rest of the structured data
– Requires firm understanding of business rules/logic
● Provides richer data sets
● Difficult to do before data lakes, after adding a data lake the problems magnify
– But so do the rewards!
31
Data Quality
● Not just a problem for Data Warehouses!
● Measuring “fit for purpose”
● Same rules used for data warehouses
apply to big data
32
Principles of Data Quality
● Consistency
● Correctness
● Timeliness
● Precision
● Unambiguous
● Completeness
● Reliability
● Accuracy
● Objectivity
● Conciseness
● Usefulness
● Usability
● Relevance
● Quantity
Source: Data Quality Fundamentals, The Data Warehouse Institute
33
Why Data Quality?
● Main way to control/tame your data
problems
● Most hidden costs because it's hardest to
fix
● Target upstream for problem solutions
34
How to Implement
● Data integration tools
● Custom coding (Map/Reduce, etc.)
● Data Profiling
● MDM (as central “dictionary”/”grammar”
handler)
35
Tooling
36
Does Your Tool Chain...
● Support Hadoop?
● Interface with non-traditional database solutions (i.e. not an RDBMS)?
● Allow for integration across disparate sources?
● Support data quality?
37
If Not...
38
Hadoop Ecosystem
● Bridges some of the gaps
– Hive – SQL to Hadoop interface (jdbc support)
● Provides even more power
https://hadoopecosystemtable.github.io/
Plus dozens of others... and growing
39
Sources
● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png
● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588
● http://www.appliedi.net/
● http://imgbuddy.com/internet-of-things-icon.asp
● http://www.smashingapps.com/, et. al.
● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h2170
16ce

More Related Content

PDF
Mastering in Data Warehousing and Business Intelligence
ODP
Open Source Business Intelligence Overview
PPT
Data warehouse architecture
PPTX
Why create a Data Mart with Dimensional Fact Model
PDF
Collecting and Making Sense of Diverse Data at WayUp
PPTX
Solution architecture for big data projects
PPTX
Big data hadoop
ODP
Graphing Your Data
Mastering in Data Warehousing and Business Intelligence
Open Source Business Intelligence Overview
Data warehouse architecture
Why create a Data Mart with Dimensional Fact Model
Collecting and Making Sense of Diverse Data at WayUp
Solution architecture for big data projects
Big data hadoop
Graphing Your Data

What's hot (20)

ODP
Mondrian and OLAP Overview
PPTX
How Linked Data Can Speed Information Discovery
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
PPTX
Data warehouse architecture
PPTX
Data warehousing
PPT
Basics of Microsoft Business Intelligence and Data Integration Techniques
PPTX
Business intelligence
PPTX
Data warehousing
PDF
How to build a data stack from scratch
PPT
Data Mining and Data Warehousing
PPTX
A brief history of data warehousing
PPTX
Analytical tools
PPTX
Top 10 Data analytics tools to look for in 2021
PPTX
DATA MART APPROCHES TO ARCHITECTURE
PPTX
Dw capabilities
PPTX
Data warehouse
PDF
Tracking data lineage at Stitch Fix
PPTX
Data warehousing
PPT
Michael Stonebraker How to do Complex Analytics
Mondrian and OLAP Overview
How Linked Data Can Speed Information Discovery
Advanced Analytics and Machine Learning with Data Virtualization
When We Spark and When We Don’t: Developing Data and ML Pipelines
Data warehouse architecture
Data warehousing
Basics of Microsoft Business Intelligence and Data Integration Techniques
Business intelligence
Data warehousing
How to build a data stack from scratch
Data Mining and Data Warehousing
A brief history of data warehousing
Analytical tools
Top 10 Data analytics tools to look for in 2021
DATA MART APPROCHES TO ARCHITECTURE
Dw capabilities
Data warehouse
Tracking data lineage at Stitch Fix
Data warehousing
Michael Stonebraker How to do Complex Analytics
Ad

Viewers also liked (10)

PDF
Microsoft Self-Service BI Tools - Business Intelligence for All
PDF
List of personal protective equipment to have
PDF
Power BI for Office 365: Using SharePoint to Deliver Self-Service
PDF
#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira
PDF
ETIS09 - Data Quality: Common Problems & Checks - Presentation
PPTX
Data Quality: A Raising Data Warehousing Concern
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PPTX
DATA WAREHOUSING
PPS
Introduction to Data Warehousing
PPT
Data Warehousing and Data Mining
Microsoft Self-Service BI Tools - Business Intelligence for All
List of personal protective equipment to have
Power BI for Office 365: Using SharePoint to Deliver Self-Service
#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira
ETIS09 - Data Quality: Common Problems & Checks - Presentation
Data Quality: A Raising Data Warehousing Concern
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
DATA WAREHOUSING
Introduction to Data Warehousing
Data Warehousing and Data Mining
Ad

Similar to Big Data Pitfalls (20)

PDF
Unlock Your Data for ML & AI using Data Virtualization
PDF
Got data?… now what? An introduction to modern data platforms
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
PPTX
Manish tripathi-ea-dw-bi
 
PPTX
Big Data Infrastructure and Hadoop components.pptx
PPTX
Big data by Mithlesh sadh
PDF
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
PDF
Enabling Your Data Science Team with Modern Data Engineering
PPTX
unit 1 big data.pptx
PDF
Mastering your data with ca e rwin dm 09082010
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PPTX
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PDF
Agile & Data Modeling – How Can They Work Together?
PDF
The Value of Customer Insights & Analytics in a Modern Retail Environment
PDF
Using Data Platforms That Are Fit-For-Purpose
PDF
Data Vault Introduction
Unlock Your Data for ML & AI using Data Virtualization
Got data?… now what? An introduction to modern data platforms
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Manish tripathi-ea-dw-bi
 
Big Data Infrastructure and Hadoop components.pptx
Big data by Mithlesh sadh
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Enabling Your Data Science Team with Modern Data Engineering
unit 1 big data.pptx
Mastering your data with ca e rwin dm 09082010
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
NoSQL Data Modeling Foundations — Introducing Concepts & Principles
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Agile & Data Modeling – How Can They Work Together?
The Value of Customer Insights & Analytics in a Modern Retail Environment
Using Data Platforms That Are Fit-For-Purpose
Data Vault Introduction

More from Alex Meadows (13)

PPTX
Ethics In A Data Driven World
PDF
SIM RTP Meeting - So Who's Using Open Source Anyway?
ODP
Introduction To Data Warehousing
ODP
Continuous Integration As A Service
ODP
Building next generation data warehouses
ODP
Introduction To Analytics
ODP
Continuous integration with business intelligence and analytics
ODP
Big Data Analytics - Introduction
PDF
Open Source BI Overview
PDF
Agile Business Intelligence
ODP
Open source data_warehousing_overview
ODP
Data quality overview
ODP
Choosing the right steps in pentaho kettle
Ethics In A Data Driven World
SIM RTP Meeting - So Who's Using Open Source Anyway?
Introduction To Data Warehousing
Continuous Integration As A Service
Building next generation data warehouses
Introduction To Analytics
Continuous integration with business intelligence and analytics
Big Data Analytics - Introduction
Open Source BI Overview
Agile Business Intelligence
Open source data_warehousing_overview
Data quality overview
Choosing the right steps in pentaho kettle

Recently uploaded (20)

PDF
Unlock new opportunities with location data.pdf
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
August Patch Tuesday
PDF
Five Habits of High-Impact Board Members
DOCX
search engine optimization ppt fir known well about this
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
STKI Israel Market Study 2025 version august
PPT
What is a Computer? Input Devices /output devices
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Hybrid model detection and classification of lung cancer
Unlock new opportunities with location data.pdf
sustainability-14-14877-v2.pddhzftheheeeee
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Univ-Connecticut-ChatGPT-Presentaion.pdf
August Patch Tuesday
Five Habits of High-Impact Board Members
search engine optimization ppt fir known well about this
Module 1.ppt Iot fundamentals and Architecture
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Developing a website for English-speaking practice to English as a foreign la...
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
NewMind AI Weekly Chronicles – August ’25 Week III
Final SEM Unit 1 for mit wpu at pune .pptx
Zenith AI: Advanced Artificial Intelligence
STKI Israel Market Study 2025 version august
What is a Computer? Input Devices /output devices
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Chapter 5: Probability Theory and Statistics
Hybrid model detection and classification of lung cancer

Big Data Pitfalls