SlideShare a Scribd company logo
1Copyright © Capgemini 2016. All Rights Reserved
Bigdata Architecture Overview
2Copyright © Capgemini 2016. All Rights Reserved
Gartner Hype Cycle – Emerging Technologies
3Copyright © Capgemini 2016. All Rights Reserved
Benefits
4Copyright © Capgemini 2016. All Rights Reserved
Big Data and its Dimensions
Extracting insight from an immense volume, variety and velocity of data, in context, beyond
what was previously possible
Manage the complexity of data in many different
structures, ranging from relational, to logs, to raw
text
Streaming data and large volume data movement
Scale from Terabytes to Petabytes
(1K TBs) to Zetabytes (1B TBs)
Having a lot of data in different volumes coming in
at high speed is worthless if that data is incorrect.
Organizations need to ensure that the data is
correct as well as the analyses performed on the
data are correct.
Discovering value from multichannel datasets
Variety:
Velocity:
Volume:
Veracity:
Value:
5Copyright © Capgemini 2016. All Rights Reserved
Applications for Big Data Analytics
Homeland Security
FinanceSmarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn
6Copyright © Capgemini 2016. All Rights Reserved
Manage
 Data governance and security
 Data privacy
 Compliance
 Collaboration
 Value generation
 Program delivery
 Data-driven culture
 Information strategy
 Skill development
 Master data mgmt
 Metadata mgmt
 Data quality mgmt
 Operations, SLA’s
 Orchestration
General reference architecture for Big Data Analytics
ValueActInsightAnalyzeInformationProcessSource
data
Customer
profitability
Operational cost
cutting
Risk prevention
Market share
increase
Business Applications
 Customer
campaign
 Trigger activity
Business Processes
 Trigger event
 Adjust process
Decision makers
 Approve/reject
business
opportunities
 Develop new
business models
and products
Customer
Experience
Operational Process
Optimization
Risk, Fraud
Disruptive Business
Model
Search
What is relevant?
Explorative
How does it work?
Descriptive
What
happened?
Diagnostic
Why did it happen?
Predictive
What
will happen?
Prescriptive
How to
act next?
Data asset
descriptions
Processed data
 Measures, KPI’s
 Dimensions,
Master data
Granular data
 Events
 Context
information
Ingest
Catalog
Stream
Store
Prepare
Refine, blend
Manage lifecycle
Internal data
 IT managed
applications (ERP,
SCM, CRM)
 Master and
reference data
 Business owned
informal data
 Documents, mail,
images, voice,
video
 Web and mobile
apps
 B2B
 Internet, Social,
Internet of Things
(machine, sensor)
 Third party data:
market, weather,
climate,
geolocation
 Open data
External Data
Business
performance
Performance
improvement
Mask
7Copyright © Capgemini 2016. All Rights Reserved
The BDL is also aligned with our principles 
Unleash Data and Insights
as-a-service
Make Insight-driven
Value a Crucial
Business KPI
Empower your People
with Insights at the
Point of Action
Develop an Enterprise Data
Science Culture
Master Governance,
Security and Privacy of your
Data Assets
Enable your Data
Landscape for the Flood
coming from Connected
People and Things
Embark on the Journey
to Insights within your
Business and
Technology Context
1 2 3
7654
It concerns both
Business and
(disruptive) Technology
It works with high volumes of
all kinds of data
It integrates Unified Data
Management capabilities to
manage governance, security,
privacy, MDM, RDM, etc
it also comes with a new,
specific mindset that has to
be addressed at the
Enterprise level
We (Capgemini) intend to
offer the BDL as-a-Service
Bringing Business Value by
delivering Insights at the Point
of Action is the motto of the
BDL
1
2 3
7
654
8Copyright © Capgemini 2016. All Rights Reserved
Business Data Lake Reference Architecture - Conceptual
Characteristics
 Store-anything; analyze everything
 Blend traditional data elements with
new data types
 Manage centrally, govern locally
 Future-proof design
 Highly scalable and available
Data Access Layer
Data Distillation Layer
Data Quality Governance Framework (Business Rules, Transformation, Aggregation)
Customer Master (CRM)
Data Lake Layer
Landing
Self-service
4
Data Ingestion LayerExtract & Load Streams
3
Structured data
Sources
2
1
ODS
SandboxSQL-on-Hadoop In-Memory Grid
Data Visualization and
Reporting
Advanced
Analytics
Data Virtualization
Or Blending
Marts
DataGovernance(Audit,Lineage)
7
MetadataManagement
Transactional
Systems(RES/CRM) Un/Semi-Structured Data Sources
Data Dissemination Layer Data Provisioning Layer
HR
Mart
1 HR
Mart
2
Distributed Compute Layer
/ Services
Distributed Storage Layer
Data Governance
Integration
APILayer
11 6 5
DataSecurity(Authentication,Authorization,Kerberos)
8 9
10
9Copyright © Capgemini 2016. All Rights Reserved
Business Data Lake Reference Architecture - Logical
Talend 6.3 or
latest
Data Access Layer
Data Distillation Layer
Data Quality Governance Framework (Business Rules, Transformation, Aggregation)
Customer Master (CRM)
Data Lake Layer
Landing
4
Data Ingestion LayerExtract & Load Streams
3
Structured data
Sources
2
1
ODS
SandboxSQL-on-Hadoop In-Memory Grid
Data Virtualization
Or Blending
Marts
DataGovernance(Audit,Lineage)
7
MetadataManagement
Transactional
Systems(RES/CRM) Un/Semi-Structured Data Sources
Data Dissemination Layer Data Provisioning Layer
HR
Mart
1 HR
Mart
2
APILayer
11 6 5
DataSecurity(Authentication,Authorization,Kerberos)
8 9
10
Ranger, Knox
Atlas
Hortonworks HDP 2.5
or latest
Spark
HBASE Hive
HBASE / Hive
Datamarts
Redshift
Zeppellin
RESTful
Service
Self-serviceData Visualization and
Reporting
Advanced
Analytics
Spark
Streaming/Storm
Kafka
10Copyright © Capgemini 2016. All Rights Reserved
Detailed layer breakup
11Copyright © Capgemini 2016. All Rights Reserved
Reference architecture for data ingestion - Indicative
Functionality: Ingest Data from a variety of sources and with varying latency, into the Data Lake
Data Integration Services
S/FTP based push
(Logs, text, other file based)
Changed Data Management
(Delta extracts, event mgmt)
Data
Sourcing
Source Extraction Services
(XML, Relational, Other extracts)
DataTransformation
Transformation Services
Fast Data
Manipulation
• Sorting
• File Merges
• Joins
• File Splitting
• Others
Transform
Routines
• Aggregation
• Mappings
• Lookups
• Calculations
• others
Metadata
Management
Automation
Services
Deployment
(Job & others)
Error Handling
Clustering &
Capacity
Common
Services
Data Sources (Structured, Semi-Structured, Unstructured)
DataState
Data at Rest
(ETL pushdown, batch using
standard DI tools or Sqoop)
Data in Motion
(Fast data, processed via tools like
Flume, Storm, Spark, etc)
Data Persistence
Big Data
Transformations
• User-defined
functions / custom
MR code (Java,
Python etc.) for
complex logic
ETL Pushdown Processing
(Execute mapping jobs on Hadoop cluster on
HDFS/Hive/Spark….)
Characteristics
 The Data Ingestion design principles are
based on integrating raw data
characterized by extreme scale and
variability, and making provisions for
both ‘data at rest’ (batch) and ‘data in
motion’ (low latency)
 The framework combines traditional
data integration methodologies
leveraging the Extract-Transform-Load
approach and extends it to also process
semi-structured and unstructured data
elements.
 The classical model of tracking data
elements through their lifecycle and
providing for lineage can be added in
this framework.
12Copyright © Capgemini 2016. All Rights Reserved
Data Acquisition and Reconciliation
The Data Reconciliation is part of data quality and ensures data
integrity in the data lake. Reconciliation process checks if the data has
been loaded properly to ensure accuracy and completeness of the data
Master Data – This is a fairly simple process as the Master Data is not
subject to frequent changes. The granularity of the data remains the
same in the source and the target
Transactional Data – Reconciliation of the Transactional Data is
instrumental to the success of the big data systems. Reconciliation can
happen on the entire data set or on the incremental data based on the
method by which the data is ingested
Separate metadata tables / files are designed specifically for
reconciliation. These tables/ files are populated with reconciliation
queries and reconciliation reports are generated after data is loaded
into the data lake.
Data Reconciliation (Optional)
The Data Acquisition can be described as combination of Landing Zone &
Data validation, Delta Detection & Data Enrichment
Landing Zone – It is an area wherein data from all the source systems
across client’s landscape will land for the utilization/consumption by
downstream systems
Data validation – It is the first check point or zone wherein the MDM
based checks will be applied on the incoming source data files.
Delta Detection : This will be applicable to the data feeds from those
source systems which have the capability to send/provide incremental
delta data for the regular ongoing data processing into data lake solution.
Data Enrichment : Data enrichment refers to processes used to enhance,
refine or otherwise improve raw data. Data from various enrichment
sources will be pushed to data lake via Landing zone for enrichment of
existing data.
Data Acquisition
13Copyright © Capgemini 2016. All Rights Reserved
Data Distillation in the Data Lake: approach to provisioning for
data consumption
Characteristics
 Uniform approach for distillation of information from
the data lake
 A centralized Data Quality engine for application of
uniform data quality rules across the enterprise
 An Integrated Data Quality function to cleanse,
standardize, enrich and de-duplicate data
 Console for Design, Development & Validation of
rules
 Data Quality Services for Integration with
operational systems, MDM
 A Exception Management solution for resolving data
issues and errors.
 Data quality process running on the data will be
translated into MapReduce for faster processing.
Data Persistence Layer
Distillation Layer
AGGREGATION
EXTRACT
TRANSFORM
Σ
SECURE
DATA QUALITY STORE
DATA QUALITY CONSOLE
DATA QUALITY ENGINE
DATA
PROFILING
DATA
CLEANSING
MATCH
& MERGE
DATA
ENRICHMENT
RULE MANAGER
DQ META-DATA
DATA
DASHBOARD
EXCEPTION
MANAGEMENT
DATA QUALITY
CONFIGURATOR
EXCEPTION
REPOSITORY
DQ MART
Functionality: Ability to ingest data from the storage tier and convert it to structured data for easier analysis by downstream applications.
This is done through a combination of Extraction, transformation and aggregation of high quality data from the Data Lake and making it
available for Analytical and Reporting Applications. Transformation will also involve data quality checks and corrections like profiling,
validating, cleansing structured and unstructured data based on Business rules. Data is distilled (or prepared) on a per-function basis, and
made available for consumption. This is consistent with the design practice of ‘manage data centrally and provision locally’
14Copyright © Capgemini 2016. All Rights Reserved
Data Persistence Layer : Schema on Read & Distill on Demand
Namenode
Hadoop Distributed File System (HDFS)
Datanodes Replication
Job / Task
Tracker
Storage Cluster/Rack
Characteristics
 Deliver a single, comprehensive view of all data,
across functional areas – to conduct deep
analysis
 Multi-tiered Data Lake that serves distinct
functionalities – e.g., Landing, staging and
curated stores
 A landing area containing both traditional data
as well as non-traditional data – characterized
by attributes of value, veracity, volume, velocity
and variety
 Eliminate the need for upfront schema design
and rigid pre-configured models
 Easy and cost-effective configuration for scale
up and scale down
 Store everything, distill on demand
Landing Staging
Data Lake
Curated
Audit Metadata Search
Data Ingestion
Functionality: Create a single repository for information and deliver a single, silo-less store to handle all types of data for all reporting,
analysis and discovery requirements
15Copyright © Capgemini 2016. All Rights Reserved
Approach to Data Provisioning
DataAccessLayer
Data provisioning
Discovery
Platform
/ Sandboxes
Analytical
Views
Data
Virtualization
DataDissemination
HR
Mart
1
HR
Mart
2
HR
Mart
3
HR
Mart
4
Characteristics
 The Data Marts & Aggregate Structures layer will
include subject specific data mart structures which
can be used by various tools to retrieve data and
information. This layer will also support User specific
Sandbox for power users to perform various
activities such as data mining, identifying data
patterns, running analytical and statistical model
using various tools
 If required, there will be multiple versions of the
subject areas for different production streams
 Data marts and aggregate structures such as
summary tables will be created based on business
and performance requirements. As far as possible,
database managed aggregates such as computed
views and indexes will be created to reduce ETL
based data movement
 Data Virtualization will address combining datasets
from multiple data stores across various layers in the
data lake stack.
Functionality: Provision data-sets to create various combinations of custom views – by specific functions/departments and also cross-
functional access
16Copyright © Capgemini 2016. All Rights Reserved
© David Feinleib
16

More Related Content

PPTX
Boosting Innovation and Value for Your Subsidiaries with SAP S/4HANA Cloud
Capgemini
 
PPTX
The Need for Speed
Capgemini
 
PDF
Connected Autonomous Planning: a continuous touchless model enabling an agile...
Capgemini
 
PDF
Digital manufacturing cwin18-milan
Capgemini
 
PPTX
Top Trends in Wealth Management 2020
Capgemini
 
PPTX
Digital manufacturing cwin18 mexico
Capgemini
 
PDF
Ai and data migration as a service subhash bhat cwin18-india
Capgemini
 
PPTX
Top Trends in Payments 2022
Capgemini
 
Boosting Innovation and Value for Your Subsidiaries with SAP S/4HANA Cloud
Capgemini
 
The Need for Speed
Capgemini
 
Connected Autonomous Planning: a continuous touchless model enabling an agile...
Capgemini
 
Digital manufacturing cwin18-milan
Capgemini
 
Top Trends in Wealth Management 2020
Capgemini
 
Digital manufacturing cwin18 mexico
Capgemini
 
Ai and data migration as a service subhash bhat cwin18-india
Capgemini
 
Top Trends in Payments 2022
Capgemini
 

What's hot (20)

PPT
Introducing Gartner
chrisforte43
 
PDF
UNLIMITED by Capgemini: Foundation of Digital Business
Capgemini
 
PDF
Pluto7 - Tableau Webinar on enabling Organization to be Data Driven in 201...
Manju Devadas
 
PDF
The Perfect Storm & Your Information Strategy
Capgemini
 
PDF
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
Capgemini
 
PPTX
Top Trends in Commercial Banking: 2020
Capgemini
 
PDF
Invenio content financials
invenioLSI
 
PDF
Make it a valuable experience, think design
Capgemini
 
PDF
20151014 Presentation Conferência Banca e Seguros Portugal
Pascal Spelier
 
PDF
Software-Defined Storage Accelerates Storage Cost Reduction and Service-Level...
DataCore Software
 
PPTX
Achieving GxP compliance with SAP S/4HANA in the AWS Cloud
Capgemini
 
PDF
Hampshire City Council and Capgemini at SAPPHIRENOW
Capgemini
 
PDF
Infographic-Unlocking Customer Satisfaction: Why Digital Holds the key for Te...
Capgemini
 
PDF
Construction Viz Project Tracker
Jeffrey Lydon
 
PDF
CWIN17 New-York / insurance spotlight building the digital core
Capgemini
 
PDF
CWIN17 san francisco-shawn kelly-iot business value
Capgemini
 
PPTX
Enabling and accelerating multi-tenancy with Capgemini Digital Cloud Platform...
Capgemini
 
PDF
Future of service
Capgemini
 
PDF
A strategic review of the top five offshore vendors
Semalytix
 
PDF
Digitally Outsmart the Competition During the Recession
BearingPoint Finland
 
Introducing Gartner
chrisforte43
 
UNLIMITED by Capgemini: Foundation of Digital Business
Capgemini
 
Pluto7 - Tableau Webinar on enabling Organization to be Data Driven in 201...
Manju Devadas
 
The Perfect Storm & Your Information Strategy
Capgemini
 
Artificial intelligence capabilities overview yashowardhan sowale cwin18-india
Capgemini
 
Top Trends in Commercial Banking: 2020
Capgemini
 
Invenio content financials
invenioLSI
 
Make it a valuable experience, think design
Capgemini
 
20151014 Presentation Conferência Banca e Seguros Portugal
Pascal Spelier
 
Software-Defined Storage Accelerates Storage Cost Reduction and Service-Level...
DataCore Software
 
Achieving GxP compliance with SAP S/4HANA in the AWS Cloud
Capgemini
 
Hampshire City Council and Capgemini at SAPPHIRENOW
Capgemini
 
Infographic-Unlocking Customer Satisfaction: Why Digital Holds the key for Te...
Capgemini
 
Construction Viz Project Tracker
Jeffrey Lydon
 
CWIN17 New-York / insurance spotlight building the digital core
Capgemini
 
CWIN17 san francisco-shawn kelly-iot business value
Capgemini
 
Enabling and accelerating multi-tenancy with Capgemini Digital Cloud Platform...
Capgemini
 
Future of service
Capgemini
 
A strategic review of the top five offshore vendors
Semalytix
 
Digitally Outsmart the Competition During the Recession
BearingPoint Finland
 
Ad

Similar to CWIN17 India / Bigdata architecture yashowardhan sowale (20)

PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
PDF
Capgemini Leap Data Transformation Framework with Cloudera
Capgemini
 
PDF
Achieve data democracy in data lake with data integration
Saurabh K. Gupta
 
PDF
Performance management capability
designer DATA
 
PPTX
Building the enterprise data architecture
Costa Pissaris
 
PPTX
Deliveinrg explainable AI
Gary Allemann
 
PDF
Fathoming Data for Competitive Advantage
Capgemini
 
PDF
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
PPTX
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
PPTX
Эволюция Big Data и Information Management. Reference Architecture.
Andrey Akulov
 
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
PDF
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
Denodo
 
PDF
Setting Up the Data Lake
Caserta
 
PPTX
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
PDF
Big Data - A Real Life Revolution
Capgemini
 
PPTX
Creating an Enterprise AI Strategy
AtScale
 
PPTX
Modern data warehouse
Elena Lopez
 
PDF
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
SG Analytics
 
PDF
02.BigDataAnalytics curso de Legsi (1).pdf
ruioliveira1921
 
PDF
Workable Enteprise Data Governance
Bhavendra Chavan
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Capgemini Leap Data Transformation Framework with Cloudera
Capgemini
 
Achieve data democracy in data lake with data integration
Saurabh K. Gupta
 
Performance management capability
designer DATA
 
Building the enterprise data architecture
Costa Pissaris
 
Deliveinrg explainable AI
Gary Allemann
 
Fathoming Data for Competitive Advantage
Capgemini
 
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Data Science Salon 2018 - Building a true enterprise data governance platform...
Data Con LA
 
Эволюция Big Data и Information Management. Reference Architecture.
Andrey Akulov
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
DAMA Webinar: Turn Grand Designs into a Reality with Data Virtualization
Denodo
 
Setting Up the Data Lake
Caserta
 
Navigating the World of User Data Management and Data Discovery
DataWorks Summit/Hadoop Summit
 
Big Data - A Real Life Revolution
Capgemini
 
Creating an Enterprise AI Strategy
AtScale
 
Modern data warehouse
Elena Lopez
 
Evolving Big Data Strategies: Bringing Data Lake and Data Mesh Vision to Life
SG Analytics
 
02.BigDataAnalytics curso de Legsi (1).pdf
ruioliveira1921
 
Workable Enteprise Data Governance
Bhavendra Chavan
 
Ad

More from Capgemini (20)

PPTX
Top Healthcare Trends 2022
Capgemini
 
PPTX
Top P&C Insurance Trends 2022
Capgemini
 
PPTX
Commercial Banking Trends book 2022
Capgemini
 
PPTX
Top Trends in Wealth Management 2022
Capgemini
 
PPTX
Retail Banking Trends book 2022
Capgemini
 
PPTX
Top Life Insurance Trends 2022
Capgemini
 
PPTX
キャップジェミニ、あなたの『RISE WITH SAP』のパートナーです
Capgemini
 
PPTX
Property & Casualty Insurance Top Trends 2021
Capgemini
 
PPTX
Life Insurance Top Trends 2021
Capgemini
 
PPTX
Top Trends in Commercial Banking: 2021
Capgemini
 
PPTX
Top Trends in Wealth Management: 2021
Capgemini
 
PPTX
Top Trends in Payments: 2021
Capgemini
 
PPTX
Health Insurance Top Trends 2021
Capgemini
 
PPTX
Top Trends in Retail Banking: 2021
Capgemini
 
PDF
Capgemini’s Connected Autonomous Planning
Capgemini
 
PPTX
Top Trends in Retail Banking: 2020
Capgemini
 
PPTX
Top Trends in Life Insurance: 2020
Capgemini
 
PPTX
Top Trends in Health Insurance: 2020
Capgemini
 
PPTX
Top Trends in Payments: 2020
Capgemini
 
PPTX
How to get off the white elephant of physical and leverage the true benefits ...
Capgemini
 
Top Healthcare Trends 2022
Capgemini
 
Top P&C Insurance Trends 2022
Capgemini
 
Commercial Banking Trends book 2022
Capgemini
 
Top Trends in Wealth Management 2022
Capgemini
 
Retail Banking Trends book 2022
Capgemini
 
Top Life Insurance Trends 2022
Capgemini
 
キャップジェミニ、あなたの『RISE WITH SAP』のパートナーです
Capgemini
 
Property & Casualty Insurance Top Trends 2021
Capgemini
 
Life Insurance Top Trends 2021
Capgemini
 
Top Trends in Commercial Banking: 2021
Capgemini
 
Top Trends in Wealth Management: 2021
Capgemini
 
Top Trends in Payments: 2021
Capgemini
 
Health Insurance Top Trends 2021
Capgemini
 
Top Trends in Retail Banking: 2021
Capgemini
 
Capgemini’s Connected Autonomous Planning
Capgemini
 
Top Trends in Retail Banking: 2020
Capgemini
 
Top Trends in Life Insurance: 2020
Capgemini
 
Top Trends in Health Insurance: 2020
Capgemini
 
Top Trends in Payments: 2020
Capgemini
 
How to get off the white elephant of physical and leverage the true benefits ...
Capgemini
 

Recently uploaded (20)

PPTX
Iconic Destinations in India: Explore Heritage and Beauty
dhorashankar
 
PDF
Green Natural Green House Presentation (2).pdf
SaeedOsman6
 
PPTX
A Power Point Presentaion of 2 test match
katarapiyush21
 
PPTX
Joy And Peace In All Circumstances.pptx
FamilyWorshipCenterD
 
PPTX
Enterprise Asset Management Overview with examples
ManikantaBN1
 
PPTX
garment-industry in bangladesh. how bangladeshi industry is doing
tanvirhossain1570
 
PPTX
Remote Healthcare Technology Use Cases and the Contextual Integrity of Olde...
Daniela Napoli
 
PPTX
THE school_exposure_presentation[1].pptx
sayanmondal3500
 
PPTX
DPIC Assingment_1.pptx.pptx for presentation
yashwork2607
 
PPTX
PHILIPPINE LITERATURE DURING SPANISH ERA
AllizaJoyMendigoria
 
PPTX
milgram study as level psychology core study (social approach)
dinhminhthu1405
 
PPTX
Bob Stewart Journey to Rome 07 30 2025.pptx
FamilyWorshipCenterD
 
PPTX
AMFI - Investor Awareness Presentation.pptx
ssuser89d308
 
PDF
50 Breathtaking WWII Colorized Photos Look Like They Were Taken Yesterday
Ivan Consiglio
 
PDF
Securing Africa’s future: Technology, culture and the changing face of threat
Kayode Fayemi
 
PDF
Developing Accessible and Usable Security Heuristics
Daniela Napoli
 
PPTX
2025-07-27 Abraham 09 (shared slides).pptx
Dale Wells
 
PPT
strucure of protein geomics for new .ppt
RakeshKumar508211
 
PDF
Pesticides | Natural Pesticides | Methods of control | Types of pesticides | ...
Home
 
PPTX
“Mastering Digital Professionalism: Your Online Image Matters”
ramjankhalyani
 
Iconic Destinations in India: Explore Heritage and Beauty
dhorashankar
 
Green Natural Green House Presentation (2).pdf
SaeedOsman6
 
A Power Point Presentaion of 2 test match
katarapiyush21
 
Joy And Peace In All Circumstances.pptx
FamilyWorshipCenterD
 
Enterprise Asset Management Overview with examples
ManikantaBN1
 
garment-industry in bangladesh. how bangladeshi industry is doing
tanvirhossain1570
 
Remote Healthcare Technology Use Cases and the Contextual Integrity of Olde...
Daniela Napoli
 
THE school_exposure_presentation[1].pptx
sayanmondal3500
 
DPIC Assingment_1.pptx.pptx for presentation
yashwork2607
 
PHILIPPINE LITERATURE DURING SPANISH ERA
AllizaJoyMendigoria
 
milgram study as level psychology core study (social approach)
dinhminhthu1405
 
Bob Stewart Journey to Rome 07 30 2025.pptx
FamilyWorshipCenterD
 
AMFI - Investor Awareness Presentation.pptx
ssuser89d308
 
50 Breathtaking WWII Colorized Photos Look Like They Were Taken Yesterday
Ivan Consiglio
 
Securing Africa’s future: Technology, culture and the changing face of threat
Kayode Fayemi
 
Developing Accessible and Usable Security Heuristics
Daniela Napoli
 
2025-07-27 Abraham 09 (shared slides).pptx
Dale Wells
 
strucure of protein geomics for new .ppt
RakeshKumar508211
 
Pesticides | Natural Pesticides | Methods of control | Types of pesticides | ...
Home
 
“Mastering Digital Professionalism: Your Online Image Matters”
ramjankhalyani
 

CWIN17 India / Bigdata architecture yashowardhan sowale

  • 1. 1Copyright © Capgemini 2016. All Rights Reserved Bigdata Architecture Overview
  • 2. 2Copyright © Capgemini 2016. All Rights Reserved Gartner Hype Cycle – Emerging Technologies
  • 3. 3Copyright © Capgemini 2016. All Rights Reserved Benefits
  • 4. 4Copyright © Capgemini 2016. All Rights Reserved Big Data and its Dimensions Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Organizations need to ensure that the data is correct as well as the analyses performed on the data are correct. Discovering value from multichannel datasets Variety: Velocity: Volume: Veracity: Value:
  • 5. 5Copyright © Capgemini 2016. All Rights Reserved Applications for Big Data Analytics Homeland Security FinanceSmarter Healthcare Multi-channel sales Telecom Manufacturing Traffic Control Trading Analytics Fraud and Risk Log Analysis Search Quality Retail: Churn
  • 6. 6Copyright © Capgemini 2016. All Rights Reserved Manage  Data governance and security  Data privacy  Compliance  Collaboration  Value generation  Program delivery  Data-driven culture  Information strategy  Skill development  Master data mgmt  Metadata mgmt  Data quality mgmt  Operations, SLA’s  Orchestration General reference architecture for Big Data Analytics ValueActInsightAnalyzeInformationProcessSource data Customer profitability Operational cost cutting Risk prevention Market share increase Business Applications  Customer campaign  Trigger activity Business Processes  Trigger event  Adjust process Decision makers  Approve/reject business opportunities  Develop new business models and products Customer Experience Operational Process Optimization Risk, Fraud Disruptive Business Model Search What is relevant? Explorative How does it work? Descriptive What happened? Diagnostic Why did it happen? Predictive What will happen? Prescriptive How to act next? Data asset descriptions Processed data  Measures, KPI’s  Dimensions, Master data Granular data  Events  Context information Ingest Catalog Stream Store Prepare Refine, blend Manage lifecycle Internal data  IT managed applications (ERP, SCM, CRM)  Master and reference data  Business owned informal data  Documents, mail, images, voice, video  Web and mobile apps  B2B  Internet, Social, Internet of Things (machine, sensor)  Third party data: market, weather, climate, geolocation  Open data External Data Business performance Performance improvement Mask
  • 7. 7Copyright © Capgemini 2016. All Rights Reserved The BDL is also aligned with our principles  Unleash Data and Insights as-a-service Make Insight-driven Value a Crucial Business KPI Empower your People with Insights at the Point of Action Develop an Enterprise Data Science Culture Master Governance, Security and Privacy of your Data Assets Enable your Data Landscape for the Flood coming from Connected People and Things Embark on the Journey to Insights within your Business and Technology Context 1 2 3 7654 It concerns both Business and (disruptive) Technology It works with high volumes of all kinds of data It integrates Unified Data Management capabilities to manage governance, security, privacy, MDM, RDM, etc it also comes with a new, specific mindset that has to be addressed at the Enterprise level We (Capgemini) intend to offer the BDL as-a-Service Bringing Business Value by delivering Insights at the Point of Action is the motto of the BDL 1 2 3 7 654
  • 8. 8Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Conceptual Characteristics  Store-anything; analyze everything  Blend traditional data elements with new data types  Manage centrally, govern locally  Future-proof design  Highly scalable and available Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing Self-service 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Visualization and Reporting Advanced Analytics Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 Distributed Compute Layer / Services Distributed Storage Layer Data Governance Integration APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10
  • 9. 9Copyright © Capgemini 2016. All Rights Reserved Business Data Lake Reference Architecture - Logical Talend 6.3 or latest Data Access Layer Data Distillation Layer Data Quality Governance Framework (Business Rules, Transformation, Aggregation) Customer Master (CRM) Data Lake Layer Landing 4 Data Ingestion LayerExtract & Load Streams 3 Structured data Sources 2 1 ODS SandboxSQL-on-Hadoop In-Memory Grid Data Virtualization Or Blending Marts DataGovernance(Audit,Lineage) 7 MetadataManagement Transactional Systems(RES/CRM) Un/Semi-Structured Data Sources Data Dissemination Layer Data Provisioning Layer HR Mart 1 HR Mart 2 APILayer 11 6 5 DataSecurity(Authentication,Authorization,Kerberos) 8 9 10 Ranger, Knox Atlas Hortonworks HDP 2.5 or latest Spark HBASE Hive HBASE / Hive Datamarts Redshift Zeppellin RESTful Service Self-serviceData Visualization and Reporting Advanced Analytics Spark Streaming/Storm Kafka
  • 10. 10Copyright © Capgemini 2016. All Rights Reserved Detailed layer breakup
  • 11. 11Copyright © Capgemini 2016. All Rights Reserved Reference architecture for data ingestion - Indicative Functionality: Ingest Data from a variety of sources and with varying latency, into the Data Lake Data Integration Services S/FTP based push (Logs, text, other file based) Changed Data Management (Delta extracts, event mgmt) Data Sourcing Source Extraction Services (XML, Relational, Other extracts) DataTransformation Transformation Services Fast Data Manipulation • Sorting • File Merges • Joins • File Splitting • Others Transform Routines • Aggregation • Mappings • Lookups • Calculations • others Metadata Management Automation Services Deployment (Job & others) Error Handling Clustering & Capacity Common Services Data Sources (Structured, Semi-Structured, Unstructured) DataState Data at Rest (ETL pushdown, batch using standard DI tools or Sqoop) Data in Motion (Fast data, processed via tools like Flume, Storm, Spark, etc) Data Persistence Big Data Transformations • User-defined functions / custom MR code (Java, Python etc.) for complex logic ETL Pushdown Processing (Execute mapping jobs on Hadoop cluster on HDFS/Hive/Spark….) Characteristics  The Data Ingestion design principles are based on integrating raw data characterized by extreme scale and variability, and making provisions for both ‘data at rest’ (batch) and ‘data in motion’ (low latency)  The framework combines traditional data integration methodologies leveraging the Extract-Transform-Load approach and extends it to also process semi-structured and unstructured data elements.  The classical model of tracking data elements through their lifecycle and providing for lineage can be added in this framework.
  • 12. 12Copyright © Capgemini 2016. All Rights Reserved Data Acquisition and Reconciliation The Data Reconciliation is part of data quality and ensures data integrity in the data lake. Reconciliation process checks if the data has been loaded properly to ensure accuracy and completeness of the data Master Data – This is a fairly simple process as the Master Data is not subject to frequent changes. The granularity of the data remains the same in the source and the target Transactional Data – Reconciliation of the Transactional Data is instrumental to the success of the big data systems. Reconciliation can happen on the entire data set or on the incremental data based on the method by which the data is ingested Separate metadata tables / files are designed specifically for reconciliation. These tables/ files are populated with reconciliation queries and reconciliation reports are generated after data is loaded into the data lake. Data Reconciliation (Optional) The Data Acquisition can be described as combination of Landing Zone & Data validation, Delta Detection & Data Enrichment Landing Zone – It is an area wherein data from all the source systems across client’s landscape will land for the utilization/consumption by downstream systems Data validation – It is the first check point or zone wherein the MDM based checks will be applied on the incoming source data files. Delta Detection : This will be applicable to the data feeds from those source systems which have the capability to send/provide incremental delta data for the regular ongoing data processing into data lake solution. Data Enrichment : Data enrichment refers to processes used to enhance, refine or otherwise improve raw data. Data from various enrichment sources will be pushed to data lake via Landing zone for enrichment of existing data. Data Acquisition
  • 13. 13Copyright © Capgemini 2016. All Rights Reserved Data Distillation in the Data Lake: approach to provisioning for data consumption Characteristics  Uniform approach for distillation of information from the data lake  A centralized Data Quality engine for application of uniform data quality rules across the enterprise  An Integrated Data Quality function to cleanse, standardize, enrich and de-duplicate data  Console for Design, Development & Validation of rules  Data Quality Services for Integration with operational systems, MDM  A Exception Management solution for resolving data issues and errors.  Data quality process running on the data will be translated into MapReduce for faster processing. Data Persistence Layer Distillation Layer AGGREGATION EXTRACT TRANSFORM Σ SECURE DATA QUALITY STORE DATA QUALITY CONSOLE DATA QUALITY ENGINE DATA PROFILING DATA CLEANSING MATCH & MERGE DATA ENRICHMENT RULE MANAGER DQ META-DATA DATA DASHBOARD EXCEPTION MANAGEMENT DATA QUALITY CONFIGURATOR EXCEPTION REPOSITORY DQ MART Functionality: Ability to ingest data from the storage tier and convert it to structured data for easier analysis by downstream applications. This is done through a combination of Extraction, transformation and aggregation of high quality data from the Data Lake and making it available for Analytical and Reporting Applications. Transformation will also involve data quality checks and corrections like profiling, validating, cleansing structured and unstructured data based on Business rules. Data is distilled (or prepared) on a per-function basis, and made available for consumption. This is consistent with the design practice of ‘manage data centrally and provision locally’
  • 14. 14Copyright © Capgemini 2016. All Rights Reserved Data Persistence Layer : Schema on Read & Distill on Demand Namenode Hadoop Distributed File System (HDFS) Datanodes Replication Job / Task Tracker Storage Cluster/Rack Characteristics  Deliver a single, comprehensive view of all data, across functional areas – to conduct deep analysis  Multi-tiered Data Lake that serves distinct functionalities – e.g., Landing, staging and curated stores  A landing area containing both traditional data as well as non-traditional data – characterized by attributes of value, veracity, volume, velocity and variety  Eliminate the need for upfront schema design and rigid pre-configured models  Easy and cost-effective configuration for scale up and scale down  Store everything, distill on demand Landing Staging Data Lake Curated Audit Metadata Search Data Ingestion Functionality: Create a single repository for information and deliver a single, silo-less store to handle all types of data for all reporting, analysis and discovery requirements
  • 15. 15Copyright © Capgemini 2016. All Rights Reserved Approach to Data Provisioning DataAccessLayer Data provisioning Discovery Platform / Sandboxes Analytical Views Data Virtualization DataDissemination HR Mart 1 HR Mart 2 HR Mart 3 HR Mart 4 Characteristics  The Data Marts & Aggregate Structures layer will include subject specific data mart structures which can be used by various tools to retrieve data and information. This layer will also support User specific Sandbox for power users to perform various activities such as data mining, identifying data patterns, running analytical and statistical model using various tools  If required, there will be multiple versions of the subject areas for different production streams  Data marts and aggregate structures such as summary tables will be created based on business and performance requirements. As far as possible, database managed aggregates such as computed views and indexes will be created to reduce ETL based data movement  Data Virtualization will address combining datasets from multiple data stores across various layers in the data lake stack. Functionality: Provision data-sets to create various combinations of custom views – by specific functions/departments and also cross- functional access
  • 16. 16Copyright © Capgemini 2016. All Rights Reserved © David Feinleib 16