SlideShare a Scribd company logo
Data Ingestion, Extraction, and
           Preparation for Hadoop


                     Sanjay Kaluskar, Sr.
                    Architect, Informatica
                     David Teniente, Data
                    Architect, Rackspace




1
Safe Harbor Statement
•   The information being provided today is for informational purposes only. The
    development, release and timing of any Informatica product or functionality described
    today remain at the sole discretion of Informatica and should not be relied upon in
    making a purchasing decision. Statements made today are based on
    currently available information, which is subject to change. Such statements should
    not be relied upon as a representation, warranty or commitment to deliver specific
    products or functionality in the future.
•   Some of the comments we will make today are forward-looking statements including
    statements concerning our product portfolio, our growth and operational
    strategies, our opportunities, customer adoption of and demand for our products and
    services, the use and expected benefits of our products and services by
    customers, the expected benefit from our partnerships and our expectations
    regarding future industry trends and macroeconomic development.
•   All forward-looking statements are based upon current expectations and beliefs.
    However, actual results could differ materially. There are many reasons why actual
    results may differ from our current expectations. These forward-looking statements
    should not be relied upon as representing our views as of any subsequent date and
    Informatica undertakes no obligation to update forward-looking statements to reflect
    events or circumstances after the date that they are made.
•   Please refer to our recent SEC filings including the Form 10-Q for the quarter ended
    September 30th, 2011 for a detailed discussion of the risk factors that may affect our
    results. Copies of these documents may be obtained from the SEC or by contacting
    our Investor Relations department.




                                                                                             2
The Hadoop Data Processing Pipeline
Informatica PowerCenter + PowerExchange
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
        PowerCenter +                      Hadoop
       PowerExchange

                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             3
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           4
Unleash the Power of Hadoop
    With High Performance Universal Data Access

     Messaging,                                                                                     Packaged
and Web Services     WebSphere MQ         Web Services     JD Edwards        SAP NetWeaver        Applications
                     JMS                  TIBCO            Lotus Notes       SAP NetWeaver BI
                     MSMQ                 webMethods       Oracle E-Business SAS
                     SAP NetWeaver XI                      PeopleSoft        Siebel

   Relational and    Oracle               Informix                                                  SaaS/BPO
        Flat Files                                         Salesforce CRM   ADP
                     DB2 UDB              Teradata                          Hewitt
                     DB2/400              Netezza          Force.com
                                                           RightNow         SAP By Design
                     SQL Server           ODBC                              Oracle OnDemand
                     Sybase               JDBC             NetSuite
      Mainframe                                                                                       Industry
   and Midrange                                            EDI–X12          AST                     Standards
                     ADABAS          VSAM
                     Datacom         C-ISAM                EDI-Fact         FIX
                     DB2             Binary Flat Files     RosettaNet       Cargo IMP
                     IDMS            Tape Formats…         HL7              MVR
                     IMS
                                                           HIPAA
    Unstructured
   Data and Files    Word, Excel           Flat files                                           XML Standards
                     PDF                   ASCII reports   XML              ebXML
                     StarOffice            HTML            LegalXML         HL7 v3.0
                     WordPerfect           RPG             IFX              ACORD (AL3, XML)
                     Email (POP, IMPA)     ANSI            cXML
                     HTTP                  LDAP
MPP Appliances

                     EMC/Greenplum       AsterData         Facebook         LinkedIn
                     Vertica                               Twitter

                                                                                                Social Media



                                                                                                                 5
Ingest Data
                      Access Data         Pre-Process        Ingest Data
  Web server




                      PowerExchange         PowerCenter
  Databases,
Data Warehouse
                         Batch                               HDFS




 Message Queues,          CDC                                 HIVE
Email, Social Media                             e.g.
                                         Filter, Join, Cle
                                               anse
                        Real-time
  ERP, CRM
                                        Reuse
                                      PowerCenter
                                       mappings
  Mainframe



                                                                           6
Extract Data

Extract Data   Post-Process           Deliver Data

                                                        Web server



                 PowerCenter           PowerExchange
                                                         Databases,
 HDFS                                     Batch        Data Warehouse




               e.g. Transform
                                                         ERP, CRM
                  to target
                   schema

                                  Reuse                  Mainframe
                                PowerCenter
                                 mappings




                                                                        7
1. Create Ingest or
Extract Mapping




2. Create Hadoop
Connection




                                  3. Configure
                                  Workflow




          4. Create & Load Into
          Hive Table




                                                 8
The Hadoop Data Processing Pipeline
Informatica HParser
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
            HParser                        Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             9
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           10
Informatica HParser
Productivity: Data Transformation Studio




                                           11
Informatica HParser
    Productivity: Data Transformation Studio


Financial            Insurance           B2B Standards
                                                             Out of the box
SWIFT MT             DTCC-NSCC                               transformations for
                                         UNEDIFACT
SWIFT MX             ACORD-AL3
                                                             all messages in all
                                         Easy example
                                         EDI-X12
NACHA
                                                             versions
                     ACORD XML           based visual
                                         EDI ARR
FIX                                      enhancements
                                         EDI UCS+WINS
Telekurs                                 and edits
                                         EDI VICS            Updates and new
FpML
                                         RosettaNet          versions delivered
BAI – V2.0Lockbox
                     Healthcare          OAGI                from Informatica
CREST DEX
IFX                  HL7
                                  Definition is done using
TWIST                             Business (industry)
                                           Other
                     HL7 V3
  Enhanced
UNIFI (ISO 20022)
                                  terminology and
                     HIPAA
  Validations                     definitions
                                           IATA-PADIS
SEPA                 NCPDP
FIXML                                    PLMXML
                     CDISC
MISMO                                    NEIM



                                                                                  12
Informatica HParser
    How does it work?
                                                 Hadoop cluster




                                Svc Repository

                                      S


hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt


1. Develop an HParser transformation
2. Deploy the transformation
3. Run HParser on Hadoop to produce
   tabular data                                      HDFS
4. Analyze the data with HIVE / PIG /
   MapReduce / Other


                                                                  13
The Hadoop Data Processing Pipeline
Informatica Roadmap
      Available Today
                                                                            Sales & Marketing             Customer Service
      1H / 2012                                                                 Data mart                      Portal




                                           4. Extract Data from Hadoop



                                           3. Transform & Cleanse Data
                                           on Hadoop


                                           2. Parse & Prepare Data on
                                           Hadoop



                                           1. Ingest Data into Hadoop




                                                                    Product & Service                    Customer Service
  Marketing Campaigns   Customer Profile    Account Transactions                          Social Media
                                                                        Offerings                         Logs & Surveys




                                                                                                                             14
Options

                   Ingest/Extract       Parse & Prepare   Transform &
                   Data                 Data              Cleanse Data
Structured (e.g.   Informatica          N/A               Hive, PIG, MR,
OLTP, OLAP)        PowerCenter +                          Future:
                   PowerExchange,                         Informatica
                   Sqoop                                  Roadmap

Unstructured,      Informatica          Informatica       Hive, PIG, MR,
semi-structured    PowerCenter +        HParser,          Future:
(e.g. web logs,    PowerExchange,       PIG/Hive UDFs,    Informatica
JSON)              copy files, Flume,   MR                Roadmap
                   Scribe, Kafka




                                                                           15
Informatica Hadoop Roadmap – 1H 2012

• Process data on Hadoop
   • IDE, administration, monitoring, workflow
   • Data processing flow designed through IDE: Source/Target,
     Filter, Join, Lookup, etc.
   • Execution on Hadoop cluster (pushdown via Hive)

• Flexibility to plug-in custom code
   • Hive and PIG UDFs
   • MR scripts

• Productivity with optimal performance
   • Exploit Hive performance characteristics
   • Optimize end-to-end data flow for performance


                                                                 16
Mapping for Hive execution

                                                      Logical
                                                      representation
                                                      of processing
                                                      steps




                              Validate &
                              configure for
          Source
                              Hive translation
                   INSERT INTO STG0
                   SELECT * FROM StockAnalysis0;   Pre-view
                   INSERT INTO STG1
                   SELECT * FROM StockAnalysis1;
                                                   generated
                   INSERT INTO STG2
                   SELECT * FROM StockAnalysis2;
                                                   Hive code



                                                                       17
                                                                        17
Takeaways

• Universal connectivity
   • Completeness and enrichment of raw data for holistic analysis
   • Prevent Hadoop from becoming another silo accessible to a few
     experts

• Maximum productivity
   • Collaborative development environment
      • Right level of abstraction for data processing logic
      • Re-use of algorithms and data flow logic
   • Meta-data driven processing
      • Document data lineage for auditing and impact analysis
      • Deploy on any platform for optimal performance and utilization




                                                                         18
Customer Sentiment - Reaching beyond
NPS (Net Promoter Score) and surveys

Gaining insight in to our customer’s sentiment
will improve Rackspace’s ability to provide
Fanatical Support™
Objectives:
• What are “they” saying
• Gauge the level of sentiment
• Fanatical Support™ for the win
   • Increase NPS
   • Increase MRR
   • Decrease Churn
   • Provide the right products
   • Keep our promises


                              19                 19
Customer Sentiment Use Cases
Pulling it all together
                     Case 1                   Case 2
           Match social media posts        Determine the
           with Customer. Determine        sentiment of a
               a probable match.          post, searching
                                          key words and
                                         scoring the post.
          Case 3
   Determine correlations
between posts, ticket volume
and NPS leading to negative                   Case 4
   or positive sentiments.            Determine correlations in
                                          sentiments with
                                      products/configurations
                                      which lead to negative or
                   Case 5               positive sentiments.
           The ability to trend all
            inputs over time…


                                                                  20
Rackspace Fanatical Support™
Big Data Environment

  Data Sources
(DBs, Flat files, Data
     Streams)




   Oracle
   MySql
   MS SQL                                                                     Greenplum DB
                                                        Indirect Analytics
   Postgres                                                over Hadoop
   DB2                                                                          BI Analytics

   Excel
   CSV                                                                           BI Stack
   Flat File             Message bus /
   XML                   port listening


   EDI                                                   Direct Analytics
                                                          over Hadoop
   Binary
   Sys Logs
                                          Hadoop HDFS                        Search, Analytics,
   Messaging
   APIs                                                                         Algorithmic




                                                                                                  21
Twitter Feed for Rackspace
Using Informatica




      Input Data             Output Data




                    22                     22
23

More Related Content

What's hot (20)

PPTX
Importance & Principles of Modeling from UML Designing
ABHISHEK KUMAR
 
PPTX
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
Simplilearn
 
PPTX
computer network OSI layer
Sangeetha Rangarajan
 
PDF
Overview of computing paradigm
Ripal Ranpara
 
PDF
Cloud Computing Architecture
Animesh Chaturvedi
 
PPTX
Integrating Public & Private Clouds
Proact Belgium
 
PDF
Strategies for Training End Users How To Use Salesforce
Shell Black
 
PDF
Introduction to Software Defined Networking (SDN)
Bangladesh Network Operators Group
 
PPTX
Ch 2 types of reqirement
Fish Abe
 
PPTX
Applications of Mealy & Moore Machine
SardarKashifKhan
 
PPTX
Apex code (Salesforce)
Mohammed Safwat Abu Kwaik
 
PDF
Cloud Mashup
Vasco Elvas
 
PPTX
Application layer protocols
FabMinds
 
PPT
Cloud deployment models
Ashok Kumar
 
PPT
logical addressing
Sagar Gor
 
PPT
Cloud computing architectures
Muhammad Aitzaz Ahsan
 
PPTX
Migrating into a cloud
ANUSUYA T K
 
PPSX
Student feedback system
msandbhor
 
PPTX
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
PDF
Type conversion in Compiler Construction
Muhammad Haroon
 
Importance & Principles of Modeling from UML Designing
ABHISHEK KUMAR
 
AWS Interview Questions Part - 2 | AWS Interview Questions And Answers Part -...
Simplilearn
 
computer network OSI layer
Sangeetha Rangarajan
 
Overview of computing paradigm
Ripal Ranpara
 
Cloud Computing Architecture
Animesh Chaturvedi
 
Integrating Public & Private Clouds
Proact Belgium
 
Strategies for Training End Users How To Use Salesforce
Shell Black
 
Introduction to Software Defined Networking (SDN)
Bangladesh Network Operators Group
 
Ch 2 types of reqirement
Fish Abe
 
Applications of Mealy & Moore Machine
SardarKashifKhan
 
Apex code (Salesforce)
Mohammed Safwat Abu Kwaik
 
Cloud Mashup
Vasco Elvas
 
Application layer protocols
FabMinds
 
Cloud deployment models
Ashok Kumar
 
logical addressing
Sagar Gor
 
Cloud computing architectures
Muhammad Aitzaz Ahsan
 
Migrating into a cloud
ANUSUYA T K
 
Student feedback system
msandbhor
 
Introduction to Parallel and Distributed Computing
Sayed Chhattan Shah
 
Type conversion in Compiler Construction
Muhammad Haroon
 

Viewers also liked (20)

PPTX
Hadoop data ingestion
Vinod Nayal
 
PDF
Open source data ingestion
Treasure Data, Inc.
 
PPTX
High Speed Continuous & Reliable Data Ingest into Hadoop
DataWorks Summit
 
PPTX
Big Data Ingestion @ Flipkart Data Platform
Navneet Gupta
 
PPTX
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
PDF
XML Parsing with Map Reduce
Edureka!
 
PDF
Designing a Real Time Data Ingestion Pipeline
DataScience
 
PDF
Efficient processing of large and complex XML documents in Hadoop
DataWorks Summit
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PDF
A poster version of HadoopXML
Kyong-Ha Lee
 
PDF
Map reduce: beyond word count
Jeff Patti
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
Excelerate Systems
 
PPSX
La plateforme OpenData 3.0 pour libérer et valoriser les données
Excelerate Systems
 
PDF
Turning Text Into Insights: An Introduction to Topic Models
DataScience
 
PDF
Scalable Hadoop in the cloud
Treasure Data, Inc.
 
PDF
The Data-Drive Paradigm
Lucidworks
 
PDF
Search in 2020: Presented by Will Hayes, Lucidworks
Lucidworks
 
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PPTX
Meson: Building a Machine Learning Orchestration Framework on Mesos
Antony Arokiasamy
 
Hadoop data ingestion
Vinod Nayal
 
Open source data ingestion
Treasure Data, Inc.
 
High Speed Continuous & Reliable Data Ingest into Hadoop
DataWorks Summit
 
Big Data Ingestion @ Flipkart Data Platform
Navneet Gupta
 
Gobblin: Unifying Data Ingestion for Hadoop
Yinan Li
 
XML Parsing with Map Reduce
Edureka!
 
Designing a Real Time Data Ingestion Pipeline
DataScience
 
Efficient processing of large and complex XML documents in Hadoop
DataWorks Summit
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
A poster version of HadoopXML
Kyong-Ha Lee
 
Map reduce: beyond word count
Jeff Patti
 
Big Data Analytics with Hadoop
Philippe Julio
 
BigDataBx #1 - Atelier 1 Cloudera Datawarehouse Optimisation
Excelerate Systems
 
La plateforme OpenData 3.0 pour libérer et valoriser les données
Excelerate Systems
 
Turning Text Into Insights: An Introduction to Topic Models
DataScience
 
Scalable Hadoop in the cloud
Treasure Data, Inc.
 
The Data-Drive Paradigm
Lucidworks
 
Search in 2020: Presented by Will Hayes, Lucidworks
Lucidworks
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Antony Arokiasamy
 
Ad

Similar to Data Ingestion, Extraction & Parsing on Hadoop (20)

PPTX
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Cloudera, Inc.
 
PPT
Hadoop India Summit, Feb 2011 - Informatica
Sanjeev Kumar
 
PPTX
OOP 2014
Emil Andreas Siemes
 
PDF
Track B-1 建構新世代的智慧數據平台
Etu Solution
 
DOCX
Informatica
mukharji
 
PDF
SnapLogic corporate presentation
pbridges
 
PDF
Magic quadrant for data warehouse database management systems
divjeev
 
PDF
Talk IT_ Oracle_김태완_110831
Cana Ko
 
PDF
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
PPTX
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Yahoo Developer Network
 
PPTX
Big Data and HPC
NetApp
 
PDF
Hadoop - Now, Next and Beyond
Teradata Aster
 
PDF
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
PPTX
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
DataWorks Summit/Hadoop Summit
 
PPTX
2012 06 hortonworks paris hug
Modern Data Stack France
 
PDF
Hadoop for shanghai dev meetup
Roby Chen
 
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
PDF
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
PDF
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
PPTX
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Cloudera, Inc.
 
Hadoop India Summit, Feb 2011 - Informatica
Sanjeev Kumar
 
Track B-1 建構新世代的智慧數據平台
Etu Solution
 
Informatica
mukharji
 
SnapLogic corporate presentation
pbridges
 
Magic quadrant for data warehouse database management systems
divjeev
 
Talk IT_ Oracle_김태완_110831
Cana Ko
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Apache Hadoop India Summit 2011 talk "Informatica and Big Data" by Snajeev Kumar
Yahoo Developer Network
 
Big Data and HPC
NetApp
 
Hadoop - Now, Next and Beyond
Teradata Aster
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Cloudera, Inc.
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
DataWorks Summit/Hadoop Summit
 
2012 06 hortonworks paris hug
Modern Data Stack France
 
Hadoop for shanghai dev meetup
Roby Chen
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Amr Awadallah
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Cloudera, Inc.
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
Hadoop as Data Refinery - Steve Loughran
JAX London
 
Ad

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
July Patch Tuesday
Ivanti
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 

Data Ingestion, Extraction & Parsing on Hadoop

  • 1. Data Ingestion, Extraction, and Preparation for Hadoop Sanjay Kaluskar, Sr. Architect, Informatica David Teniente, Data Architect, Rackspace 1
  • 2. Safe Harbor Statement • The information being provided today is for informational purposes only. The development, release and timing of any Informatica product or functionality described today remain at the sole discretion of Informatica and should not be relied upon in making a purchasing decision. Statements made today are based on currently available information, which is subject to change. Such statements should not be relied upon as a representation, warranty or commitment to deliver specific products or functionality in the future. • Some of the comments we will make today are forward-looking statements including statements concerning our product portfolio, our growth and operational strategies, our opportunities, customer adoption of and demand for our products and services, the use and expected benefits of our products and services by customers, the expected benefit from our partnerships and our expectations regarding future industry trends and macroeconomic development. • All forward-looking statements are based upon current expectations and beliefs. However, actual results could differ materially. There are many reasons why actual results may differ from our current expectations. These forward-looking statements should not be relied upon as representing our views as of any subsequent date and Informatica undertakes no obligation to update forward-looking statements to reflect events or circumstances after the date that they are made. • Please refer to our recent SEC filings including the Form 10-Q for the quarter ended September 30th, 2011 for a detailed discussion of the risk factors that may affect our results. Copies of these documents may be obtained from the SEC or by contacting our Investor Relations department. 2
  • 3. The Hadoop Data Processing Pipeline Informatica PowerCenter + PowerExchange Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on PowerCenter + Hadoop PowerExchange 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 3
  • 4. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 4
  • 5. Unleash the Power of Hadoop With High Performance Universal Data Access Messaging, Packaged and Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications JMS TIBCO Lotus Notes SAP NetWeaver BI MSMQ webMethods Oracle E-Business SAS SAP NetWeaver XI PeopleSoft Siebel Relational and Oracle Informix SaaS/BPO Flat Files Salesforce CRM ADP DB2 UDB Teradata Hewitt DB2/400 Netezza Force.com RightNow SAP By Design SQL Server ODBC Oracle OnDemand Sybase JDBC NetSuite Mainframe Industry and Midrange EDI–X12 AST Standards ADABAS VSAM Datacom C-ISAM EDI-Fact FIX DB2 Binary Flat Files RosettaNet Cargo IMP IDMS Tape Formats… HL7 MVR IMS HIPAA Unstructured Data and Files Word, Excel Flat files XML Standards PDF ASCII reports XML ebXML StarOffice HTML LegalXML HL7 v3.0 WordPerfect RPG IFX ACORD (AL3, XML) Email (POP, IMPA) ANSI cXML HTTP LDAP MPP Appliances EMC/Greenplum AsterData Facebook LinkedIn Vertica Twitter Social Media 5
  • 6. Ingest Data Access Data Pre-Process Ingest Data Web server PowerExchange PowerCenter Databases, Data Warehouse Batch HDFS Message Queues, CDC HIVE Email, Social Media e.g. Filter, Join, Cle anse Real-time ERP, CRM Reuse PowerCenter mappings Mainframe 6
  • 7. Extract Data Extract Data Post-Process Deliver Data Web server PowerCenter PowerExchange Databases, HDFS Batch Data Warehouse e.g. Transform ERP, CRM to target schema Reuse Mainframe PowerCenter mappings 7
  • 8. 1. Create Ingest or Extract Mapping 2. Create Hadoop Connection 3. Configure Workflow 4. Create & Load Into Hive Table 8
  • 9. The Hadoop Data Processing Pipeline Informatica HParser Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on HParser Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 9
  • 10. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 10
  • 11. Informatica HParser Productivity: Data Transformation Studio 11
  • 12. Informatica HParser Productivity: Data Transformation Studio Financial Insurance B2B Standards Out of the box SWIFT MT DTCC-NSCC transformations for UNEDIFACT SWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12 NACHA versions ACORD XML based visual EDI ARR FIX enhancements EDI UCS+WINS Telekurs and edits EDI VICS Updates and new FpML RosettaNet versions delivered BAI – V2.0Lockbox Healthcare OAGI from Informatica CREST DEX IFX HL7 Definition is done using TWIST Business (industry) Other HL7 V3 Enhanced UNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADIS SEPA NCPDP FIXML PLMXML CDISC MISMO NEIM 12
  • 13. Informatica HParser How does it work? Hadoop cluster Svc Repository S hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt 1. Develop an HParser transformation 2. Deploy the transformation 3. Run HParser on Hadoop to produce tabular data HDFS 4. Analyze the data with HIVE / PIG / MapReduce / Other 13
  • 14. The Hadoop Data Processing Pipeline Informatica Roadmap Available Today Sales & Marketing Customer Service 1H / 2012 Data mart Portal 4. Extract Data from Hadoop 3. Transform & Cleanse Data on Hadoop 2. Parse & Prepare Data on Hadoop 1. Ingest Data into Hadoop Product & Service Customer Service Marketing Campaigns Customer Profile Account Transactions Social Media Offerings Logs & Surveys 14
  • 15. Options Ingest/Extract Parse & Prepare Transform & Data Data Cleanse Data Structured (e.g. Informatica N/A Hive, PIG, MR, OLTP, OLAP) PowerCenter + Future: PowerExchange, Informatica Sqoop Roadmap Unstructured, Informatica Informatica Hive, PIG, MR, semi-structured PowerCenter + HParser, Future: (e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica JSON) copy files, Flume, MR Roadmap Scribe, Kafka 15
  • 16. Informatica Hadoop Roadmap – 1H 2012 • Process data on Hadoop • IDE, administration, monitoring, workflow • Data processing flow designed through IDE: Source/Target, Filter, Join, Lookup, etc. • Execution on Hadoop cluster (pushdown via Hive) • Flexibility to plug-in custom code • Hive and PIG UDFs • MR scripts • Productivity with optimal performance • Exploit Hive performance characteristics • Optimize end-to-end data flow for performance 16
  • 17. Mapping for Hive execution Logical representation of processing steps Validate & configure for Source Hive translation INSERT INTO STG0 SELECT * FROM StockAnalysis0; Pre-view INSERT INTO STG1 SELECT * FROM StockAnalysis1; generated INSERT INTO STG2 SELECT * FROM StockAnalysis2; Hive code 17 17
  • 18. Takeaways • Universal connectivity • Completeness and enrichment of raw data for holistic analysis • Prevent Hadoop from becoming another silo accessible to a few experts • Maximum productivity • Collaborative development environment • Right level of abstraction for data processing logic • Re-use of algorithms and data flow logic • Meta-data driven processing • Document data lineage for auditing and impact analysis • Deploy on any platform for optimal performance and utilization 18
  • 19. Customer Sentiment - Reaching beyond NPS (Net Promoter Score) and surveys Gaining insight in to our customer’s sentiment will improve Rackspace’s ability to provide Fanatical Support™ Objectives: • What are “they” saying • Gauge the level of sentiment • Fanatical Support™ for the win • Increase NPS • Increase MRR • Decrease Churn • Provide the right products • Keep our promises 19 19
  • 20. Customer Sentiment Use Cases Pulling it all together Case 1 Case 2 Match social media posts Determine the with Customer. Determine sentiment of a a probable match. post, searching key words and scoring the post. Case 3 Determine correlations between posts, ticket volume and NPS leading to negative Case 4 or positive sentiments. Determine correlations in sentiments with products/configurations which lead to negative or Case 5 positive sentiments. The ability to trend all inputs over time… 20
  • 21. Rackspace Fanatical Support™ Big Data Environment Data Sources (DBs, Flat files, Data Streams) Oracle MySql MS SQL Greenplum DB Indirect Analytics Postgres over Hadoop DB2 BI Analytics Excel CSV BI Stack Flat File Message bus / XML port listening EDI Direct Analytics over Hadoop Binary Sys Logs Hadoop HDFS Search, Analytics, Messaging APIs Algorithmic 21
  • 22. Twitter Feed for Rackspace Using Informatica Input Data Output Data 22 22
  • 23. 23

Editor's Notes

  • #4: * EXAMPLE *Some talking points to cover over the next few slides on PowerExchange for Hadoop…Access all data sourcesAbility to pre-process (e.g. filter) before landing to HDFS and post-process to fit target schemaPerformance of load via partitioning, native APIs, grid, pushdown to source or target, process offloadingProductivity via visual designerDifferent latencies (batch, near real-time)One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. All too often developers resort to reinventing the wheel by building custom adapters and scripts that require expert knowledge of the source systems, applications, data structures and formats. Once they overcome this hurdle they need to make sure their custom code will perform and scale as data volumes grow. Along with the need for speed, security and reliability are often overlooked which increases the risk of non-compliance and system downtime. Needless to say building a robust custom adapter takes time and can be costly to maintain as software versions change. Sometimes the end result is adapters that lack direct connectivity between the source systems and Hadoop which means you need to temporarily stage the data before it can move into Hadoop, increasing storage costs. Informatica PowerExchange can access data from virtually any data source at any latency (e.g. batch, real-time, or near real-time) and deliver all your data directly into Hadoop (see Figure 2). Similarly, Informatica PowerExchange can deliver data from Hadoop to your enterprise applications and information management systems. You can schedule batch loads to move data from multiple source systems directly into Hadoop without any staging. Alternatively, you can move only changed data from relational and mainframe systems directly into Hadoop. For real-time data feeds, you can move data off of message queues and deliver into Hadoop. Informatica PowerExchange accesses data through native API’s to ensure optimal performance and is designed to minimize the impact to source systems through caching and process offloading. To further increase the performance of data flows between the source systems and Hadoop, PowerCenter supports data partitioning to distribute the processing across CPUs.  Informatica PowerExchange for Hadoop is integrated with PowerCenter so that you can pre-process data from multiple data sources before it lands in Hadoop. This enables you to leverage the source system metadata since this information is not retained in the Hadoop File System (HDFS). For example, you can perform lookups, filters, or relational joins based on primary and foreign key relationships before data is delivered to HDFS. You can also pushdown the pre-processing to the source system to limit data movement and unnecessary data duplication to Hadoop. Common design patterns for data flows into or out of Hadoop can be generated in PowerCenter using parameterized templates built in Microsoft Visio to dramatically increase productivity. To securely and reliably manage the file transfer and collection of very large data files from both inside and outside the firewall you can use Informatica Managed File Transfer (MFT).
  • #5: Sanjay’s notes:Flume, scribe are options for streaming ingestion of log filesKafka is for near real-time
  • #7: See PWX for Hadoop white paperDoes not require expert knowledge of source systemsDeliver data directly to Hadoop without any intermediate stagingAccess data through native API’s for optimal performanceBring in both un-modeled / un-structured and structured relational data to make the analysis completeUse example to illustrate combining both unstructured and structured data needed for analysis
  • #8: Have lineage of where data came from
  • #10: Informatica announced on Nov 2 the industry’s first data parser for HadoopThe solution is designed to provide a powerful data parsing alternative to organizations who are seeking to achieve the full potential of Big Data in Hadoop with efficiency and scale.This solution addresses the industry’s growing demand in turning the unstructured, complex data into structured or semi-structured format in Hadoop to drive insights and improve operations.Tapping our industry leading experience in parsing unstructured data and handling industry formats and documents within and across enterprise, Informatica pioneered the development of the data parser that exploits the parallelism of MapReduce framework.Using an engine-based, interactive tool to simplify the data parsing process, Informatica HParser processes complex files and messages in Hadoop with the following three offerings:Informatica HParser for logs, Omniture, XML and JSON (community edition), free of charge.Informatica HParser for industry standards (commercial edition).Informatica HParser for documents (commercial edition).With HParser, organizations can derive unique benefits using:Accelerate deployment using out of the box ready to use transformations and industry standards.Increase productivity for tackling diverse complex formats including proprietary log files.Speed the development of parsing exploiting the parallelism inside MapReduce.Optimize performance in data parsing for large files including logs, XML, JSON and industry standards.Informatica also provides a free 30 day trial of the commercial edition of Hparser for Documents to the users interested in learning about the design environment for data transformation.
  • #14: Definethe extraction/transformation logic using the designerRun the parser as a standalone MR jobCommand line arguments are script, input, output filesParallelism across files, no support for file splits
  • #17: Describe each of the future capabilities in the bulletsYou can design and specify the entire end-to-end flow of your data processing pipeline with the flexibility to insert custom code.Choose the right level of abstraction to define your data flow, don’t reinvent the wheel. Informatica provides the right level of abstraction for data processing for rapid development (e.g. metadata driven development environment) and easy maintenance (e.g. complete specification and lineage of data)