Data Ingestion, Extraction & Parsing on Hadoop

Data Ingestion, Extraction, and
Preparation for Hadoop

Sanjay Kaluskar, Sr.
Architect, Informatica
David Teniente, Data
Architect, Rackspace

1

Safe Harbor Statement
• The information being provided today is for informational purposes only. The
development, release and timing of any Informatica product or functionality described
today remain at the sole discretion of Informatica and should not be relied upon in
making a purchasing decision. Statements made today are based on
currently available information, which is subject to change. Such statements should
not be relied upon as a representation, warranty or commitment to deliver specific
products or functionality in the future.
• Some of the comments we will make today are forward-looking statements including
statements concerning our product portfolio, our growth and operational
strategies, our opportunities, customer adoption of and demand for our products and
services, the use and expected benefits of our products and services by
customers, the expected benefit from our partnerships and our expectations
regarding future industry trends and macroeconomic development.
• All forward-looking statements are based upon current expectations and beliefs.
However, actual results could differ materially. There are many reasons why actual
results may differ from our current expectations. These forward-looking statements
should not be relied upon as representing our views as of any subsequent date and
Informatica undertakes no obligation to update forward-looking statements to reflect
events or circumstances after the date that they are made.
• Please refer to our recent SEC filings including the Form 10-Q for the quarter ended
September 30th, 2011 for a detailed discussion of the risk factors that may affect our
results. Copies of these documents may be obtained from the SEC or by contacting
our Investor Relations department.

2

The Hadoop Data Processing Pipeline
Informatica PowerCenter + PowerExchange
Available Today
Sales & Marketing Customer Service
1H / 2012 Data mart Portal

4. Extract Data from Hadoop

3. Transform & Cleanse Data
on Hadoop

2. Parse & Prepare Data on
PowerCenter + Hadoop
PowerExchange

1. Ingest Data into Hadoop

Product & Service Customer Service
Marketing Campaigns Customer Profile Account Transactions Social Media
Offerings Logs & Surveys

3

Options

Ingest/Extract Parse & Prepare Transform &
Data Data Cleanse Data
Structured (e.g. Informatica N/A Hive, PIG, MR,
OLTP, OLAP) PowerCenter + Future:
PowerExchange, Informatica
Sqoop Roadmap

Unstructured, Informatica Informatica Hive, PIG, MR,
semi-structured PowerCenter + HParser, Future:
(e.g. web logs, PowerExchange, PIG/Hive UDFs, Informatica
JSON) copy files, Flume, MR Roadmap
Scribe, Kafka

4

Unleash the Power of Hadoop
With High Performance Universal Data Access

Messaging, Packaged
and Web Services WebSphere MQ Web Services JD Edwards SAP NetWeaver Applications
JMS TIBCO Lotus Notes SAP NetWeaver BI
MSMQ webMethods Oracle E-Business SAS
SAP NetWeaver XI PeopleSoft Siebel

Relational and Oracle Informix SaaS/BPO
Flat Files Salesforce CRM ADP
DB2 UDB Teradata Hewitt
DB2/400 Netezza Force.com
RightNow SAP By Design
SQL Server ODBC Oracle OnDemand
Sybase JDBC NetSuite
Mainframe Industry
and Midrange EDI–X12 AST Standards
ADABAS VSAM
Datacom C-ISAM EDI-Fact FIX
DB2 Binary Flat Files RosettaNet Cargo IMP
IDMS Tape Formats… HL7 MVR
IMS
HIPAA
Unstructured
Data and Files Word, Excel Flat files XML Standards
PDF ASCII reports XML ebXML
StarOffice HTML LegalXML HL7 v3.0
WordPerfect RPG IFX ACORD (AL3, XML)
Email (POP, IMPA) ANSI cXML
HTTP LDAP
MPP Appliances

EMC/Greenplum AsterData Facebook LinkedIn
Vertica Twitter

Social Media

5

Ingest Data
Access Data Pre-Process Ingest Data
Web server

PowerExchange PowerCenter
Databases,
Data Warehouse
Batch HDFS

Message Queues, CDC HIVE
Email, Social Media e.g.
Filter, Join, Cle
anse
Real-time
ERP, CRM
Reuse
PowerCenter
mappings
Mainframe

6

Extract Data

Extract Data Post-Process Deliver Data

Web server

PowerCenter PowerExchange
Databases,
HDFS Batch Data Warehouse

e.g. Transform
ERP, CRM
to target
schema

Reuse Mainframe
PowerCenter
mappings

7

1. Create Ingest or
Extract Mapping

2. Create Hadoop
Connection

3. Configure
Workflow

4. Create & Load Into
Hive Table

8

Informatica HParser
Available Today


on Hadoop

HParser Hadoop



9

Options

Sqoop Roadmap

Scribe, Kafka

10

Informatica HParser
Productivity: Data Transformation Studio

11

Informatica HParser
Productivity: Data Transformation Studio

Financial Insurance B2B Standards
Out of the box
SWIFT MT DTCC-NSCC transformations for
UNEDIFACT
SWIFT MX ACORD-AL3
all messages in all
Easy example
EDI-X12
NACHA
versions
ACORD XML based visual
EDI ARR
FIX enhancements
EDI UCS+WINS
Telekurs and edits
EDI VICS Updates and new
FpML
RosettaNet versions delivered
BAI – V2.0Lockbox
Healthcare OAGI from Informatica
CREST DEX
IFX HL7
Definition is done using
TWIST Business (industry)
Other
HL7 V3
Enhanced
UNIFI (ISO 20022)
terminology and
HIPAA
Validations definitions
IATA-PADIS
SEPA NCPDP
FIXML PLMXML
CDISC
MISMO NEIM

12

Informatica HParser
How does it work?
Hadoop cluster

Svc Repository

S

hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt

1. Develop an HParser transformation
2. Deploy the transformation
3. Run HParser on Hadoop to produce
tabular data HDFS
4. Analyze the data with HIVE / PIG /
MapReduce / Other

13

Informatica Roadmap
Available Today


on Hadoop

Hadoop



14

Options

Sqoop Roadmap

Scribe, Kafka

15

Informatica Hadoop Roadmap – 1H 2012

• Process data on Hadoop
• IDE, administration, monitoring, workflow
• Data processing flow designed through IDE: Source/Target,
Filter, Join, Lookup, etc.
• Execution on Hadoop cluster (pushdown via Hive)

• Flexibility to plug-in custom code
• Hive and PIG UDFs
• MR scripts

• Productivity with optimal performance
• Exploit Hive performance characteristics
• Optimize end-to-end data flow for performance

16

Mapping for Hive execution

Logical
representation
of processing
steps

Validate &
configure for
Source
Hive translation
INSERT INTO STG0
SELECT * FROM StockAnalysis0; Pre-view
INSERT INTO STG1
SELECT * FROM StockAnalysis1;
generated
INSERT INTO STG2
SELECT * FROM StockAnalysis2;
Hive code

17
17

Takeaways

• Universal connectivity
• Completeness and enrichment of raw data for holistic analysis
• Prevent Hadoop from becoming another silo accessible to a few
experts

• Maximum productivity
• Collaborative development environment
• Right level of abstraction for data processing logic
• Re-use of algorithms and data flow logic
• Meta-data driven processing
• Document data lineage for auditing and impact analysis
• Deploy on any platform for optimal performance and utilization

18

Customer Sentiment - Reaching beyond
NPS (Net Promoter Score) and surveys

Gaining insight in to our customer’s sentiment
will improve Rackspace’s ability to provide
Fanatical Support™
Objectives:
• What are “they” saying
• Gauge the level of sentiment
• Fanatical Support™ for the win
• Increase NPS
• Increase MRR
• Decrease Churn
• Provide the right products
• Keep our promises

19 19

Customer Sentiment Use Cases
Pulling it all together
Case 1 Case 2
Match social media posts Determine the
with Customer. Determine sentiment of a
a probable match. post, searching
key words and
scoring the post.
Case 3
Determine correlations
between posts, ticket volume
and NPS leading to negative Case 4
or positive sentiments. Determine correlations in
sentiments with
products/configurations
which lead to negative or
Case 5 positive sentiments.
The ability to trend all
inputs over time…

20

Rackspace Fanatical Support™
Big Data Environment

Data Sources
(DBs, Flat files, Data
Streams)

Oracle
MySql
MS SQL Greenplum DB
Indirect Analytics
Postgres over Hadoop
DB2 BI Analytics

Excel
CSV BI Stack
Flat File Message bus /
XML port listening

EDI Direct Analytics
over Hadoop
Binary
Sys Logs
Hadoop HDFS Search, Analytics,
Messaging
APIs Algorithmic

21

Twitter Feed for Rackspace
Using Informatica

Input Data Output Data

22 22

Data Ingestion, Extraction & Parsing on Hadoop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data Ingestion, Extraction & Parsing on Hadoop (20)

Recently uploaded (20)

Data Ingestion, Extraction & Parsing on Hadoop

Editor's Notes