1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Data & Analytics in Insurance
Can you have one without the other?
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
P&C Insurance trends in big data/analytics
Use of Predictive Models in P&C New applications, New Methods
• Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.)
• The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers
and 20% of commercial lines carriers.
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
P&C Insurance trends in big data/analytics
 “Big data, notably from vehicle
telematics and the IoT, are opening up
many new potential avenues for
investigation and improvement. These
opportunities apply as much to carriers
that have invested recently in
improved policy administration and
quote systems as it does to others.
Whatever the available level of
hardware and software within a
business, a lack of accompanying
investment in data and analytics is
rather like driving a sports car
without fully revving up the engine.”
Uses of Big Data
• Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.)
• The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers
and 20% of commercial lines carriers.
The Liberty Mutual
Insurance Data Lake
One small Hadoop footprint …
One giant leap to understanding
#TechAtLiberty
5Liberty Mutual Insurance
Empower Liberty Mutual to leverage the vast data and
amazing talent that we have
Make analytics as easy as it can be
Allow data to be free and secure
Foster a culture of quick iterative experiments, failing
and learning as fast as possible
Remove the separation between IT and business
Our North Star: What we strive for
6Liberty Mutual Insurance
75th
7Liberty Mutual Insurance
Agenda
• How do we think about analytics?
• How do we work as a team?
• Who/what is a data scientist?
• How does a data lake help us?
8Liberty Mutual Insurance
How we think about analytics and machine learning (ML)
Obtaining the data
from source systems
and devices
Storing the data in a
format and location so
that it can be studied
Studying the data to
gain insight and
business value
GET LAND STUDY
• ML is an extension of STUDY
• ML programs need to access
data that’s in “LAND”
9Liberty Mutual Insurance
Who/what is a data scientist?
How do we work as a team?
10Liberty Mutual Insurance
What makes up a data scientists?
True data scientists are extremely rare
because of the unique combination of
skills required.
We believe in investing in data science
teams made up of energized engineers
with various roles:
• Software developers
• Data engineers
• Data analysts
• Data scientists
You don’t need a PhD to be a data
scientist!!!
Business analyst
Engineer/
Developer
Mathematician
Data
Scientist
11Liberty Mutual Insurance
We heard common frustrations
Analytics is hard!
Tools are too hard to
use; Requires many
types of skills
Security and Analytics
have competing goals
IT/business collaboration
needs to improve
12Liberty Mutual Insurance
Information Technology Business
Source system Data
scientists/analysts
MS SQL
Teradata
DB2
Mysql
MS SQL
Oracle
Mongo
Postgres
DATA
Mart
Information management (IM)
Ent. Data
Warehouse
DATA
Mart
2
EDW
Cognos
Tableau
SAS
OBIEE
Micro
strategy
SharePoint
PowerBI
Sybase
13Liberty Mutual Insurance
PYTHON
R
SAS
H2O
R Shiny
Excel
PowerBI
Source system Data
scientists/analysts
IM evolving into Data analytics
MS SQL
Teradata
DB2
Mysql
Oracle
Mongo
EDW
Sybase
Iterate and learn
Information technology Business
Unstructured Data
14Liberty Mutual Insurance
Text
Analytics
Streaming
Analytics
Predictive
Analytics
Data Engineer
Data Engineer
IT Data
Scientist
Software
Developer
Software
Developer
Form one team with business and IT together
Data Scientist
Data Scientist
Data Scientist
15Liberty Mutual Insurance
How does a data lake help us?
16Liberty Mutual Insurance
HORTONWORKS DATA PLATFORM (HDP®)
17Liberty Mutual Insurance
Enterprise data lake security
Security: Centrify / AD / Kerberos / Ranger/ HDFS Encryption /SSL
Kerberos HDP Data Lake on-Premises
AD Server as KDC
Secured Zone
HDFS
Secured Zone
HDFS
Secured Zone
HDFS
/Legal
| user:grp
| __1
| __2
/HR
| user:grp
| __1
| __2
/Finance
| user:grp
| __1
| __2
Ranger Policies & Plugins
HDFS Permission & ACL
System Admins
Power BI Users
Data Scientists
ETL Developers
Ambari Server
Spark Thrift Server
HDP Edge Node
Kerberos
Kerberos
NAS/Local HDD
SSL
ODBC
SSL
SSL
RMDBS on-Premises
Sqoop
Security Options Available:
1. Kerberos
2. SSL Enablein Connection String
3. Encryption=true on database
Zeppelin  Livy Server
Layers of Defense
Perimeter Level Security: Apache Knox for REST API
Authentication : Kerberos
Authorization: Ranger
OS Security : HDFS Permission, encryption on HDFS
ApacheKnox
18Liberty Mutual Insurance
Security challenges and alternatives
• Security implementation requires existing tools reconfiguration
• Need to use the combined security mechanisms
• Testing is painful and something doesn’t work
• Not all BI Tools Build-in Drivers Support Kerberos
• Spark Security
⎻ Kerberos for Authentication
⎻ AD Groups for HDFS ACLs
⎻ SparkSQL, Ranger, and LLAP via Spark Thrift Server for Authorization
19Liberty Mutual Insurance
Data lake BI & analytics example
User’s Desktop / Laptop /VDEApplications & Databases
PowerBI Desktop
Dashboard
(data embedded)
Sources of Cost
Information
PowerBI
Hive/Data
Transformation
Kerberos / ODBC
S3: csv Files
Centrify / AD / Kerberos/ Ranger/Encryption
Publish
Text Files / API
License
Counts from
Office 365
Daily
HDP Cluster
PullData from Hadoop
Report& Data
AWS Keys
Upload Data
PowerBI Services
DataAutomation
PowerBIGateway
Report Developers
Report Consumers
ETL Developers
Other Data Sources
on-premises
Sqoop
Data Lake on-Premises
AD Server
Rest API
20Liberty Mutual Insurance
Integrate Elasticsearch and Spark in data lake
Enterprise Data Lake
Master & Data Nodes
HDP Edge/ES Node 1 HDP Edge/ES Node 2 HDP Edge/ES Node 3
ES Repo
/experian
| index
| __1
| __2
ES Repo
/experian
| index
| __3
| __4
ES Repo
/experian
| index
| __5
| __6
ElasticSearch
Hadoop Plugin
ElasticSearch
Hadoop Plugin
ElasticSearch
Hadoop Plugin
REST API – Elasticsearch Queries
End Users
NAS
spark-submit --master yarn --num-executors 4 --executor-
memory 1G --executor-cores 1 esspark-assembly-1.0.jar
hdfs:///data/BRICK_2016_Q3_masked.csv
curl -XPOST "http://localhost:9200/gs/_search" -
d'{"query": {"match" :{ "CITY": {"query": "Yiqing",
"fuzziness": "AUTO"}}}}'
Data Volume: 1 data brick
100GB csv file
Fuzzy Match: company name,
street address, city, state
Results: match score and all
500+ attributes
IT Developer
21Liberty Mutual Insurance
Integrate Elasticsearch and Spark in data lake (cont.)
22Liberty Mutual Insurance
Data archiving example
Apache Flume
Syslog Server 1
Syslog Server 3
Syslog Server 2
Apache Flume
Apache Flume
VirtualIndex
Enterprise Data Lake (5 data nodes total 120TB)
Analytics, trend
Hot Data Storage
OneMonth
 1TB uncompressed Data
 100GB Compressed Data
SharePoint Logs
HDP Edge Node
SharePoint Logs
IT Developers
Data Analytists
Kafka
Warn Data Storage
OneYear
 60TB uncompressed Data
 6TB Compressed Data
SharePoint Logs
Kafka Kafka
SIEM, Alerts, Real Time Monitoring
Kerberos
NifiMergeContent: Holds data
until the flow file reaches a
suitable sizeto be loaded to HDFS
Logs
23Liberty Mutual Insurance
Sample DataFlow
24Liberty Mutual Insurance
Conclusion
Just get started!
Don’t be afraid to fail!
Invite your “business” partners into the process
A small lake is still very beneficial!
25Liberty Mutual Insurance
Thank you

More Related Content

PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PDF
Data Profiling, Data Catalogs and Metadata Harmonisation
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Apache Kafka in the Transportation and Logistics
PDF
Modularized ETL Writing with Apache Spark
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Evening out the uneven: dealing with skew in Flink
Simplifying Real-Time Architectures for IoT with Apache Kudu
Data Profiling, Data Catalogs and Metadata Harmonisation
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Iceberg: A modern table format for big data (Strata NY 2018)
Apache Kafka in the Transportation and Logistics
Modularized ETL Writing with Apache Spark

What's hot (20)

PPTX
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
PDF
Reliable and Scalable Data Ingestion at Airbnb
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPT
Introduction to Business Intelligence
PPTX
Data Modernization_Harinath Susairaj.pptx
PDF
End to End Process Transformation with Signavio.pdf
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Modernizing to a Cloud Data Architecture
PDF
Moving to Databricks & Delta
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
The why of a cloud ppt
PDF
Bye Bye Batch, Hallo Events: Der Kafka-Weg von SIEMENS in die Cloud
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
PPTX
System design for video streaming service
PPTX
Intro Microsoft Dynamics 365
PDF
Big Data Architecture and Design Patterns
PPTX
Updated: Should you be using an Event Driven Architecture
PPT
Oracle Hyperion Planning Best Practices
PDF
HBase and Hadoop at Adobe
PPTX
Databricks Platform.pptx
Collibra Data Citizen '19 - Bridging Data Privacy with Data Governance
Reliable and Scalable Data Ingestion at Airbnb
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Introduction to Business Intelligence
Data Modernization_Harinath Susairaj.pptx
End to End Process Transformation with Signavio.pdf
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Modernizing to a Cloud Data Architecture
Moving to Databricks & Delta
Architect’s Open-Source Guide for a Data Mesh Architecture
The why of a cloud ppt
Bye Bye Batch, Hallo Events: Der Kafka-Weg von SIEMENS in die Cloud
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
System design for video streaming service
Intro Microsoft Dynamics 365
Big Data Architecture and Design Patterns
Updated: Should you be using an Event Driven Architecture
Oracle Hyperion Planning Best Practices
HBase and Hadoop at Adobe
Databricks Platform.pptx
Ad

Similar to Security, ETL, BI & Analytics, and Software Integration (20)

PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
PDF
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
PPTX
The Power of Data
PPTX
Using a Data Lake at the core of a Life Assurance business
PDF
Dataguise hortonworks insurance_feb25
PDF
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
PDF
Data lake benefits
PDF
Fight Fraud with Big Data Analytics
PPTX
Introduction to Big Data Analytics
PDF
Real-Time Applications of Data Science in Cybersecurity.pdf
PDF
Large Scale Data Analytics
PPTX
ICE-B.pptx
PPTX
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
PDF
intelligent-data-lake_executive-brief
PPTX
Data deck - CV - AXA - CVC
PDF
Entry Points – How to Get Rolling with Big Data Analytics
PPT
Industry_Use_Cases.ppt Industry_Use_Cases.ppt
PDF
Transition to a modern data platform
PDF
Overview - IBM Big Data Platform
PPTX
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Delivering Self-Service Analytics using Big Data and Data Virtualization on t...
The Power of Data
Using a Data Lake at the core of a Life Assurance business
Dataguise hortonworks insurance_feb25
How to Become an Analytics Ready Insurer - with Informatica and Hortonworks
Data lake benefits
Fight Fraud with Big Data Analytics
Introduction to Big Data Analytics
Real-Time Applications of Data Science in Cybersecurity.pdf
Large Scale Data Analytics
ICE-B.pptx
"From Big Data To Big Valuewith HPE Predictive Analytics & Machine Learning",...
intelligent-data-lake_executive-brief
Data deck - CV - AXA - CVC
Entry Points – How to Get Rolling with Big Data Analytics
Industry_Use_Cases.ppt Industry_Use_Cases.ppt
Transition to a modern data platform
Overview - IBM Big Data Platform
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Co-training pseudo-labeling for text classification with support vector machi...
PDF
Human Computer Interaction Miterm Lesson
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PPTX
Build automations faster and more reliably with UiPath ScreenPlay
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PPTX
How to use fields_get method in Odoo 18
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PPTX
Report in SIP_Distance_Learning_Technology_Impact.pptx
PPTX
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
Identification of potential depression in social media posts
PDF
substrate PowerPoint Presentation basic one
PDF
Build Real-Time ML Apps with Python, Feast & NoSQL
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Co-training pseudo-labeling for text classification with support vector machi...
Human Computer Interaction Miterm Lesson
Data Virtualization in Action: Scaling APIs and Apps with FME
Build automations faster and more reliably with UiPath ScreenPlay
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
How to use fields_get method in Odoo 18
Streamline Vulnerability Management From Minimal Images to SBOMs
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Report in SIP_Distance_Learning_Technology_Impact.pptx
Rise of the Digital Control Grid Zeee Media and Hope and Tivon FTWProject.com
Connector Corner: Transform Unstructured Documents with Agentic Automation
Identification of potential depression in social media posts
substrate PowerPoint Presentation basic one
Build Real-Time ML Apps with Python, Feast & NoSQL
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
Lung cancer patients survival prediction using outlier detection and optimize...
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf

Security, ETL, BI & Analytics, and Software Integration

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data & Analytics in Insurance Can you have one without the other?
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved P&C Insurance trends in big data/analytics Use of Predictive Models in P&C New applications, New Methods • Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.) • The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers and 20% of commercial lines carriers.
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved P&C Insurance trends in big data/analytics  “Big data, notably from vehicle telematics and the IoT, are opening up many new potential avenues for investigation and improvement. These opportunities apply as much to carriers that have invested recently in improved policy administration and quote systems as it does to others. Whatever the available level of hardware and software within a business, a lack of accompanying investment in data and analytics is rather like driving a sports car without fully revving up the engine.” Uses of Big Data • Source: Willis Towers Watson 2016 Predictive Modeling Benchmark Survey (U.S.) • The survey was fielded from September 7 to October 24, 2016. Respondents comprise 14% of U.S. personal lines carriers and 20% of commercial lines carriers.
  • 4. The Liberty Mutual Insurance Data Lake One small Hadoop footprint … One giant leap to understanding #TechAtLiberty
  • 5. 5Liberty Mutual Insurance Empower Liberty Mutual to leverage the vast data and amazing talent that we have Make analytics as easy as it can be Allow data to be free and secure Foster a culture of quick iterative experiments, failing and learning as fast as possible Remove the separation between IT and business Our North Star: What we strive for
  • 7. 7Liberty Mutual Insurance Agenda • How do we think about analytics? • How do we work as a team? • Who/what is a data scientist? • How does a data lake help us?
  • 8. 8Liberty Mutual Insurance How we think about analytics and machine learning (ML) Obtaining the data from source systems and devices Storing the data in a format and location so that it can be studied Studying the data to gain insight and business value GET LAND STUDY • ML is an extension of STUDY • ML programs need to access data that’s in “LAND”
  • 9. 9Liberty Mutual Insurance Who/what is a data scientist? How do we work as a team?
  • 10. 10Liberty Mutual Insurance What makes up a data scientists? True data scientists are extremely rare because of the unique combination of skills required. We believe in investing in data science teams made up of energized engineers with various roles: • Software developers • Data engineers • Data analysts • Data scientists You don’t need a PhD to be a data scientist!!! Business analyst Engineer/ Developer Mathematician Data Scientist
  • 11. 11Liberty Mutual Insurance We heard common frustrations Analytics is hard! Tools are too hard to use; Requires many types of skills Security and Analytics have competing goals IT/business collaboration needs to improve
  • 12. 12Liberty Mutual Insurance Information Technology Business Source system Data scientists/analysts MS SQL Teradata DB2 Mysql MS SQL Oracle Mongo Postgres DATA Mart Information management (IM) Ent. Data Warehouse DATA Mart 2 EDW Cognos Tableau SAS OBIEE Micro strategy SharePoint PowerBI Sybase
  • 13. 13Liberty Mutual Insurance PYTHON R SAS H2O R Shiny Excel PowerBI Source system Data scientists/analysts IM evolving into Data analytics MS SQL Teradata DB2 Mysql Oracle Mongo EDW Sybase Iterate and learn Information technology Business Unstructured Data
  • 14. 14Liberty Mutual Insurance Text Analytics Streaming Analytics Predictive Analytics Data Engineer Data Engineer IT Data Scientist Software Developer Software Developer Form one team with business and IT together Data Scientist Data Scientist Data Scientist
  • 15. 15Liberty Mutual Insurance How does a data lake help us?
  • 16. 16Liberty Mutual Insurance HORTONWORKS DATA PLATFORM (HDP®)
  • 17. 17Liberty Mutual Insurance Enterprise data lake security Security: Centrify / AD / Kerberos / Ranger/ HDFS Encryption /SSL Kerberos HDP Data Lake on-Premises AD Server as KDC Secured Zone HDFS Secured Zone HDFS Secured Zone HDFS /Legal | user:grp | __1 | __2 /HR | user:grp | __1 | __2 /Finance | user:grp | __1 | __2 Ranger Policies & Plugins HDFS Permission & ACL System Admins Power BI Users Data Scientists ETL Developers Ambari Server Spark Thrift Server HDP Edge Node Kerberos Kerberos NAS/Local HDD SSL ODBC SSL SSL RMDBS on-Premises Sqoop Security Options Available: 1. Kerberos 2. SSL Enablein Connection String 3. Encryption=true on database Zeppelin  Livy Server Layers of Defense Perimeter Level Security: Apache Knox for REST API Authentication : Kerberos Authorization: Ranger OS Security : HDFS Permission, encryption on HDFS ApacheKnox
  • 18. 18Liberty Mutual Insurance Security challenges and alternatives • Security implementation requires existing tools reconfiguration • Need to use the combined security mechanisms • Testing is painful and something doesn’t work • Not all BI Tools Build-in Drivers Support Kerberos • Spark Security ⎻ Kerberos for Authentication ⎻ AD Groups for HDFS ACLs ⎻ SparkSQL, Ranger, and LLAP via Spark Thrift Server for Authorization
  • 19. 19Liberty Mutual Insurance Data lake BI & analytics example User’s Desktop / Laptop /VDEApplications & Databases PowerBI Desktop Dashboard (data embedded) Sources of Cost Information PowerBI Hive/Data Transformation Kerberos / ODBC S3: csv Files Centrify / AD / Kerberos/ Ranger/Encryption Publish Text Files / API License Counts from Office 365 Daily HDP Cluster PullData from Hadoop Report& Data AWS Keys Upload Data PowerBI Services DataAutomation PowerBIGateway Report Developers Report Consumers ETL Developers Other Data Sources on-premises Sqoop Data Lake on-Premises AD Server Rest API
  • 20. 20Liberty Mutual Insurance Integrate Elasticsearch and Spark in data lake Enterprise Data Lake Master & Data Nodes HDP Edge/ES Node 1 HDP Edge/ES Node 2 HDP Edge/ES Node 3 ES Repo /experian | index | __1 | __2 ES Repo /experian | index | __3 | __4 ES Repo /experian | index | __5 | __6 ElasticSearch Hadoop Plugin ElasticSearch Hadoop Plugin ElasticSearch Hadoop Plugin REST API – Elasticsearch Queries End Users NAS spark-submit --master yarn --num-executors 4 --executor- memory 1G --executor-cores 1 esspark-assembly-1.0.jar hdfs:///data/BRICK_2016_Q3_masked.csv curl -XPOST "http://localhost:9200/gs/_search" - d'{"query": {"match" :{ "CITY": {"query": "Yiqing", "fuzziness": "AUTO"}}}}' Data Volume: 1 data brick 100GB csv file Fuzzy Match: company name, street address, city, state Results: match score and all 500+ attributes IT Developer
  • 21. 21Liberty Mutual Insurance Integrate Elasticsearch and Spark in data lake (cont.)
  • 22. 22Liberty Mutual Insurance Data archiving example Apache Flume Syslog Server 1 Syslog Server 3 Syslog Server 2 Apache Flume Apache Flume VirtualIndex Enterprise Data Lake (5 data nodes total 120TB) Analytics, trend Hot Data Storage OneMonth  1TB uncompressed Data  100GB Compressed Data SharePoint Logs HDP Edge Node SharePoint Logs IT Developers Data Analytists Kafka Warn Data Storage OneYear  60TB uncompressed Data  6TB Compressed Data SharePoint Logs Kafka Kafka SIEM, Alerts, Real Time Monitoring Kerberos NifiMergeContent: Holds data until the flow file reaches a suitable sizeto be loaded to HDFS Logs
  • 24. 24Liberty Mutual Insurance Conclusion Just get started! Don’t be afraid to fail! Invite your “business” partners into the process A small lake is still very beneficial!

Editor's Notes

  • #3: Two-thirds of P&C insurers surveyed currently use predictive models for underwriting and risk selection, an increase of over 10 percentage points compared to the 2015 survey. The reasons behind such an increase are clear. There is unanimous agreement from personal lines insurers about the fundamental importance of using more sophisticated predictive techniques to drive success in today’s market. Equally, many commercial lines carriers are recognizing that the traditional barrier of the relative paucity of homogenous risk data in commercial portfolios can be overcome, enabling models to contribute significantly in more unique underwriting environments. Eighty-six percent of small- to mid-market carriers rate more sophisticated risk selection as essential or very important to future success. Over half (56%) of large account or specialty lines carriers share that view.
  • #4: Two-thirds of P&C insurers surveyed currently use predictive models for underwriting and risk selection, an increase of over 10 percentage points compared to the 2015 survey. The reasons behind such an increase are clear. There is unanimous agreement from personal lines insurers about the fundamental importance of using more sophisticated predictive techniques to drive success in today’s market. Equally, many commercial lines carriers are recognizing that the traditional barrier of the relative paucity of homogenous risk data in commercial portfolios can be overcome, enabling models to contribute significantly in more unique underwriting environments. Eighty-six percent of small- to mid-market carriers rate more sophisticated risk selection as essential or very important to future success. Over half (56%) of large account or specialty lines carriers share that view.
  • #5: https://dataworkssummit.com/san-jose-2017/sessions/from-big-data-to-data-discovery-one-small-footprint-one-giant-leap-to-understanding/
  • #7: Most people don’t know that Liberty Mutual has over 4,000 technical employees who create our solutions. In order to keep up with the demands of our customers, we are changing the way our company works. We are moving to a faster paced, customer centric model. We want to offer innovative products and services in order to provide best in class experiences for our customers. We are basically operating like a startup backed by the strength of a Fortune 100 company.
  • #9: Our group is involved in the entire lifecycle of analytics from Get to Study. We think about “Analytics” in 3 phases: Get/Land/Study All the way from obtaining the data orginally to landing it somewhere and then studying it
  • #12: The necessary tools are often not to scale or not available Majority of people don’t have the training or understanding of how to use the tools In some areas we are relying on 3rd party vendors to solve our problems rather than build expertise – is this really an issue outside of Hadoop? But, data scientists want better performance from R and Python, want the freedom to use downloaded data science libraries, want to use Spark, Tensorflow, H2O, etc. want to be able to pull data directly from Liberty databases, want to be able to deploy models without IT involvement, want to be able to work with large datasets We could certainly create opportunities for people to expand their skills with R, Python, and increase out knowledge and level of support on the IT side. But you need to pick a tool Security and analytics seems to be opposing forces There is a bureaucratic and or autocratic view controlling data and it’s flow Data scientists generally don’t need NPPI – they want to analyze inputs and predict outcomes No understanding of the risk or lack of risk associated with using business data for analytics Unclear how to traverse governance and approval processes No resources available to assist with data requests or scrub data to prepare for analytics Need a place to persist prepared data, refresh as new data becomes available, make scrubbed data available to multiple projects
  • #13: This works great for operational reporting… but not data analytics Some of those frustrations came from environments like this. Way to many data sources… very complex… wall between IT and Business Why does it take that long? For one… the data is everywhere… There are operational reasons for these EDW’s and Data Marts. I’m not saying there “not useful”, however as an analysts/ds they don’t alone met the need. What did the original data look like? Who do I talk to? Is there more data I don’t see? What about using R or Spark? Can I use open source?
  • #14: Cleansing and cleaning is now shifting more towards the business side… Moving away from hard wall UPDATE: IM -> Data Analytics Excellent! I’m not bound by data storage or PC capacity. I can access/see all the data available to me I can “fail” and try again quickly!
  • #15: How we work… we work together as one team with our business partners. We have Data Scientists and Engineers on our team, along with the software developers Next Yiqing will talk about how this team tackled various “big data” problems and how we used our lake in practice.
  • #18: Remember our frustrations: Security and access for our users If you don’t setup security you have a lake that nobody can use!
  • #20: This is an example of our Data Lake in action. GET: Were taking usage/billing data from various cloud providers LAND: and landing it in the LAKE. STUDY: Were leveraging PowerBI to surface that data to our end users Remember the OLD WAY: everyone talks about it forever in meetings, agrees on a schema, then an ETL developer starts the work.
  • #21: Another Example: We leveraged the SAME LAKE to LAND that large amount of Experian data to HDFS. Then we used SPARK to preform ETL (Convert data) and write text documents to ELASTICSEARCH. In this example we used the same lake, but extended our capabilities with Elastic Search. REMEMBER THE OLD WAY: We whould have loaded into standard RDBMS, slow performance, and will have to write your own Queries and fuzzy matching. Large table scans. Would only look at a subset of the data because of the size. LONG time from idea to UNDERSTANDING!!!
  • #23: Another example of how we use the SAME LAKE: Streaming Analytics for Security and Operational logs – Splunk cost containment
  • #25: Get started Check back to North stars… Be mindful of transitions SPEAKER: MAKE SURE YOU HAVE TRANSTION STATEMENTS - Add more….