SlideShare a Scribd company logo
Taewook Eom
Data Infrastructure Team
SK planet
2014-01-28
Taewook Eom
Data Programmer
Plaster(Planet Master)
of Big Data Infra
Pre-Assessor of Hiring Programmers
Mentor of 101 Startup Korea

Twitter: @taewooke
LinkedIn: http://kr.linkedin.com/in/taewookeom
http://www.flickr.com/photos/oreillyconf/10616622085/
Santa Clara
: Technical

New York
with Cloudera

: Financial, Business

Europe

: Privacy, Government

Boston
: Medical

http://strataconf.com/

by O’Reilly
Web 2.0

: Open, Sharing, Participation

Big Data

: Making Data Work
Change the World with Data.
Data
When hardware became commoditized,
software was valuable.
Now software being commoditized,
data is valuable.
– Tim O’Reilly, 2011

Data is like the blood of the enterprise.
– Amr Awadallah, CTO at Cloudera, 2013
What is Big Data?
All data that is not a fit for a traditional RDBMS,
whether used for OLTP or Analytics purposes

Big Data Architectural Patterns
http://strataconf.com/stratany2013/public/schedule/detail/30397
Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data
- Gartner, 2011

http://blog.vitria.com/Portals/47881/images/3values-resized-600.png
http://image-store.slidesharecdn.com/ae63030a-3d9b-11e3-9cff-22000a970267-original.jpg
Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png
Big Data

http://mappingignorance.org/fx/media/2013/07/Figura-11.jpg

Open Mind!
Big Data

Gartner's 2013 Hype Cycle for Emerging Technologies (2013-08-19)
more than half of
technical sessions
are presented by
Chinese or Indian

39 of 125 sessions are
sponsored sessions
Big Data: 4 Approaches
Hadoop-based

RDB-based

Search-based

NoSQL
Real-time Processing

Real-time Recommendations for Retail: Architecture, Algorithms, and Design
http://strataconf.com/stratany2013/public/schedule/detail/30217
Real-time Stream Processing
Apache
Kafka

Gathering

Apache
Storm

Processing
Querying

Streaming
Search-based
NoSQL
SQL

Stringer/Tez

Shark
… not yet Graph Processing
Big Data Space
No one tools is the right fit for all Big Data problem
Do not be afraid to recommend the right solution
for the problem over the popular solution
To do this, you must be aware of the entire ecosystem

Big Data Architectural Patterns
http://strataconf.com/stratany2013/public/schedule/detail/30397
Practical Performance Analysis and Tuning for Cloudera Impala
http://strataconf.com/stratany2013/public/schedule/detail/30551
Hadoop and the Relational Data Warehouse – When to Use Which?
http://strataconf.com/stratany2013/public/schedule/detail/30964
Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
Ignite
Signal Detection Theory: Man vs Machine
Co-Founder @VividCortex
Kyle Redinger
http://www.youtube.com/watch?v=Fg6mN-jevds
(5 minutes 6 seconds)
http://www.slideshare.net/realkyleredinger/man-vs-machine-signal-detection-theory-and-big-data
Signal Detection Theory: Man vs Machine

Remove the obvious and look at what is important
Remember: Less is more.
Keynote
Towards Strata 2014
Director of market research at O’Reilly Media
Roger Magoulas
http://www.youtube.com/watch?v=Ytd5VkEgQf8
(5 minutes 26 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31935

http://www.oreilly.com/data/free/files/stratasurvey.pdf
Towards Strata 2014
Towards Strata 2014
Towards Strata 2014
Towards Strata 2014
Science is fundamentally about data,
but data is not fundamentally about science
Beyond R and Ph.D.s: The Mythology of Data Science Debunked
Douglas Merrill (ZestFinance)
http://www.youtube.com/watch?v=J2sgObXbIWY (8 minutes 9 seconds)
People

A data scientist is a data analyst who lives in California.

– George Roumeliotis, (Intuit)
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
Data
Data
Data
Data

Businessperson: Business person, Leader, Entrepreneur
Creative: Artist, Jack-of-All-Trades, Hacker
Researcher: Scientist, Researcher, Statistician
Engineer: Engineer, Developer

http://datacommunitydc.org/blog/2012/08/data-scientists-survey-results-teaser/

http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf
Scientists think they can code,
software engineers think they are scientists.
Team them up so they collaborate.

– Scott Sorenson (Ancestry.com)

Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany2013/public/schedule/detail/30707
Data scientists spend their lives as data janitors
instead of leveraging their skills

– Wes McKinney (DataPad)

Building More Productive Data Science and Analytics Workflows
Keynote
Is Bigger Really Better?
Predictive Analytics
with Fine-grained Behavior Data
Professor at the NYU Stern School of Business
Foster Provost
http://www.youtube.com/watch?v=1jzMiAfLH2c
(10 minutes 16 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31685
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data

Predictive does not mean actionable.

– Scott Sorenson (Ancestry.com)

Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
More data gives you more precision, not more prediction.
Using multiple datasets to reduce errors when measuring values.
Is Bigger Really Better?
- Ravi Iyer (Ranker.com)
Predictive Analytics with Fine-grained Understand yourData Users, and Employees
Behavior Customers,
Using Graphs of Data to
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Keynote
Big Impact from Big Data
Head of Analytics at Facebook
Ken Rudin
http://www.youtube.com/watch?v=RJFwsZwTBgg
(11 minutes 57 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31903
Big Impact from Big Data
Hadoop is a hammer,
but you need other tools along with it.

Designing Your Data-Centric Organization
Josh Klahr (Pivotal)

http://www.youtube.com/watch?v=D86udfrVzrI (12 minutes)
Big Impact from Big Data

The way you organize information
depends on the question
you intend to ask of it.

- Richard Saul Wurman
Building a Data Platform
HaDump

: Loading data into Hadoop
for not reason.

Data Science Without a Scientist
http://strataconf.com/stratany2013/public/schedule/detail/31801
Big Impact from Big Data

Technical people still don't understand the business needs of business people!
Business people don't know what's a table.

- Anurag Tandon (MicroStrategy)

Inject Big Data into your Corporate DNA: Enable Every Employee to Make Data Driven Decisions
Ask the Right Questions
Organizations already have people who know their own data
better than mystical data scientists.
Learning Hadoop is easier than learning the company’s business.
- Gartner, 2012

Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative
http://strataconf.com/stratany2013/public/schedule/detail/30207
Every Soldier is a Sensor: Countering Corruption in Afghanistan
http://strataconf.com/stratany2013/public/schedule/detail/30828
Big Impact from Big Data
Big Impact from Big Data
Big Impact from Big Data
Value of Data
Usable < Useful < Actionable
with Impact

If you can't answer for "so what?",
you only have facts, not insight
- Baron Schwartz (VividCortex Inc)
Making Big Data Small

Descriptive (Easy)
Predictive (Medium)
Prescriptive (Hard)

What happened?
What will happen?
What should we do about it?

Hadoop & Data Science for the Enterprise
The Future of Hadoop
: What Happened
& What's Possible?
Co-Founder of Hadoop
Doug Cutting
http://www.youtube.com/watch?v=_WwuZI6AhN8
(14 minutes 41 seconds)
http://strataconf.com/stratany2013/public/
schedule/detail/31591

Big Data is first industry that was created
by open source.

- Jack Norris (MapR Technologies)
Separating Hadoop Myths from Reality

Hadoop the kernel of the OS for data.
Hadoop's Impact on the Future of Data Management
Mike Olson (Cloudera)

http://www.youtube.com/watch?v=puHS2JNKgRM
http://strataconf.com/stratany2013/public/schedule/detail/31380
Single
:
:
:
:
:
:

S/W & H/W system
security model
management model
metadata model
audit model
resource
management model

Common

: storage & schema
http://www.slideshare.net/cloudera/enterprise-data-hub-the-next-big-thing-in-big-data
Last generation of data management is not sufficient
More copies, representations, transformations increase risk
Index once and reuse across workloads, lifecycle
NoSQL: indexing and updates for interactive apps
Hadoop: staging, persistence, and analytics

Data Governance for Regulated Industries Using Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30738
Data Intelligence
Rethink How You See Data

Sharmila Shahani-Mulligan (ClearStory Data)

http://www.youtube.com/watch?v=07hGulTOZGk (9 minutes 6 seconds)
http://strataconf.com/stratany2013/public/schedule/detail/31742
The Data Availability Problem

?

Access

Question
Sampling

Analysis & Disc
Modeling
overy

Loading
Insight

Data Prep – too slow!

Information Supply Chain
Introducing a New Way to Interact with Insight
http://strataconf.com/stratany2013/public/schedule/detail/31743

Presentation
Running Non-MapReduce Big Data applications on Apache Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30755
Apache HBase for Architects
http://strataconf.com/stratany2013/public/schedule/detail/30619
What’s Next for Apache HBase: Multi-tenancy, Predictability, and Extensions.
http://strataconf.com/stratany2013/public/schedule/detail/30857
Securing the Apache Hadoop Ecosystem
http://strataconf.com/stratany2013/public/schedule/detail/30302
An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB
http://strataconf.com/stratany2013/public/schedule/detail/30959
Schema
Information does not exist until a schema is defined
and data is stored in a relational database

- anonymous

Building a Data Platform
http://strataconf.com/stratany2013/public/schedule/detail/31400
Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA)
http://strataconf.com/stratany2013/public/schedule/detail/30913
Managing a Rapidly Evolving Analytics Pipeline
http://strataconf.com/stratany2013/public/schedule/detail/30635
Stringer/Tez

Shark

SQL on/in Hadoop/Hbase Solutions

Perception is Key: Telescopes, Microscopes and Data
http://strataconf.com/strataeu2013/public/schedule/detail/32351
All SQL on Hadoop Solutions are
Missing the Point of Hadoop
Every Solution makes you define a schema

- SQL(Structured Query Language) is expressed over an assumed schema

Major reasons why Hadoop has taken of include:

- Ability to load data without defining a schema
- Process data using schema-on-read instead of first defining a schema

Hadoop contains a lot of:

- Raw, granular data sets with potentially inconsistent schemas
- Data sets in JSON, key-value, and other self-describing (non-relational) models
designed for schema-on-read processing

SQL on Hadoop solutions that make you first define a schema are missing
a major part of Hadoop’s usage patterns

Flexible Schema and the End of ETL
http://strataconf.com/stratany2013/public/schedule/detail/31868
Lessons Learned
Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
Quick prototyping is the fastest way to internal advocacy. Ship It!
Cloud == Speed
We don’t always need a complicated solution. KISS
Play to your differentiating strengths. Experience >> Data
Bias towards impact.
It Takes a Village
EASE!! (Emulate, Analyze, Scale, Evaluate)
How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany2013/public/schedule/detail/30707

Prototyping is key to overcoming resistance to change
Technical architecture is heavily influenced by people organization
Developing a team of experienced Hadoop users can often be done
using internal employees
A culture of experimentation and innovation yields the best result
Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/30499
Strata Conference NYC 2013
Questions?
SELECT questions FROM audience;
References
Strata Conference + Hadoop World 2013 Keynotes & Interviews

http://www.youtube.com/playlist?list=PL055Epbe6d5ZtziVAooUC04i1hL_Z9Xvk

Slides & Video

http://strataconf.com/stratany2013/public/schedule/proceedings

Tweets

https://twitter.com/search?q=%23strataconf #strataconf

More Related Content

PPTX
From the Big Bang to Ecommerce, a journey in making sense of Big Data
Patrick Deglon
 
PDF
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
PDF
Data visualisationsummit 2013
The Pathway Group
 
PPTX
Best practices in building machine learning models in Azure ML
Zeydy Ortiz, Ph. D.
 
PPTX
Measuring What Matters: Meaningful Metrics
Social Media for Nonprofits
 
PPTX
Public Data and Data Mining Competitions - What are Lessons?
Gregory Piatetsky-Shapiro
 
PDF
An intro into AI and how business leaders should use it
Lutz Finger
 
PPTX
Introduction to Data Science
LivePerson
 
From the Big Bang to Ecommerce, a journey in making sense of Big Data
Patrick Deglon
 
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Data visualisationsummit 2013
The Pathway Group
 
Best practices in building machine learning models in Azure ML
Zeydy Ortiz, Ph. D.
 
Measuring What Matters: Meaningful Metrics
Social Media for Nonprofits
 
Public Data and Data Mining Competitions - What are Lessons?
Gregory Piatetsky-Shapiro
 
An intro into AI and how business leaders should use it
Lutz Finger
 
Introduction to Data Science
LivePerson
 

What's hot (8)

PDF
Data Science Popup Austin: Conflict in Growing Data Science Organizations
Domino Data Lab
 
PPTX
Analytics Education in the era of Big Data
Gregory Piatetsky-Shapiro
 
PPT
Data and information
steveathon
 
PDF
Data science and_analytics_for_ordinary_people_ebook
Jeffrey Strickland, Ph.D., CMSP
 
PDF
Data Analaytics.04. Data visualization
Alex Rayón Jerez
 
PDF
What's the Value of Data Science for Organizations: Tips for Invincibility in...
Ganes Kesari
 
PPTX
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Data ScienceTech Institute
 
PDF
Less is More: Behind the Data at Risk I/O
Michael Roytman
 
Data Science Popup Austin: Conflict in Growing Data Science Organizations
Domino Data Lab
 
Analytics Education in the era of Big Data
Gregory Piatetsky-Shapiro
 
Data and information
steveathon
 
Data science and_analytics_for_ordinary_people_ebook
Jeffrey Strickland, Ph.D., CMSP
 
Data Analaytics.04. Data visualization
Alex Rayón Jerez
 
What's the Value of Data Science for Organizations: Tips for Invincibility in...
Ganes Kesari
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Data ScienceTech Institute
 
Less is More: Behind the Data at Risk I/O
Michael Roytman
 
Ad

Viewers also liked (20)

PDF
Strata Conference NYC 2013 Full Version
Taewook Eom
 
PPT
TOC 2011: Content as Application, presented by Scott Grillo
Silverchair
 
PPTX
TOC 2011: Content as Application, presented by Thane Kerner
Silverchair
 
PDF
Extreme Web Performance for Mobile Devices
Maximiliano Firtman
 
PPTX
Costa Pacifica in Baler, Aurora
Claire Algarme
 
PDF
Mobile & Desktop Cache 2.0: How To Create A Scriptable Cache
Blaze Software Inc.
 
PPT
Pacifica Affiliates Program
Pacifica Radio Affiliates
 
PPTX
Continuous Delivery in Financial Trading at IG
David Genn
 
PPTX
Velocity 2015-tim-prendergast-continuous-security-the-devops-way
Evident.io
 
PDF
Can you wireframe 'Delightful'?
Ben Tollady
 
KEY
Is there such a thing as a good business model for publishing these days?
Louis Rosenfeld
 
PDF
Forensic Tools for In-Depth Performance Investigations
Nicholas Jansma
 
PPTX
We Are Killing Serendipity
Schneider, Mike
 
PDF
Locked Out in London (and tweeting about it)
Sylvain Carle
 
PPTX
What You Need to Know About Email Authentication
Kurt Andersen
 
PPT
TOC 2011: Content as Application, presented by Reid Sherline
Silverchair
 
PDF
Case Studies: Harnessing Speed for Competitive Advantage
VMware Tanzu
 
PPTX
Hadoop and rdbms with sqoop
Guy Harrison
 
PPTX
Advanced Sqoop
Yogesh Kulkarni
 
PDF
Branding Presentation Robin Horne Casa Pacifica
Robin Horne
 
Strata Conference NYC 2013 Full Version
Taewook Eom
 
TOC 2011: Content as Application, presented by Scott Grillo
Silverchair
 
TOC 2011: Content as Application, presented by Thane Kerner
Silverchair
 
Extreme Web Performance for Mobile Devices
Maximiliano Firtman
 
Costa Pacifica in Baler, Aurora
Claire Algarme
 
Mobile & Desktop Cache 2.0: How To Create A Scriptable Cache
Blaze Software Inc.
 
Pacifica Affiliates Program
Pacifica Radio Affiliates
 
Continuous Delivery in Financial Trading at IG
David Genn
 
Velocity 2015-tim-prendergast-continuous-security-the-devops-way
Evident.io
 
Can you wireframe 'Delightful'?
Ben Tollady
 
Is there such a thing as a good business model for publishing these days?
Louis Rosenfeld
 
Forensic Tools for In-Depth Performance Investigations
Nicholas Jansma
 
We Are Killing Serendipity
Schneider, Mike
 
Locked Out in London (and tweeting about it)
Sylvain Carle
 
What You Need to Know About Email Authentication
Kurt Andersen
 
TOC 2011: Content as Application, presented by Reid Sherline
Silverchair
 
Case Studies: Harnessing Speed for Competitive Advantage
VMware Tanzu
 
Hadoop and rdbms with sqoop
Guy Harrison
 
Advanced Sqoop
Yogesh Kulkarni
 
Branding Presentation Robin Horne Casa Pacifica
Robin Horne
 
Ad

Similar to Strata Conference NYC 2013 (20)

PPT
Making friends with big data resource links
Heather Stark
 
PDF
Business intelligence 3.0 and the data lake
Data Science Thailand
 
PDF
AI in Business - Key drivers and future value
APPANION
 
PDF
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Mark Tabladillo
 
PDF
Come diventare data scientist - Paolo Pellegrini
Donatella Cambosu
 
PPTX
10 Keynotes in STRATA and HADOOP World Conference
KCC Software Ltd. & Easylearning.guru
 
PPTX
Module 6 The Future of Big and Smart Data- Online
caniceconsulting
 
PPTX
Data scienceppt
Jayabalan Sekar
 
PDF
SQL PASS BA London 2014 - Data Culture & Future of Analytics
Jonathan Woodward
 
PPTX
Hadoop for beginners free course ppt
Njain85
 
PPTX
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
arpit206900
 
PPTX
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
DOC
Complete-SRS.doc
jadhavpravin920
 
PPTX
BigData Meets the Federal Data Center
Abe Usher
 
PDF
From Lab to Factory: Creating value with data
Peadar Coyle
 
PPTX
Data Driven Economy @CMU
Komes Chandavimol
 
PDF
Computing for Data Analysis: Theory and Practices 1st Edition Sanjay Chakraborty
eddsabada
 
PDF
Problem Definition muAoPS | Analytics Problem Solving | Mu Sigma
n40077943
 
PDF
Collaborative Data UX Design - Virtually and Phyically
Datentreiber
 
Making friends with big data resource links
Heather Stark
 
Business intelligence 3.0 and the data lake
Data Science Thailand
 
AI in Business - Key drivers and future value
APPANION
 
Secrets of Enterprise Data Mining: SQL Saturday 328 Birmingham AL
Mark Tabladillo
 
Come diventare data scientist - Paolo Pellegrini
Donatella Cambosu
 
10 Keynotes in STRATA and HADOOP World Conference
KCC Software Ltd. & Easylearning.guru
 
Module 6 The Future of Big and Smart Data- Online
caniceconsulting
 
Data scienceppt
Jayabalan Sekar
 
SQL PASS BA London 2014 - Data Culture & Future of Analytics
Jonathan Woodward
 
Hadoop for beginners free course ppt
Njain85
 
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
arpit206900
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Jongwook Woo
 
Complete-SRS.doc
jadhavpravin920
 
BigData Meets the Federal Data Center
Abe Usher
 
From Lab to Factory: Creating value with data
Peadar Coyle
 
Data Driven Economy @CMU
Komes Chandavimol
 
Computing for Data Analysis: Theory and Practices 1st Edition Sanjay Chakraborty
eddsabada
 
Problem Definition muAoPS | Analytics Problem Solving | Mu Sigma
n40077943
 
Collaborative Data UX Design - Virtually and Phyically
Datentreiber
 

Recently uploaded (20)

PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Doc9.....................................
SofiaCollazos
 
The Future of Artificial Intelligence (AI)
Mukul
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 

Strata Conference NYC 2013

  • 1. Taewook Eom Data Infrastructure Team SK planet 2014-01-28
  • 2. Taewook Eom Data Programmer Plaster(Planet Master) of Big Data Infra Pre-Assessor of Hiring Programmers Mentor of 101 Startup Korea Twitter: @taewooke LinkedIn: http://kr.linkedin.com/in/taewookeom http://www.flickr.com/photos/oreillyconf/10616622085/
  • 3. Santa Clara : Technical New York with Cloudera : Financial, Business Europe : Privacy, Government Boston : Medical http://strataconf.com/ by O’Reilly Web 2.0 : Open, Sharing, Participation Big Data : Making Data Work Change the World with Data.
  • 4. Data When hardware became commoditized, software was valuable. Now software being commoditized, data is valuable. – Tim O’Reilly, 2011 Data is like the blood of the enterprise. – Amr Awadallah, CTO at Cloudera, 2013
  • 5. What is Big Data? All data that is not a fit for a traditional RDBMS, whether used for OLTP or Analytics purposes Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397
  • 6. Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data - Gartner, 2011 http://blog.vitria.com/Portals/47881/images/3values-resized-600.png
  • 8. Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  • 11. Big Data Gartner's 2013 Hype Cycle for Emerging Technologies (2013-08-19)
  • 12. more than half of technical sessions are presented by Chinese or Indian 39 of 125 sessions are sponsored sessions
  • 13. Big Data: 4 Approaches Hadoop-based RDB-based Search-based NoSQL
  • 14. Real-time Processing Real-time Recommendations for Retail: Architecture, Algorithms, and Design http://strataconf.com/stratany2013/public/schedule/detail/30217
  • 16. … not yet Graph Processing
  • 17. Big Data Space No one tools is the right fit for all Big Data problem Do not be afraid to recommend the right solution for the problem over the popular solution To do this, you must be aware of the entire ecosystem Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397
  • 18. Practical Performance Analysis and Tuning for Cloudera Impala http://strataconf.com/stratany2013/public/schedule/detail/30551
  • 19. Hadoop and the Relational Data Warehouse – When to Use Which? http://strataconf.com/stratany2013/public/schedule/detail/30964
  • 20. Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  • 21. Ignite Signal Detection Theory: Man vs Machine Co-Founder @VividCortex Kyle Redinger http://www.youtube.com/watch?v=Fg6mN-jevds (5 minutes 6 seconds) http://www.slideshare.net/realkyleredinger/man-vs-machine-signal-detection-theory-and-big-data
  • 22. Signal Detection Theory: Man vs Machine Remove the obvious and look at what is important Remember: Less is more.
  • 23. Keynote Towards Strata 2014 Director of market research at O’Reilly Media Roger Magoulas http://www.youtube.com/watch?v=Ytd5VkEgQf8 (5 minutes 26 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31935 http://www.oreilly.com/data/free/files/stratasurvey.pdf
  • 28. Science is fundamentally about data, but data is not fundamentally about science Beyond R and Ph.D.s: The Mythology of Data Science Debunked Douglas Merrill (ZestFinance) http://www.youtube.com/watch?v=J2sgObXbIWY (8 minutes 9 seconds)
  • 29. People A data scientist is a data analyst who lives in California. – George Roumeliotis, (Intuit)
  • 31. Data Data Data Data Businessperson: Business person, Leader, Entrepreneur Creative: Artist, Jack-of-All-Trades, Hacker Researcher: Scientist, Researcher, Statistician Engineer: Engineer, Developer http://datacommunitydc.org/blog/2012/08/data-scientists-survey-results-teaser/ http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf
  • 32. Scientists think they can code, software engineers think they are scientists. Team them up so they collaborate. – Scott Sorenson (Ancestry.com) Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
  • 33. How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707
  • 34. Data scientists spend their lives as data janitors instead of leveraging their skills – Wes McKinney (DataPad) Building More Productive Data Science and Analytics Workflows
  • 35. Keynote Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data Professor at the NYU Stern School of Business Foster Provost http://www.youtube.com/watch?v=1jzMiAfLH2c (10 minutes 16 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31685
  • 36. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  • 37. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  • 38. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data Predictive does not mean actionable. – Scott Sorenson (Ancestry.com) Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
  • 39. More data gives you more precision, not more prediction. Using multiple datasets to reduce errors when measuring values. Is Bigger Really Better? - Ravi Iyer (Ranker.com) Predictive Analytics with Fine-grained Understand yourData Users, and Employees Behavior Customers, Using Graphs of Data to
  • 40. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  • 41. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  • 42. Keynote Big Impact from Big Data Head of Analytics at Facebook Ken Rudin http://www.youtube.com/watch?v=RJFwsZwTBgg (11 minutes 57 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31903
  • 43. Big Impact from Big Data
  • 44. Hadoop is a hammer, but you need other tools along with it. Designing Your Data-Centric Organization Josh Klahr (Pivotal) http://www.youtube.com/watch?v=D86udfrVzrI (12 minutes)
  • 45. Big Impact from Big Data The way you organize information depends on the question you intend to ask of it. - Richard Saul Wurman Building a Data Platform
  • 46. HaDump : Loading data into Hadoop for not reason. Data Science Without a Scientist http://strataconf.com/stratany2013/public/schedule/detail/31801
  • 47. Big Impact from Big Data Technical people still don't understand the business needs of business people! Business people don't know what's a table. - Anurag Tandon (MicroStrategy) Inject Big Data into your Corporate DNA: Enable Every Employee to Make Data Driven Decisions
  • 48. Ask the Right Questions Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business. - Gartner, 2012 Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  • 49. Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative http://strataconf.com/stratany2013/public/schedule/detail/30207
  • 50. Every Soldier is a Sensor: Countering Corruption in Afghanistan http://strataconf.com/stratany2013/public/schedule/detail/30828
  • 51. Big Impact from Big Data
  • 52. Big Impact from Big Data
  • 53. Big Impact from Big Data
  • 54. Value of Data Usable < Useful < Actionable with Impact If you can't answer for "so what?", you only have facts, not insight - Baron Schwartz (VividCortex Inc) Making Big Data Small Descriptive (Easy) Predictive (Medium) Prescriptive (Hard) What happened? What will happen? What should we do about it? Hadoop & Data Science for the Enterprise
  • 55. The Future of Hadoop : What Happened & What's Possible? Co-Founder of Hadoop Doug Cutting http://www.youtube.com/watch?v=_WwuZI6AhN8 (14 minutes 41 seconds) http://strataconf.com/stratany2013/public/ schedule/detail/31591 Big Data is first industry that was created by open source. - Jack Norris (MapR Technologies) Separating Hadoop Myths from Reality Hadoop the kernel of the OS for data.
  • 56. Hadoop's Impact on the Future of Data Management Mike Olson (Cloudera) http://www.youtube.com/watch?v=puHS2JNKgRM http://strataconf.com/stratany2013/public/schedule/detail/31380
  • 57. Single : : : : : : S/W & H/W system security model management model metadata model audit model resource management model Common : storage & schema http://www.slideshare.net/cloudera/enterprise-data-hub-the-next-big-thing-in-big-data
  • 58. Last generation of data management is not sufficient More copies, representations, transformations increase risk Index once and reuse across workloads, lifecycle NoSQL: indexing and updates for interactive apps Hadoop: staging, persistence, and analytics Data Governance for Regulated Industries Using Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30738
  • 59. Data Intelligence Rethink How You See Data Sharmila Shahani-Mulligan (ClearStory Data) http://www.youtube.com/watch?v=07hGulTOZGk (9 minutes 6 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31742
  • 60. The Data Availability Problem ? Access Question Sampling Analysis & Disc Modeling overy Loading Insight Data Prep – too slow! Information Supply Chain Introducing a New Way to Interact with Insight http://strataconf.com/stratany2013/public/schedule/detail/31743 Presentation
  • 61. Running Non-MapReduce Big Data applications on Apache Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30755
  • 62. Apache HBase for Architects http://strataconf.com/stratany2013/public/schedule/detail/30619 What’s Next for Apache HBase: Multi-tenancy, Predictability, and Extensions. http://strataconf.com/stratany2013/public/schedule/detail/30857
  • 63. Securing the Apache Hadoop Ecosystem http://strataconf.com/stratany2013/public/schedule/detail/30302
  • 64. An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB http://strataconf.com/stratany2013/public/schedule/detail/30959
  • 65. Schema Information does not exist until a schema is defined and data is stored in a relational database - anonymous Building a Data Platform http://strataconf.com/stratany2013/public/schedule/detail/31400
  • 66. Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA) http://strataconf.com/stratany2013/public/schedule/detail/30913
  • 67. Managing a Rapidly Evolving Analytics Pipeline http://strataconf.com/stratany2013/public/schedule/detail/30635
  • 68. Stringer/Tez Shark SQL on/in Hadoop/Hbase Solutions Perception is Key: Telescopes, Microscopes and Data http://strataconf.com/strataeu2013/public/schedule/detail/32351
  • 69. All SQL on Hadoop Solutions are Missing the Point of Hadoop Every Solution makes you define a schema - SQL(Structured Query Language) is expressed over an assumed schema Major reasons why Hadoop has taken of include: - Ability to load data without defining a schema - Process data using schema-on-read instead of first defining a schema Hadoop contains a lot of: - Raw, granular data sets with potentially inconsistent schemas - Data sets in JSON, key-value, and other self-describing (non-relational) models designed for schema-on-read processing SQL on Hadoop solutions that make you first define a schema are missing a major part of Hadoop’s usage patterns Flexible Schema and the End of ETL http://strataconf.com/stratany2013/public/schedule/detail/31868
  • 71. Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570
  • 72. Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570
  • 73. Quick prototyping is the fastest way to internal advocacy. Ship It! Cloud == Speed We don’t always need a complicated solution. KISS Play to your differentiating strengths. Experience >> Data Bias towards impact. It Takes a Village EASE!! (Emulate, Analyze, Scale, Evaluate) How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707 Prototyping is key to overcoming resistance to change Technical architecture is heavily influenced by people organization Developing a team of experienced Hadoop users can often be done using internal employees A culture of experimentation and innovation yields the best result Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30499
  • 76. References Strata Conference + Hadoop World 2013 Keynotes & Interviews http://www.youtube.com/playlist?list=PL055Epbe6d5ZtziVAooUC04i1hL_Z9Xvk Slides & Video http://strataconf.com/stratany2013/public/schedule/proceedings Tweets https://twitter.com/search?q=%23strataconf #strataconf