SlideShare a Scribd company logo
Processamento Intensivo de 
Dados 
Intensive Data Processing 
(Big Data) 
Nelson F. F. 
Ebecken 
NTT/COPPE/UFRJ 
Your Big Data Is Worthless if You Don’t Bring It Into the Real World 
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the real--world/
Big Data 
Big Data refers to data that is too big to fit on a 
single server, too unstructured to fit into a 
row-and-column database, or too 
continuously flowing to fit into a static data 
warehouse (Thomas H. Davenport)
Big Data and traditional analytics 
Type of data 
Volume of Data 
Big Data 
Unstructured formats 
100 terabytes to petabytes 
Traditional analytics 
Formated in rows and 
columns 
Tens of terabytes or less 
Flow of Data 
Analysis methods 
Constant flow of data 
Machine Learning 
Static pool of data 
Hypothesis-based 
Primary purpose Data-based products Internal decision support 
and services
A menu of big data possibilities 
Style of data Source of data Industry affected Function affected 
Large volume Online Financial services Marketing 
Unstructured Video Health care Supply chain 
Continuous flow Sensor Manufacturing Human resources 
Multiple formats Genomic Travel/transport Finance
Terminology for using and analyzing data 
Term Time frame 
Decision support 1970-1985 
Executive support 1980-1990 
Online analytical 
processing OLAP 
1990-2000 
Business intelligence 1989-2005 
Analytics 2005-2010 
Big Data 2010-present 
Specific meaning 
Use of data analysis to 
support decision making 
Focus on data analysis for 
decisions by senior 
executives 
Software for analysing 
multidimensional data 
tables 
Tools to support data-driven 
decisions, with 
emphasis on reporting 
Focus on ststistical and 
mathematical analysis for 
decisions 
Focus on very large, 
unstructured, fast moving 
data
How important is Big Data to You and Your Organization ? 
 Has your management team considered some of the new types of data 
that may affect your business and industry, both now and in the next 
several years ? 
 Have you discussed the term big data and wether it’s a good description of 
what your organization is doing with data and analytics ? 
 Are you beggining to change your decision-making processes toward a 
more continuos approach driven by the continuos availability of data ? 
 Has your organization adopted faster and more agile approaches to 
analyzing and acting on important data and analysis ? 
 Are you beggining to focus more on external information about business 
and makets enviroments ? 
 Have you made a big bet on big data ?
Big data is going to reshape a lot of different 
businesses and industries 
 Every industry that moves things 
 Every industry that sells to consumers 
 Every industry that emplys machinery 
 Every industry that sells or uses content 
 Every industry that provides service 
 Every industry that has physical facilities 
 Every industry that involves money
Responsability locus for big data projects 
Cost savings 
Faster decisions 
Better decisions 
Product/service innovation 
Discovery 
IT innovation group 
Business unit or function 
analytics group 
Business unit or function 
analytics group 
R&D or product 
development group 
Production 
IT architecture and 
operations 
Business unit or function 
executive 
Business unit or function 
executive 
Product development or 
product management
Overview of technologies for big data 
Technology 
Hadoop 
Definition 
Open source software for processing 
big data across multiple parallel servers 
MapReduce 
Scripting languages 
Machine learning 
Visual analytics 
Natural language processing NLP 
In-memory analytics 
The architectural framework on which 
Hadoop is based 
Programming languages that work well 
with big data (Python, Pig, Hive...) 
Algorithms for rapidly finding the model 
that best fits a data set 
Display of analytical results in visual or 
graphic formats 
Algorithms for analyzing text, frequencies, 
meanings,... 
Processing big data in computer memory 
for greater speed
MapReduce 
MapReduce is a programming model for expressing 
distributed computations on massive amounts of data and 
an execution framework for large-scale data processing on 
clusters of commodity servers. 
 It was originally developed by Google 
 In 2003, Google's distributed file system, called GFS 
In 2004, Google published the paper that introduced 
MapReduce 
MapReduce has since enjoyed widespread adoption via 
an open-source implementation called Hadoop, whose 
development was led by Yahoo (an Apache project).
Programming Model 
Input & Output: each a set of key/value pairs 
Programmer specifies two functions: 
Processes input key/value pair 
Produces set of intermediate pairs 
'map (in_key, in_value) -> list(out_key, 
intermediate_value)I 
• Produces a set of merged output values (usually just one) 
'reduce (out_key, list(intermediate_value)) -> list(out_value)I
Map-Reduce 
. Parallel programming for large masses of data 
Map/Combine/Partition Shuffle Sort/Reduce 
key/val key/val 
key/val key/val 
key/val key/val 
Reduce output 
Reduce output 
Reduce output 
input Map 
input Map 
input Map 
14
Why learn models in MapReduce? 
 High data throughput 
Stream about 100 Tb per hour using 500 mappers 
 Framework provides fault tolerance 
Monitors mappers and reducers and re-starts tasks on 
other machines should one of the machines fail 
 Excels in counting patterns over data records 
 Built on relatively cheap, commodity hardware 
No special purpose computing hardware 
 Large volumes of data are being increasingly 
stored on Grid clusters running MapReduce 
Especially in the internet domain
Why learn models in MapReduce? 
• Learning can become limited by computation 
time and not data volume 
With large enough data and number of machines 
Reduces the need to down-sample data 
More accurate parameter estimates compared to 
learning on a single machine for the same amount of time
Learning models in MapReduce 
 A primer for learning models in MapReduce (MR) 
Illustrate techniques for distributing the learning algorithm in a 
MapReduce framework 
Focus on the mapper and reducer computations 
 Data parallel algorithms are most appropriate for 
MapReduce implementations 
 Not necessarily the most optimal implementation for a 
specific algorithm 
Other specialized non-MapReduce implementations exist for 
some algorithms, which may be better 
 MR may not be the appropriate framework for exact 
solutions of non data parallel/sequential algorithms 
Approximate solutions using MapReduce may be good enough
Types of learning in MapReduce 
• Three common types of learning models using 
MapReduce framework 
1. Parallel training of multiple models 
– Train either in mappers or reducers 
2. Ensemble training methods 
– Train multiple models and combine them 
3. Distributed learning algorithms 
– Learn using both mappers and reducers 
Use the Grid as a 
large cluster 
of independent 
machines 
(with fault 
tolerance)
Parallel training of multiple models 
 Train multiple models simultaneously using a learning 
algorithm that can be learnt in memory 
 Useful when individual models are trained using a 
subset, filtered or modification of raw data 
 Can train 1000`s of models simultaneously 
 Essentially, treat Grid as a large cluster of machines 
– Leverage fault tolerance of Hadoop 
 Train 1 model in each reducer 
– Map: 
 Input: All data 
 Filters subset of data relevant for each model training 
 Output: <model_index, subset of data for training this model> 
– Reduce 
 Train model on data corresponding to that model_index
Apache Mahout 
Scalable to large data sets. Our core algorithms for clustering, classification and 
collaborative filtering are implemented on top of scalable, distributed systems. 
However, contributions that run on a single machine are welcome as well. 
Scalable to support your business case. Mahout is distributed under a 
commercially friendly Apache Software license. 
Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse 
community to facilitate discussions not only on the project itself but also on potential 
use cases. Come to the mailing lists to find out more. 
Currently Mahout supports mainly three use cases: Recommendation mining takes 
users' behavior and from that tries to find items users might like. Clustering takes 
e.g. text documents and groups them into groups of topically related documents. 
Classification learns from existing categorized documents what documents of a 
specific category look like and is able to assign unlabelled documents to the 
(hopefully) correct category. 
25 April 2014 - Goodbye MapReduce 
The Mahout community decided to move its codebase onto modern data processing systems that offer a richer 
programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new 
MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce 
algorithms in the codebase and maintain them. 
We are building our future implementations on top of a DSL for linear algebraic operations which has been 
developed over the last months. Programs written in this DSL are automatically optimized and executed in 
parallel on Apache Spark. 
Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into 
Mahout. 
Apache Spark™ is a fast and general engine for large-scale data processing. 
H2O is the open source in memory solution from 0xdata for predictive analytics on big data.
Matrix 
Methods 
Slides with bit.ly/10SIe1A 
Code github.com/dgleich/matrix-Hadoop hadoop-tutorial 
DAVID F. 
GLEICH ASSISTANT PROFESSOR 
COMPUTER SCIENCE 
PURDUE UNIVERSITY 
David Gleich á Purdue bit.ly/10SIe1A 
1
20
ACM KDD 2014 
24-27/08 
New environments: Microsoft Azure ML Studio, Google 
Prediction API,… 
2 Research Sessions + Industry & Government 
Statistical Techniques for Big Data 
Scaling-up Methods for Big Data 
Topic Modeling
Big data & machine learning 
This is a huge field, growing very fast 
Many algorithms and techniques: 
can be seen as a giant toolbox with wide-ranging applications 
Ranging from the very simple to the extremely sophisticated 
Difficult to see the big picture 
Huge range of applications 
Math skills are crucial

More Related Content

PPTX
Hadoop - An Introduction
Shankar R
 
PDF
Memory Management in BigData: A Perpective View
ijtsrd
 
PPTX
Bigdata
Shankar R
 
PDF
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
PPTX
BigData
Shankar R
 
PDF
Future of Data - Big Data
Shankar R
 
PPTX
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
PPTX
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 
Hadoop - An Introduction
Shankar R
 
Memory Management in BigData: A Perpective View
ijtsrd
 
Bigdata
Shankar R
 
Hadoop,Big Data Analytics and More
Trendwise Analytics
 
BigData
Shankar R
 
Future of Data - Big Data
Shankar R
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
 
Big Data Analysis Patterns with Hadoop, Mahout and Solr
boorad
 

What's hot (20)

PDF
Big data analysis concepts and references
Information Security Awareness Group
 
PDF
Hadoop(Term Paper)
Dux Chandegra
 
PDF
Hadoop for Finance - sample chapter
Rajiv Tiwari
 
PDF
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
PDF
Présentation on radoop
siliconsudipt
 
PDF
Actian DataFlow Whitepaper
Edgar Alejandro Villegas
 
PPTX
Big Data Analytics Using Hadoop
Srikanth VNV
 
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
PPTX
Monitizing Big Data at Telecom Service Providers
DataWorks Summit
 
PPTX
Hadoop Turns a Corner and Sees the Future
DataWorks Summit
 
PPTX
DW Appliance
Shankar R
 
PDF
Real time data processing frameworks
IJDKP
 
PDF
An introduction to Big Data
ForwardSprint
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Decision trees in hadoop
Revolution Analytics
 
PPTX
Big data analytics
Dr.Bhuvaneswari Velumani
 
PPTX
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
PDF
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PDF
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
PDF
Big Data Real Time Applications
DataWorks Summit
 
Big data analysis concepts and references
Information Security Awareness Group
 
Hadoop(Term Paper)
Dux Chandegra
 
Hadoop for Finance - sample chapter
Rajiv Tiwari
 
Rob peglar introduction_analytics _big data_hadoop
Ghassan Al-Yafie
 
Présentation on radoop
siliconsudipt
 
Actian DataFlow Whitepaper
Edgar Alejandro Villegas
 
Big Data Analytics Using Hadoop
Srikanth VNV
 
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Monitizing Big Data at Telecom Service Providers
DataWorks Summit
 
Hadoop Turns a Corner and Sees the Future
DataWorks Summit
 
DW Appliance
Shankar R
 
Real time data processing frameworks
IJDKP
 
An introduction to Big Data
ForwardSprint
 
Big Data Analytics with Hadoop
Philippe Julio
 
Decision trees in hadoop
Revolution Analytics
 
Big data analytics
Dr.Bhuvaneswari Velumani
 
Exploring Big Data Analytics Tools
Multisoft Virtual Academy
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
Big Data Real Time Applications
DataWorks Summit
 
Ad

Viewers also liked (16)

PPTX
27/09/2011 - 14h às 18h - encontro de negócios com software livre - Arlindo M...
Rio Info
 
PDF
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Rio Info
 
PPT
Chapter 1 Pt2
Mitch Murray
 
PPT
Rio Info 2010 - Encontro PSVs IT - Vanda Scartezini
Rio Info
 
PPTX
dia 27/09/2011 - 14h às 17h30 - Talentos 2.0 - Karen Gallant
Rio Info
 
PDF
Xalles presentation for Rio Info Portugal - may 18 2010
Rio Info
 
PPT
watten reserach why its sale decline
Wahab Yunus
 
PPT
Rio Info 2010 - Encontro PSVs IT - Djalma Petit
Rio Info
 
PPT
Rio Info 2009 - Europeana - Bram van der Werf
Rio Info
 
PPT
Big data: tendências e oportunidades - Palestrante: Cezar Taurion
Rio Info
 
PPT
RioInfo 2010: Seminário de Tecnologia - Mesa 1 - Integração e Convergência Ma...
Rio Info
 
DOC
Wateen final (research method)
Wahab Yunus
 
PPT
Why We Need Friends 97 2003
Wahab Yunus
 
PPT
My career goals
Wahab Yunus
 
PPT
Olpers final
Wahab Yunus
 
27/09/2011 - 14h às 18h - encontro de negócios com software livre - Arlindo M...
Rio Info
 
Inovação e equipes geograficamente distribuídas - Palestrante: Maíra Gatti
Rio Info
 
Chapter 1 Pt2
Mitch Murray
 
Rio Info 2010 - Encontro PSVs IT - Vanda Scartezini
Rio Info
 
dia 27/09/2011 - 14h às 17h30 - Talentos 2.0 - Karen Gallant
Rio Info
 
Xalles presentation for Rio Info Portugal - may 18 2010
Rio Info
 
watten reserach why its sale decline
Wahab Yunus
 
Rio Info 2010 - Encontro PSVs IT - Djalma Petit
Rio Info
 
Rio Info 2009 - Europeana - Bram van der Werf
Rio Info
 
Big data: tendências e oportunidades - Palestrante: Cezar Taurion
Rio Info
 
RioInfo 2010: Seminário de Tecnologia - Mesa 1 - Integração e Convergência Ma...
Rio Info
 
Wateen final (research method)
Wahab Yunus
 
Why We Need Friends 97 2003
Wahab Yunus
 
My career goals
Wahab Yunus
 
Olpers final
Wahab Yunus
 
Ad

Similar to Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson Favilla (20)

PPTX
Big data Presentation
himanshu arora
 
PPTX
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
PDF
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
PPTX
introduction to big data frameworks
Amal Targhi
 
PDF
Hadoop Master Class : A concise overview
Abhishek Roy
 
PPT
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
PDF
Big data analytics 1
gauravsc36
 
PDF
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
PPT
Big data with hadoop
Anusha sweety
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PPTX
big data and machine learning ppt.pptx
NATASHABANO
 
PPTX
Big Data
Faisal Ahmed
 
PPTX
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
PDF
Influence of Hadoop in Big Data Analysis and Its Aspects
IJMER
 
PDF
Big Data
Kirubaburi R
 
PDF
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
PDF
Introduction to Big Data
Roi Blanco
 
PDF
Big Data
Mehmet Burak Akgün
 
PPTX
Big Data
Mahesh Bmn
 
Big data Presentation
himanshu arora
 
Big data Intro - Presentation to OCHackerz Meetup Group
Sri Kanajan
 
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
introduction to big data frameworks
Amal Targhi
 
Hadoop Master Class : A concise overview
Abhishek Roy
 
Data analytics & its Trends
Dr.K.Sreenivas Rao
 
Big data analytics 1
gauravsc36
 
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Big-Data-Seminar-6-Aug-2014-Koenig
Manish Chopra
 
Big data with hadoop
Anusha sweety
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
big data and machine learning ppt.pptx
NATASHABANO
 
Big Data
Faisal Ahmed
 
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
YashiBatra1
 
Influence of Hadoop in Big Data Analysis and Its Aspects
IJMER
 
Big Data
Kirubaburi R
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
Introduction to Big Data
Roi Blanco
 
Big Data
Mahesh Bmn
 

More from Rio Info (20)

PDF
Rio Info 2015: Painel: Educação digital: experiências e oportunidades - Sylvi...
Rio Info
 
PDF
Rio Info 2015 - Desafio de tornar networking em faturamento - Cristina Dissat
Rio Info
 
PDF
Rio Info 2015 - A verdade sobre os instrumentos de inovação - Luiz Claudio Souza
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Argentina - Visual Factory - Pablo Navarro
Rio Info
 
PDF
Rio Info 2015 - Como captar recursos não reembolsáveis em editais de inovação...
Rio Info
 
PDF
Rio Info 2015 - Plano de stock options o que fazer e o que não fazer - Marcel...
Rio Info
 
PDF
Rio Info 2015 - Empreendendo sonhos compartilhados - Natalie Witte
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Paraíba - Luiz Maurício Fraga martins
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Rio Grande do Sul - Leandro Araújo carras...
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - São Paulo Capital - Valmir Souza - Biomob
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Portugal Finity - Orlando Ribas
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Amazonas - Senior APP - Dalvanira Santos ...
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Espírito Santo - Fabrio Oliveira
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Paraná - Any Market - Rogério Gonçalves
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Rio de Janeiro Interior - Luís Gustavo Bo...
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Alagoas - Leandro - Quanto Gastei
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Rio de Janeiro - Pedro Pisa - Ploog
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Sergipe - Marcus Dratovsky
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Maranhão - Weldys da Cruz Santos
Rio Info
 
PDF
Rio Info 2015 - Salão da Inovação - Uruguai - Ricardo Fynn
Rio Info
 
Rio Info 2015: Painel: Educação digital: experiências e oportunidades - Sylvi...
Rio Info
 
Rio Info 2015 - Desafio de tornar networking em faturamento - Cristina Dissat
Rio Info
 
Rio Info 2015 - A verdade sobre os instrumentos de inovação - Luiz Claudio Souza
Rio Info
 
Rio Info 2015 - Salão da Inovação - Argentina - Visual Factory - Pablo Navarro
Rio Info
 
Rio Info 2015 - Como captar recursos não reembolsáveis em editais de inovação...
Rio Info
 
Rio Info 2015 - Plano de stock options o que fazer e o que não fazer - Marcel...
Rio Info
 
Rio Info 2015 - Empreendendo sonhos compartilhados - Natalie Witte
Rio Info
 
Rio Info 2015 - Salão da Inovação - Paraíba - Luiz Maurício Fraga martins
Rio Info
 
Rio Info 2015 - Salão da Inovação - Rio Grande do Sul - Leandro Araújo carras...
Rio Info
 
Rio Info 2015 - Salão da Inovação - São Paulo Capital - Valmir Souza - Biomob
Rio Info
 
Rio Info 2015 - Salão da Inovação - Portugal Finity - Orlando Ribas
Rio Info
 
Rio Info 2015 - Salão da Inovação - Amazonas - Senior APP - Dalvanira Santos ...
Rio Info
 
Rio Info 2015 - Salão da Inovação - Espírito Santo - Fabrio Oliveira
Rio Info
 
Rio Info 2015 - Salão da Inovação - Paraná - Any Market - Rogério Gonçalves
Rio Info
 
Rio Info 2015 - Salão da Inovação - Rio de Janeiro Interior - Luís Gustavo Bo...
Rio Info
 
Rio Info 2015 - Salão da Inovação - Alagoas - Leandro - Quanto Gastei
Rio Info
 
Rio Info 2015 - Salão da Inovação - Rio de Janeiro - Pedro Pisa - Ploog
Rio Info
 
Rio Info 2015 - Salão da Inovação - Sergipe - Marcus Dratovsky
Rio Info
 
Rio Info 2015 - Salão da Inovação - Maranhão - Weldys da Cruz Santos
Rio Info
 
Rio Info 2015 - Salão da Inovação - Uruguai - Ricardo Fynn
Rio Info
 

Recently uploaded (20)

PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 

Big data: Descoberta de conhecimento em ambientes de big data e computação na nuvem - Nelson Favilla

  • 1. Processamento Intensivo de Dados Intensive Data Processing (Big Data) Nelson F. F. Ebecken NTT/COPPE/UFRJ Your Big Data Is Worthless if You Don’t Bring It Into the Real World http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the real--world/
  • 2. Big Data Big Data refers to data that is too big to fit on a single server, too unstructured to fit into a row-and-column database, or too continuously flowing to fit into a static data warehouse (Thomas H. Davenport)
  • 3. Big Data and traditional analytics Type of data Volume of Data Big Data Unstructured formats 100 terabytes to petabytes Traditional analytics Formated in rows and columns Tens of terabytes or less Flow of Data Analysis methods Constant flow of data Machine Learning Static pool of data Hypothesis-based Primary purpose Data-based products Internal decision support and services
  • 4. A menu of big data possibilities Style of data Source of data Industry affected Function affected Large volume Online Financial services Marketing Unstructured Video Health care Supply chain Continuous flow Sensor Manufacturing Human resources Multiple formats Genomic Travel/transport Finance
  • 5. Terminology for using and analyzing data Term Time frame Decision support 1970-1985 Executive support 1980-1990 Online analytical processing OLAP 1990-2000 Business intelligence 1989-2005 Analytics 2005-2010 Big Data 2010-present Specific meaning Use of data analysis to support decision making Focus on data analysis for decisions by senior executives Software for analysing multidimensional data tables Tools to support data-driven decisions, with emphasis on reporting Focus on ststistical and mathematical analysis for decisions Focus on very large, unstructured, fast moving data
  • 6. How important is Big Data to You and Your Organization ?  Has your management team considered some of the new types of data that may affect your business and industry, both now and in the next several years ?  Have you discussed the term big data and wether it’s a good description of what your organization is doing with data and analytics ?  Are you beggining to change your decision-making processes toward a more continuos approach driven by the continuos availability of data ?  Has your organization adopted faster and more agile approaches to analyzing and acting on important data and analysis ?  Are you beggining to focus more on external information about business and makets enviroments ?  Have you made a big bet on big data ?
  • 7. Big data is going to reshape a lot of different businesses and industries  Every industry that moves things  Every industry that sells to consumers  Every industry that emplys machinery  Every industry that sells or uses content  Every industry that provides service  Every industry that has physical facilities  Every industry that involves money
  • 8. Responsability locus for big data projects Cost savings Faster decisions Better decisions Product/service innovation Discovery IT innovation group Business unit or function analytics group Business unit or function analytics group R&D or product development group Production IT architecture and operations Business unit or function executive Business unit or function executive Product development or product management
  • 9. Overview of technologies for big data Technology Hadoop Definition Open source software for processing big data across multiple parallel servers MapReduce Scripting languages Machine learning Visual analytics Natural language processing NLP In-memory analytics The architectural framework on which Hadoop is based Programming languages that work well with big data (Python, Pig, Hive...) Algorithms for rapidly finding the model that best fits a data set Display of analytical results in visual or graphic formats Algorithms for analyzing text, frequencies, meanings,... Processing big data in computer memory for greater speed
  • 10. MapReduce MapReduce is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers.  It was originally developed by Google  In 2003, Google's distributed file system, called GFS In 2004, Google published the paper that introduced MapReduce MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (an Apache project).
  • 11. Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: Processes input key/value pair Produces set of intermediate pairs 'map (in_key, in_value) -> list(out_key, intermediate_value)I • Produces a set of merged output values (usually just one) 'reduce (out_key, list(intermediate_value)) -> list(out_value)I
  • 12. Map-Reduce . Parallel programming for large masses of data Map/Combine/Partition Shuffle Sort/Reduce key/val key/val key/val key/val key/val key/val Reduce output Reduce output Reduce output input Map input Map input Map 14
  • 13. Why learn models in MapReduce?  High data throughput Stream about 100 Tb per hour using 500 mappers  Framework provides fault tolerance Monitors mappers and reducers and re-starts tasks on other machines should one of the machines fail  Excels in counting patterns over data records  Built on relatively cheap, commodity hardware No special purpose computing hardware  Large volumes of data are being increasingly stored on Grid clusters running MapReduce Especially in the internet domain
  • 14. Why learn models in MapReduce? • Learning can become limited by computation time and not data volume With large enough data and number of machines Reduces the need to down-sample data More accurate parameter estimates compared to learning on a single machine for the same amount of time
  • 15. Learning models in MapReduce  A primer for learning models in MapReduce (MR) Illustrate techniques for distributing the learning algorithm in a MapReduce framework Focus on the mapper and reducer computations  Data parallel algorithms are most appropriate for MapReduce implementations  Not necessarily the most optimal implementation for a specific algorithm Other specialized non-MapReduce implementations exist for some algorithms, which may be better  MR may not be the appropriate framework for exact solutions of non data parallel/sequential algorithms Approximate solutions using MapReduce may be good enough
  • 16. Types of learning in MapReduce • Three common types of learning models using MapReduce framework 1. Parallel training of multiple models – Train either in mappers or reducers 2. Ensemble training methods – Train multiple models and combine them 3. Distributed learning algorithms – Learn using both mappers and reducers Use the Grid as a large cluster of independent machines (with fault tolerance)
  • 17. Parallel training of multiple models  Train multiple models simultaneously using a learning algorithm that can be learnt in memory  Useful when individual models are trained using a subset, filtered or modification of raw data  Can train 1000`s of models simultaneously  Essentially, treat Grid as a large cluster of machines – Leverage fault tolerance of Hadoop  Train 1 model in each reducer – Map:  Input: All data  Filters subset of data relevant for each model training  Output: <model_index, subset of data for training this model> – Reduce  Train model on data corresponding to that model_index
  • 18. Apache Mahout Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. 25 April 2014 - Goodbye MapReduce The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. We are building our future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark. Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout. Apache Spark™ is a fast and general engine for large-scale data processing. H2O is the open source in memory solution from 0xdata for predictive analytics on big data.
  • 19. Matrix Methods Slides with bit.ly/10SIe1A Code github.com/dgleich/matrix-Hadoop hadoop-tutorial DAVID F. GLEICH ASSISTANT PROFESSOR COMPUTER SCIENCE PURDUE UNIVERSITY David Gleich á Purdue bit.ly/10SIe1A 1
  • 20. 20
  • 21. ACM KDD 2014 24-27/08 New environments: Microsoft Azure ML Studio, Google Prediction API,… 2 Research Sessions + Industry & Government Statistical Techniques for Big Data Scaling-up Methods for Big Data Topic Modeling
  • 22. Big data & machine learning This is a huge field, growing very fast Many algorithms and techniques: can be seen as a giant toolbox with wide-ranging applications Ranging from the very simple to the extremely sophisticated Difficult to see the big picture Huge range of applications Math skills are crucial