Trends in Data Science
July 07, 2016
Albert Gavino
Talas Data Scientist
Trends in Data Science
July 07, 2016
Albert Gavino
Talas Data Scientist
CSB_community
Trends in Data Science Domains
Data Science Domain Status
Statistics traditional
Natural Language Processing (NLP) Entered the market
Predictive Analytics / Machine Learning Entered the market
Visualization / Dashboards Entered the market
Image Processing (openCV) exploration
Internet of Things (IoT) exploration
Artificial Intelligence/ Deep Learning exploration
VOLUME
VARIETY
VELOCITY
3 V’s of BIG DATA
Trends on Programming Languages
Comparing R, Python and SQL
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across
clusters of computers using simple programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability,
the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available
service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
● Hadoop Common: The common utilities that support the other Hadoop modules.
● Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access
to application data.
● Hadoop YARN: A framework for job scheduling and cluster resource management.
● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Framework
Sample methodology
Problem Statement of
Client: e.g. We want to
know how many clusters
we have
We choose a Machine
Learning or an
algorithm appropriate
for the problem
(e.g. we use K-means)
Visualization using
Tableau
Big data
(Hadoop)
Running R libraries,
packages on your
machine/server
INSIGHTS
1
2
3
4
5
6
Things to get started
● Install R (open source)
● Install RStudio on top of R (this is AGPL)
● Install packages needed (RandomForest, ggplot etc)
● Create a Kaggle account
● Try learning on the competition use cases (see Image processing)
● Exchange learnings with other Data Science users/Data Scientists (meetups)

More Related Content

PDF
Using Machine Learning with HDInsight
PDF
Cred_hadoop_presenatation
PPT
Big data
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
Hadoop/Spark Non-Technical Basics
PPTX
Data analytics
PPTX
1.demystifying big data & hadoop
PPTX
Using Machine Learning with HDInsight
Cred_hadoop_presenatation
Big data
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Hadoop/Spark Non-Technical Basics
Data analytics
1.demystifying big data & hadoop

What's hot (20)

PPTX
Topic modeling using big data analytics
PDF
Hadoop_Presentation
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PPTX
Big data and tools
PDF
Hadoop - A Very Short Introduction
PDF
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
PPTX
Hadoop - A big data initiative
PPTX
Hadoop
PPTX
Hadoop
PPTX
Big Data - Part IV
PPTX
Big data analysis using hadoop cluster
PDF
Bigdata and Hadoop Bootcamp
PDF
Open source stak of big data techs open suse asia
PPTX
Big Data Open Source Technologies
PDF
Introduction_OF_Hadoop_and_BigData
PPTX
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
PPT
Big data and hadoop
PPTX
Big Data - Part II
PPTX
Intro to Big Data Hadoop
PPTX
Big Data Analytics for Non-Programmers
Topic modeling using big data analytics
Hadoop_Presentation
Introduction To Big Data Analytics On Hadoop - SpringPeople
Big data and tools
Hadoop - A Very Short Introduction
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop - A big data initiative
Hadoop
Hadoop
Big Data - Part IV
Big data analysis using hadoop cluster
Bigdata and Hadoop Bootcamp
Open source stak of big data techs open suse asia
Big Data Open Source Technologies
Introduction_OF_Hadoop_and_BigData
Apache Cassandra Lunch #54: Machine Learning with Spark + Cassandra Part 2
Big data and hadoop
Big Data - Part II
Intro to Big Data Hadoop
Big Data Analytics for Non-Programmers
Ad

Viewers also liked (20)

PPT
Social_Good_DSCON
PPT
Crop Challenge_2015
PDF
introduction_aikaike
PDF
Big Data for Library Services (2017)
PDF
Nine Pages You Should Optimize on Your Blog and How
PDF
Recovery: Job Growth and Education Requirements Through 2020
PDF
African Americans: College Majors and Earnings
PDF
The Online College Labor Market
PDF
What's Trending in Talent and Learning for 2016?
PDF
GAME ON! Integrating Games and Simulations in the Classroom
PDF
Digitized Student Development, Social Media, and Identity
PDF
Responding to Academically Distressed Students
PPTX
IT in Healthcare
PPTX
What I Carry: 10 Tools for Success
PDF
Dear NSA, let me take care of your slides.
PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
PDF
Designing Teams for Emerging Challenges
PDF
Visual Design with Data
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
Social_Good_DSCON
Crop Challenge_2015
introduction_aikaike
Big Data for Library Services (2017)
Nine Pages You Should Optimize on Your Blog and How
Recovery: Job Growth and Education Requirements Through 2020
African Americans: College Majors and Earnings
The Online College Labor Market
What's Trending in Talent and Learning for 2016?
GAME ON! Integrating Games and Simulations in the Classroom
Digitized Student Development, Social Media, and Identity
Responding to Academically Distressed Students
IT in Healthcare
What I Carry: 10 Tools for Success
Dear NSA, let me take care of your slides.
UX, ethnography and possibilities: for Libraries, Museums and Archives
Designing Teams for Emerging Challenges
Visual Design with Data
3 Things Every Sales Team Needs to Be Thinking About in 2017
Ad

Similar to CSB_community (20)

PDF
Tools and techniques for data science
PPTX
Hadoop-2022.pptx
PPTX
Analyzing Big data in R and Scala using Apache Spark 17-7-19
PPTX
Fundamental of Big Data with Hadoop and Hive
PPTX
So your boss says you need to learn data science
PDF
Rapid Cluster Computing with Apache Spark 2016
PDF
Microsoft R Server for Data Sciencea
PDF
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
PPTX
Bar camp bigdata
PDF
Big data and hadoop overvew
PDF
Job Data Analysis Reveals Key Skills Required for Data Scientists
PPTX
Bigdata and hadoop
PDF
Getting started with R & Hadoop
PDF
Running R on Hadoop - CHUG - 20120815
PPT
Hadoop Technology
PPTX
Big data analytics - hadoop
PDF
Hadoop .pdf
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Introduction to Big Data
Tools and techniques for data science
Hadoop-2022.pptx
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Fundamental of Big Data with Hadoop and Hive
So your boss says you need to learn data science
Rapid Cluster Computing with Apache Spark 2016
Microsoft R Server for Data Sciencea
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Bar camp bigdata
Big data and hadoop overvew
Job Data Analysis Reveals Key Skills Required for Data Scientists
Bigdata and hadoop
Getting started with R & Hadoop
Running R on Hadoop - CHUG - 20120815
Hadoop Technology
Big data analytics - hadoop
Hadoop .pdf
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Hadoop_EcoSystem slide by CIDAC India.pptx
Introduction to Big Data

CSB_community

  • 1. Trends in Data Science July 07, 2016 Albert Gavino Talas Data Scientist
  • 2. Trends in Data Science July 07, 2016 Albert Gavino Talas Data Scientist
  • 4. Trends in Data Science Domains Data Science Domain Status Statistics traditional Natural Language Processing (NLP) Entered the market Predictive Analytics / Machine Learning Entered the market Visualization / Dashboards Entered the market Image Processing (openCV) exploration Internet of Things (IoT) exploration Artificial Intelligence/ Deep Learning exploration
  • 8. What Is Apache Hadoop? The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: ● Hadoop Common: The common utilities that support the other Hadoop modules. ● Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. ● Hadoop YARN: A framework for job scheduling and cluster resource management. ● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 10. Sample methodology Problem Statement of Client: e.g. We want to know how many clusters we have We choose a Machine Learning or an algorithm appropriate for the problem (e.g. we use K-means) Visualization using Tableau Big data (Hadoop) Running R libraries, packages on your machine/server INSIGHTS 1 2 3 4 5 6
  • 11. Things to get started ● Install R (open source) ● Install RStudio on top of R (this is AGPL) ● Install packages needed (RandomForest, ggplot etc) ● Create a Kaggle account ● Try learning on the competition use cases (see Image processing) ● Exchange learnings with other Data Science users/Data Scientists (meetups)