SlideShare a Scribd company logo
DSA – 105 Introduction to
Data Science
Week 4 – Tools and Technologies in Data Science
Ferdin Joe John Joseph, PhD
Faculty of Information Technology
Thai-Nichi Institute of Technology
Week 4
Agenda
• Tools and Technologies in Data Science
Faculty of Information Technology, Thai - Nichi Institute of
Technology
2
Tools and Technologies in 2017
Faculty of Information Technology, Thai - Nichi Institute of
Technology
3
Programming Languages
• Python
• R
• Java
• C++
• Perl
• Matlab/Octave
Faculty of Information Technology, Thai - Nichi Institute of
Technology
4
Data Bases
• MySQL
• No SQL
• Microsoft SQL server
• Oracle
Faculty of Information Technology, Thai - Nichi Institute of
Technology
5
Data Analytics Tools
• SAS
• Tableau
• IBM SPSS Statistics
• Microsoft Excel
• Statistica
• Rapid Miner
• SAP
Faculty of Information Technology, Thai - Nichi Institute of
Technology
6
API
• Scala
• Tensor Flow
• Amazon Web Services
Faculty of Information Technology, Thai - Nichi Institute of
Technology
7
Servers and Application Frameworks
• Hadoop
• Spark
• Microsoft Azure
• Jupyter
Faculty of Information Technology, Thai - Nichi Institute of
Technology
8
Tools and Technologies in 2018
Faculty of Information Technology, Thai - Nichi Institute of
Technology
9
Recruiters Requirements 2018
Faculty of Information Technology, Thai - Nichi Institute of
Technology
10
Tools - Summary
Faculty of Information Technology, Thai - Nichi Institute of
Technology
11
R: The Most Popular Language for Data
Science
Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data
for analysis, R is a popular software package for actually doing the math and visualizing the results. An open-
source statistical modeling language, R has traditionally been popular in the academic community, which
means that lots of data scientists will be familiar with it.
R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including
text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem,
R has become increasingly popular as programmers have created additional add-on packages for handling big
datasets and parallel processing techniques that have come to dominate statistical modeling today.
• Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of
POSIX (OS X, Linux, UNIX) machines.
• Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive
processes like simulations or AI learning processes.
• Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for
the “MapReduce” function of dividing the computing problem among separate clusters and then re-
combining or “reducing” all of the varying results into a single answer.
R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more.
Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make
marketing campaigns more effective, and reporting.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
12
Java & the Java Virtual Machine
Organizations that seek to write custom analytics tools from scratch
increasingly use the venerable language Java, as well as other
languages that run on the Java Virtual Machine (JVM). Java is a
variation of the object-oriented C++ language, and because Java runs
on a platform-agnostic virtual machine, programs can be compiled
once and run anywhere.
The upside of using the JVM over a language written to run directly on
the processor is the reduction in development time. This simpler
development process has been a draw for data analytics, making JVM-
based data mining tools very popular. Also, Hadoop—the popular
open-source, distributed big data storage and analysis software—is
written in Java.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
13
Java & the Java Virtual Machine
Java has rich open-source libraries for data mining, including Mahout and Weka, and the
JVM provides robust memory management and exception handling. Other programming
languages that can be used with the JVM include:
• Scala: This programming language has the same efficiency as Java because it’s run on the
JVM. However, it’s also become increasingly popular in data mining because it permits
developers to use object-oriented programming (OOP) as well as functional
programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell,
Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab.
• Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a
primarily (although not 100%) functional language that also runs on the JVM. Clojure
keeps data static and was designed for running concurrent processes. These features are
important because, in contrast, object-oriented code executing concurrent processes will
sometimes attempt to write to the same variable simultaneously. Keeping data
structures immutable avoids this problem. Clojure has access to Java libraries, and the
same development efficiencies as Java. Clojure can use the LISP macro facility to
integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank,
WalMart Labs, and Spotify.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
14
Python: A High-Level Programming Language
with Excellent Data Libraries
Python is a high-level language, meaning that the creators automated
certain housekeeping processes in order to make code easier to write.
Python has robust libraries that support statistical modeling (Scipy and
Numpy), data mining (Orange and Pattern), and visualization
(Matplotlib).
Scikit-learn, a library of machine learning techniques very useful to data
scientists, has attracted developers from Spotify, OKCupid, and
Evernote, but can be challenging to master.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
15
Excel: Powerful Data Analytics on a Smaller
Scale
Excel can actually accomplish a lot of sophisticated analysis
It’s easy to use and widely available. While it’s not best for analyzing
truly massive, unstructured datasets
For example, a massive dataset of some 30 million healthcare records
distributed via Hadoop across dozens of servers
It is surprisingly powerful when used for a variety of data analytics
projects at a small scale. These can include clustering, optimization,
and predictive modeling using supervised AI learning or forecasting
techniques.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
16
SAS (Statistical Analysis System): Data Mining
Software Suite
Used for advanced analytics, data management, and social media
analytics, SAS is a robust suite that’s popular for business intelligence
analysis of large data and unstructured datasets. In 2015, SAS topped
the Gartner Magic Quadrant list in terms of “ability to execute” in the
category of advanced analytics platforms due to the breadth and
quality of its predictive modeling and data mining techniques. With a
well-regarded visualization tool and integration with open-source tools
like R, Hadoop and Python, SAS also puts significant effort into making
tools backwards compatible, an important feature when looking at
older historical datasets.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
17
SAS (Statistical Analysis System): Data Mining
Software Suite
Say, for example, a company’s sales records were prepared for use by
SAS in 1998. With backwards compatibility, they can still be read today.
In large organizations, employee turnover over the years puts a
premium on the continuity of tools. So, when a data scientist retires,
you won’t lose the ability to access their work if they preferred older
software that no one new to the position knows how to use.
SAS can be costly, has a complicated licensing structure that some
customers have found to be annoying, and has a steep learning curve.
Although it’s expensive and complicated, it’s a very popular option,
with more than 65,000 customers.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
18
IBM: SPSS Modeler and SPSS Analytics
Forrester Research Wave ranks IBM’s advanced data analytics platform
as the top offering in the advanced analytics category for its breadth of
tools that handle all elements of big data modeling: loading, “cleaning,”
preparing, and then predictive modeling, whether using statistical or
machine learning techniques.
Other makers of highly rated commercial tools for advanced data
analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
19
IBM: SPSS Modeler and SPSS Analytics
SPSS Modeler and SPSS Statistics were acquired by IBM in 2009, and
have a loyal following among statisticians. These tools integrate
Hadoop to facilitate file-system computing using big datasets. The
Social Media Analytics product helps data scientists harvest data from
Twitter, Facebook, and other platforms to perform customer sentiment
analysis. Gartner reports that the IBM advanced analytics platform has
lower customer satisfaction ratings than average, largely due to weak
customer support, inadequate documentation, and a challenging
installation process.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
20
SQL vs. NoSQL Databases: Tackling the
“Messiness” of Big Data
Another important distinction in the world of data is SQL databases vs.
NoSQL databases, both of which are well suited to different types of
datasets. Here’s a quick look at what makes them different in the context of
data analysis.
The traditional “relational” database was designed for an era in which data
was far more expensive to collect and to store—and much more carefully
organized. Structured Query Language (SQL) has been the means by which
programmers transfer data to and from those neatly categorized rows and
columns.
Only 5% of the world’s information was structured data—and the rest
consists of articles, photos, videos, social media posts, machine-to-machine
communication, product inventory, and technical documents. So data
scientists turned to a different standard for data storage called “NoSQL.”
Faculty of Information Technology, Thai - Nichi Institute of
Technology
21
Databases
MySQL: Open-Source RDBMS
Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and
one part of the LAMP software stack. This free, open-source database management system is used by web
applications like WordPress, Drupal, Facebook, Twitter, and YouTube.
MongoDB
The most popular NoSQL database system available on the market is the open-source MongoDB, which has
been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer
service, and the tool is particularly popular with startups.
One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing
framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up
with MongoDB, Spark allows organizations to put real-time analytics reporting to use.
Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra.
Oracle
Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and
OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with
Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular
and considered to be top-notch by many, they’re expensive.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
22
Databases
SQL Server DBMS: Enterprise-Level Database Management
Microsoft SQL Server DBMS is a competitive enterprise-level database
management system that includes support for SQL or noSQL
architectures, in-memory computing, the cloud, and analytics on
transactions. Existing customers are generally impressed with its
performance
Other strong performers in the market include SAP, IBM, EnterpriseDB,
InterSystems, and MarkLogic.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
23
File System Computing
Hadoop: File System Computing
What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For
example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of
servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large
to extract from the drives to a place where it can be analyzed, software like Hadoop was created.
Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of
big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to
the data so it can be processed in place. It has increasingly become the industry standard for file system
computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York
Times.
There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top
with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software
giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
24
R installation procedure
Follow the procedure in the link below and install R software.
https://bit.ly/2MKNB4j
This will help you learn R along with basic mathematics and statistics in
the next one month time. Concepts learned so far from Java is enough
to accomplish this.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
25
Next Week…
• Basic Mathematics I
Faculty of Information Technology, Thai - Nichi Institute of
Technology
26

More Related Content

What's hot (20)

PDF
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
IJDKP
 
PPTX
Education data mining presentation
Nishabhanot1
 
PDF
An insight into Educational Data Mining at Muğla Sıtkı Koçman University, Turkey
strehlst
 
PDF
Advances in Learning Analytics and Educational Data Mining
MehrnooshV
 
PDF
Predicting students performance using classification techniques in data mining
Lovely Professional University
 
PPTX
Data mining to predict academic performance.
Ranjith Gowda
 
PDF
Data Mining Techniques for School Failure and Dropout System
Kumar Goud
 
PDF
A Nobel Approach On Educational Data Mining
ijircee
 
PDF
Predicting student performance using aggregated data sources
Olugbenga Wilson Adejo
 
PDF
Evaluation of Data Mining Techniques for Predicting Student’s Performance
Lovely Professional University
 
PDF
New Generation MTech and MSc Programs at JKLU
Sanjay Goel
 
PPTX
Students academic performance using clustering technique
saniacorreya
 
PPTX
STUDENT PERFORMANCE ANALYSIS USING DECISION TREE
Akshay Jain
 
PDF
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
IJDKP
 
PDF
Student Performance Evaluation in Education Sector Using Prediction and Clust...
IJSRD
 
PDF
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET Journal
 
PDF
IRJET- Using Data Mining to Predict Students Performance
IRJET Journal
 
PDF
Educational Data Mining & Students Performance Prediction using SVM Techniques
IRJET Journal
 
PPTX
Short story ppt
KarishmaKuria1
 
PPTX
Short story ppt
KarishmaKuria1
 
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
IJDKP
 
Education data mining presentation
Nishabhanot1
 
An insight into Educational Data Mining at Muğla Sıtkı Koçman University, Turkey
strehlst
 
Advances in Learning Analytics and Educational Data Mining
MehrnooshV
 
Predicting students performance using classification techniques in data mining
Lovely Professional University
 
Data mining to predict academic performance.
Ranjith Gowda
 
Data Mining Techniques for School Failure and Dropout System
Kumar Goud
 
A Nobel Approach On Educational Data Mining
ijircee
 
Predicting student performance using aggregated data sources
Olugbenga Wilson Adejo
 
Evaluation of Data Mining Techniques for Predicting Student’s Performance
Lovely Professional University
 
New Generation MTech and MSc Programs at JKLU
Sanjay Goel
 
Students academic performance using clustering technique
saniacorreya
 
STUDENT PERFORMANCE ANALYSIS USING DECISION TREE
Akshay Jain
 
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUE
IJDKP
 
Student Performance Evaluation in Education Sector Using Prediction and Clust...
IJSRD
 
IRJET- Evaluation Technique of Student Performance in various Courses
IRJET Journal
 
IRJET- Using Data Mining to Predict Students Performance
IRJET Journal
 
Educational Data Mining & Students Performance Prediction using SVM Techniques
IRJET Journal
 
Short story ppt
KarishmaKuria1
 
Short story ppt
KarishmaKuria1
 

Similar to 2019 DSA 105 Introduction to Data Science Week 4 (20)

PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
PDF
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
phdAssistance1
 
PPTX
Coding software and tools used for data science management - Phdassistance
phdAssistance1
 
PDF
Memory Management in BigData: A Perpective View
ijtsrd
 
PPTX
Gurney · SlidesCarnival.pptx
yakotalordea
 
PDF
Python para Manual de Ciência de Dados
Rafael Oliveira Bitcoin
 
PDF
tools
bhavesh lande
 
PPTX
10 Best Platforms For Data Science and Machine Learning.pptx
SatawareTechnologies7
 
PDF
Data Science Tools and Technologies: A Comprehensive Overview
saniakhan8105
 
PPTX
Top 10 Data analytics tools to look for in 2021
Mobcoder
 
PDF
Big Data Tools: A Deep Dive into Essential Tools
FredReynolds2
 
PDF
Best Data Science Tools You should know.pdf
Hrushikesh Joshi
 
PPTX
DATA SCIENCE
PariJain40
 
DOCX
Introduction To Data Science with Apache Spark
ZaranTech LLC
 
PDF
Job Data Analysis Reveals Key Skills Required for Data Scientists
JobsPikr
 
PPTX
ODSC and iRODS
Raminder Singh
 
PDF
How to Become a Big Data Professional.pdf
Careervira
 
PDF
10 things you need to know about Spark
IBM Analytics
 
DOCX
sudipto_resume
Sudipto Saha
 
PDF
The Study of the Large Scale Twitter on Machine Learning
IRJET Journal
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Ferdin Joe John Joseph PhD
 
Coding‌ ‌Software‌ ‌and‌ ‌Tools‌ ‌used‌ ‌for‌ ‌Data‌ ‌Science‌ ‌Management‌ ‌...
phdAssistance1
 
Coding software and tools used for data science management - Phdassistance
phdAssistance1
 
Memory Management in BigData: A Perpective View
ijtsrd
 
Gurney · SlidesCarnival.pptx
yakotalordea
 
Python para Manual de Ciência de Dados
Rafael Oliveira Bitcoin
 
10 Best Platforms For Data Science and Machine Learning.pptx
SatawareTechnologies7
 
Data Science Tools and Technologies: A Comprehensive Overview
saniakhan8105
 
Top 10 Data analytics tools to look for in 2021
Mobcoder
 
Big Data Tools: A Deep Dive into Essential Tools
FredReynolds2
 
Best Data Science Tools You should know.pdf
Hrushikesh Joshi
 
DATA SCIENCE
PariJain40
 
Introduction To Data Science with Apache Spark
ZaranTech LLC
 
Job Data Analysis Reveals Key Skills Required for Data Scientists
JobsPikr
 
ODSC and iRODS
Raminder Singh
 
How to Become a Big Data Professional.pdf
Careervira
 
10 things you need to know about Spark
IBM Analytics
 
sudipto_resume
Sudipto Saha
 
The Study of the Large Scale Twitter on Machine Learning
IRJET Journal
 
Ad

More from Ferdin Joe John Joseph PhD (20)

PDF
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
PDF
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
PDF
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
PDF
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
PDF
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
PDF
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
PDF
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
PDF
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
PDF
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
PDF
Week 8: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Invited Talk DGTiCon 2022
Ferdin Joe John Joseph PhD
 
Week 12: Cloud AI- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 11: Cloud Native- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 10: Cloud Security- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 9: Relational Database Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 7: Object Storage Service Alibaba Cloud- DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 6: Server Load Balancer and Auto Scaling Alibaba Cloud- DSA 441 Cloud Co...
Ferdin Joe John Joseph PhD
 
Week 5: Elastic Compute Service (ECS) with Alibaba Cloud- DSA 441 Cloud Compu...
Ferdin Joe John Joseph PhD
 
Week 4: Big Data and Hadoop in Alibaba Cloud - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 3: Virtual Private Cloud, On Premise, IaaS, PaaS, SaaS - DSA 441 Cloud C...
Ferdin Joe John Joseph PhD
 
Week 2: Virtualization and VM Ware - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Week 1: Introduction to Cloud Computing - DSA 441 Cloud Computing
Ferdin Joe John Joseph PhD
 
Sept 6 2021 BTech Artificial Intelligence and Data Science curriculum
Ferdin Joe John Joseph PhD
 
Hadoop in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Cloud Computing Essentials in Alibaba Cloud
Ferdin Joe John Joseph PhD
 
Transforming deep into transformers – a computer vision approach
Ferdin Joe John Joseph PhD
 
Week 11: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 10: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 9: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Week 8: Programming for Data Analysis
Ferdin Joe John Joseph PhD
 
Ad

Recently uploaded (20)

PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 

2019 DSA 105 Introduction to Data Science Week 4

  • 1. DSA – 105 Introduction to Data Science Week 4 – Tools and Technologies in Data Science Ferdin Joe John Joseph, PhD Faculty of Information Technology Thai-Nichi Institute of Technology
  • 2. Week 4 Agenda • Tools and Technologies in Data Science Faculty of Information Technology, Thai - Nichi Institute of Technology 2
  • 3. Tools and Technologies in 2017 Faculty of Information Technology, Thai - Nichi Institute of Technology 3
  • 4. Programming Languages • Python • R • Java • C++ • Perl • Matlab/Octave Faculty of Information Technology, Thai - Nichi Institute of Technology 4
  • 5. Data Bases • MySQL • No SQL • Microsoft SQL server • Oracle Faculty of Information Technology, Thai - Nichi Institute of Technology 5
  • 6. Data Analytics Tools • SAS • Tableau • IBM SPSS Statistics • Microsoft Excel • Statistica • Rapid Miner • SAP Faculty of Information Technology, Thai - Nichi Institute of Technology 6
  • 7. API • Scala • Tensor Flow • Amazon Web Services Faculty of Information Technology, Thai - Nichi Institute of Technology 7
  • 8. Servers and Application Frameworks • Hadoop • Spark • Microsoft Azure • Jupyter Faculty of Information Technology, Thai - Nichi Institute of Technology 8
  • 9. Tools and Technologies in 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 9
  • 10. Recruiters Requirements 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 10
  • 11. Tools - Summary Faculty of Information Technology, Thai - Nichi Institute of Technology 11
  • 12. R: The Most Popular Language for Data Science Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data for analysis, R is a popular software package for actually doing the math and visualizing the results. An open- source statistical modeling language, R has traditionally been popular in the academic community, which means that lots of data scientists will be familiar with it. R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today. • Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of POSIX (OS X, Linux, UNIX) machines. • Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive processes like simulations or AI learning processes. • Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for the “MapReduce” function of dividing the computing problem among separate clusters and then re- combining or “reducing” all of the varying results into a single answer. R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting. Faculty of Information Technology, Thai - Nichi Institute of Technology 12
  • 13. Java & the Java Virtual Machine Organizations that seek to write custom analytics tools from scratch increasingly use the venerable language Java, as well as other languages that run on the Java Virtual Machine (JVM). Java is a variation of the object-oriented C++ language, and because Java runs on a platform-agnostic virtual machine, programs can be compiled once and run anywhere. The upside of using the JVM over a language written to run directly on the processor is the reduction in development time. This simpler development process has been a draw for data analytics, making JVM- based data mining tools very popular. Also, Hadoop—the popular open-source, distributed big data storage and analysis software—is written in Java. Faculty of Information Technology, Thai - Nichi Institute of Technology 13
  • 14. Java & the Java Virtual Machine Java has rich open-source libraries for data mining, including Mahout and Weka, and the JVM provides robust memory management and exception handling. Other programming languages that can be used with the JVM include: • Scala: This programming language has the same efficiency as Java because it’s run on the JVM. However, it’s also become increasingly popular in data mining because it permits developers to use object-oriented programming (OOP) as well as functional programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell, Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab. • Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a primarily (although not 100%) functional language that also runs on the JVM. Clojure keeps data static and was designed for running concurrent processes. These features are important because, in contrast, object-oriented code executing concurrent processes will sometimes attempt to write to the same variable simultaneously. Keeping data structures immutable avoids this problem. Clojure has access to Java libraries, and the same development efficiencies as Java. Clojure can use the LISP macro facility to integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank, WalMart Labs, and Spotify. Faculty of Information Technology, Thai - Nichi Institute of Technology 14
  • 15. Python: A High-Level Programming Language with Excellent Data Libraries Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master. Faculty of Information Technology, Thai - Nichi Institute of Technology 15
  • 16. Excel: Powerful Data Analytics on a Smaller Scale Excel can actually accomplish a lot of sophisticated analysis It’s easy to use and widely available. While it’s not best for analyzing truly massive, unstructured datasets For example, a massive dataset of some 30 million healthcare records distributed via Hadoop across dozens of servers It is surprisingly powerful when used for a variety of data analytics projects at a small scale. These can include clustering, optimization, and predictive modeling using supervised AI learning or forecasting techniques. Faculty of Information Technology, Thai - Nichi Institute of Technology 16
  • 17. SAS (Statistical Analysis System): Data Mining Software Suite Used for advanced analytics, data management, and social media analytics, SAS is a robust suite that’s popular for business intelligence analysis of large data and unstructured datasets. In 2015, SAS topped the Gartner Magic Quadrant list in terms of “ability to execute” in the category of advanced analytics platforms due to the breadth and quality of its predictive modeling and data mining techniques. With a well-regarded visualization tool and integration with open-source tools like R, Hadoop and Python, SAS also puts significant effort into making tools backwards compatible, an important feature when looking at older historical datasets. Faculty of Information Technology, Thai - Nichi Institute of Technology 17
  • 18. SAS (Statistical Analysis System): Data Mining Software Suite Say, for example, a company’s sales records were prepared for use by SAS in 1998. With backwards compatibility, they can still be read today. In large organizations, employee turnover over the years puts a premium on the continuity of tools. So, when a data scientist retires, you won’t lose the ability to access their work if they preferred older software that no one new to the position knows how to use. SAS can be costly, has a complicated licensing structure that some customers have found to be annoying, and has a steep learning curve. Although it’s expensive and complicated, it’s a very popular option, with more than 65,000 customers. Faculty of Information Technology, Thai - Nichi Institute of Technology 18
  • 19. IBM: SPSS Modeler and SPSS Analytics Forrester Research Wave ranks IBM’s advanced data analytics platform as the top offering in the advanced analytics category for its breadth of tools that handle all elements of big data modeling: loading, “cleaning,” preparing, and then predictive modeling, whether using statistical or machine learning techniques. Other makers of highly rated commercial tools for advanced data analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx. Faculty of Information Technology, Thai - Nichi Institute of Technology 19
  • 20. IBM: SPSS Modeler and SPSS Analytics SPSS Modeler and SPSS Statistics were acquired by IBM in 2009, and have a loyal following among statisticians. These tools integrate Hadoop to facilitate file-system computing using big datasets. The Social Media Analytics product helps data scientists harvest data from Twitter, Facebook, and other platforms to perform customer sentiment analysis. Gartner reports that the IBM advanced analytics platform has lower customer satisfaction ratings than average, largely due to weak customer support, inadequate documentation, and a challenging installation process. Faculty of Information Technology, Thai - Nichi Institute of Technology 20
  • 21. SQL vs. NoSQL Databases: Tackling the “Messiness” of Big Data Another important distinction in the world of data is SQL databases vs. NoSQL databases, both of which are well suited to different types of datasets. Here’s a quick look at what makes them different in the context of data analysis. The traditional “relational” database was designed for an era in which data was far more expensive to collect and to store—and much more carefully organized. Structured Query Language (SQL) has been the means by which programmers transfer data to and from those neatly categorized rows and columns. Only 5% of the world’s information was structured data—and the rest consists of articles, photos, videos, social media posts, machine-to-machine communication, product inventory, and technical documents. So data scientists turned to a different standard for data storage called “NoSQL.” Faculty of Information Technology, Thai - Nichi Institute of Technology 21
  • 22. Databases MySQL: Open-Source RDBMS Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and one part of the LAMP software stack. This free, open-source database management system is used by web applications like WordPress, Drupal, Facebook, Twitter, and YouTube. MongoDB The most popular NoSQL database system available on the market is the open-source MongoDB, which has been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer service, and the tool is particularly popular with startups. One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up with MongoDB, Spark allows organizations to put real-time analytics reporting to use. Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra. Oracle Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular and considered to be top-notch by many, they’re expensive. Faculty of Information Technology, Thai - Nichi Institute of Technology 22
  • 23. Databases SQL Server DBMS: Enterprise-Level Database Management Microsoft SQL Server DBMS is a competitive enterprise-level database management system that includes support for SQL or noSQL architectures, in-memory computing, the cloud, and analytics on transactions. Existing customers are generally impressed with its performance Other strong performers in the market include SAP, IBM, EnterpriseDB, InterSystems, and MarkLogic. Faculty of Information Technology, Thai - Nichi Institute of Technology 23
  • 24. File System Computing Hadoop: File System Computing What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large to extract from the drives to a place where it can be analyzed, software like Hadoop was created. Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to the data so it can be processed in place. It has increasingly become the industry standard for file system computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York Times. There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel. Faculty of Information Technology, Thai - Nichi Institute of Technology 24
  • 25. R installation procedure Follow the procedure in the link below and install R software. https://bit.ly/2MKNB4j This will help you learn R along with basic mathematics and statistics in the next one month time. Concepts learned so far from Java is enough to accomplish this. Faculty of Information Technology, Thai - Nichi Institute of Technology 25
  • 26. Next Week… • Basic Mathematics I Faculty of Information Technology, Thai - Nichi Institute of Technology 26