CH 4_TYBSC(CS)_Data Science_Visualisation

Chapter-4
Data Visualization
By-Prof.Sangeeta Borde

Visualization:
• Definition: Graphical representation of data that can make information easy to analyze &
understand.
• Advantages:
1.Easier to analyze
2. Easier to detect trends, patterns, outliers
Exploratory Data Analysis (EDA)-
Exploratory Data Analysis (EDA) is a crucial initial step in data science projects.
It involves analyzing and visualising data to understand its key characteristics,
uncover patterns, and identify relationships between variables. It refers to
studying and exploring record sets to apprehend their predominant traits,
discover patterns, locate outliers, and identify relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra
formal statistical analyses or modelling.

Following Methods are involves in EDA
1. Univariate Visualization: Univariate Visualization statistics & summary for each
field in the raw dataset.
Univariate analysis focuses on a single variable to understand its
internal structure. It is primarily concerned with describing the data and
finding patterns existing in a single feature.
• Common techniques include:
• Histograms: Used to visualize the distribution of a variable.
• Box plots: Useful for detecting outliers and understanding the spread
and skewness of the data.
• Bar charts: Employed for categorical data to show the frequency of
each category.
• Summary statistics: Calculations like mean, median, mode, variance,
and standard deviation that describe the central tendency and
dispersion of the data.

Bivariate evaluation:
• Bivariate evaluation involves exploring the connection between variables. It enables find
associations, correlations, and dependencies between pairs of variables.
• Scatter Plots: These are one of the most common tools used in bivariate analysis. A
scatter plot helps visualize the relationship between two continuous variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient
for linear relationships) quantifies the degree to which two variables are related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze
the relationship between two categorical variables. It shows the frequency distribution of
categories of one variable in rows and the other in columns, which helps in understanding
the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare two
variables over time. This helps in identifying trends, cycles, or patterns that emerge in the
interaction of the variables over the specified period.
• Covariance: Covariance is a measure used to determine how much two random variables
change together.

Multivariate analysis
• Multivariate analysis examines the relationships between
two or more variables in the dataset. It aims to understand
how variables interact with one another, which is crucial for
most statistical modeling techniques. Techniques include:
• Pair plots: Visualize relationships across several variables
simultaneously to capture a comprehensive view of
potential interactions.
• Principal Component Analysis (PCA): A dimensionality
reduction technique used to reduce the dimensionality of
large datasets, while preserving as much variance as
possible.

Key aspects of EDA include:
• Distribution of Data: Examine the distribution of data points to understand their range, central
tendencies (mean, median), and dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and
bar charts to visualize relationships within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can
influence statistical analyses and might indicate data entry errors or unique cases. Outliers may
occur due to several reasons such as measurement error, data entry error, sampling error etc.
• Correlation Analysis: Checking the relationships between variables to understand how they
might affect each other. This includes computing correlation coefficients and creating
correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data points, whether
by imputation or removal, depending on their impact and the amount of missing data.

Key aspects of EDA include:
(Data visualization can help in…)
• Business Analysis made easy:
Decision making-such as sales prediction, product promotion, and
customer behavior.
Improve response Time- Quick glance
Greater simplicity:

Visual Encoding:
• Encoding in data visualization means translating into visual elements
on a chart or map.
• The attribute values signify important data characteristics such as
numerical, categorical, or ordinal data.
• The use of an appropriate visualization graph is a challenging task.
• Role of data visualization & its corresponding tool.
• 1. Distribution- Scatter Chart ,3D Area chart,Histogram
• 2.Relationship-Bubble chart,Scatter plot
• 3.Comparision-Bar ,Line,Column,Area

Visual Encoding:
4.Composition- Pie, Waterfall chart,stacked column chart
5.Location-Bubble Map
6.Connection-Matrix chart,word cloud,Tube map.
7.These are used to show accurate data in the dataset.
To represent data that involves three or more variables---
1.Shape 2.Size 3.Color 4.orientation
5.Texture 6. Length 7.Angles

Based on type of data.visualization tools will
be decided.
• The following software is used for data visualization:-
• 1.Tableau: Database integration, Email Integration, Dashboard creation.
• 2.Looker: Business Intelligent Platform.
• 3.Qlikview: Personalized data search, Role-based access
• 4.MS-EXCEL:
• 5.Domo:Dashboard creation
• 6.Power BI: Affordability web publishing
• 7.plotly: Image storage

Tools for Performing Exploratory Data Analysis
• 1. Python Libraries
• Pandas: Provides extensive functions for data manipulation and analysis, including data
structure handling and time series functionality.
• Matplotlib: A plotting library for creating static, interactive, and animated visualisations
in Python.
• Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive and
informative statistical graphics.
• Plotly: An interactive graphing library for making interactive plots and offers more
sophisticated visualization capabilities.
• 2. R Packages
• ggplot2: it’s a powerful tool for making complex plots from data in a data frame.
• dplyr: A grammar of data manipulation, providing a consistent set of verbs that help you
solve the most common data manipulation challenges.
• tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent form that
matches the semantics of the dataset with the way it is stored.

Libraries:
• Matplotlib is a data visualization library and 2-D plotting library of Python It was
initially released in 2003 and it is the most popular and widely-used plotting library in
the Python community. It comes with an interactive environment across multiple
platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the
Jupyter Notebook, web application servers, etc.
• Plotly is a free open-source graphing library that can be used to form data
visualizations. Plotly (plotly.py) is built on top of the Plotly JavaScript library
(plotly.js) and can be used to create web-based data visualizations that can be
displayed in Jupyter notebooks or web applications using Dash or saved as individual
HTML files. Plotly provides more than 40 unique chart types like scatter plots,
histograms, line charts, bar charts, pie charts, error bars, box plots, multiple axes,
sparklines, dendrograms, 3-D charts, etc.

Libraries:
• Seaborn is a Python data visualization library that is based on
Matplotlib and closely integrated with the NumPy and pandas data
structures. Seaborn has various dataset-oriented plotting functions that
operate on data frames and arrays that have whole datasets within
them. Then it internally performs the necessary statistical aggregation
and mapping functions to create informative plots that the user desires.
It is a high-level interface for creating beautiful and informative
statistical graphics that are integral to exploring and understanding
data. The Seaborn data graphics can include bar charts, pie charts,
histograms, scatterplots, error charts, etc. Seaborn also has various
tools for choosing colour palettes that can reveal patterns in the data.

GGplot
• Ggplot is a Python data visualization library that is based on the
implementation of ggplot2 which is created for the programming
language R. Ggplot can create data visualizations such as bar charts,
pie charts, histograms, scatterplots, error charts, etc. using high-level
API. It also allows you to add different types of data visualization
components or layers in a single visualization.
• geoplotlib: Most of the data visualization libraries don’t provide
much support for creating maps or using geographical data and that is
why geoplotlib is such an important Python library. It supports the
creation of geographical maps in particular with many different types
of maps available such as dot-density maps, choropleths, symbol
maps, etc.

Basic Data visualization Tools:
• 1.Histogram
• 2.Bar chart/Graphs
• 3.Line plot
• 4.Scatter plot

Histogram:
• A histogram is a visual depiction of a frequency distribution
table with continuous divisions that have been grouped. A
series of rectangles with foundations equal to the distances
between class bounds and areas proportionate to the
frequency in the associated classes make up the area
diagram.

Histogram:
import matplotlib.pyplot as plt
# create data
data = [32, 96, 45, 67, 76, 28, 79, 62, 43, 81, 70,
61, 95, 44, 60, 69, 71, 23, 69, 54, 76, 67,
82, 97, 26, 34, 18, 16, 59, 88, 29, 30, 66,
23, 65, 72, 20, 78, 49, 73, 62, 87, 37, 68,
81, 80, 77, 92, 81, 52, 43, 68, 71, 86]
# create histogram
plt.hist(data)
# display histogram
plt.show()

Import matplotlib.pyplot as plt
Arr1[ ]
for I in range(0,50)
Arr1(random. append(randint(0,100))
print(arr1)
plt.plot(Arr1,marker=‘o’)
Plt.show()

Bar Chart/Graph:
• A bar graph or bar chart can be defined as a graph or chart
representing explicit data in rectangular bars. In short, a bar graph is a
graph with either horizontal or vertical rectangular bars. A bar chart
with vertical bars is also called a column chart. The length of the bars
depends on the values because the bars are proportional to the values.

Components of Bar Chart:
• Chart Title: It denotes the name of the bar chart. In this, we can write what the chart is
representing.
• Grid Lines: The vertical and horizontal lines in gray color is called grid lines.
• Bars: A bar is corresponding to a value. It may be horizontal or vertical. The largest bar represents
the largest value.
• Axis Title: A bar graph has two titles one is vertical, and the other is horizontal. Both the axis is
related to each other. We can write the axis title for easy understanding. Suppose, the vertical axis
represents expenses. So, we can write Expenses (in rupees) on the vertical axis. The expenses may
be of different types, so we can write types of expenses on the horizontal axis.
• Labels: We can also categorize the horizontal axis title. For example, types of expenses can be
categorized into medical, transport, office, etc.
• Legends: A legend specifies what a bar is representing. It is also known as the key of a chart.
Consider the following graph; if we write 2019 in place of Series 1, it means the blue bars in the
graph represent the data of the year 2019.
• Scale: The scale represents the vertical values. It may include rupees, population, size, etc.

TYPES:
• Vertical
• Horizontal
• Stack
• 3D Bar

Line Plot,Scatter Plot,Area plot/chart
• Python's Matplotlib module is used for data visualization. A set of
methods called pyplot, a submodule of matplotlib, aids in creating
several charts. The relationship between two sets of data, X and Y, is
shown using line plots on a distinct axis.

Specialized Data Visualization Tools:
• BOX PLOT
• BUBBLE PLOT
• HEAT MAP
• VENN DIAGRAM

CH 4_TYBSC(CS)_Data Science_Visualisation

More Related Content

Similar to CH 4_TYBSC(CS)_Data Science_Visualisation (20)

More from sangeeta borde (11)

Recently uploaded (20)

CH 4_TYBSC(CS)_Data Science_Visualisation