Summarizing data

SUMMARIZING
DATA
Dr Lipilekha Patnaik
Professor, Community Medicine
Institute of Medical Sciences & SUM Hospital
Siksha ‘O’ Anusandhan deemed to be University
Bhubaneswar, Odisha, India
Email: drlipilekha@yahoo.co.in
1

•Measures of central tendency –
Mean, Median, mode
•Measures of dispersion – Range,
standard deviation, Standard error
2
Session Objectives

Descriptive Measures for continuous data
•Central tendency measures – They are
computed to give a “center” around
which the measurements in the data
are distributed.
•Variation or variability measures –
They describe data spread or how far
away the measurements are from the
center.
3

Statistics related to continuous variables
• Mean
• Median
• Mode
• Range
• Standard Deviation
• Standard Error
4

Measures of Central Tendency
5

Central tendency measures
•Mean – The average value
Affected by extreme values
•Median – The middle value
Not affected by extremes
•Mode – Most frequently occurring
observation, there may be
more than one mode.
6

Mean
•Average
•Arithmetic Mean = (x )
= sum of individual values
number of observations
= Ʃ x
n
7

Exercise
• The diastolicblood pressureof 10 individualswas
83, 75, 81, 79, 71, 95, 75, 77, 84, 90.
•
• Arithmetic
Mean = 83+75+81+79+71+95+75+77+84+90
10
= 810
10
= 81
8

Median
§ The data are first arranged in an ascending or
descending order of magnitude
§ Middle observation is located, which is called
median.
§If the number of values is odd,
Median = middle value
§If the number of values is even,
Median = average of the two middle values
9

Median divides the data into two equal parts
with 50% of the observations above the median
and 50% below it.
10
Unsorted
Sorted in ascending
order

• Exercise: 1 odd no (11) of observations
• 11, 13, 15, 12, 10, 9, 2, 8, 12, 11, 10
• Median
• 8, 9, 10, 10, 11, 11, 12, 12, 12, 13, 15
• Exercise: 2 even no (12) of observations
• 11, 13, 15, 12, 10, 9, 12, 8, 12, 11, 10,12
• Arranged in ascending order
• 8, 9, 10, 10, 11, 11, 12, 12, 12, 12, 13, 15
11
median = 11+12
2
Exercise

Mode
•Most frequent observation.
•The value that appears most frequently in the data set.
12

11, 13, 15, 12, 10, 9, 12, 8, 12, 11, 10
Mode = 12
13
Exercise

Number of seizures/month:
3, 3, 1, 2, 4, 7, 9
14
•Mean? 4.1
•Median? 3
•Mode? 30
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7
No of seizures

What’s wrong with a mean?
•Mean is sensitive to outliers (values far from the
middle of the distribution)
–Provides a falsely high or low measure of central
tendency when outliers exist.
–In such cases (look at your data), use the median
as the preferred measure of central tendency.
15

Number of seizures/month: 100,2,3,3,4,7,1
•Mean? 17.14
•Median? 3
•Mode? 3
16
0
20
40
60
80
100
120
1 2 3 4 5 6 7
Outlier

Measures of dispersion
• “Dispersion” also called variability, scatter, spread)
•Measure how spread out a set of data is.
•Dispersion is the scatteredness of the data series
around its average.
18

Measures of dispersion
•Range
•Standard deviation
•Variance
•Interquartile range
19

Range
•The difference between the values of
the two extreme items of a series.
•i.e Difference between the maximum
& minimum value in a set of
observations.
20

•For example, from the following record of diastolic
blood pressure of 10 individuals -
93, 75, 81, 79, 7 7, 90, 75, 95, 77, 94.
• Highest value = 95
• Lowest value = 71.
•The Range is expressed as = 95-71=24
& 71 to 95 .
21
Exercise

•Simplest and most crude measure of
dispersion.
• Affected by the extreme values.
•Gives an idea of the variability very quickly.
22
Characteristics of Range

Standard deviation
•Tells us how individual values are deviated from and
around the mean in the sample.
•Provides an index of variability.
23

Characteristics of Standard Deviation
• Very satisfactory and most widely used measure of
dispersion.
•If SD is small, there is a high probability for getting a
value close to the mean and
• If it is large, the value is farther away from the mean.
• It is less affected by fluctuations of sampling.
24

How to determine a SD
1. Calculate the mean
2. Calculate the difference between each value
and the mean
3. Square each of the differences and sum them
4. Divide the sum by one less than the number of
observations (if n< 30) and no. of observations
(if n > 30).
25

Standard deviation
• The diastolic blood pressure was as follows : 83, 75, 81, 79, 71, 95, 75, 77,
84, 90 of 10 individuals.
27
x
_
( x – x )
_
(x – x ) 2
83 2 4
75 -6 36
81 0 0
79 -2 4
71 -10 100
95 -14 196
75 6 36
77 4 16
84 3 9
90 9 81
Ʃ x = 810 _
Ʃ( x – x ) 2 = 482
n = 10
Mean = 810 = 81
10

Uses of the standard deviation
•The standard deviation enables us to determine,
with a great deal of accuracy, where the values
of a frequency distribution are located in relation
to the mean.
28

Standard Deviation (SD) – for ‘Normal
distribution’
2.5 3.5 4.5
Birth Weight
[N]
29
Mean Birth-wt = 3.5 kg
Std Dev. = 1.0 kg
Mean ±1 SD
3.5 ±1kg
2.5 – 4.5 kg = 68%
Mean ± 2 SD
3.5 ±2 kg
1.5 – 5.5 kg = 95%
3.5
1.5 5.5
(kg)

Variance
• Variance = (SD)2
_
=
! "!" 𝟐
(𝒏!𝟏)
• Indicates the degree of variability among the
observations for a given variable.
30

Percentiles
• The percentile is a number such that most p%
of the measurements are below it and at most
100 – p percent of data are above it.
• Ex – if in a certain data the 85th percentile is
520 means that 15% of the measurements in
the data are above 520 and 85% of the
measurements are below 520.
31

Percentiles - for
non-normally distributed data
32
50 60 70 80 90 100 110 120
Diastolic BP
[N]
25% 25% 25% 25%
25th %-ile 50th %-ile
Ð Ï
75th %-ile
Ï
50th percentile is the
MEDIAN.
The 25th to the 75th
percentile is the
INTERQUARTILE
RANGE (IQR).
….% of data that fall below a specific value

INTERQUARTILE RANGE
25% 25% 25% 25%
33
Q 1 Q 2 Q3
“Interquartilerange” is from Q1 to Q3.
interquartile range = Q 3 – Q 1

To calculate it just subtract quartile 1
from quartile 3
Example: 5, 8 , 4, 4, 6, 3, 8.
• First put the list of numbers in order.
• Then cut the list into 4 equal parts.
• The quartiles are the cuts.
3 , 4 , 4 , 5 , 6 , 8 , 8
34
Q 1
Lower
quartile
Q 2
Middle quartile
(median)
Q 3
upper
quartile
Quartile (Q1) =4
Quartile (Q2) = median = 5
Quartile (Q3) = 8
Interquartile range is Q3 – Q1 = 8 – 4 = 4

Standard Error
•If we take a random sample (n) from the population,
and similar samples over and over again we will find
that every sample will have a differentmean (x ).
•If we make a frequency distribution of all the sample
means drawn from the same population, we will find
that the distribution of the mean is nearly a normal
distribution and the mean of the sample means
practically the same as the population mean (p).
35

•This is a very important observation that the sample
means are distributed normally about the population
mean (p).
•The standard deviation of the means is a measure of
the sample error and is given by the formula б/√n
which is called the standard error or the standard
error of the mean.
36

95% confidence interval
•Approximately 2 standard errors above and below
the estimate
•The range within which 95% of estimates from
multiple samples would be expected to lie
•Regarded as the range within which the “true
population” value probably lies (with 95% certainty)
37

95% confidence interval of the mean
The SEM is used to describe a 95% confidence interval for an observed
mean. (95% CI = Mean ± 2 SEM)
This confidence interval narrows with larger sample size.
Since SE = '(
)*
38

95% CI of the mean
If based on 4 values,
95% CI is mean ± 2 SE
150 ± 2 x 30/ 4
150 ± 2 x 15
If based on 100 values,
95% CI is mean ± 2 SE
150 ± 2 x 30/ 100
150 ± 2 x 3
120 – 180
144 – 156
Mean = 150
S.D. = 30
39

Interpreting Estimates with Confidence
Intervals
•Confident that 95% of all sample
means based on the given sample size
will fall within the range of the CI.
40

Categorical data
• For categorical data
Compare groups
Use proportions
41

Example
• In a prevalence study of Hypertension, we found
that
Hypertension No Hypertension
Non smokers 10 (10%) 90
Smokers 26 (26%) 74
• It is visible from the table that the proportion of
HTN was higher among smokers . The question
that arises is whether HTN was really higher
among smokers or the difference was merely due
to chance.
42

Take – home messages:
§Look at your data
§For continuous data, summarize with mean (for
central tendency) and SD (for dispersion) only
for normal bell – shaped distributions
(otherwise, use median and percentiles)
§Interpret mean with confidence interval while
inferring to population
§For categorical data, use proportions.
43

Summarizing data

More Related Content

What's hot

Similar to Summarizing data

More from Dr Lipilekha Patnaik

Recently uploaded

In this document

Summarizing data