SUMMARIZING
DATA
Dr Lipilekha Patnaik
Professor, Community Medicine
Institute of Medical Sciences & SUM Hospital
Siksha ‘O’ Anusandhan deemed to be University
Bhubaneswar, Odisha, India
Email: drlipilekha@yahoo.co.in
1
•Measures of central tendency –
Mean, Median, mode
•Measures of dispersion – Range,
standard deviation, Standard error
2
Session Objectives
Descriptive Measures for continuous data
•Central tendency measures – They are
computed to give a “center” around
which the measurements in the data
are distributed.
•Variation or variability measures –
They describe data spread or how far
away the measurements are from the
center.
3
Statistics related to continuous variables
• Mean
• Median
• Mode
• Range
• Standard Deviation
• Standard Error
4
Measures of Central Tendency
5
Central tendency measures
•Mean – The average value
Affected by extreme values
•Median – The middle value
Not affected by extremes
•Mode – Most frequently occurring
observation, there may be
more than one mode.
6
Mean
•Average
•Arithmetic Mean = (x )
= sum of individual values
number of observations
= Ʃ x
n
7
Exercise
• The diastolicblood pressureof 10 individualswas
83, 75, 81, 79, 71, 95, 75, 77, 84, 90.
•
• Arithmetic
Mean = 83+75+81+79+71+95+75+77+84+90
10
= 810
10
= 81
8
Median
§ The data are first arranged in an ascending or
descending order of magnitude
§ Middle observation is located, which is called
median.
§If the number of values is odd,
Median = middle value
§If the number of values is even,
Median = average of the two middle values
9
Median divides the data into two equal parts
with 50% of the observations above the median
and 50% below it.
10
Unsorted
Sorted in ascending
order
• Exercise: 1 odd no (11) of observations
• 11, 13, 15, 12, 10, 9, 2, 8, 12, 11, 10
• Median
• 8, 9, 10, 10, 11, 11, 12, 12, 12, 13, 15
• Exercise: 2 even no (12) of observations
• 11, 13, 15, 12, 10, 9, 12, 8, 12, 11, 10,12
• Arranged in ascending order
• 8, 9, 10, 10, 11, 11, 12, 12, 12, 12, 13, 15
11
median = 11+12
2
Exercise
Mode
•Most frequent observation.
•The value that appears most frequently in the data set.
12
11, 13, 15, 12, 10, 9, 12, 8, 12, 11, 10
Mode = 12
13
Exercise
Number of seizures/month:
3, 3, 1, 2, 4, 7, 9
14
•Mean?	 4.1	
•Median? 3
•Mode? 30
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 7
No	of	seizures
What’s wrong with a mean?
•Mean	is	sensitive	to	outliers (values	far	from	the	
middle	of	the	distribution)
–Provides	a	falsely	high	or	low	measure	of	central	
tendency	when	outliers	exist.
–In	such	cases	(look	at	your	data),	use	the	median
as	the	preferred	measure	of	central	tendency.
15
Number	of	seizures/month:	100,2,3,3,4,7,1
•Mean? 17.14
•Median? 3
•Mode? 3
16
0
20
40
60
80
100
120
1 2 3 4 5 6 7
Outlier
Measures of Dispersion
17
Measures of dispersion
• “Dispersion” also called variability, scatter, spread)
•Measure how spread out a set of data is.
•Dispersion is the scatteredness of the data series
around its average.
18
Measures of dispersion
•Range
•Standard deviation
•Variance
•Interquartile range
19
Range
•The difference between the values of
the two extreme items of a series.
•i.e Difference between the maximum
& minimum value in a set of
observations.
20
•For example, from the following record of diastolic
blood pressure of 10 individuals -
93, 75, 81, 79, 7 7, 90, 75, 95, 77, 94.
• Highest value = 95
• Lowest value = 71.
•The Range is expressed as = 95-71=24
& 71 to 95 .
21
Exercise
•Simplest and most crude measure of
dispersion.
• Affected by the extreme values.
•Gives an idea of the variability very quickly.
22
Characteristics of Range
Standard deviation
•Tells us how individual values are deviated from and
around the mean in the sample.
•Provides an index of variability.
23
Characteristics of Standard Deviation
• Very satisfactory and most widely used measure of
dispersion.
•If SD is small, there is a high probability for getting a
value close to the mean and
• If it is large, the value is farther away from the mean.
• It is less affected by fluctuations of sampling.
24
How to determine a SD
1. Calculate the mean
2. Calculate the difference between each value
and the mean
3. Square each of the differences and sum them
4. Divide the sum by one less than the number of
observations (if n< 30) and no. of observations
(if n > 30).
25
Standard deviation
26
Standard deviation
• The diastolic blood pressure was as follows : 83, 75, 81, 79, 71, 95, 75, 77,
84, 90 of 10 individuals.
27
x
_
(		x	– x		)
_
(x	– x		) 2
83 2 4
75 -6 36
81 0 0
79 -2 4
71 -10 100
95 -14 196
75 6 36
77 4 16
84 3 9
90 9 81
Ʃ x = 810 _
Ʃ( x – x ) 2 = 482
n	=	10
Mean	=					810		=		81
10
Uses of the standard deviation
•The standard deviation enables us to determine,
with a great deal of accuracy, where the values
of a frequency distribution are located in relation
to the mean.
28
Standard Deviation (SD) – for ‘Normal
distribution’
2.5 3.5 4.5
Birth Weight
[N]
29
Mean	Birth-wt	=	3.5	kg
Std	Dev.	=	1.0	kg
Mean	±1	SD
3.5	±1kg
2.5	– 4.5	kg	=	68%
Mean		± 2	SD
3.5	±2	kg
1.5	– 5.5	kg	=	95%
3.5
1.5 5.5
(kg)
Variance
• Variance = (SD)2
_
=
! "!" 𝟐
(𝒏!𝟏)
• Indicates the degree of variability among the
observations for a given variable.
30
Percentiles
• The percentile is a number such that most p%
of the measurements are below it and at most
100 – p percent of data are above it.
• Ex – if in a certain data the 85th percentile is
520 means that 15% of the measurements in
the data are above 520 and 85% of the
measurements are below 520.
31
Percentiles - for
non-normally distributed data
32
50 60 70 80 90 100 110 120
Diastolic BP
[N]
25% 25% 25% 25%
25th	%-ile 50th	%-ile
Ð Ï
75th %-ile
Ï
50th percentile is the
MEDIAN.
The 25th to the 75th
percentile is the
INTERQUARTILE
RANGE (IQR).
….% of data that fall below a specific value
INTERQUARTILE RANGE
25% 25% 25% 25%
33
Q 1 Q 2 Q3
“Interquartilerange” is from Q1 to Q3.
interquartile range = Q 3 – Q 1
To calculate it just subtract quartile 1
from quartile 3
Example: 5, 8 , 4, 4, 6, 3, 8.
• First put the list of numbers in order.
• Then cut the list into 4 equal parts.
• The quartiles are the cuts.
3 , 4 , 4 , 5 , 6 , 8 , 8
34
Q 1
Lower
quartile
Q 2
Middle quartile
(median)
Q 3
upper
quartile
Quartile (Q1) =4
Quartile (Q2) = median = 5
Quartile (Q3) = 8
Interquartile range is Q3 – Q1 = 8 – 4 = 4
Standard Error
•If we take a random sample (n) from the population,
and similar samples over and over again we will find
that every sample will have a differentmean (x ).
•If we make a frequency distribution of all the sample
means drawn from the same population, we will find
that the distribution of the mean is nearly a normal
distribution and the mean of the sample means
practically the same as the population mean (p).
35
•This	is	a	very	important	observation	that	the	sample	
means	are	distributed	normally	about	the	population	
mean	(p).	
•The	standard	deviation	of	the	means	is	a	measure	of	
the	sample	error	and	is	given	by	the	formula	б/√n	
which	is	called	the	standard	error	or	the	standard	
error	of	the	mean.
36
95% confidence interval
•Approximately 2 standard errors above and below
the estimate
•The range within which 95% of estimates from
multiple samples would be expected to lie
•Regarded as the range within which the “true
population” value probably lies (with 95% certainty)
37
95% confidence interval of the mean
The SEM is used to describe a 95% confidence interval for an observed
mean. (95% CI = Mean ± 2 SEM)
This confidence interval narrows with larger sample size.
Since SE = '(
)*
38
95% CI of the mean
If based on 4 values,
95% CI is mean ± 2 SE
150 ± 2 x 30/ 4
150 ± 2 x 15
If based on 100 values,
95% CI is mean ± 2 SE
150 ± 2 x 30/ 100
150 ± 2 x 3
120	– 180
144	– 156
Mean	=	150
S.D.	=	30
39
Interpreting Estimates with Confidence
Intervals
•Confident that 95% of all sample
means based on the given sample size
will fall within the range of the CI.
40
Categorical data
• For categorical data
Compare groups
Use proportions
41
Example
• In a prevalence study of Hypertension, we found
that
Hypertension No Hypertension
Non smokers 10 (10%) 90
Smokers 26 (26%) 74
• It is visible from the table that the proportion of
HTN was higher among smokers . The question
that arises is whether HTN was really higher
among smokers or the difference was merely due
to chance.
42
Take – home messages:
§Look at your data
§For continuous data, summarize with mean (for
central tendency) and SD (for dispersion) only
for normal bell – shaped distributions
(otherwise, use median and percentiles)
§Interpret mean with confidence interval while
inferring to population
§For categorical data, use proportions.
43
44

Summarizing data

  • 1.
    SUMMARIZING DATA Dr Lipilekha Patnaik Professor,Community Medicine Institute of Medical Sciences & SUM Hospital Siksha ‘O’ Anusandhan deemed to be University Bhubaneswar, Odisha, India Email: [email protected] 1
  • 2.
    •Measures of centraltendency – Mean, Median, mode •Measures of dispersion – Range, standard deviation, Standard error 2 Session Objectives
  • 3.
    Descriptive Measures forcontinuous data •Central tendency measures – They are computed to give a “center” around which the measurements in the data are distributed. •Variation or variability measures – They describe data spread or how far away the measurements are from the center. 3
  • 4.
    Statistics related tocontinuous variables • Mean • Median • Mode • Range • Standard Deviation • Standard Error 4
  • 5.
  • 6.
    Central tendency measures •Mean– The average value Affected by extreme values •Median – The middle value Not affected by extremes •Mode – Most frequently occurring observation, there may be more than one mode. 6
  • 7.
    Mean •Average •Arithmetic Mean =(x ) = sum of individual values number of observations = Ʃ x n 7
  • 8.
    Exercise • The diastolicbloodpressureof 10 individualswas 83, 75, 81, 79, 71, 95, 75, 77, 84, 90. • • Arithmetic Mean = 83+75+81+79+71+95+75+77+84+90 10 = 810 10 = 81 8
  • 9.
    Median § The dataare first arranged in an ascending or descending order of magnitude § Middle observation is located, which is called median. §If the number of values is odd, Median = middle value §If the number of values is even, Median = average of the two middle values 9
  • 10.
    Median divides thedata into two equal parts with 50% of the observations above the median and 50% below it. 10 Unsorted Sorted in ascending order
  • 11.
    • Exercise: 1odd no (11) of observations • 11, 13, 15, 12, 10, 9, 2, 8, 12, 11, 10 • Median • 8, 9, 10, 10, 11, 11, 12, 12, 12, 13, 15 • Exercise: 2 even no (12) of observations • 11, 13, 15, 12, 10, 9, 12, 8, 12, 11, 10,12 • Arranged in ascending order • 8, 9, 10, 10, 11, 11, 12, 12, 12, 12, 13, 15 11 median = 11+12 2 Exercise
  • 12.
    Mode •Most frequent observation. •Thevalue that appears most frequently in the data set. 12
  • 13.
    11, 13, 15,12, 10, 9, 12, 8, 12, 11, 10 Mode = 12 13 Exercise
  • 14.
    Number of seizures/month: 3,3, 1, 2, 4, 7, 9 14 •Mean? 4.1 •Median? 3 •Mode? 30 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 No of seizures
  • 15.
    What’s wrong witha mean? •Mean is sensitive to outliers (values far from the middle of the distribution) –Provides a falsely high or low measure of central tendency when outliers exist. –In such cases (look at your data), use the median as the preferred measure of central tendency. 15
  • 16.
  • 17.
  • 18.
    Measures of dispersion •“Dispersion” also called variability, scatter, spread) •Measure how spread out a set of data is. •Dispersion is the scatteredness of the data series around its average. 18
  • 19.
    Measures of dispersion •Range •Standarddeviation •Variance •Interquartile range 19
  • 20.
    Range •The difference betweenthe values of the two extreme items of a series. •i.e Difference between the maximum & minimum value in a set of observations. 20
  • 21.
    •For example, fromthe following record of diastolic blood pressure of 10 individuals - 93, 75, 81, 79, 7 7, 90, 75, 95, 77, 94. • Highest value = 95 • Lowest value = 71. •The Range is expressed as = 95-71=24 & 71 to 95 . 21 Exercise
  • 22.
    •Simplest and mostcrude measure of dispersion. • Affected by the extreme values. •Gives an idea of the variability very quickly. 22 Characteristics of Range
  • 23.
    Standard deviation •Tells ushow individual values are deviated from and around the mean in the sample. •Provides an index of variability. 23
  • 24.
    Characteristics of StandardDeviation • Very satisfactory and most widely used measure of dispersion. •If SD is small, there is a high probability for getting a value close to the mean and • If it is large, the value is farther away from the mean. • It is less affected by fluctuations of sampling. 24
  • 25.
    How to determinea SD 1. Calculate the mean 2. Calculate the difference between each value and the mean 3. Square each of the differences and sum them 4. Divide the sum by one less than the number of observations (if n< 30) and no. of observations (if n > 30). 25
  • 26.
  • 27.
    Standard deviation • Thediastolic blood pressure was as follows : 83, 75, 81, 79, 71, 95, 75, 77, 84, 90 of 10 individuals. 27 x _ ( x – x ) _ (x – x ) 2 83 2 4 75 -6 36 81 0 0 79 -2 4 71 -10 100 95 -14 196 75 6 36 77 4 16 84 3 9 90 9 81 Ʃ x = 810 _ Ʃ( x – x ) 2 = 482 n = 10 Mean = 810 = 81 10
  • 28.
    Uses of thestandard deviation •The standard deviation enables us to determine, with a great deal of accuracy, where the values of a frequency distribution are located in relation to the mean. 28
  • 29.
    Standard Deviation (SD)– for ‘Normal distribution’ 2.5 3.5 4.5 Birth Weight [N] 29 Mean Birth-wt = 3.5 kg Std Dev. = 1.0 kg Mean ±1 SD 3.5 ±1kg 2.5 – 4.5 kg = 68% Mean ± 2 SD 3.5 ±2 kg 1.5 – 5.5 kg = 95% 3.5 1.5 5.5 (kg)
  • 30.
    Variance • Variance =(SD)2 _ = ! "!" 𝟐 (𝒏!𝟏) • Indicates the degree of variability among the observations for a given variable. 30
  • 31.
    Percentiles • The percentileis a number such that most p% of the measurements are below it and at most 100 – p percent of data are above it. • Ex – if in a certain data the 85th percentile is 520 means that 15% of the measurements in the data are above 520 and 85% of the measurements are below 520. 31
  • 32.
    Percentiles - for non-normallydistributed data 32 50 60 70 80 90 100 110 120 Diastolic BP [N] 25% 25% 25% 25% 25th %-ile 50th %-ile Ð Ï 75th %-ile Ï 50th percentile is the MEDIAN. The 25th to the 75th percentile is the INTERQUARTILE RANGE (IQR). ….% of data that fall below a specific value
  • 33.
    INTERQUARTILE RANGE 25% 25%25% 25% 33 Q 1 Q 2 Q3 “Interquartilerange” is from Q1 to Q3. interquartile range = Q 3 – Q 1
  • 34.
    To calculate itjust subtract quartile 1 from quartile 3 Example: 5, 8 , 4, 4, 6, 3, 8. • First put the list of numbers in order. • Then cut the list into 4 equal parts. • The quartiles are the cuts. 3 , 4 , 4 , 5 , 6 , 8 , 8 34 Q 1 Lower quartile Q 2 Middle quartile (median) Q 3 upper quartile Quartile (Q1) =4 Quartile (Q2) = median = 5 Quartile (Q3) = 8 Interquartile range is Q3 – Q1 = 8 – 4 = 4
  • 35.
    Standard Error •If wetake a random sample (n) from the population, and similar samples over and over again we will find that every sample will have a differentmean (x ). •If we make a frequency distribution of all the sample means drawn from the same population, we will find that the distribution of the mean is nearly a normal distribution and the mean of the sample means practically the same as the population mean (p). 35
  • 36.
  • 37.
    95% confidence interval •Approximately2 standard errors above and below the estimate •The range within which 95% of estimates from multiple samples would be expected to lie •Regarded as the range within which the “true population” value probably lies (with 95% certainty) 37
  • 38.
    95% confidence intervalof the mean The SEM is used to describe a 95% confidence interval for an observed mean. (95% CI = Mean ± 2 SEM) This confidence interval narrows with larger sample size. Since SE = '( )* 38
  • 39.
    95% CI ofthe mean If based on 4 values, 95% CI is mean ± 2 SE 150 ± 2 x 30/ 4 150 ± 2 x 15 If based on 100 values, 95% CI is mean ± 2 SE 150 ± 2 x 30/ 100 150 ± 2 x 3 120 – 180 144 – 156 Mean = 150 S.D. = 30 39
  • 40.
    Interpreting Estimates withConfidence Intervals •Confident that 95% of all sample means based on the given sample size will fall within the range of the CI. 40
  • 41.
    Categorical data • Forcategorical data Compare groups Use proportions 41
  • 42.
    Example • In aprevalence study of Hypertension, we found that Hypertension No Hypertension Non smokers 10 (10%) 90 Smokers 26 (26%) 74 • It is visible from the table that the proportion of HTN was higher among smokers . The question that arises is whether HTN was really higher among smokers or the difference was merely due to chance. 42
  • 43.
    Take – homemessages: §Look at your data §For continuous data, summarize with mean (for central tendency) and SD (for dispersion) only for normal bell – shaped distributions (otherwise, use median and percentiles) §Interpret mean with confidence interval while inferring to population §For categorical data, use proportions. 43
  • 44.