Data Analysis Course
Testing of Hypothesis
Venkat Reddy
Data Analysis Course
•   Data analysis design document
•   Introduction to statistical data analysis
•   Descriptive statistics
•   Data exploration, validation & sanitization
•   Probability distributions examples and applications




                                                                Venkat Reddy
                                                          Data Analysis Course
•   Simple correlation and regression analysis
•   Multiple liner regression analysis
•   Logistic regression analysis

• Testing of hypothesis
•   Clustering and decision trees
•   Time series analysis and forecasting
•   Credit Risk Model building-1                                 2
•   Credit Risk Model building-2
Note
• This presentation is just class notes. The course notes for Data
  Analysis Training is by written by me, as an aid for myself.
• The best way to treat this is as a high-level summary; the
  actual session went more in depth and contained other




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  information.
• Most of this material was written as informal notes, not
  intended for publication
• Please send questions/comments/corrections to
  venkat@trenwiseanalytics.com or 21.venkat@gmail.com
• Please check my website for latest version of this document
                                         -Venkat Reddy                      3
Contents
• What is the need of testing
• Recap of sampling distribution
• Hypothesis testing
   • Five main steps in testing
       •   Assumptions
       •   Hypotheses




                                         Venkat Reddy
                                   Data Analysis Course
       •   Test Statistic
       •   P-value (P)
       •   Conclusion:
• Testing example
• Types of errors
• Testing for Means
   • Z test
   • T test
• Testing for Proportions
• Test of independence                    4
Inference
• A cake, (weighs 20 Kg) how do you decide about its taste? A piece of
  cake or after eating completely
• A truck half filled with oranges rest with apples. How would you
  verify that? By counting them?
• Apple & Samsung sell mobiles all over the world. If you want to find




                                                                               Venkat Reddy
                                                                         Data Analysis Course
  which one people prefer, Do you take the opinion poll from all the
  users around the world?
• Product manager claims that girls are buying their product more
  than boys? Is there any association between gender and buying or
  not buying?



                                                                                5
Inference
• A cake, (weighs 20 Kg) how do you decide about its taste? A piece of cake or
  after eating completely
   • You took a piece of cake, you are not completely satisfied with the taste –Would
     you recommend the cake?
   • You took a 10 gram piece of cake, you didn’t like the taste—What would you say?
   • You took a 1 milligram piece of cake, you liked the taste—What would you say?




                                                                                              Venkat Reddy
                                                                                        Data Analysis Course
• A truck half filled with oranges rest with apples (owner claims 10,000
  oranges & 10,000 apples) . How would you verify that? By counting them?
   • You randomly picked 200 fruits, 120 are apples and 80 are oranges. What would
     you say?
   • You randomly picked 200 fruits, 190 are apples and 10s are oranges. What would
     you say?


   How bad in the sample is bad enough to say the population is also
   bad                                                                                         6
Statistical Inference
• Inferences about a population are made on the basis of results
  obtained from a sample drawn from that population
• Want to talk about the larger population from which the
  subjects are drawn, not the particular subjects!




                                                                                   Venkat Reddy
                                                                             Data Analysis Course
                            • A hypothesis test is a process that uses
                              sample statistics to test a claim about the
                              value of a population parameter.
                            • A verbal statement, or claim, about a
                              population parameter is called a statistical
                              hypothesis.
                            • Hypothesis: Proportion of apples = 0.5
                                                                                    7
Applications of testing
• Law and Forensics
  • Testing for discrimination (in admission, hiring, pay, promotion practices, etc.) –
    test of proportions
  • Paternity testing
  • Testing whether evidence found on a suspect came from the crime scene (blood,
    fiber, fingerprints, ...)




                                                                                                Venkat Reddy
                                                                                          Data Analysis Course
  • Indeed, testing whether the defendant is guilty or not
• Medicine and Health
  • Testing if a new drug is effective or ineffective-test of association
  • Testing if particulate matter in air pollution causes lung cancer
  • Testing if a particular gene is responsible for hemophilia
• Industry/Business/Economics
  • Testing if a production machine is ‘in control’ or not-test of means
  • Testing if a silicon wafer is good for use
• Science and Engineering
  • Testing which theory of gravitation is correct, based on dark matter                         8
  • Testing whether men and woman differ according to a psychological trait
  • Testing if two species have a common ancestor
Hypothesis Testing – An example
CEO of SBI claims employees mean age in SBI bank is 35
(292,215 employees). How can we prove or disprove it?
1. Take a random sample (500) and find their age
2. If it is near 35 then we say there is no evidence to




                                                                   Venkat Reddy
                                                             Data Analysis Course
   reject that hypothesis
3. What if sample average age is lower or higher than 35 ?
4. How far is really far? We want to quantify the severity
   of deviation……
5. Lets find the probability of this occurrence
6. It if it is really low, then we say we reject null
   hypothesis                                                       9
Hypothesis testing process
 Assume the
 population
 mean age is 35.
 (Null Hypothesis)




                                                     Venkat Reddy
                                               Data Analysis Course
                                  Population
                                  100,000
                     The Sample
Is X  40   35?   Mean Is 40
 No, not likely!

    REJECT

Null Hypothesis                     Sample         10
                                    500
Reason for Rejecting H0
              Sampling Distribution


                                              ... Therefore, we reject
                                                 the null hypothesis




                                                                               Venkat Reddy
                                                                         Data Analysis Course
                                                     that  = 35.




            ... if in fact this were
             the population mean.



                     = 35                   40
                                  It is unlikely that we would get a         11
                       H0         sample mean of this value ...
Sampling distribution -Recap
   • A sampling distribution is the probability distribution of a sample
     statistic that is formed when samples of size n are repeatedly taken
     from a population.
   • If the sample statistic is the sample mean, then the distribution is
     the sampling distribution of sample means.




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
                                                  Sample

                  Sample

                             Sample                             Sample
      Sample                                     Sample



The sampling distribution consists of the values of the sample
                                                                                12
means,
Central Limit theorem -Recap
 If a sample n (30) is taken from a population with
 any type distribution that has a mean =
 and standard deviation =
the sample means will have a normal distribution
                                 and standard deviation




                                                                    Venkat Reddy
                                                              Data Analysis Course
                                                                  13

                                                          x
Five Step in Testing of Hypothesis
1.   Make Assumptions and meet test requirements.

2.   State the null hypothesis.




                                                                         Venkat Reddy
                                                                   Data Analysis Course
3.   Select the sampling distribution and establish the critical
     region.

4.   Compute the test statistic.

5.   Make a decision and interpret results.
                                                                       14
Step 1: Make Assumptions and Meet Test
Requirements
• Random sampling
  • Hypothesis testing assumes samples were selected using random
    sampling.
  • In this case, the sample of 500 cases was randomly selected from
    all major branches.




                                                                             Venkat Reddy
                                                                       Data Analysis Course
• Level of Measurement is Interval-Ratio.
  • Yes age is not a categorical variable
• Sampling Distribution is normal in shape.
  • What is the sampling distribution of age?
• This is a “large” sample (n≥100).

                                                                           15
Step 2 State the Null Hypothesis
• H0: μ = 35
• In other words, Ho: No difference between the sample mean
  and the population parameter
• In other words, The sample mean of 40 is really the same as




                                                                        Venkat Reddy
                                                                  Data Analysis Course
  the population mean of 35 – the difference is not real but is
  due to chance.
• In other words, The sample of 500 comes from a population
  that has average age of 35
• In other words, The difference between 35 and 40 is trivial
  and caused by random chance.

                                                                      16
Step 2 (cont.) State the Alternate Hypothesis

• H1: μ≠35
• Or H1: There is a difference between the sample mean and
  the population parameter
• Or The sample of 500 comes a population that does not have




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  average age 35 In reality, it comes from a different population.
• Or The difference between 40 and 35 reflects an actual
  difference
• Or the average age of the population is more than 35




                                                                         17
Step 3 Select Sampling Distribution and
Establish the Critical Region
• What is the sampling distribution of population mean?
• What is alpha?
   • Probability of rejecting H0 when it is true
• α is the indicator of “rare” events.
• Any difference with a probability less than α is rare and will cause us




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
  to reject the H0.
• We started with H0 as true, we still want to reject the null if the
  statistic is beyond a certain value,
• We already know about some unlikely values of test statistic when
  null hypothesis is true
• for example if the average age of the sample is 60, we definitely
  want to reject null

Details later                                                                   18
Step 4: Use Formula to Compute the Test
Statistic - Z for large samples (≥ 100)
 • We got the sample average as 40, the age according to null hypothesis
   is 35, there is a difference of 5, is it due to chance?
 • How bad is this difference of 5?


                    
                 Z




                                                                                 Venkat Reddy
                                                                           Data Analysis Course
                     N
 When the Population σ is not known, use the following formula:


                                        40  35
       Z                            Z
          s n 1                        7.86 500  1
                                                                               19
Step 5 Make a Decision and Interpret
 Results
  • The obtained Z score fell in the Critical Region, so we reject the H0.
     • If the H0 were true, a sample outcome of 14 would be unlikely.
     • Therefore, the H0 is false and must be rejected.


                                                                 a




                                                                                         Venkat Reddy
                                                                                   Data Analysis Course
     H0:   35
     H1:  > 35                                                         P-Value

                                                   0
What does z of 14 mean? The probability of z being more than 14 is less than
0.000000001
• It is like getting more than 25 heads in a row when you toss a coin
• It is like drawing the same card more than 6 times from a shuffled deck
• If the average age of 40 is just by chance then compare that chance with above       20
   examples
What is P-Value
• If the observed statistic happens to be just a chance, p-values tells
  us what is the probability of that chance
• The P-value answer the question: What is the probability of the
  observed test statistic or one more extreme when H0 is true?
• Given H0, probability of the current value or extreme than this




                                                                                 Venkat Reddy
                                                                           Data Analysis Course
• Given H0 is true, probability of obtaining a result as extreme or more
  extreme than the actual sample
• The observed significance level, or p-value of a test of hypothesis is
  the probability of obtaining the observed value of the sample
  statistic, or one which is even more supportive of the alternative
  hypothesis, under the assumption that the null hypothesis is true.
• Smallest α the observed sample would reject H0
                                                                               21
P -value
• If the alternative hypothesis contains the greater-than symbol (>), the
  hypothesis test is a right-tailed test.


   H0: μ =k
   Ha: μ > k




                                                                                          Venkat Reddy
                                                                                    Data Analysis Course
                                                             P is the area to
                                                             the right of the
                                                             test statistic.



                                                                                z
                     -3    -2     -1    0     1          2       3
                                               Test                                     22
                                             statistic
One tail & two tailed tests
• Two-tailed Test
  • If the alternative hypothesis contains the not-equal-to symbol (),
    the hypothesis test is a two-tailed test. In a two-tailed test, each
    tail has an area of 0.5P.
        • H0: μ = k
        • Ha: μ  k




                                                                                 Venkat Reddy
                                                                           Data Analysis Course
• Left-tailed Test
  • If the alternative hypothesis contains the less-than inequality
    symbol (<), the hypothesis test is a left-tailed test.
        • H0: μ  k
        • Ha: μ < k
• Right-tailed Test
  • If the alternative hypothesis contains the less-than inequality
    symbol (<), the hypothesis test is a left-tailed test.
        • H0: μ  k                                                            23
        • Ha: μ < k
Types of errors
• No matter which hypothesis represents the claim, always
  begin the hypothesis test assuming that the null hypothesis is
  true.
• At the end of the test, one of two decisions will be made:




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  • reject the null hypothesis, or
  • fail to reject the null hypothesis.
• A type I error occurs if the null hypothesis is rejected when it
  is true.
• A type II error occurs if the null hypothesis is not rejected
  when it is false.

                                                                         24
Error Types
• Type I Error: Reject H0 when it is true
• Type II Error: Do not reject H0 when it is false


       Test Result –                      Don’t Reject




                                                               Venkat Reddy
                                                         Data Analysis Course
                          Reject H0
                                              H0
     Reality
     H0 True           Type I Error     Correct


     H0 False          Correct          Type II Error

                                                             25
Level of Significance a
• Defines Unlikely Values of Sample Statistic if Null Hypothesis Is
  True
  • Called Rejection Region of Sampling Distribution
• Designated a (alpha)




                                                                            Venkat Reddy
                                                                      Data Analysis Course
• Typical values are 0.01, 0.05, 0.10
• Selected by the Researcher at the Start Provides the Critical
  Value(s) of the Test
• P(Type I error)
• Think of analogy with SBI average age


                                                                          26
Level of Significance, a and the Rejection
  Region

H0:   35                           a         Critical
                                               Value(s)
H1:  < 35
                                 0




                                                                Venkat Reddy
                                                          Data Analysis Course
             Rejection Regions
                                           a
H0:   35
H1:  > 35                       0
                                           a/2
H0:   35
H1:   35
                                 0                            27
Power of test 1-b
• P(Type II error) is b
• P(Type II error) = b depends on the true value of the
  parameter (from the range of values in Ha ).
• The farther the true parameter value falls from the null value,




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  the easier it is to reject null, and P(Type II error) goes down.
• Power of test = 1 - b = P(reject null, given it is false)
• In practice, you want a large enough n for your study so that
  P(Type II error) is small for the size of effect you expect.



                                                                         28
Which error is bad?
• False negative
  • Miss what could be important
      • Testing a metal whether it is gold or not
      • Are these samples going to be looked at again?
• False positive




                                                               Venkat Reddy
                                                         Data Analysis Course
  • Waste resources following dead ends
      • Test whether a drug is deadly or not




                                                             29
Confidence Intervals
• Hypothesis testing focuses on where the sample mean is
  located
• Confidence intervals focus on plausible values for the
  population mean
• General Formula (1-α)% CI for μ




                                                                 Venkat Reddy
                                                           Data Analysis Course
                Z1a / 2      Z1a / 2 
            X            ,X            
                    n              n 


• Construct an interval around the point estimate
• Look to see if the population/null mean is inside            30
Significance Test for Mean
                                 
  For large samples         Z
                                  N

                                y  0
  For small samples t                   where se  s / n
                                  se




                                                                                             Venkat Reddy
                                                                                       Data Analysis Course
• Sampling distribution for small samples is t
• The curve of the t distribution varies with sample size (the smaller the size, the
  flatter the curve)
• In using the t-table, we use “degrees of freedom” based on the sample size.
• For a one-sample test, df = n – 1.
• When looking at the table, find the t-value for the appropriate df = n-1. This
  will be the cutoff point for your critical region.                                       31
Lab: Significance Test for Mean
• It is known that the mean cholesterol level for the nation is 190. We
  test 100 only children and find that the sample average cholesterol level is
  198 and suppose we know the population standard deviation  = 15. does
  that signify that only children have an average higher cholesterol level than
  the national average?
• Given this sample what are the reasonable values for population mean?




                                                                                        Venkat Reddy
                                                                                  Data Analysis Course
• 50 smokers were questioned about the number of hours they sleep each
  day. We want to test the hypothesis that the smokers need less sleep than
  the general public which needs an average of 7.7 hours of sleep. If the
  sample mean is 7.5 and the population standard deviation is 0.5, what can
  you conclude?
• Given this sample what are the 95% confidence limits for population mean?

                                                                                      32
Test of Proportion
• Assumptions:
  • Categorical variable
  • Randomization
  • Large sample (but two-sided ok for nearly all n)




                                                             Venkat Reddy
                                                       Data Analysis Course
• Hypotheses:
   • Null hypothesis: H0: p  p0
   • Alternative hypothesis: Ha: p  p0 (2-sided)
   • Ha: p > p0 Ha: p < p0 (1-sided)
   • Set up hypotheses before getting the data
• Test statistic:
                     p p0
                     ˆ           p p0
                                  ˆ
                  z                                      33
                            p 0 (1  p 0 ) / n
                          pˆ
Lab: Test of Proportion
• Suppose a coin toss turns up 12 heads out of 20 trials. At .05
  significance level, can one reject the null hypothesis that the
  coin toss is fair?
• Suppose that you interview 1000 exiting voters about who
  they voted for PM. Of the 1000 voters, 550 reported that they




                                                                          Venkat Reddy
                                                                    Data Analysis Course
  voted for Rahul Gandhi. Is there sufficient evidence to suggest
  that Rahul Gandhi will win the election at the .01 level?




                                                                        34
Chi square test for Independence
• Chi square test of independence
• Is happiness independent of family income? A sample of 2955
  families are studied




                                                                                           Venkat Reddy
                                                                                     Data Analysis Course
               Happiness
                           Very         Pretty         Not too        Total
        Income Above              272            294             49           615
               Average            454            835         131              1420
               Below              185            527         208              920




                                                                                         35
Chi-square statistic
• Summarize closeness of {fo} and {fe} by

                      ( fo  fe )2
               2  
                           fe




                                                                      Venkat Reddy
                                                                Data Analysis Course
 where sum is taken over all cells in the table.

• When H0 is true, sampling distribution of this statistic is
  approximately (for large n) the chi-squared probability
  distribution.


                                                                    36
Chi-square calculation
• In happiness and family income

                 ( f o  f e )2 (272  189.6)2
             
              2
                                               ...  172.3
                       fe           189.6




                                                                                     Venkat Reddy
                                                                               Data Analysis Course
• df = (3 – 1)(3 – 1) = 4. P-value = 0.000 (rounded, often reported as P <
  0.001). Chi-squared percentile values for various right-tail probabilities
  are in table on text p. 594.
• There is very strong evidence against H0: independence (If H0 were
  true, prob. would be < 0.001 of getting this large a 2 test statistic or
  even larger).
• For significance level a = 0.05 (or a = 0.01 or a = 0.001), we reject H0
  and conclude that an association exists between happiness and
  income.                                                                          37
Lab Chi-square distribution
• Suppose that 125 children are shown three television commercials for
  breakfast cereal and are asked to pick which they liked best. The results are
  shown in table below. You would like to know if the choice of favorite
  commercial was related to whether the child was a boy or a girl or if these
  two variables are independent




                                                                                        Venkat Reddy
                                                                                  Data Analysis Course
             A           B         C         Totals
    Boys     30          29        16        75
    Girls    12          33        5         50
    Totals   42          62        21        125


• Suppose you conducted a drug trial on a group of animals and you
  hypothesized that the animals receiving the drug would show increased
  heart rates compared to those that did not receive the drug. You conduct
  the study and collect the following data.
                  Heart Rate Increased No Heart Rate Increase   Total
                                                                                      38
  Treated     36                        14                      50
  Not treated 30                        25                      55
  Total       66                        39                      105
Further Reading
• Test of samples means for two populations
• Test of sample proportions for two populations
• Odds ration for test of association




                                                         Venkat Reddy
                                                   Data Analysis Course
                                                       39
Venkat Reddy Konasani
Manager at Trendwise Analytics
venkat@TrendwiseAnalytics.com
21.venkat@gmail.com




                                       Venkat Reddy
                                 Data Analysis Course
+91 9886 768879




                                     40

More Related Content

PDF
Inferential Statistics
PPTX
Presentation chi-square test & Anova
PDF
Multiple regression
PPT
hypothesis test
PDF
Correlation and Simple Regression
PPTX
hypothesis testing-tests of proportions and variances in six sigma
PPT
Regression and Co-Relation
PPTX
Covariance vs Correlation
Inferential Statistics
Presentation chi-square test & Anova
Multiple regression
hypothesis test
Correlation and Simple Regression
hypothesis testing-tests of proportions and variances in six sigma
Regression and Co-Relation
Covariance vs Correlation

What's hot (20)

PPTX
non parametric statistics
PDF
Simple & Multiple Regression Analysis
PPTX
Non-Parametric Tests
PPTX
Regression ppt
PPTX
3.1 non parametric test
PPTX
Applications of sas and minitab in data analysis
PPTX
Wilcoxon Rank-Sum Test
PPTX
Application of excel and spss programme in statistical
PPTX
Experimental design techniques
PDF
Unit 1 Correlation- BSRM.pdf
PPTX
NON-PARAMETRIC TESTS by Prajakta Sawant
PPTX
Graphs(Biostatistics and Research Methodology) B.pharmacy(8th sem.)
PPTX
Factorial Design.pptx
PPTX
designing the methodology.pptx
PDF
Unit-III Non-Parametric Tests BSRM.pdf
PDF
Report Writing and Presentation of Data.pdf
PPT
factorial design
PDF
Central Composite Design
PPTX
PPT on Sample Size, Importance of Sample Size,
PPTX
Multiple Regression Analysis (MRA)
non parametric statistics
Simple & Multiple Regression Analysis
Non-Parametric Tests
Regression ppt
3.1 non parametric test
Applications of sas and minitab in data analysis
Wilcoxon Rank-Sum Test
Application of excel and spss programme in statistical
Experimental design techniques
Unit 1 Correlation- BSRM.pdf
NON-PARAMETRIC TESTS by Prajakta Sawant
Graphs(Biostatistics and Research Methodology) B.pharmacy(8th sem.)
Factorial Design.pptx
designing the methodology.pptx
Unit-III Non-Parametric Tests BSRM.pdf
Report Writing and Presentation of Data.pdf
factorial design
Central Composite Design
PPT on Sample Size, Importance of Sample Size,
Multiple Regression Analysis (MRA)
Ad

Viewers also liked (20)

PPT
Hypothesis Testing
PPTX
Hypothesis testing ppt final
PPT
Test of hypothesis
PPTX
Hypothesis
PPT
Hypothesis
PDF
Statistical Distributions
PPTX
Testing of hypothesis case study
PPT
Hypothesis Testing
PDF
Big data Introduction by Mohan
PDF
Hypothesis testing; z test, t-test. f-test
PPTX
Introduction to predictive modeling v1
PPTX
Step By Step Guide to Learn R
PDF
Logistic regression
PDF
A data analyst view of Bigdata
PPTX
Decision tree
PPTX
SAS basics Step by step learning
PPTX
Credit Risk Model Building Steps
PPT
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
PPT
Hypothesis Testing
Hypothesis testing ppt final
Test of hypothesis
Hypothesis
Hypothesis
Statistical Distributions
Testing of hypothesis case study
Hypothesis Testing
Big data Introduction by Mohan
Hypothesis testing; z test, t-test. f-test
Introduction to predictive modeling v1
Step By Step Guide to Learn R
Logistic regression
A data analyst view of Bigdata
Decision tree
SAS basics Step by step learning
Credit Risk Model Building Steps
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Ad

Similar to Testing of hypothesis (20)

PDF
Descriptive statistics
PDF
Statistical Preliminaries
PDF
Quantitative methods
DOCX
Important terminologies
PPTX
Tests of significance Periodontology
PPT
PPTX
Environmental statistics
PDF
Basic Statistical Concepts.pdf
DOCX
T test for two independent samples and induction
PPTX
CO 3. Hypothesis Testing which is basicl
DOC
Kinds Of Variable
DOC
Module stats
DOC
Module
PPT
Chapter 12
PPT
Statistics
PPTX
Descriptive Analysis.pptx
PPTX
Q4_Understanding-Hypothesis_w1.pptx
PPT
Statistics
PPT
T7 data analysis
Descriptive statistics
Statistical Preliminaries
Quantitative methods
Important terminologies
Tests of significance Periodontology
Environmental statistics
Basic Statistical Concepts.pdf
T test for two independent samples and induction
CO 3. Hypothesis Testing which is basicl
Kinds Of Variable
Module stats
Module
Chapter 12
Statistics
Descriptive Analysis.pptx
Q4_Understanding-Hypothesis_w1.pptx
Statistics
T7 data analysis

More from Venkata Reddy Konasani (16)

PDF
Transformers 101
PDF
Machine Learning Deep Learning AI and Data Science
PDF
Model selection and cross validation techniques
PDF
Neural Network Part-2
PDF
GBM theory code and parameters
PDF
Neural Networks made easy
PDF
Table of Contents - Practical Business Analytics using SAS
DOCX
L101 predictive modeling case_study
PDF
Machine Learning for Dummies
PDF
Online data sources for analaysis
PPTX
R- Introduction
PDF
Cluster Analysis for Dummies
PDF
Data exploration validation and sanitization
PDF
Data Analyst - Interview Guide
PDF
Model building in credit card and loan approval
PDF
Data Exploration, Validation and Sanitization
Transformers 101
Machine Learning Deep Learning AI and Data Science
Model selection and cross validation techniques
Neural Network Part-2
GBM theory code and parameters
Neural Networks made easy
Table of Contents - Practical Business Analytics using SAS
L101 predictive modeling case_study
Machine Learning for Dummies
Online data sources for analaysis
R- Introduction
Cluster Analysis for Dummies
Data exploration validation and sanitization
Data Analyst - Interview Guide
Model building in credit card and loan approval
Data Exploration, Validation and Sanitization

Recently uploaded (20)

PDF
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
PPTX
PLASMA AND ITS CONSTITUENTS 123.pptx
PPTX
Reproductive system-Human anatomy and physiology
PDF
THE CHILD AND ADOLESCENT LEARNERS & LEARNING PRINCIPLES
PDF
Farming Based Livelihood Systems English Notes
PPTX
Integrated Management of Neonatal and Childhood Illnesses (IMNCI) – Unit IV |...
PDF
Fun with Grammar (Communicative Activities for the Azar Grammar Series)
PPTX
UNIT_2-__LIPIDS[1].pptx.................
PDF
PUBH1000 - Module 6: Global Health Tute Slides
DOCX
Ibrahim Suliman Mukhtar CV5AUG2025.docx
PPTX
What’s under the hood: Parsing standardized learning content for AI
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
0520_Scheme_of_Work_(for_examination_from_2021).pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PPTX
2025 High Blood Pressure Guideline Slide Set.pptx
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI Syllabus.pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PPT
REGULATION OF RESPIRATION lecture note 200L [Autosaved]-1-1.ppt
PDF
Everyday Spelling and Grammar by Kathi Wyldeck
PDF
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
PLASMA AND ITS CONSTITUENTS 123.pptx
Reproductive system-Human anatomy and physiology
THE CHILD AND ADOLESCENT LEARNERS & LEARNING PRINCIPLES
Farming Based Livelihood Systems English Notes
Integrated Management of Neonatal and Childhood Illnesses (IMNCI) – Unit IV |...
Fun with Grammar (Communicative Activities for the Azar Grammar Series)
UNIT_2-__LIPIDS[1].pptx.................
PUBH1000 - Module 6: Global Health Tute Slides
Ibrahim Suliman Mukhtar CV5AUG2025.docx
What’s under the hood: Parsing standardized learning content for AI
Cambridge-Practice-Tests-for-IELTS-12.docx
0520_Scheme_of_Work_(for_examination_from_2021).pdf
Environmental Education MCQ BD2EE - Share Source.pdf
2025 High Blood Pressure Guideline Slide Set.pptx
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI Syllabus.pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
REGULATION OF RESPIRATION lecture note 200L [Autosaved]-1-1.ppt
Everyday Spelling and Grammar by Kathi Wyldeck
LIFE & LIVING TRILOGY - PART (3) REALITY & MYSTERY.pdf

Testing of hypothesis

  • 1. Data Analysis Course Testing of Hypothesis Venkat Reddy
  • 2. Data Analysis Course • Data analysis design document • Introduction to statistical data analysis • Descriptive statistics • Data exploration, validation & sanitization • Probability distributions examples and applications Venkat Reddy Data Analysis Course • Simple correlation and regression analysis • Multiple liner regression analysis • Logistic regression analysis • Testing of hypothesis • Clustering and decision trees • Time series analysis and forecasting • Credit Risk Model building-1 2 • Credit Risk Model building-2
  • 3. Note • This presentation is just class notes. The course notes for Data Analysis Training is by written by me, as an aid for myself. • The best way to treat this is as a high-level summary; the actual session went more in depth and contained other Venkat Reddy Data Analysis Course information. • Most of this material was written as informal notes, not intended for publication • Please send questions/comments/corrections to [email protected] or [email protected] • Please check my website for latest version of this document -Venkat Reddy 3
  • 4. Contents • What is the need of testing • Recap of sampling distribution • Hypothesis testing • Five main steps in testing • Assumptions • Hypotheses Venkat Reddy Data Analysis Course • Test Statistic • P-value (P) • Conclusion: • Testing example • Types of errors • Testing for Means • Z test • T test • Testing for Proportions • Test of independence 4
  • 5. Inference • A cake, (weighs 20 Kg) how do you decide about its taste? A piece of cake or after eating completely • A truck half filled with oranges rest with apples. How would you verify that? By counting them? • Apple & Samsung sell mobiles all over the world. If you want to find Venkat Reddy Data Analysis Course which one people prefer, Do you take the opinion poll from all the users around the world? • Product manager claims that girls are buying their product more than boys? Is there any association between gender and buying or not buying? 5
  • 6. Inference • A cake, (weighs 20 Kg) how do you decide about its taste? A piece of cake or after eating completely • You took a piece of cake, you are not completely satisfied with the taste –Would you recommend the cake? • You took a 10 gram piece of cake, you didn’t like the taste—What would you say? • You took a 1 milligram piece of cake, you liked the taste—What would you say? Venkat Reddy Data Analysis Course • A truck half filled with oranges rest with apples (owner claims 10,000 oranges & 10,000 apples) . How would you verify that? By counting them? • You randomly picked 200 fruits, 120 are apples and 80 are oranges. What would you say? • You randomly picked 200 fruits, 190 are apples and 10s are oranges. What would you say? How bad in the sample is bad enough to say the population is also bad 6
  • 7. Statistical Inference • Inferences about a population are made on the basis of results obtained from a sample drawn from that population • Want to talk about the larger population from which the subjects are drawn, not the particular subjects! Venkat Reddy Data Analysis Course • A hypothesis test is a process that uses sample statistics to test a claim about the value of a population parameter. • A verbal statement, or claim, about a population parameter is called a statistical hypothesis. • Hypothesis: Proportion of apples = 0.5 7
  • 8. Applications of testing • Law and Forensics • Testing for discrimination (in admission, hiring, pay, promotion practices, etc.) – test of proportions • Paternity testing • Testing whether evidence found on a suspect came from the crime scene (blood, fiber, fingerprints, ...) Venkat Reddy Data Analysis Course • Indeed, testing whether the defendant is guilty or not • Medicine and Health • Testing if a new drug is effective or ineffective-test of association • Testing if particulate matter in air pollution causes lung cancer • Testing if a particular gene is responsible for hemophilia • Industry/Business/Economics • Testing if a production machine is ‘in control’ or not-test of means • Testing if a silicon wafer is good for use • Science and Engineering • Testing which theory of gravitation is correct, based on dark matter 8 • Testing whether men and woman differ according to a psychological trait • Testing if two species have a common ancestor
  • 9. Hypothesis Testing – An example CEO of SBI claims employees mean age in SBI bank is 35 (292,215 employees). How can we prove or disprove it? 1. Take a random sample (500) and find their age 2. If it is near 35 then we say there is no evidence to Venkat Reddy Data Analysis Course reject that hypothesis 3. What if sample average age is lower or higher than 35 ? 4. How far is really far? We want to quantify the severity of deviation…… 5. Lets find the probability of this occurrence 6. It if it is really low, then we say we reject null hypothesis 9
  • 10. Hypothesis testing process Assume the population mean age is 35. (Null Hypothesis) Venkat Reddy Data Analysis Course Population 100,000 The Sample Is X  40   35? Mean Is 40 No, not likely! REJECT Null Hypothesis Sample 10 500
  • 11. Reason for Rejecting H0 Sampling Distribution ... Therefore, we reject the null hypothesis Venkat Reddy Data Analysis Course that  = 35. ... if in fact this were the population mean.  = 35 40 It is unlikely that we would get a 11 H0 sample mean of this value ...
  • 12. Sampling distribution -Recap • A sampling distribution is the probability distribution of a sample statistic that is formed when samples of size n are repeatedly taken from a population. • If the sample statistic is the sample mean, then the distribution is the sampling distribution of sample means. Venkat Reddy Data Analysis Course Sample Sample Sample Sample Sample Sample The sampling distribution consists of the values of the sample 12 means,
  • 13. Central Limit theorem -Recap If a sample n (30) is taken from a population with any type distribution that has a mean = and standard deviation = the sample means will have a normal distribution and standard deviation Venkat Reddy Data Analysis Course 13 x
  • 14. Five Step in Testing of Hypothesis 1. Make Assumptions and meet test requirements. 2. State the null hypothesis. Venkat Reddy Data Analysis Course 3. Select the sampling distribution and establish the critical region. 4. Compute the test statistic. 5. Make a decision and interpret results. 14
  • 15. Step 1: Make Assumptions and Meet Test Requirements • Random sampling • Hypothesis testing assumes samples were selected using random sampling. • In this case, the sample of 500 cases was randomly selected from all major branches. Venkat Reddy Data Analysis Course • Level of Measurement is Interval-Ratio. • Yes age is not a categorical variable • Sampling Distribution is normal in shape. • What is the sampling distribution of age? • This is a “large” sample (n≥100). 15
  • 16. Step 2 State the Null Hypothesis • H0: μ = 35 • In other words, Ho: No difference between the sample mean and the population parameter • In other words, The sample mean of 40 is really the same as Venkat Reddy Data Analysis Course the population mean of 35 – the difference is not real but is due to chance. • In other words, The sample of 500 comes from a population that has average age of 35 • In other words, The difference between 35 and 40 is trivial and caused by random chance. 16
  • 17. Step 2 (cont.) State the Alternate Hypothesis • H1: μ≠35 • Or H1: There is a difference between the sample mean and the population parameter • Or The sample of 500 comes a population that does not have Venkat Reddy Data Analysis Course average age 35 In reality, it comes from a different population. • Or The difference between 40 and 35 reflects an actual difference • Or the average age of the population is more than 35 17
  • 18. Step 3 Select Sampling Distribution and Establish the Critical Region • What is the sampling distribution of population mean? • What is alpha? • Probability of rejecting H0 when it is true • α is the indicator of “rare” events. • Any difference with a probability less than α is rare and will cause us Venkat Reddy Data Analysis Course to reject the H0. • We started with H0 as true, we still want to reject the null if the statistic is beyond a certain value, • We already know about some unlikely values of test statistic when null hypothesis is true • for example if the average age of the sample is 60, we definitely want to reject null Details later 18
  • 19. Step 4: Use Formula to Compute the Test Statistic - Z for large samples (≥ 100) • We got the sample average as 40, the age according to null hypothesis is 35, there is a difference of 5, is it due to chance? • How bad is this difference of 5?  Z Venkat Reddy Data Analysis Course  N When the Population σ is not known, use the following formula:  40  35 Z Z s n 1 7.86 500  1 19
  • 20. Step 5 Make a Decision and Interpret Results • The obtained Z score fell in the Critical Region, so we reject the H0. • If the H0 were true, a sample outcome of 14 would be unlikely. • Therefore, the H0 is false and must be rejected. a Venkat Reddy Data Analysis Course H0:   35 H1:  > 35 P-Value 0 What does z of 14 mean? The probability of z being more than 14 is less than 0.000000001 • It is like getting more than 25 heads in a row when you toss a coin • It is like drawing the same card more than 6 times from a shuffled deck • If the average age of 40 is just by chance then compare that chance with above 20 examples
  • 21. What is P-Value • If the observed statistic happens to be just a chance, p-values tells us what is the probability of that chance • The P-value answer the question: What is the probability of the observed test statistic or one more extreme when H0 is true? • Given H0, probability of the current value or extreme than this Venkat Reddy Data Analysis Course • Given H0 is true, probability of obtaining a result as extreme or more extreme than the actual sample • The observed significance level, or p-value of a test of hypothesis is the probability of obtaining the observed value of the sample statistic, or one which is even more supportive of the alternative hypothesis, under the assumption that the null hypothesis is true. • Smallest α the observed sample would reject H0 21
  • 22. P -value • If the alternative hypothesis contains the greater-than symbol (>), the hypothesis test is a right-tailed test. H0: μ =k Ha: μ > k Venkat Reddy Data Analysis Course P is the area to the right of the test statistic. z -3 -2 -1 0 1 2 3 Test 22 statistic
  • 23. One tail & two tailed tests • Two-tailed Test • If the alternative hypothesis contains the not-equal-to symbol (), the hypothesis test is a two-tailed test. In a two-tailed test, each tail has an area of 0.5P. • H0: μ = k • Ha: μ  k Venkat Reddy Data Analysis Course • Left-tailed Test • If the alternative hypothesis contains the less-than inequality symbol (<), the hypothesis test is a left-tailed test. • H0: μ  k • Ha: μ < k • Right-tailed Test • If the alternative hypothesis contains the less-than inequality symbol (<), the hypothesis test is a left-tailed test. • H0: μ  k 23 • Ha: μ < k
  • 24. Types of errors • No matter which hypothesis represents the claim, always begin the hypothesis test assuming that the null hypothesis is true. • At the end of the test, one of two decisions will be made: Venkat Reddy Data Analysis Course • reject the null hypothesis, or • fail to reject the null hypothesis. • A type I error occurs if the null hypothesis is rejected when it is true. • A type II error occurs if the null hypothesis is not rejected when it is false. 24
  • 25. Error Types • Type I Error: Reject H0 when it is true • Type II Error: Do not reject H0 when it is false Test Result – Don’t Reject Venkat Reddy Data Analysis Course Reject H0 H0 Reality H0 True Type I Error Correct H0 False Correct Type II Error 25
  • 26. Level of Significance a • Defines Unlikely Values of Sample Statistic if Null Hypothesis Is True • Called Rejection Region of Sampling Distribution • Designated a (alpha) Venkat Reddy Data Analysis Course • Typical values are 0.01, 0.05, 0.10 • Selected by the Researcher at the Start Provides the Critical Value(s) of the Test • P(Type I error) • Think of analogy with SBI average age 26
  • 27. Level of Significance, a and the Rejection Region H0:   35 a Critical Value(s) H1:  < 35 0 Venkat Reddy Data Analysis Course Rejection Regions a H0:   35 H1:  > 35 0 a/2 H0:   35 H1:   35 0 27
  • 28. Power of test 1-b • P(Type II error) is b • P(Type II error) = b depends on the true value of the parameter (from the range of values in Ha ). • The farther the true parameter value falls from the null value, Venkat Reddy Data Analysis Course the easier it is to reject null, and P(Type II error) goes down. • Power of test = 1 - b = P(reject null, given it is false) • In practice, you want a large enough n for your study so that P(Type II error) is small for the size of effect you expect. 28
  • 29. Which error is bad? • False negative • Miss what could be important • Testing a metal whether it is gold or not • Are these samples going to be looked at again? • False positive Venkat Reddy Data Analysis Course • Waste resources following dead ends • Test whether a drug is deadly or not 29
  • 30. Confidence Intervals • Hypothesis testing focuses on where the sample mean is located • Confidence intervals focus on plausible values for the population mean • General Formula (1-α)% CI for μ Venkat Reddy Data Analysis Course  Z1a / 2 Z1a / 2  X  ,X    n n  • Construct an interval around the point estimate • Look to see if the population/null mean is inside 30
  • 31. Significance Test for Mean  For large samples Z  N y  0 For small samples t  where se  s / n se Venkat Reddy Data Analysis Course • Sampling distribution for small samples is t • The curve of the t distribution varies with sample size (the smaller the size, the flatter the curve) • In using the t-table, we use “degrees of freedom” based on the sample size. • For a one-sample test, df = n – 1. • When looking at the table, find the t-value for the appropriate df = n-1. This will be the cutoff point for your critical region. 31
  • 32. Lab: Significance Test for Mean • It is known that the mean cholesterol level for the nation is 190. We test 100 only children and find that the sample average cholesterol level is 198 and suppose we know the population standard deviation  = 15. does that signify that only children have an average higher cholesterol level than the national average? • Given this sample what are the reasonable values for population mean? Venkat Reddy Data Analysis Course • 50 smokers were questioned about the number of hours they sleep each day. We want to test the hypothesis that the smokers need less sleep than the general public which needs an average of 7.7 hours of sleep. If the sample mean is 7.5 and the population standard deviation is 0.5, what can you conclude? • Given this sample what are the 95% confidence limits for population mean? 32
  • 33. Test of Proportion • Assumptions: • Categorical variable • Randomization • Large sample (but two-sided ok for nearly all n) Venkat Reddy Data Analysis Course • Hypotheses: • Null hypothesis: H0: p  p0 • Alternative hypothesis: Ha: p  p0 (2-sided) • Ha: p > p0 Ha: p < p0 (1-sided) • Set up hypotheses before getting the data • Test statistic: p p0 ˆ p p0 ˆ z  33  p 0 (1  p 0 ) / n pˆ
  • 34. Lab: Test of Proportion • Suppose a coin toss turns up 12 heads out of 20 trials. At .05 significance level, can one reject the null hypothesis that the coin toss is fair? • Suppose that you interview 1000 exiting voters about who they voted for PM. Of the 1000 voters, 550 reported that they Venkat Reddy Data Analysis Course voted for Rahul Gandhi. Is there sufficient evidence to suggest that Rahul Gandhi will win the election at the .01 level? 34
  • 35. Chi square test for Independence • Chi square test of independence • Is happiness independent of family income? A sample of 2955 families are studied Venkat Reddy Data Analysis Course Happiness Very Pretty Not too Total Income Above 272 294 49 615 Average 454 835 131 1420 Below 185 527 208 920 35
  • 36. Chi-square statistic • Summarize closeness of {fo} and {fe} by ( fo  fe )2 2   fe Venkat Reddy Data Analysis Course where sum is taken over all cells in the table. • When H0 is true, sampling distribution of this statistic is approximately (for large n) the chi-squared probability distribution. 36
  • 37. Chi-square calculation • In happiness and family income ( f o  f e )2 (272  189.6)2   2   ...  172.3 fe 189.6 Venkat Reddy Data Analysis Course • df = (3 – 1)(3 – 1) = 4. P-value = 0.000 (rounded, often reported as P < 0.001). Chi-squared percentile values for various right-tail probabilities are in table on text p. 594. • There is very strong evidence against H0: independence (If H0 were true, prob. would be < 0.001 of getting this large a 2 test statistic or even larger). • For significance level a = 0.05 (or a = 0.01 or a = 0.001), we reject H0 and conclude that an association exists between happiness and income. 37
  • 38. Lab Chi-square distribution • Suppose that 125 children are shown three television commercials for breakfast cereal and are asked to pick which they liked best. The results are shown in table below. You would like to know if the choice of favorite commercial was related to whether the child was a boy or a girl or if these two variables are independent Venkat Reddy Data Analysis Course A B C Totals Boys 30 29 16 75 Girls 12 33 5 50 Totals 42 62 21 125 • Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals receiving the drug would show increased heart rates compared to those that did not receive the drug. You conduct the study and collect the following data. Heart Rate Increased No Heart Rate Increase Total 38 Treated 36 14 50 Not treated 30 25 55 Total 66 39 105
  • 39. Further Reading • Test of samples means for two populations • Test of sample proportions for two populations • Odds ration for test of association Venkat Reddy Data Analysis Course 39
  • 40. Venkat Reddy Konasani Manager at Trendwise Analytics [email protected] [email protected] Venkat Reddy Data Analysis Course +91 9886 768879 40