0% found this document useful (0 votes)
425 views65 pages

DATA SCIENCE Question Bank For ESA 20 123 .Odt

1. The document contains multiple choice questions related to statistics concepts such as mean, standard deviation, z-scores, hypothesis testing, correlation, and linear regression. 2. Questions assess understanding of key terms like type 1 error, null hypothesis, standard error, t-statistic, and coefficient of determination. 3. Several questions provide examples to test comprehension of applying statistical tests and interpreting their results in contexts like measuring effects of teaching methods and diet on test scores and blood sugar.

Uploaded by

Ferekkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
425 views65 pages

DATA SCIENCE Question Bank For ESA 20 123 .Odt

1. The document contains multiple choice questions related to statistics concepts such as mean, standard deviation, z-scores, hypothesis testing, correlation, and linear regression. 2. Questions assess understanding of key terms like type 1 error, null hypothesis, standard error, t-statistic, and coefficient of determination. 3. Several questions provide examples to test comprehension of applying statistical tests and interpreting their results in contexts like measuring effects of teaching methods and diet on test scores and blood sugar.

Uploaded by

Ferekkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

1. Five numbers are given: (5, 10, 15, 5, 15).

Now, what would be the sum of deviations of individual data


points from their mean?

A) 10

B)25

C) 50

D) 0

E) None of the above

2. A test is administered annually. The test has a mean score of 150 and a standard deviation of 20. If Ravi’s z-
score is 1.50, what was his score on the test?

A) 180
B) 130
C) 30
D) 150
E) None of the above

3. If the variance of a dataset is correctly computed with the formula using (n – 1) in the denominator, which
of the following option is true?

A) Dataset is a sample
B) Dataset is a population
C) Dataset could be either a sample or a population
D) Dataset is from a census
E) None of the above

4. Standard deviation is robust to outliers?

A) True

B) False

If you look at the formula for standard deviation above, a very high or a very low value would increase standard
deviation as it would be very different from the mean. Hence outliers will effect standard deviation.

5. Studies show that listening to music while studying can improve your memory. To demonstrate this, a
researcher obtains a sample of 36 college students and gives them a standard memory test while they listen
to some background music. Under normal circumstances (without music), the mean score obtained was 25
and standard deviation is 6. The mean score for the sample after the experiment (i.e With music) is 28.

15) What is the null hypothesis in this case?

A) Listening to music while studying will not impact memory.


B) Listening to music while studying may worsen memory.
C) Listening to music while studying may improve memory.
D) Listening to music while studying will not improve memory but can make it worse.

NOTE:

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
The null hypothesis is generally assumed statement, that there is no relationship in the measured phenomena.
Here the null hypothesis would be that there is no relationship between listening to music and improvement in
memory.

What would be the Type I error?

A) Concluding that listening to music while studying improves memory, and it’s right.
B) Concluding that listening to music while studying improves memory when it actually doesn’t.
C) Concluding that listening to music while studying does not improve memory but it does.

6. Type 1 error means that we reject the null hypothesis when it is actually true. Here the null hypothesis is that
music does not improve memory. Type 1 error would be that we reject it and say that music does improve
memory when it actually doesn’t.

After performing the Z-test, what can we conclude ____ ?

A) Listening to music does not improve memory.

B) Listening to music significantly improves memory at p

C) The information is insufficient for any conclusion.

D) None of the above

7. Let’s perform the Z test on the given case. We know that the null hypothesis is that listening to music does
not improve memory.

Alternate hypothesis is that listening to music does improve memory.

In this case the standard error i.e.

The Z score for a sample mean of 28 from this population is

Z critical value for α = 0.05 (one tailed) would be 1.65 as seen from the z table.

Therefore since the Z value observed is greater than the Z critical value, we can reject the null hypothesis and
say that listening to music does improve the memory with 95% confidence.

8. A researcher concludes from his analysis that a placebo cures AIDS. What type of error is he making?

A) Type 1 error

B) Type 2 error

C) None of these. The researcher is not making an error.

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
D) Cannot be determined

By definition, type 1 error is rejecting the null hypothesis when it is actually true and type 2 error is accepting
the null hypothesis when its actually false. In this case to define the error, we need to first define the null and
alternate hypothesis.

9. What happens to the confidence interval when we introduce some outliers to the data?

A) Confidence interval is robust to outliers

B) Confidence interval will increase with the introduction of outliers.

C) Confidence interval will decrease with the introduction of outliers.

D) We cannot determine the confidence interval in this case.

For questions10-12, consider the following case:

A medical doctor wants to reduce blood sugar level of all his patients by altering their diet. He finds that the
mean sugar level of all patients is 180 with a standard deviation of 18. Nine of his patients start dieting and
the mean of the sample is observed to 175. Now, he is considering recommending all his patients to go on a
diet.

Note: He calculates 99% confidence interval.

10 , What is the standard error of the mean?

A) 9
B) 6
C) 7.5
D) 18

The standard error of the mean is the standard deviation by the square root of the number of values. i.e.

Standard error = =6

11. What is the probability of getting a mean of 175 or less after all the patients start dieting?

A) 20%
B) 25%
C) 15%
D) 12%

12. Which of the following statement is correct?

A) The doctor has valid evidence that dieting reduces blood sugar level.

B) The doctor does not have enough evidence that dieting reduces blood sugar level.
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C) If the doctor makes all future patients diet in a similar way, the mean blood pressure will fall below 160.

Question Context 13-15

A researcher is trying to examine the effects of two different teaching methods. He divides 20 students into
two groups of 10 each. For group 1, the teaching method is using fun examples. Where as for group 2 the
teaching method is using software to help students learn. After a 20 minutes lecture of both groups, a test is
conducted for all the students.

We want to calculate if there is a significant difference in the scores of both the groups.

It is given that:

● Alpha=0.05, two tailed.


● Mean test score for group 1 = 10
● Mean test score for group 2 = 7
● Standard error = 0.94

13) What is the value of t-statistic?

A) 3.191
B) 3.395
C) Cannot be determined.
D) None of the above

The t statistic of the given group is nothing but the difference between the group means by the standard error.

=(10-7)/0.94 = 3.191

14) Is there a significant difference in the scores of the two groups?

A) Yes
B) No

15) What percentage of variability in scores is explained by the method of teaching?

A) 36.13
B) 45.21
C) 40.33
D) 32.97

16. Correlation between two variables (Var1 and Var2) is 0.65. Now, after adding numeric 2 to all the values of
Var1, the correlation co-efficient will_______ ?

A) Increase
B) Decrease
C) None of the above

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
16. It is observed that there is a very high correlation between math test scores and amount of physical
exercise done by a student on the test day. What can you infer from this?

1. High correlation implies that after exercise the test scores are high.
2. Correlation does not imply causation.
3. Correlation measures the strength of linear relationship between amount of exercise and test scores.

A) Only 1
B) 1 and 3
C) 2 and 3
D) All the statements are true

17. If the correlation coefficient (r) between scores in a math test and amount of physical exercise by a
student is 0.86, what percentage of variability in math test is explained by the amount of exercise?

A)86%
B) 74%
C)14%
D) 26%

18. Consider a regression line y=ax+b, where a is the slope and b is the intercept. If we know the value of the
slope then by using which option can we always find the value of the intercept?

A) Put the value (0,0) in the regression line True

B) Put any value from the points used to fit the regression line and compute the value of b False

C) Put the mean values of x & y in the equation along with the value a to get b False

D) None of the above can be used False

19. What happens when we introduce more variables to a linear regression model?

A) The r squared value may increase or remain constant, the adjusted r squared may increase or decrease.

B) The r squared may increase or decrease while the adjusted r squared always increases.

C) Both r square and adjusted r square always increase on the introduction of new variables in the model.

D) Both might increase or decrease depending on the variables introduced.

20. In univariate linear least squares regression, relationship between correlation coefficient and coefficient
of determination is ______ ?

A) Both are unrelated False

B) The coefficient of determination is the coefficient of correlation squared True

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C) The coefficient of determination is the square root of the coefficient of correlation False

D) Both are same F

21. What is the relationship between significance level and confidence level?

A) Significance level = Confidence level


B) Significance level = 1- Confidence level
C) Significance level = 1/Confidence level
D) Significance level = sqrt (1 – Confidence level)

22.. Let the coefficient of determination computed to be 0.39 in a problem involving one independent variable
and one dependent variable. This result means that

A. The relationship between the two variables is negative

B. The correlation coefficient is 0.39 also

C. 39% of the total variation is explained by the independent variable

D. 39% of the total variation is explained by the dependent variable

23. What test statistic is used for a global test of significance?

A. Z-test

B. Z-test

C. Chi-square test

D. F-test

24. A coefficient of correlation is computed to be -0.95 means that

A. The relationship between the two variables is weak

B. The relationship between the two variables is strong and positive

C. The relationship between the two variables is strong and but negative

D. The correlation coefficient cannot have this value

25. If “time” is used as the independent variable in a simple linear regression analysis, then which of the
following assumption could be violated

A. There is a linear relationship between the independent and dependent variables

B. The residual variation is the same for all fitted values of Y

C. The residuals are normally distributed

D. Successive observations of the dependent variable are uncorrelated

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
26. In multiple regression, when the global test of significance is rejected, we can conclude that

A. All of the net sample regression coefficients are equal to zero

B. All of the sample regression coefficients are not equal to zero

C. At least one sample regression coefficient is not equal to zero

D. The regression equation intersects the Y-axis at zero.

27. Multicollinearity exists when

A. Independent variables are correlated less than -0.70 or more than 0.70

B. An independent variable is strongly correlated with a dependent variable


C. There is only one independent variable

D. The relationship between the dependent and independent variable is non-linear

28. The strength (degree) of the correlation between a set of independent variables XX and a dependent
variable YY is measured by
A. Coefficient of Correlation
B. Coefficient of Determination
C. Standard error of estimate
D. All of the mentioned

29. The estimate of β in the regression equation Y=α+βX+eY=α+βX+e by the method of least square is:
A. Biased
B. Unbiased
C. Consistent
D. Efficient

30. An investigator reports that the arithmetic mean of two regression coefficients of a regression line is 0.7 and
the correlation coefficient is 0.75. Are the results

A. Valid

B. Invalid

C. Inconclusive

D. None of mentioned options

31. The average of two regression coefficients is always greater than or equal to the correction coefficient is
called:

A. Fundamental property

B. Signature property

C. Magnitude property

D. Mean property

32. The lines of regression intersect at the point

A. (XX, YY)
B. (, )
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. (00 00)

D. (11, 11)

33. Regression coefficient is independent of

A. Origin
B. Scale

C. Both origin and scale

D. Neither origin nor scale

34. If ρ=0, the lines of regression are:


A. Coincident
B. Parallel
C. None of the mentioned options
D. Perpendicular to each other

35. If the two lines of regression are perpendicular to each other, the correlation coefficient r is:
A. 00
B. -1-1
C. 11
D. Nothing can be said

36. If ρ is the correlation coefficient, the quantity 1−ρ2−√1−ρ2 is termed as


A. Coefficient of determination
B. Coefficient of Non-determination
C. Coefficient of Alienation
D. All of the mentioned options

37. The correlation coefficient is used to determine:


A. A specific value of the y-variable given a specific value of the x-variable
B. A specific value of the x-variable given a specific value of the y-variable
C. The strength of the relationship between the x and y variables
D. None of these

38. If there is a very strong correlation between two variables then the correlation coefficient must be
A. any value larger than 1
B. much smaller than 0, if the correlation is negative
C. much larger than 0, regardless of whether the correlation is negative or positive
D. None of these alternatives is correct.

39. In regression, the equation that describes how the response variable (y) is related to the
explanatory variable (x) is:
A. the correlation model
B. the regression model
C. used to compute the correlation coefficient
D. None of these alternatives is correct.

40. The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in
16 male college students by using least squares regression. The following regression equation was

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
obtained from this study:

!= -0.0127 + 0.0180x

The above equation implies that:


A. each beer consumed increases blood alcohol by 1.27%
B. on average it takes 1.8 beers to increase blood alcohol content by 1%
C. each beer consumed increases blood alcohol by an average of amount of 1.8%
D. each beer consumed increases blood alcohol by exactly 0.018

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
41. Regression modeling is a statistical framework for developing a mathematical equation that describes how
A. one explanatory and one or more response variables are related
B. several explanatory and several response variables response are related
C. one response and one or more explanatory variables are related
D. All of these are correct.

42. In regression analysis, the variable that is being predicted is the


A. response, or dependent, variable
B. independent variable
C. intervening variable
D. is usually x

43. Regression analysis was applied to return rates of sparrowhawk colonies. Regression analysis was used to study the
relationship between return rate (x: % of birds that return to the colony in a given year) and immigration rate (y: % of
new adults that join the colony per year). The following regression equation was obtained.

! = 31.9 – 0.34x

Based on the above estimated regression equation, if the return rate were to decrease by 10% the rate of
immigration to the colony would:
A. increase by 34%
B. increase by 3.4%
C. decrease by 0.34%
D. decrease by 3.4%

44. In least squares regression, which of the following is not a required assumption about the error term?
A. The expected value of the error term is one.
B. The variance of the error term is the same for all values of x.
C. The values of the error term are independent.
D. The error term is normally distributed.

45. Larger values of r2 (R2) imply that the observations are more closely grouped about the
A. average value of the independent variables
B. average value of the dependent variable
C. least squares line
D. origin

46. In a regression analysis if r2 = 1, then


A. SSE must also be equal to one
B. SSE must be equal to zero
C. SSE can be any positive value
D. SSE must be negative

47. The coefficient of correlation


A. is the square of the coefficient of determination

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. is the square root of the coefficient of determination
C. is the same as r-square
D. can never be negative

48. In regression analysis, the variable that is used to explain the change in the outcome of an experiment, or some
natural process, is called
A. the x-variable
B. the independent variable
C. the predictor variable
D. the explanatory variable
E. all of the above (a-d) are correct
F. none are correct

49. In the case of an algebraic model for a straight line, if a value for the x variable is specified, then
A. the exact value of the response variable can be computed
B. the computed response to the independent value will always give a minimal residual
C. the computed value of y will always be the best estimate of the mean response
D. None of these alternatives is correct.

50. A regression analysis between sales (in $1000) and price (in dollars) resulted in the following equation:

! = 50,000 - 8X

The above equation implies that an


A. increase of $1 in price is associated with a decrease of $8 in sales
B. increase of $8 in price is associated with an increase of $8,000 in sales
C. increase of $1 in price is associated with a decrease of $42,000 in sales
D. increase of $1 in price is associated with a decrease of $8000 in sales

51. If the coefficient of determination is a positive value, then the regression equation
A. must have a positive slope
B. must have a negative slope
C. could have either a positive or a negative slope
D. must have a positive y intercept

52. If two variables, x and y, have a very strong linear relationship, then
A. there is evidence that x causes a change in y
B. there is evidence that y causes a change in x
C. there might not be any causal relationship between x and y
D. None of these alternatives is correct.

53. If the coefficient of determination is equal to 1, then the correlation coefficient


A. must also be equal to 1
B. can be either -1 or +1
C. can be any value between -1 to +1
D. must be -1

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
54. In regression analysis, if the independent variable is measured in kilograms, the dependent variable
A. must also be in kilograms
B. must be in some unit of weight
C. cannot be in kilograms
D. can be in any units

55. The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in 16 male college
students by using least squares regression. The following regression equation was obtained from this study:

!= -0.0127 + 0.0180x

Suppose that the legal limit to drive is a blood alcohol content of 0.08. If Ricky consumed 5 beers the model would
predict that he would be:
A. 0.09 above the legal limit
B. 0.0027 below the legal limit
C. 0.0027 above the legal limit
D. 0.0733 above the legal limit

56. If the correlation coefficient is 0.8, the percentage of variation in the response variable explained by the variation in the
explanatory variable is
a. 0.80%
b. 80%
c. 0.64%
d. 64%
57. If the correlation coefficient is a positive value, then the slope of the regression line
A. must also be positive
B. can be either negative or positive
C. can be zero
D. cannot be zero

58. If the coefficient of determination is 0.81, the correlation coefficient

A. . is 0.6561
B. could be either + 0.9 or - 0.9
C. must be positive
D. must be negative

59. A fitted least squares regression line


A. may be used to predict a value of y if the corresponding x value is given
B. is evidence for a cause-effect relationship between x and y
C. can only be computed if a strong linear relationship exists between x and y
D. None of these alternatives is correct.

60. Regression analysis was applied between $ sales (y) and $ advertising (x) across all the branches of a major
international corporation. The following regression function was obtained.

! = 5000 + 7.25x

If the advertising budgets of two branches of the corporation differ by $30,000, then what will be the predicted
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
difference in their sales?
A. $217,500
B. $222,500
C. $5000
D. $7.25

61. Suppose the correlation coefficient between height (as measured in feet) versus weight (as measured in
pounds) is 0.40. What is the correlation coefficient of height measured in inches versus weight measured in
ounces? [12 inches = one foot; 16 ounces = one pound]
A 0.40
B. 0.30
C. 0.533
D. cannot be determined from information given

E. none of these

62. If height is measured in feet and weight is measured in pounds. Now, suppose that the units of both variables are
converted to metric (meters and kilograms). The impact on the slope is:
A. the sign of the slope will change
B. the magnitude of the slope will change
C. both a and b are correct
D. neither a nor b are correct

63. You have carried out a regression analysis; but, after thinking about the relationship between variables, you have
decided that you have to swap the explanatory and the response variables. After refitting the regression model to
the data you expect that:
A. the value of the correlation coefficient will change
B. the value of SSE will change
C. the value of the coefficient of determination will change
D. the sign of the slope will change
E. nothing changes

64. Suppose you use regression to predict the height of a woman’s current boyfriend by using her own height as the
explanatory variable. Height was measured in feet from a sample of 100 women undergraduates, and their
boyfriends at Techno India University. Now, suppose that the height of both the women and the men are
converted to centimeters. The impact of this conversion on the slope is:
A. the sign of the slope will change
B. the magnitude of the slope will change
C. both the options are correct
D. neither of the options is correct

65. You studied the impact of the dose of a new drug treatment for high blood pressure. You think that the drug might be
more effective in people with very high blood pressure. Because you expect a bigger change in those patients who start
the treatment with high blood pressure, you use regression to analyze the relationship between the initial blood
pressure of a patient (x) and the change in blood pressure after treatment with the new drug (y). If you find a very
strong positive association between these variables, then:
A. there is evidence that the higher the patient’s initial blood pressure, the bigger the impact of the new
drug.

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. there is evidence that the higher the patient’s initial blood pressure, the smaller the impact of the new
drug.
C. there is evidence for an association of some kind between the patient’s initial blood pressure and
the impact of the new drug on the patient’s blood pressure
D. none of these are correct, this is a case of regression fallacy

66. Discrete probability distribution depends on the properties of ___________


A. data
B. machine
C. discrete variables
D. probability function

If the outcomes of a discrete random variable follow a Poisson distribution, then which of the following is true?
A. The mean equals the variance.
67. B. The mean equals the standard deviation.
C. The median equals the variance.
D. The median equals the standard deviation

Which of the following is not a characteristic of a binomial probability distribution?


A. Each trial has a finite number of possible outcomes.
68. B. There is a fixed number of identical trials.
C. The process must be consistent in generating successes and failures
D. The trials must be independent of each other.

69. The number of arrivals of delivery trucks per hour at a loading station is an example of which of the
following processes?
A. Binomial

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. Uniform
C. Poisson
D. Normal

70. Discrete probability distribution depends on the properties of ___________


A. data
B. machine
C. discrete variables
D. probability function

71. Which of the following is not a true statement about the


binomial probability distribution?
A. Each outcome is independent of each other.
B. Each outcome can be classified as either success
or failure.
C. The probability of success must be constant from
trial to trial.
D. The random variable of interest is continuous.

72. . A random variable x has a binomial distribution with n=4 and p=1/6. What is the probability that x is 1?

A. 0.3458
B. 0.4158
C. 0.4358
D. 0.3858

73. A random variable x has a binomial distribution with n=64 and p=0.65. What is the probability that x is 47 or less?

A. 0.9417
B. 0.9717
C. 0.8817
D. 0.9017

74. A random variable x has a binomial distribution with n=100 and p=0.35. What is the probability x falls in the range from 26
to 34, inclusive?

A. 0.3813
B. 0.5413

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. 0.4413
D. 0.4913

75. A random variable x has a binomial distribution with n =28 and p=0.55. What is the probability that x will be greater than
18?

A. 0.1787
B. 0.1187
C. 0.2256
D. 0.0887

76. Suppose x is a Poisson-distributed random variable with an expected value of 5 occurrences per interval. What is p(x=3)?

A. 0.2004
B. 0.1404
C. 0.1704
D. 0.0904

77. Suppose x is a Poisson-distributed random variable with an expected value of 12 occurrences per interval. What is p(x<10)?

A. 0.2424
B. 0.2124
C. 0.2824
D. 0.2624

78. Suppose x is a Poisson-distributed random variable with an expected value of 55 occurrences per interval. What is
p(45<x<60)?

A. 0.6055
B. 0.6755
C. 0.6355
D. 0.6955

79. Suppose x is a Poisson-distributed random variable with an expected value of 105 occurrences per interval. What is
p(x>90)?

A. 0.9641
B. 0.8741
C. 0.8341
D. 0.9241

80. An urn has 10 marbles: 3 red, 7 black. If we draw a random sample of 4, what is the probability we will end up with 2 red
and 2 black?

A. 0.1787
B. 0.1187
C. 0.2256
D. 0.0887

81. Suppose 10 cards are drawn from a deck of 52 cards consisting of 13 hearts, 13 diamonds, 13 clubs, and 13 spades. What is
the probability that the hand of 10 cards will include 3 hearts, 3 diamonds, 2 clubs, and 2 spades?

A. 0.0315

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. 0.0515
C. 0.0815
D. 0.0215

82. Which of the following is not an application of Business Analytics?


A. critical product analysis
B. improved customer service
C. up-selling opportunities
D. None of the mentioned ones
E. All of the mentioned ones.

113. Which of the following is not a business analytics tool?

A. Business intelligence reporting software


B. Statistical analysis tools
C. Big data platforms
D. None of the mentioned
E. All are BA tools
114. Roll-up implies
A. Consolidation
B. Drill-up
C. Summarization along the dimension
D. All of the mentioned
E. None of the mentioned.
115. Which of the following is not a part of activities in data blending?
A. It is a process of combining data
B. It procures data from multiple sources
C. It creates a functional actionable dataset
D. All of the mentioned
E. None of the mentioned
116. Data blending provides
A. Deeper intelligence
B. Accurate actionable data
C. Decision –making skills
D. All of the mentioned
E. None of the mentioned
117. Which of the following is not correct?

A. Datawatch Monarch is used for data blending


B. Joining is a type of Data Blending
C. Data analysis is very fast with Tableau
D. All of the mentioned
E. None of the mentioned
118. dir(path='.') in R shows character(0). It implies:
A. It is an invalid command
B. It is an invalid path
C. It is an invalid directory
D. It is the current directory

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
E. None of the mentioned
119. The term JSON is an acronym of

A. JavaScript Object Notation


B. JavaScript Object Narration
C. John Smith Object Notation
D. None of the mentioned.

120. ODBC

A. Is an API
B. permits application programmers to access SQL- based DBMSs
C. for R is a package named RODBC
D. Each of the mentioned one is correct
E. None of the mentioned one is correct

121. The incorrect option about DSN is:

A. It stands for Data Source Name


B. It is a connection to a specific database.
C. It stands for Data Science Nomenclature
D. It can be a ‘User DSN’ and a ‘System DSN’
E. All options are correct
122. To Connect to SQL Server, we need to know

A. IP address of SQL Server(I)


B. User name to connect to SQL Server(U)
C. Password for the user to connect(P)
D. Only the option (U)
E. All of the options (I), (U) and (P)
123. sqldf package can be used for

A. Writing SQL queries


B. Using SQLite database.
C. H2 Java database
D. PostgresSQL database
E. All of the mentioned
124. Which of the following is incorrect?

A. Mean-Mode=3(Mean-Median)
B. Harmonic mean≤Geometric mean ≤Arithmetic mean
C. Geometric Mean =
D. Harmonic mean≥Geometric mean ≤Arithmetic mean

125. Quartile deviation is defined by

A. Upper quartile – lower quartile


B. ½ × (Q3 – Q1)
C. IRQ
D. None of the mentioned
E. All of the mentioned

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
126. A confidence interval consists of

A. Confidence level(C)
B. Statistic(S)
C. Margin of error(M)
D. All of (C ), (S) and (M)
E. None of (C ), (S) and (M)

127. The formula for the standard error of the mean is

A. σ/√n
B. σ
C. σ^2/n
D. None of the mentioned ones
128. Which one of the following is called the distribution of rare events?
A. Bernoulli distribution
B. Binomial distribution
C. Normal distribution
D. Poisson distribution
E. None of the mentioned
129.A standard normal deviate is a normally distributed random variable

A. With mean and sd


B. With mean 0 and sd =1
C. With mean and variance
D. None of the mentioned

130. The probability distribution of the sum of squared standard normal deviates is called

A. Chi square distribution


B. t distribution
C. F distribution
D. None of the mentioned
131.A random variable is a variable whose value is

A. A numerical outcome
B. The result of a random phenomenon
C. One of the values of a sample space
D. All of the mentioned ones
E. None of the mentioned ones

132. A probability distribution

A. Is a relationship between each possible outcome of a random variable and their probabilities
B. summarizes the relationship between possible values and their probability for a random variable
C. can have variable structure and type based on the properties of the random variable
D. All the mentioned points are correct
E. None of the mentioned points is correct

133. A random variable can be

A. Discrete (D)
B. Continuous(C )
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. Hybrid (H)
D. Both (D) and (C)
E. All of (D), (C ) and (H)
134. Which of the following is incorrect?
A. P.M.F. is related to a discrete R.V.
B. P.D.F. is related to a continuous R.V.
C. C.D.F. is related to a continuous R.V. only.
D. C.D.F. is related to both discrete and continuous R.V.
135. A Bernoulli distribution is related to

A. Binary outcome
B. Binomial distribution
C. Any one of the mentioned ones
D. Both of the mentioned ones

136. Which one of the following is not a Gaussian distribution?


A. Chi-Square Distribution
B. t distribution
C. F distribution
D. All of the mentioned ones
E. None of the mentioned ones

137. A binary random variable is a

A. Continuous random variable


B. Discrete random variable
C. Hybrid random variable
D. None of the mentioned

138. A random variable that has a finite or countable infinite possible values is a

A. Continuous Radom Variable


B. Discrete Random Variable
C. None of the mentioned
D. Either of the mentioned

139. A random variable that has an interval for its set of possible values is a

A. Continuous Radom Variable


B. Discrete Random Variable
C. None of the mentioned
D. Either of the mentioned

140. The function that defines the probability distribution for a discrete random variable is called

A. Probability density function


B. Probability mass function
C. Probability distribution function
D. None of the mentioned

141. A function that assigns a probability that a discrete random variable will have a value of less than or equal to a specific
discrete value is called

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
A. Cumulative distribution function
B. Discrete distribution function
C. Probability distribution function
D. None of the mentioned

142. A discrete probability distribution that covers a case where an event will have a binary outcome is called

A. Bernoulli distribution
B. Binomial distribution
C. Binary distribution
D. None of the mentioned

143. The single flip of a coin that may have a head (0) or a tail (1) outcome is an example of a

A. Binomial Trial
B. Bernoulli Trial
C. Poisson Trial
D. None of the mentioned

144. When there are exactly two mutually exclusive outcomes of a trial, we use

A. Bernoulli distribution
B. Binomial distribution
C. Uniform distribution
D. Poisson distribution
E. None of the mentioned

145. The distribution of rare events is:

A. Binomial distribution
B. Uniform distribution
C. Poisson distribution
D. Normal distribution
E. None of the mentioned

146. If a model is meant for a series of discrete events where the average time between events is known, but the exact timing of
events is random, then it is called

A. Binomial Process
B. Bernoulli Process
C. Poisson process
D. Normal process
E. None of the mentioned

147. When some discrete event occurs in a continuous, but finite interval of time or sample space in S, we say that

A. A Bernoulli process has happened


B. A binomial process has happened
C. A Poisson process has happened
D. None of the mentioned.

148. The customers calling a help centre is said to be a


MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
A. Poisson process
B. Binomial process
C. Bernoulli process
D. Normal process

149. The radioactive decay in atoms is said to be a

A. Poisson process
B. Binomial process
C. Bernoulli process
D. Normal process

150. The distribution of the sum of squared standard normal deviates is called

A. Chi square distribution


B. F distribution
C. t distribution
D. None of the mentioned
151. The number of degrees of freedom of a probability distribution generally refers to

A. The number of independent observations in a sample


B. The number of population parameters that must be estimated from sample data
C. The number of independent observations in a sample minus the number of population parameters that must be
estimated from sample data.

D. None of the mentioned


152. The t distribution (Student’s t-distribution) is a probability distribution that is used to estimate population parameters

A. When the sample size is small


B. When the population variance is unknown.
C. When the sample size is small and/or when the population variance is unknown.
D. None of the mentioned.
153. The sampling distribution of a statistic (like a sample mean) will follow a normal distribution, as long as the sample size is

A. Sufficiently small
B. Sufficiently large
C. Insignificant
D. None of the mentioned.
154. We use t statistic (also known as the t score) when
A. When sample sizes are small[case 1]
B. We do not know the standard deviation of the population[case 2]
C. Either case 1 or case 2 is true
D. None of the mentioned.
155. The F distribution takes
A. One parameter
B. Two parameters

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. More than two parameters
D. None of these
156. Let X be a vector of values, G be a vector of labels then which of the following command will be correct to draw a colored
pie chart?
A. pie(X,labels=G,radius=0.7,clockwise=FALSE,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
B. pie(labels=G,X,radius=0.7,clockwise=FALSE,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
C. pie(labels=G,X,radius=0.7,clockwise=True,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
D. pie(labels=G,X,clockwise=False,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
E. None of the mentioned

157. Let a represents the number of admitted students in four different streams and s represents the names of the streams
corresponding to the admission counts. Then, which of the following commands is appropriate for drawing a bar chart?
A. barplot(names.arg=s,a, xlab="Streams",ylab="No. of students got
admitted",col="green",main="Student Strength in 2018",border="red")

B. barplot(a,names.arg=s,xlab="Streams",ylab="No. of students got


admitted",col="green",main="Student Strength in 2018",border="red")
C. barplot(xlab="Streams",ylab="No. of students got
admitted",col="green",main="Student Strength in 2018",border="red", a,names.arg=s)
D.None of the mentioned

158. A histogram differs from a bargraph in that

A. A histogram groups the values in continuous ranges whereas a bargraph groups in discrete values.

B. A histogram groups the values in discrete ranges whereas a bargraph groups in continuous values

C. The data type does not matter

D. None of the mentioned.

159. Which one of the following is not true about a scatter plot ?

A. It is also known as a correlation plot

B. It is a two-dimensional graphical presentation of data

C. It always shows a linear relationship

D. None of the mentioned

160. Which one of the following is not true about a line chart:

A. A line chart connects a series of data points with a straight line

B. Line charts are most often used to visualize data that changes over time

C. Line charts are usually used in identifying the trends in data.


D. plot() function is used to draw a line graph

E. None of the mentioned

161. The median of the data values is pointed out by which of the following graphical representation?

A. A line graph
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. A bar graph

C. A box and whisker plot

D. None of the mentioned

162. Which of the following is not true about a box plot?

A. Spread of all the data is represented on a boxplot by the horizontal distance between the smallest value and the largest
value, including any outliers.

B. The Interquartile range is represented by the width of the box

C. Spread of all the data is represented on a boxplot by the horizontal distance between the smallest value and the largest value,
excluding the outliers.

D. None of the mentioned

163. Text Mining is also known as

A. Natural Language Processing (NLP)

B. Text Analytics

C. A machine-supported analysis of text with a view to extract interesting and useful patterns and information

D. None of the mentioned

E. All of the mentioned

164. Which of the following is not a task in Text Preprocessing?

A. Building a term-document matrix

B. Exploring frequent terms and their associations


C. Cleaning the text

D. Creating a corpus of the text.

165. Which of the following packages is not required in text mining?

A. NLP

B. tm

C. wordcloud

D. None of the packages mentioned is required

E. All the mentioned packages are required.

166. A corpus is created in text mining by using which of the following expressions?

A. corpus(VectorSource(text_file_name))

B. Corpus(VectorSource(text_file_name))

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. Vectorsource(Corpus(text_file_name))

D. None of the mentioned

167. A corpus is a format for

A. storing textual data that is used throughout linguistics and text analysis

B. containing each document, along with some Meta attributes that help describe that document

C. making a database of the text files to be used in the data analysis.

D. None of the options about corpus is true

E. Each of the options about corpus is true.

168. Which of the following is not true about text cleaning?

A. Its purpose is to remove unnecessary special characters, white spaces, common stopwords then to convert the text to lower
case form.
B. The tm_map() function is used for text cleaning
C. The tm_clean() function is used for text cleaning
D. When the text cleaning function is used, the warning messages tell us that the desired cleaning has been done.

E. None of the mentioned is true.

169. Which of the following is not true about the stemDocument() function?

A. It enables us to get to a word's root


B. The function either takes in a character vector and returns a character vector or takes in a PlainTextDocument and
returns a Plaintext Document.
C. The SnowballC package enables us in the stemdocument process
D. The tm package provides the stemDocument() function
E. All mentioned points are true.

170. Which is not true about a Term Document Matrix/Document Term Matrix?

A. It is used to hold a Sparse matrix

B. The system shows the percentage of sparsity after the creation of a TDM/DTM.

C. The sparsity represents the proportion of entries that are zero in TDM/DTM

D. None of the mentioned is true.

E. All of the mentioned points are true.

171. Which of the following packages are required for sentiment analysis?

A. SentimentAnalysis

B. dplyr

C. zealot

D. All of the mentioned


MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
E. None of the mentioned

172. For using the pipelining symbol ‘%>% ‘ in queries, we need to install

A. zealot package

B. dplyr

C. Both the mentioned packages

D. None of the mentioned ones.

173. Which of the following packages is used to pull out emotions from the Comments?

A. syuzhet

B. ggplot2

C. SentimentAnalysis

D. None of the mentioned ones

174. Which function is used for creating new variables?

A. mutate()

B. extract()

C. kable()

D. None

D. None of the mentioned ones.

175. The visual representation of text data is known as

A. Word cloud

B. Tag cloud

C. None of the mentioned names

D. Both of the Mentioned Names.

176. For data mining from the social media Face Book using R,

A. we need to have a Face Book account

B. We need to register in the Face Book Developer’s account

C. Neither of the mentioned points

D. Either of the mentioned points

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
E. Both the mentioned points.

177. Which of the following packages has no role in data mining from the social media Face Book using R

A. httpuv

B. Rfacebook

C. RCurl

D. httpuv and RCurl

E. All the mentioned points have a role.

178. Which of the following has no role in Data Science?

A. informatics

B. operations research

C. mathematics

D. None of the mentioned

E. All of the mentioned

179. A statement made about a population for testing purpose is called?


a) Statistic
b) Hypothesis
c) Level of Significance
d) Test-Statistic

180. The hypothesis that is tested for rejection considering it to be true is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis

180. The rejection probability of Null Hypothesis when it is true is called as:
a) Level of Confidence
b) Level of Significance
c) Level of Margin
d) Level of Rejection

181. The point where the Null Hypothesis gets rejected is called ?
a) Significant Value
b) Rejection Value
c) Acceptance Value
d) Critical Value

182. If the Critical region is evenly distributed then the test is referred to as?
a) Two tailed
b) one tailed
c) Three tailed

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
d) Zero tailed

183. Which of the following is defined as the rule or formula to test a Null Hypothesis?
a) Test statistic
b) Population statistic
c) Variance statistic
d) Null statistic

184. Consider a hypothesis H0 where ϕ0 = 5 against H1 where ϕ1 > 5. The test is?
a) Right tailed
b) Left tailed
c) Center tailed
d) Cross tailed

185. Consider a hypothesis where H0 where ϕ0 = 23 against H1 where ϕ1 < 23. The test is?
a) Right tailed
b) Left tailed
c) Center tailed
d) Cross tailed

186. Type 1 error occurs when?


a) We reject H0 if it is True
b) We reject H0 if it is False
c) We accept H0 if it is True
d) We accept H0 if it is False

187. The probability of Type 1 error is referred to as?


a) 1-α
b) β
c) α
d) 1-β

188. Alternative Hypothesis is also called as?


a) Composite hypothesis
b) Research Hypothesis
c) Simple Hypothesis
d) Null Hypothesis

189. Two types of errors associated with hypothesis testing are Type I and Type II. Type II error is committed when

A. We reject the null hypothesis whilst the alternative hypothesis is true

B. We reject a null hypothesis when it is true

C. We accept a null hypothesis when it is not true

D. None of the mentioned

190. One or two tail test will determine

A. If the two extreme values (min or max) of the sample need to be rejected

B. If the hypothesis has one or possible two conclusions

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. If the region of rejection is located in one or two tails of the distribution

190. By taking a level of significance of 5% it is the same as saying

A. We are 5% confident the results have not occurred by chance

B. We are 95% confident that the results have not occurred by chance
C. We are 95% confident that the results have occurred by chance

191. The level of significance can be viewed as the amount of risk that an analyst will accept when making a decision

A. True

B. False

C. Nothing can be stated

192. Parametric test, unlike the non-parametric tests, make certain assumptions about

A. The population size

B. The underlying distribution

C. The sample size

193. Rejection of the null hypothesis is a conclusive proof that the alternative hypothesis is

A. True

B. False

C. Neither

194. A 99% t-based confidence interval for the mean price for a gallon of gasoline (dollars) is calculated using a simple
random sample of gallon gasoline prices for 50 gas stations. Given that the 99% confidence interval is $3.32 < < $3.98,
what is the sample mean price for a gallon of gasoline (dollars)?

A. $0.33
B. $3.65
C. Not Enough Information; we would need to know the variation in the sample of gallon gasoline prices
D. Not Enough Information; we would need to know the variation in the population of gallon gasoline prices

195. Green sea turtles have normally distributed weights, measured in kilograms, with a mean of 134.5 and a variance of 49.0. A
particular green sea turtle’s weight has a z-score of -2.4. What is the weight of this green sea turtle rounded to the nearest
whole number?

A. 17 kg
B. 151 kg
C. 118 kg
D. 252 kg
196. Which of the following exam scores is better relative to other students enrolled in the course?

● A psychology exam grade of 85; the mean grade for the psychology exam is 92 with a

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
standard deviation of 3.5
● An economics exam grade of 67; the mean grade for the economics exam is 79 with a
standard deviation of 8
● A chemistry exam grade of 62; the mean grade for the chemistry exam is 62 with a
standard deviation of 5

A. The psychology exam score is relatively better

B. The economics exam score is relatively better

C. The chemistry exam score is relatively better

D.All of the exam scores are relatively equivalent

197. The statement “If there is sufficient evidence to reject a null hypothesis at the 10% significance level, then there is
sufficient evidence to reject it at the 5% significance level” is: Please select the best answer of those provided below.

198.A. Always True


198.B. Never True
198.C. Sometimes True; the p-value for the statistical test needs to be provided for a conclusion
198.D. Not Enough Information; this would depend on the type of statistical test used

198.A randomly selected sample of 1,000 college students was asked whether they had ever used the drug Ecstasy. Sixteen
percent (16% or 0.16) of the 1,000 students surveyed said they had. Which one of the following statements about the number
0.16 is correct?

A. It is a sample proportion.
B. It is a population proportion.
C. It is a margin of error.
D. It is a randomly chosen number.

199.In a random sample of 1000 students, pˆ = 0.80 (or 80%) were in favor of longer hours at the school library. The
standard error of pˆ (the sample proportion) is

A. .013
B. .160
C. .640
D. .800

200.For a random sample of 9 women, the average resting pulse rate is x = 76 beats per minute, and the sample standard
deviation is s = 5. The standard error of the sample mean is

A. 0.557
B. 0.745
C. 1.667
D. 2.778

2. Assume the cholesterol levels in a certain population have mean = 200 and standard deviation  =
24. The cholesterol levels for a random sample of n = 9 individuals are measured and the sample mean x is determined.
What is the z-score for a sample mean x = 180?
A. –3.75

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. –2.50
C. 0.83
D. 2.50

201.In a past General Social Survey, a random sample of men and women answered the question “Are you a member of any
sports clubs?” Based on the sample data, 95% confidence intervals for the population proportion who would answer “yes” are .
13 to .19 for women and .247 to .33 for men. Based on these results, you can reasonably conclude that

A. At least 25% of American men and American women belong to sports clubs.
B. At least 16% of American women belong to sports clubs.
C. There is a difference between the proportions of American men and American women who belong to sports
clubs.
D. There is no conclusive evidence of a gender difference in the proportion belonging to sports clubs.

202.Suppose a 95% confidence interval for the proportion of Americans who exercise regularly is 0.29 to

0.37. Which one of the following statements is FALSE?


A. It is reasonable to say that more than 25% of Americans exercise regularly.
B. It is reasonable to say that more than 40% of Americans exercise regularly.
C. The hypothesis that 33% of Americans exercise regularly cannot be rejected.
D. It is reasonable to say that fewer than 40% of Americans exercise regularly.

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
203.In hypothesis testing, a Type 2 error occurs when

199.A. The null hypothesis is not rejected when the null hypothesis is true.
199.B. The null hypothesis is rejected when the null hypothesis is true.
199.C. The null hypothesis is not rejected when the alternative hypothesis is true.
199.D. The null hypothesis is rejected when the alternative hypothesis is true.

204.Null and alternative hypotheses are statements about:

A.population parameters.
B.sample parameters.

C.sample statistics.

D.it depends - sometimes population parameters and sometimes sample statistics.

205.A hypothesis test is done in which the alternative hypothesis is that more than 10% of a population is left-handed. The p-
value for the test is calculated to be 0.25. Which statement is correct?

A. We can conclude that more than 10% of the population is left-handed.

211.B. We can conclude that more than 25% of the population is left-handed.
211.C. We can conclude that exactly 25% of the population is left-handed.
211.D. We cannot conclude that more than 10% of the population is left-handed.

206.Which of the following is NOT true about the standard error of a statistic?

218.A. The standard error measures, roughly, the average difference between the statistic
and the population parameter.
B.The standard error is the estimated standard deviation of the sampling distribution for the statistic.

200.C. The standard error can never be a negative number.


200.D. The standard error increases as the sample size(s) increases.

207.A prospective observational study on the relationship between sleep deprivation and heart disease was done by Ayas, et.
al. (Arch Intern Med 2003). Women who slept at most 5 hours a night were compared to women who slept for 8 hours a night
(reference group). After adjusting for potential confounding variables like smoking, a 95% confidence interval for the relative
risk of heart disease was (1.10, 1.92). Based on this confidence interval, a consistent conclusion would be

226.A. Sleep deprivation is associated with a modestly increased risk of heart disease.
226.B. Sleep deprivation is associated with a modestly decreased risk of heart disease.
226.C. There was no evidence of an association between sleep deprivation and heart disease.
226.D. Lack of sleep causes the risk of heart disease to increase by 10% to 92%.

208. Consider a random sample of 100 females and 100 males. Suppose 15 of the females are left-handed and 12 of the
males are left-handed. What is the estimated difference between population proportions of females and males who are left-
handed (females  males)? Select the choice with the correct notation and numerical value.

A. p1 - p2 = 3
B. p1 - p2 = 0.03
C. pˆ1 - pˆ 2 D.pˆ1 - pˆ 2= 3

= 0.03

MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
209.A result is called “statistically significant” whenever

A. The null hypothesis is true.

B. The alternative hypothesis is true.

C. The p-value is less or equal to the significance level.


D. The p-value is larger than the significance level.

210. The confidence level for a confidence interval for a mean is

• the probability the procedure provides an interval that covers the sample mean.
• the probability of making a Type 1 error if the interval is used to test a null hypothesis about the population mean.
• the probability that individuals in the population have values that fall into the interval.
• the probability the procedure provides an interval that covers the population mean.

For the next two questions: It is known that for right-handed people, the dominant (right) hand tends to be stronger. For left-
handed people who live in a world designed for right-handed people, the same may not be true. To test this, muscle strength was
measured on the right and left hands of a random sample of 15 left-handed men and the difference (left - right) was found. The
alternative hypothesis is one-sided (left hand stronger). The resulting t-statistic was 1.80.

211. This is an example of:

A. A two-sample t-test.
B. A paired t-test.
C. A pooled t-test.
D. An unpooled t-test.

212.Assuming the conditions are met, based on the t-statistic of 1.80, the appropriate conclusion for this test using= .05 is: (Table
would be provided with exam.)

A. Df = 14, so p-value < .05 and the null hypothesis can be rejected.
B. Df = 14, so p-value > .05 and the null hypothesis cannot be rejected.

E. Df = 28, so p-value < .05 and the null hypothesis can be rejected.
F. Df = 28, so p-value > .05 and the null hypothesis cannot be rejected.

213.A test of H0:  = 0 versus Ha:  > 0 is conducted on the same population independently by two different researchers. They both
use the same sample size and the same value of  = 0.05. Which of the following will be the same for both researchers?

G. The p-value of the test.


H. The power of the test if the true  = 6.
I. The value of the test statistic.
J. The decision about whether or not to reject the null hypothesis.

215.A test to screen for a serious but curable disease is similar to hypothesis testing, with a null hypothesis of no disease, and an
alternative hypothesis of disease. If the null hypothesis is rejected treatment will be given. Otherwise, it will not. Assuming the
treatment does not have serious side effects, in this scenario it is better to increase the probability of:

A. making a Type 1 error, providing treatment when it is not needed.


B. making a Type 1 error, not providing treatment when it is needed.
C. making a Type 2 error, providing treatment when it is not needed.
D. making a Type 2 error, not providing treatment when it is needed.
216.A random sample of 25 college males was obtained and each was asked to report their actual height and what they wished as
their ideal height. A 95% confidence interval for d = average difference between their ideal and actual heights was 0.8" to 2.2".
Based on this interval, which one of the null hypotheses below (versus a two-sided alternative) can be rejected?

A. H0: d = 0.5
B. H0: d = 1.0

C. H0: d = 1.5

D. H0: d = 2.0

217.The average time in years to get an undergraduate degree in computer science was compared for men and women. Random
samples of 100 male computer science majors and 100 female computer science majors were taken. Choose the appropriate
parameter(s) for this situation.

A. One population proportion p.


B. Difference between two population proportions p1  p2.
1
C. One population mean

D. Difference between two population means µ1  µ2

218.If the word significant is used to describe a result in a news article reporting on a study,

A. the p-value for the test must have been very large.
B. the effect size must have been very large.
C. the sample size must have been very small.

D. it may be significant in the statistical sense, but not in the everyday sense.

219. A random sample of 5000 students were asked whether they prefer a 10 week quarter system or a 15 week semester system.
Of the 5000 students asked, 500 students responded. The results of this survey

A. can be generalized to the entire student body because the sampling was random.
B. can be generalized to the entire student body because the margin of error was 4.5%.
C. should not be generalized to the entire student body because the non-response rate was 90%.
D. should not be generalized to the entire student body because the margin of error was 4.5%.

220.In a report by ABC News, the headlines read “City Living Increases Men’s Death Risk” The headlines were based on a study of
3,617 adults who lived in the United States and were more than 25 years old. One researcher said, “Elevated levels of tumor deaths
suggest the influence of physical, chemical and biological exposures in urban areas… Living in cities also involves potentially stressful
levels of noise, sensory stimulation and overload, interpersonal relations and conflict, and vigilance against hazards ranging from crime
to accidents.” Is a conclusion that living in an urban environment causes an increased risk of death justified?

A. Yes, because the study was a randomized study.


B. Yes, because many of the men in the study were under stress.
C. No, because the study was a retrospective study.
D. No, because the study was an observational study.

221.A significance test based on a small sample may not produce a statistically significant result even if the true value differs
substantially from the null value. This type of result is known as

A. the significance level of the test.


B. the power of the study.
C. a Type 1 error.
D. a Type 2 error.
For the next two questions: An observational study found a statistically significant relationship between regular consumption of
tomato products (yes, no) and development of prostate cancer (yes, no), with lower risk for those consuming tomato products.

222.Which of the following is not a possible explanation for this finding?

A. Something in tomato products causes lower risk of prostate cancer.


B. There is a confounding variable that causes lower risk of prostate cancer, such as eating vegetables in general, that is
also related to eating tomato products.
C. A large number of food products were measured to test for a relationship, and tomato products happened to
show a relationship just by chance.
D. A large sample size was used, so even if there were no relationship, one would almost certainly be detected.

223.Which of the following is a valid conclusion from this finding?

A. Something in tomato products causes lower risk of prostate cancer.


B. Based on this study, the relative risk of prostate cancer, for those who do not consume tomato products regularly compared
with those who do, is greater than one.
C. If a new observational study were to be done using the same sample size and measuring the same variables, it would find the
same relationship.
D. Prostate cancer can be prevented by eating the right diet.

224.The best way to determine whether a statistically significant difference in two means is of practical importance is to

A. find a 95% confidence interval and notice the magnitude of the difference.
B. repeat the study with the same sample size and see if the difference is statistically significant again.
C. see if the p-value is extremely small.
D. see if the p-value is extremely large.

225.A large company examines the annual salaries for all of the men and women performing a certain job and finds that the means
and standard deviations are $32,120 and $3,240, respectively, for the men and $34,093 and $3521, respectively, for the women. The
best way to determine if there is a difference in mean salaries for the population of men and women performing this job in this
company is

A. to compute a 95% confidence interval for the difference.


B. to subtract the two sample means.
C. to test the hypothesis that the population means are the same versus that they are different.
D. to test the hypothesis that the population means are the same versus that the mean for men is higher.

219. One problem with hypothesis testing is that a real effect may not be detected. This problem is most likely to
occur when
A. the effect is small and the sample size is small.
B. the effect is large and the sample size is small.
C. the effect is small and the sample size is large.
D. the effect is large and the sample size is large.

226. A Type I error occurs when we:

A. reject a false null hypothesis

B. reject a true null hypothesis

C. do not reject a false null hypothesis

D. do not reject a true null hypothesis

E. fail to make a decision regarding whether to reject a hypothesis or not

227.In a criminal trial, a Type I error is made when:


A. a guilty defendant is acquitted (set free)

B. an innocent person is convicted (sent to jail)

C. a guilty defendant is convicted

D. an innocent person is acquitted

E. no decision is made about whether to acquit or convict the defendant

228. A Type II error occurs when we:

a. reject a false null hypothesis

b. reject a true null hypothesis

c. do not reject a false null hypothesis

d. do not reject a true null hypothesis

e. fail to make a decision regarding whether to reject a hypothesis or not

229. If a hypothesis is rejected at the 0.025 level of significance, it:

A. must be rejected at any level

B. must be rejected at the 0.01 level

C. . must not be rejected at the 0.01 level

D. must not be rejected at any other level

E. may or may not be rejected at the 0.01 level

230. In a criminal trial, a Type II error is made when:

A. a guilty defendant is acquitted (set free)

B. an innocent person is convicted (sent to jail)

C. a guilty defendant is convicted

D. an innocent person is acquitted

E. no decision is made about whether to acquit or convict the defendant

231. In a two-tail test for the population mean, if the null hypothesis is rejected when the alternative is true, then:

A. a Type I error is committed

B. a Type II error is committed

C. a correct decision is made

D. a one-tail test should be used instead of a two-tail test

E. it is unclear whether a correct or incorrect decision has been made

232. In a one-tail test for the population mean, if the null hypothesis is not rejected when the

alternative hypothesis is true, then:


a. a Type I error is committed

b. a Type II error is committed

c. a correct decision is made

d. a two-tail test should be used instead of a one-tail test

e. it is unclear whether a correct or incorrect decision has been made

233. In a one-tail test for the population mean, if the null hypothesis is rejected when the

alternative hypothesis is not true, then:

A. a Type I error is committed

B. a Type II error is committed

C. a correct decision is made

D. a two-tail test should be used instead of a one-tail test

E. it is unclear whether a correct or incorrect decision has been made

234. If we reject the null hypothesis, we conclude that:

A. there is enough statistical evidence to infer that the alternative hypothesis is true

B. there is not enough statistical evidence to infer that the alternative hypothesis is true

C. there is enough statistical evidence to infer that the null hypothesis is true

D. the test is statistically insignificant at whatever level of significance the test was conducted at

E. further tests need to be carried out to determine for sure whether the null hypothesis should

be rejected or not

235. If we do not reject the null hypothesis, we conclude that:

A. there is enough statistical evidence to infer that the alternative hypothesis is true

B. there is not enough statistical evidence to infer that the alternative hypothesis is true

C. there is enough statistical evidence to infer that the null hypothesis is true

D. the test is statistically insignificant at whatever level of significance the test was conducted at

E. further tests need to be carried out to determine for sure whether the null hypothesis should

be rejected or not

236. The p-value of a test is the:

A. smallest significance level at which the null hypothesis cannot be rejected

B. largest significance level at which the null hypothesis cannot be rejected

C. smallest significance level at which the null hypothesis can be rejected

D. largest significance level at which the null hypothesis can be rejected


E.probability that no errors have been made in rejecting or not rejecting the null hypothesis

237. In order to determine the p-value of a hypothesis test, which of the following is not needed?

A. whether the test is one-tail or two-tail

B. the value of the test statistic

C. the form of the null and alternate hypotheses

D. the level of significance

E. all of the above are needed to determine the p-value

238. Which of the following p-values will lead us to reject the null hypothesis if the significance

level of the test if 5%?

A. 0.15

B. 0.10

C. 0.06

D. 0.20

E. 0.025

239. Suppose that we reject a null hypothesis at the 5% level of significance. For which of the

following levels of significance do we also reject the null hypothesis?

A. 6%

B. 2.5%

C. 4%

D. 3%

E. 2%

240. Which of the following statements about hypothesis testing is true?

A. If the p-value is greater than the significance level, we fail to reject Ho.

B. A Type II error is rejecting the null when it is actually true.

C. If the alternative hypothesis is that the population mean is greater than a specified value, then

the test is a two-tailed test.

D. The significance level equals one minus the probability of a Type I error.

E. None of the above statements are true.

241. The purpose of hypothesis testing is to:

A. test how far the mean of a sample is from zero


B. determine whether a statistical result is significant

C. determine the appropriate value of the significance level

D. derive the standard error of the data

E. determine the appropriate value of the null hypothesis

242. In hypothesis testing, what level of significance would be most appropriate to choose if you knew that making a Type I error would
be more costly than making a Type II error?

A. 0.005

B. 0.025

C. 0.050

D. 0.100

E. 0.028

243. The p-value obtained from a classical hypothesis test is:

A. the probability that the null hypothesis is true given the data

B. the probability that the null hypothesis is false given the data

C. the probability of observing the data or more extreme values if the null hypothesis is true

D. the probability of observing the data or more extreme values if the alternative hypothesis is

true

E. the probability that the observed data were obtained due to chance

244. To test a hypothesis involving proportions, both np and n(1-p) should

A. Be at least 30

B. Be greater than 5

C. Lie in the range from 0 to 1

D. Be greater than 50

E. There are no specific conditions surrounding the values of n and p

245. What assumption is being made when we use the t-distribution to perform a hypothesis

test?

A. That the underlying distribution has more then one modal class

B. That the underlying population has a constant variance

C. That the underlying population has a non-symmetrical distribution

D. That the underlying population follows an approximately Normal distribution

E. None of the above


246. Researchers would like the probability of which of the following research decision outcomes to be the greatest?

A. the probability of Type II (Beta)

B. the probability of Type I

C. Statistical Power (1.00 minus Beta)

D. None of the mentioned

247. The probability that the test statistic will fall inside the region of rejection due to chance alone is equal to which of the following
probabilities?

266. ‘Children can learn a second language faster before the age of 7’. Is this statement:

A. A non-scientific statement

B. A one-tailed hypothesis

A two-tailed hypothesis

A null hypothesis
267. If my experimental hypothesis were ‘Eating cheese before bed affects the number of nightmares you have’, what
would the null hypothesis be?

A. Eating cheese before bed gives you more nightmares.

B. Eating cheese before bed gives you fewer nightmares.

C. Eating cheese is linearly related to the number of nightmares you have.

D. The number of nightmares you have is not affected by eating cheese before bed.

268. If my null hypothesis is ‘Dutch people do not differ from English people in height’, what is my alternative
hypothesis?

A. All of the statements are plausible alternative hypotheses.

B. Dutch people are taller than English people.

C. English people are taller than Dutch people.

D. Dutch people differ in height from English people.

269. Of what is p the probability if the null hypothesis were true? (Hint: NHST relies on fitting a ‘model’ to the data and
then evaluating the probability of this ‘model’ given the assumption that no effect exists.)

A. p is the probability that the results are due to chance, the probability that the null hypothesis (H0) is true.

B. p is the probability of observing a test statistic at least as big as the one we have if there were no effect in the
population (i.e., the null hypothesis were true).

C. p is the probability that the results are not due to chance, the probability that the null hypothesis (H0) is false.

D. p is the probability that the results would be replicated if the experiment was conducted a second time.

270. A Type I error occurs when: (Hint: When we use test statistics to tell us about the true state of the world, we’re
trying to see whether there is an effect in our population.)

A. We conclude that there is not an effect in the population when in fact there is.

B. We conclude that the test statistic is significant when in fact it is not.

C. The data we have typed into SPSS is different from the data collected.

D. We conclude that there is an effect in the population when in fact there is not.

271. A statement about a population developed for the purpose of testing is called:
A. Hypothesis
B. Hypothesis testing
C. Level of significance
D. Test-statistic

272. Any hypothesis which is tested for the purpose of rejection under the assumption that it is true is called:
A. Null hypothesis

B. Alternative hypothesis

C. Statistical hypothesis
D. Composite hypothesis

273. A statement about the value of a population parameter is called:


A. Null hypothesis

B. Alternative hypothesis

C. Simple hypothesis

D. Composite hypothesis

274. Any statement whose validity is tested on the basis of a sample is called:
A. Null hypothesis

B. Alternative hypothesis

C. Statistical hypothesis

D. Simple hypothesis

275. A quantitative statement about a population is called:


A. Research hypothesis

B. Composite hypothesis

C. Simple hypothesis

D. Statistical hypothesis

276. A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false is called:
A. Simple hypothesis

B. Composite hypothesis
C. Statistical hypothesis
D. Alternative hypothesis

277. The alternative hypothesis is also called:


A. Null hypothesis

B. Statistical hypothesis

C. Research hypothesis

D. Simple hypothesis

278. A hypothesis that specifies all the values of parameter is called:


A. Simple hypothesis
B. Composite hypothesis
C. Statistical hypothesis
D. None of the above

279. The hypothesis µ ≤ 10 is


a:
A. Simple hypothesis ( (.
B. Composite hypothesis
C. Alternative hypothesis
D. Difficult to tell

280. If a hypothesis specifies the population distribution is called:


A. Simple hypothesis

B. Composite hypothesis

C. Alternative hypothesis

D. None of the above

281. A hypothesis may be classified as:

A. Simple

B. Composite

C. Null

D. All of the above

282.The probability of rejecting the null hypothesis when it is true is called:


A. Level of confidence

B. Level of significance

C. Power of the test

D. Difficult to tell
283. The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected is said
to be:
A. Critical region

B. Critical value

C. Acceptance region

D. Significant region

284. If the critical region is located equally in both sides of the sampling distribution of test-statistic, the test is called:
A. One tailed

B. Two tailed

C. Right tailed

D. Left tailed

285.The choice of one-tailed test and two-tailed test depends upon:


A. Null hypothesis

B. Alternative hypothesis

C. None of these

D. Composite hypotheses

286. Test of hypothesis Ho: µ = 50 against H1: µ > 50 leads to:


A. Left-tailed test

B. Right-tailed test

C. Two-tailed test

D. Difficult to tell

287. Test of hypothesis Ho: µ = 20 against H1: µ < 20 leads to:


A. Right one-sided test

B. Left one-sided test

C. Two-sided test

D. All of the above

288. Testing Ho: µ = 25 against H1: µ ≠ 20 leads to:


A. Two-tailed test

B. . Left-tailed test

C. Right-tailed test

D. Neither of the mentioned ones


289. A rule or formula that provides a basis for testing a null hypothesis is called:
A. Test-statistic

B. Population statistic

C. Both of these

D. None of the mentioned ones

290. The range of test statistic-Z is:


A. 0 to 1

B. -1 to +1

C. 0 to ∞

D. -∞ to +∞

291.The range of test statistic-t is:


A. 0 to ∞

B. 0 to 1

C. -∞ to +∞

D. -1 to +1

292. If Ho is true and we reject it is called:


A. Type-I error

B. Type-II error

C. Standard error

D. Sampling error

293. The probability associated with committing type-I error is:


A. β
B. α
C. 1 – β
D. 1 – α

294. A failing student is passed by an examiner, it is an example of:


A. Type-I error

B. Type-II error

C. Unbiased decision

D. Difficult to tell
296. A passing student is failed by an examiner, it is an example of:
A. Type-I error

B. Type-II error

C. Best decision

D. All of the mentioned

297. 1 – α is also called:


A. Confidence coefficient

B. Power of the test

C. Size of the test

D. Level of significance

298. 1 – α is the probability associated with:


A. Type-I error

B. Type-II error

C. Level of confidence

D. Level of significance

299.Area of the rejection region depends on:


A. Size of α
B. Size of β
C. Test-statistic
D. Number of values

300. Size of critical region is known as:


A. β

B. 1 - β

C. Critical value

D. Size of the test

301. A null hypothesis is rejected if the value of a test statistic lies in the:
A. Rejection region

B. Acceptance region

C. Both the mentioned ones

D. Neither of the mentioned ones


302. Level of significance is also called:
A. Power of the test

B. Size of the test

C. Level of confidence

D. Confidence coefficient

303. Level of significance α lies between:


A. -1 and +1

B. 0 and 1

C. 0 and n

D. -∞ to +∞

304. Critical region is also called:


A. Acceptance region

B. Rejection region

C. Confidence region

D. Statistical region

305. The probability of rejecting Ho when it is false is called:


A. Power of the test

B. Size of the test

C. Level of confidence

D. Confidence coefficient

306. Power of a test is related to:


A. Type-I error

B. Type-II error

C. Both the options are correct

D. Neither of the mentioned

307. In testing hypothesis α + β is always equal to:

A. One

B. Zero

C. Two
D. Difficult to tell

308. The significance level is the risk of:


A. Rejecting Ho when Ho is correct

B. Rejecting Ho when H1 is correct

C. Rejecting H1 when H1 is correct


D. Accepting Ho when Ho is correct.

309. An example in a two-sided alternative hypothesis is:


A. H1: µ < 0

B. H1: µ > 0

C. H1: µ ≥ 0

D. H1: µ ≠ 0

310. If the magnitude of calculated value of t is less than the tabulated value of t and H1 is two-sided, we
should:
A. Reject Ho

B. Accept H1

C. Not reject Ho

D. Difficult to tell

311. Accepting a null hypothesis Ho:


A. Proves that Ho is true
B. Proves that Ho is false
C. Implies that Ho is likely to be true

D. Proves that µ ≤ 0

312. The chance of rejecting a true hypothesis decreases when sample size is:
A. Decreased

B. Increased

C. Constant

D. None of the mentioned

313. The equality condition always appears in:


A. Null hypothesis

B. Simple hypothesis

C. Alternative hypothesis
D. Both (a) and (b)

314. Which hypothesis is always in an inequality form?


A. Null hypothesis

B. Alternative hypothesis

C. Simple hypothesis

D. Composite hypothesis

315. Which of the following is composite hypothesis?


A. µ ≥ µo

B. µ ≤ µo

C. µ = µo

D. µ ≠ µo

316. P (Type I error) is equal to:


A. 1 – α
B. 1 – β
C. α
D. β

317. P (Type II error) is equal to:


A. α
B. β
C. 1 – α
D. 1 – β

318.The power of the test is equal to:


A. α

B. β

C. 1 – α (

D. 1 – β

319. The degree of confidence is equal to:

A. α

B. β

C. 1 – α

D. 1 – β

320. α / 2 is called:
A. One tailed significance level

B. Two tailed significance level

C. Left tailed significance level


D. Right tailed significance level

321. Student’s t-test is applicable only when:


A. n≤30 and σ is known

B. n>30 and σ is unknown

C. n=30 and σ is known

D. All of the mentioned

321. Student’s t-statistic is applicable in case of:

A. Equal number of samples

B. Unequal number of samples

C. Small samples

D. All of the above

322. Paired t-test is applicable when the observations in the two samples are:
A. Equal in number

B. Paired

C. Correlation

D. All of the above

323. The degree of freedom for paired t-test based on n pairs of observations is:
A. 2n - 1
B. n - 2
C. 2(n - 1)
D. n – 1
324. The test-statistic : t= with = has d.f = :

A. n

B. m-1

C. n - 2

D. m+n - 2

325. In an unpaired samples t-test with sample sizes n 1= 11 and n2= 11, the value of tabulated t should be obtained
for:
A. 10 degrees of freedom
B. 21 degrees of freedom
C. 22 degrees of freedom
D. 20 degrees of freedom

326. In analyzing the results of an experiment involving seven paired samples, tabulated t should be obtained
for:
A. 13 degrees of freedom

B. 6 degrees of freedom

C. 12 degrees of freedom
D. 14 degrees of freedom

327.The mean difference between 16 paired observations is 25 and the standard deviation of differences is
10. The value of statistic-t is:
A. 4

B. 10

C. 16

D. 25

328. Statistic-t is defined as deviation of sample mean from population mean µ expressed in terms of:
A. Standard deviation

B. Standard error

C. Coefficient of standard deviation


D. Coefficient of variation

329. Student’s t-distribution has (n-1) d.f. when all the n observations in the sample are:
A. Dependent

B. Independent

C. Maximum

D. Minimum

330. The number of independent values in a set of values is called:


A. Test-statistic

B. Degree of freedom

C. Level of significance

D. Level of confidence

331. The purpose of statistical inference is:


A. To collect sample data and use them to formulate hypotheses about a population
B. To draw conclusion about populations and then collect sample data to support the conclusions
C. To draw conclusions about populations from sample data
D. To draw conclusions about the known value of population parameter

332. Suppose that the null hypothesis is true and it is rejected, is known as:
A. A type-I error, and its probability is β
B. A type-I error, and its probability is α
C. A type-II error, and its probability is α
D. A type-Il error, and its probability is β

333. An advertising agency wants to test the hypothesis that the proportion of adults in Pakistan who read a Sunday
Magazine is 25 percent. The null hypothesis is that the proportion reading the Sunday Magazine is:
A. Different from 25%

B. Equal to 25%

C. Less than 25 %

D. More than 25 %

334. If the mean of a particular population is µ=0 and s.d.=1


Then the population is distributed:
219.A. As a standard normal variable, if the population is non-normal
219.B. As a standard normal variable, if the sample is large
219.C. As a standard normal variable, if the population is normal
219.D. As the t-distribution with v = n - 1 degrees of freedom

335. If µ1 and µ2 are means of two populations and two samples of sizes n1 and n2 are drawn from them then the
pooled sample is distributed:
A. As a standard normal variable, if both samples are independently drawn and n 1 + n2 is less than 30

B. As a standard normal variable, if both populations are normal


C. As the t-distribution with n1 + n2 - 2 degrees of freedom
D. None of the mentioned

336. If the population proportion equals p , then

Z= is distributed for a sample of size n as

A. As a standard normal variable, if n > 30


B. As a Poisson variable
C. As the t-distribution with v= n 1 degrees of freedom
D. As a distribution with v degrees of freedom

337. When σ is known, the hypothesis about population mean is tested by:
A. t-test
B. Z-test

C. χ2-test

D. F-test

338. Given µo = 130, = 150, σ = 25 and n = 4; what test statistics is appropriate?


A. t

B. Z

C. χ2

D. F

339. Given Ho: µ = µo, H1: µ ≠ µo, α = 0.05 and we reject Ho; the absolute value of the Z-statistic must have equalled or
been beyond what value?
(a) 1.96 (b) 1.65 (c) 2.58 (d) 2.33

339. If p1 and p2 are not identical, then standard error of the difference of proportions (p 1 – p2) is:
A. cannot be determined
B. sqrt{ p * ( 1 - p ) * [ (1/n ) + (1/n ) ] }, where p is the pooled sample proportion, n is the size of sample
1 2 1

1, and n2 is the size of sample 2.


C. p * ( 1 - p ) * [ (1/n1) + (1/n2) ], where p is the pooled sample proportion, n1 is the size of sample 1, and
n2 is the size of sample 2.
D. None of the mentioned points.

340. Under the hypothesis Ho: p1 = p2, the formula for the pooled sample proportion is:
A. p = (p1 * n1 + p2 * n2) / (n1+ n2) where p is the sample proportion from population 1, p is
1 2

the sample proportion from population 2, n1 is the size of sample 1, and n2 is the size of
sample 2.
B. . p = (p1 * n2 + p2 * n1) / (n1+ n2)
C. None of the mentioned
D. Any one can be used.

341.A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs, and utility.
A. Decision tree
B. Graphs
C. Trees
D. Neural Networks

342. Decision Tree is a display of an algorithm.


A. True
B. False

C. None of the options.

343. What is Decision Tree?


A. Flow-Chart
B. Structure in which internal node represents test on an attribute, each branch represents outcome of test and each
leaf node represents class label
C. Flow-Chart & Structure in which internal node represents test on an attribute, each branch represents outcome of
test and each leaf node represents class label
D. None of the mentioned
ANSWER: C

344. Decision Trees can be used for Classification Tasks.


A. True
B. False

345. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned

246. Decision Nodes are represented by ____________


A. Disks
B. Squares
C. Circles
D. Triangles

347. Chance Nodes are represented by __________


A. Disks
B. Squares
C. Circles
D. Triangles

348. End Nodes are represented by __________


A. Disks
B. Squares
C. Circles
D. Triangles

249. Which of the following is/are the advantage/s of a Decision Trees?


A. Possible Scenarios can be added
B. Use a white box model, If given result is provided by a model
C. Worst, best and expected values can be determined for different scenarios
D. All of the mentioned

350. What are tree based classifiers?

A. Classifiers which form a tree with each attribute at one level


B. Classifiers which perform series of condition checking with one attribute at a time
C. Both options except none
D. None of the options Ans: c

351. What is Gini index?


A. It is a type of index structure

B. It is a measure of purity

C. Both options except none

D. None of the options Ans: b

352. Tree/Rule based classification algorithms generate ... rule to perform the classification.
A. if-then.
B. while.
C. do while.
D. switch.
353. Which of the following sentences are correct in reference to Information gain?
A. It is biased towards single-valued attributes
B. It is biased towards multi-valued attributes(X)
C. ID3 makes use of information gain(Y)
D. The approach used by ID3 is greedy (Z)
E. All of X, Y and Z

354. Cost complexity pruning algorithm is used in?


A. CART
B. C4.5
C. ID3
D. All
355. Multivariate split is where the partitioning of tuples is based on a combination of attributes rather than on a
single attribute.
A. True
B. False
C. None

356. CART system cannot find multivariate splits.

A. True

B. False

C. None

357.Gain ratio tends to prefer unbalanced splits in which one partition is much smaller than the other.

A. True

B. False

C. None

358. The Gini index is not biased towards multivalued attributed.


Ans: False

359.. Gini index does not favour equal sized partitions.

A. True

B. False
C. Cannot be determined

360. When the number of classes is large Gini index is not a good choice

A. True

B. False

C. Cannot be decided

361.Attribute selection measures are also known as splitting rules.

A. True

B. False

C. None

362. Which one of these is not a tree based learner?


A. CART
B. ID3
C. Bayesian classifier
D. Random Forest Ans: c
363. Which one of these is a tree based learner?
A. Rule based
B. Bayesian Belief Network
C. Bayesian classifier
D. Random Forest
364. What is the approach of basic algorithm for decision tree induction?
A. Greedy
B. Top Down
C. Procedural
D. Step by Step
365. Which of the following classifications would best suit the student performance classification systems?
A. If...then... analysis
B. Market-basket analysis
C. Regression analysis
D. Cluster analysis

Consider the following dataset in table 1 for question 5-10 where each record represents the age, income
and is a student or not. And we need to classify it as buyers/non-buyers of computer.

Sky_Appear Temperature Humidity Windiness Playing_Decision


ence
Sunny Hot High True No
Sunny Hot High False No
Cloudy Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Cloudy Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Rainy Mild High True No

Table 1. Playing_condition

366. How many records will be in the root initially?


A. 14
B. 9
C. 5
D. 10
367. In the root node how many classes will be there?
A. 3
B. 2
C. 4
D. 14

368. The records at the root node will be divided into how many nodes in the next level based on
Sky_Appearance?
A. 4
B. 3
C. 14
D. 2

369. Using ID3 algorithm what is the information gain for each possible split if you split based on
Sky_Appearance?
A. 0.30
B. 0.40
C. 0.247
D. 0.1

370. Calculate the information gain(Sky_Appearance)?

A.0.1
B. 0.3
C. 0.029
D. 0.35

371. Calculate the information gain(student)?

A. 0.35
B. 0.248
C. 0.151
D. 0.43

372. Calculate the gain ratio(income)?

A. 0.024
B. 0.019
C. 0.34
D. 0.112
373. Calculate the gini(income)?
A. 0.34
B. 0.443
C. 0.56
D. 0.123
374. How will you counter over-fitting in decision tree?
A. By pruning the longer rules
B. By creating new rules
C. Both By pruning the longer rules’ and ‘ By creating new rules’
D. None of the options
375.. What are two steps of tree pruning work?
A. Pessimistic pruning and Optimistic pruning
B. Postpruning and Prepruning
C. Cost complexity pruning and time complexity pruning
D. None of the options Ans: b
376. Which of the following sentences are true?
A. In pre-pruning a tree is 'pruned' by halting its construction early.
B. A pruning set of class labelled tuples is used to estimate cost complexity.
C. The best pruned tree is the one that minimizes the number of encoding bits.
D. All of the above
377.The most widely used metrics and tools to assess a classification model are:
A. Confusion matrix
B. Cost-sensitive accuracy
C. Area under the ROC curve
D. All of the above
378.Which of the following is a good test dataset characteristic?
A. Large enough to yield meaningful results
B. Is representative of the dataset as a whole

C. Both A and B - answer


D. None of the above
379.Which of the following is a disadvantage of decision trees?
A. Factor analysis
B. Decision trees are robust to outliers

C. Decision trees are prone to be overfit - answer


D. None of the above
380.How do you handle missing or corrupted data in a dataset?
A. Drop missing rows or columns
B. Replace missing values with mean/median/mode
C. Assign a unique category to missing values

D. All of the above - answer


381.What is the purpose of performing cross-validation?
A. To assess the predictive performance of the models
B. To judge how the trained model performs outside the sample on test data

C. Both A and B - answer


382.Why is second order differencing in time series needed?
A. To remove stationarity
B. To find the maxima or minima at the local point
C. Both A and B - answer
D. None of the above
383.When performing regression or classification, which of the following is the correct way to preprocess the
data?

A. Normalize the data → PCA → training - answer


B. PCA → normalize PCA output → training
C. Normalize the data → PCA → normalize PCA output → training
D. None of the above
384.Which of the folllowing is an example of feature extraction?
A. Constructing bag of words vector from an email
B. Applying PCA projects to a large high-dimensional data
C. Removing stopwords in a sentence

D. All of the above - answer


385.What is pca.components_ in Sklearn?

A. Set of all eigen vectors for the projection space - answer


B. Matrix of principal components
C. Result of the multiplication matrix
D. None of the above options
386.Which of the following is true about Naive Bayes ?
A. Assumes that all the features in a dataset are equally important
B. Assumes that all the features in a dataset are independent

C. Both A and B - answer


D. None of the above options
387.Which of the following statements about regularization is not correct?
A. Using too large a value of lambda can cause your hypothesis to underfit the data.
B. Using too large a value of lambda can cause your hypothesis to overfit the data.
C. Using a very large value of lambda cannot hurt the performance of your hypothesis.

D. None of the above - answer


388.How can you prevent a clustering algorithm from getting stuck in bad local optima?
A. Set the same seed value for each run

B. Use multiple random initializations - answer


C. Both A and B
D. None of the above
389.Which of the following techniques can be used for normalization in text mining?
A. Stemming
B. Lemmatization
C. Stop Word Removal

D. Both A and B - answer


390.In which of the following cases will K-means clustering fail to give good results? 1) Data points with
outliers 2) Data points with different densities 3) Data points with nonconvex shapes
A. 1 and 2
B. 2 and 3

C. 1, 2, and 3 - answer
D. 1 and 3
391.Which of the following is a reasonable way to select the number of principal components "k"?

A. Choose k to be the smallest value so that at least 99% of the varinace is retained. - answer
B. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
C. Choose k to be the largest value so that 99% of the variance is retained.
D. Use the elbow method
392.You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration. You find
that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the following
conclusions seems most plausible?
A. Rather than using the current value of a, use a larger value of a (say a=1.0)
B. Rather than using the current value of a, use a smaller value of a (say a=0.1)
C. a=0.3 is an effective choice of learning rate - answer
D. None of the above
393.What is a sentence parser typically used for?
A. It is used to parse sentences to check if they are utf-8 compliant.

B. It is used to parse sentences to derive their most likely syntax tree structures. - answer
C. It is used to parse sentences to assign POS tags to all tokens.
D. It is used to check if sentences can be parsed into meaningful tokens.
394.Suppose you have trained a logistic regression classifier and it outputs a new example x with a prediction
ho(x) = 0.2. This means
A. Our estimate for P(y=1 | x)

B. Our estimate for P(y=0 | x) - answer


C. Our estimate for P(y=1 | x)
D. Our estimate for P(y=0 | x)
395. A Type I error occurs when we:
A. reject a false null hypothesis
B. reject a true null hypothesis
C. do not reject a false null hypothesis
D. do not reject a true null hypothesis
E. fail to make a decision regarding whether to reject a hypothesis or not

396. In a criminal trial, a Type I error is made when:


A. a guilty defendant is acquitted (set free)
B. an innocent person is convicted (sent to jail)
C. a guilty defendant is convicted
D. an innocent person is acquitted
E. no decision is made about whether to acquit or convict the defendant

397. A Type II error occurs when we:


A. reject a false null hypothesis
B. reject a true null hypothesis
C. do not reject a false null hypothesis
D. do not reject a true null hypothesis
E. fail to make a decision regarding whether to reject a hypothesis or not

398. If a hypothesis is rejected at the 0.025 level of significance, it:


A. must be rejected at any level
B. must be rejected at the 0.01 level
C. must not be rejected at the 0.01 level
D. must not be rejected at any other level
E. may or may not be rejected at the 0.01 level

399. In a two-tail test for the population mean, if the null hypothesis is rejected when the
alternative is true, then:
A. a Type I error is committed
B. a Type II error is committed
C. a correct decision is made
D. a one-tail test should be used instead of a two-tail test
E. it is unclear whether a correct or incorrect decision has been made

400. In a one-tail test for the population mean, if the null hypothesis is not rejected when the
alternative hypothesis is true, then:
A. a Type I error is committed
B. a Type II error is committed
C. a correct decision is made
D. a two-tail test should be used instead of a one-tail test
E. it is unclear whether a correct or incorrect decision has been made

401. In a one-tail test for the population mean, if the null hypothesis is rejected when the
alternative hypothesis is not true, then:
A. a Type I error is committed
B. a Type II error is committed
c. a correct decision is made
d. a two-tail test should be used instead of a one-tail test
e. it is unclear whether a correct or incorrect decision has been made

402. If we reject the null hypothesis, we conclude that:


A. there is enough statistical evidence to infer that the alternative hypothesis is true
B. there is not enough statistical evidence to infer that the alternative hypothesis is true
C. there is enough statistical evidence to infer that the null hypothesis is true
D. the test is statistically insignificant at whatever level of significance the test was conducted at
E. further tests need to be carried out to determine for sure whether the null hypothesis should
be rejected or not

403. If we do not reject the null hypothesis, we conclude that:


A. there is enough statistical evidence to infer that the alternative hypothesis is true
B. there is not enough statistical evidence to infer that the alternative hypothesis is true
C. there is enough statistical evidence to infer that the null hypothesis is true
D. the test is statistically insignificant at whatever level of significance the test was conducted at
E. further tests need to be carried out to determine for sure whether the null hypothesis should
be rejected or not

403. The p-value of a test is the:


A. smallest significance level at which the null hypothesis cannot be rejected
B. largest significance level at which the null hypothesis cannot be rejected
C. smallest significance level at which the null hypothesis can be rejected
D. largest significance level at which the null hypothesis can be rejected
E. probability that no errors have been made in rejecting or not rejecting the null hypothesis

404. In order to determine the p-value of a hypothesis test, which of the following is not needed?
A. whether the test is one-tail or two-tail
B. the value of the test statistic
C. the form of the null and alternate hypotheses
D. the level of significance
E. all of the above are needed to determine the p-value

405. Which of the following p-values will lead us to reject the null hypothesis if the significance
level of the test if 5%?
A. 0.15
B. 0.10
C. 0.06
D. 0.20
E. 0.025

406. Suppose that we reject a null hypothesis at the 5% level of significance. For which of the
following levels of significance do we also reject the null hypothesis?
A. 6%
B. 2.5%
C. 4%
D. 3%
E. 2%

407. Which of the following statements about hypothesis testing is true?


A. If the p-value is greater than the significance level, we fail to reject Ho.
B. A Type II error is rejecting the null when it is actually true.
C. If the alternative hypothesis is that the population mean is greater than a specified value, then
the test is a two-tailed test.
D. The significance level equals one minus the probability of a Type I error.
E. None of the above statements are true.

408. The purpose of hypothesis testing is to:


A. test how far the mean of a sample is from zero
B. determine whether a statistical result is significant
C. determine the appropriate value of the significance level
D. derive the standard error of the data
E. determine the appropriate value of the null hypothesis

409. In hypothesis testing, what level of significance would be most appropriate to choose if you
knew that making a Type I error would be more costly than making a Type II error?
A. 0.005
B. 0.025
C. 0.050
D. 0.100
E. 0.028

410. The p-value obtained from a classical hypothesis test is:


A. the probability that the null hypothesis is true given the data
B. the probability that the null hypothesis is false given the data
C. the probability of observing the data or more extreme values if the null hypothesis is true
D the probability of observing the data or more extreme values if the alternative hypothesis is
true
E. the probability that the observed data were obtained due to chance

411. To test a hypothesis involving proportions, both np and n(1-p) should


A. Be at least 30
B. Be greater than 5
C. Lie in the range from 0 to 1
D. Be greater than 50
E. There are no specific conditions surrounding the values of n and p

412. What assumption is being made when we use the t-distribution to perform a hypothesis
test?
A. That the underlying distribution has more then one modal class
B. That the underlying population has a constant variance
C. That the underlying population has a non-symmetrical distribution
D. That the underlying population follows an approximately Normal distribution
E. None of the above

413. The acronym of the term ETL in data mining is:

A. Elevation, Transport and Leveling

B. Extraction, Transmission and Loading

C. Extraction, Transformation and Loading

D. None of the mentioned

414. Which of the following is correct?

A. 1=

B. = -

C. 1=-

D. All options are equivalent

E. Only 1= is correct

You might also like