DATA SCIENCE Question Bank For ESA 20 123 .Odt
DATA SCIENCE Question Bank For ESA 20 123 .Odt
A) 10
B)25
C) 50
D) 0
2. A test is administered annually. The test has a mean score of 150 and a standard deviation of 20. If Ravi’s z-
score is 1.50, what was his score on the test?
A) 180
B) 130
C) 30
D) 150
E) None of the above
3. If the variance of a dataset is correctly computed with the formula using (n – 1) in the denominator, which
of the following option is true?
A) Dataset is a sample
B) Dataset is a population
C) Dataset could be either a sample or a population
D) Dataset is from a census
E) None of the above
A) True
B) False
If you look at the formula for standard deviation above, a very high or a very low value would increase standard
deviation as it would be very different from the mean. Hence outliers will effect standard deviation.
5. Studies show that listening to music while studying can improve your memory. To demonstrate this, a
researcher obtains a sample of 36 college students and gives them a standard memory test while they listen
to some background music. Under normal circumstances (without music), the mean score obtained was 25
and standard deviation is 6. The mean score for the sample after the experiment (i.e With music) is 28.
NOTE:
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
The null hypothesis is generally assumed statement, that there is no relationship in the measured phenomena.
Here the null hypothesis would be that there is no relationship between listening to music and improvement in
memory.
A) Concluding that listening to music while studying improves memory, and it’s right.
B) Concluding that listening to music while studying improves memory when it actually doesn’t.
C) Concluding that listening to music while studying does not improve memory but it does.
6. Type 1 error means that we reject the null hypothesis when it is actually true. Here the null hypothesis is that
music does not improve memory. Type 1 error would be that we reject it and say that music does improve
memory when it actually doesn’t.
7. Let’s perform the Z test on the given case. We know that the null hypothesis is that listening to music does
not improve memory.
Z critical value for α = 0.05 (one tailed) would be 1.65 as seen from the z table.
Therefore since the Z value observed is greater than the Z critical value, we can reject the null hypothesis and
say that listening to music does improve the memory with 95% confidence.
8. A researcher concludes from his analysis that a placebo cures AIDS. What type of error is he making?
A) Type 1 error
B) Type 2 error
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
D) Cannot be determined
By definition, type 1 error is rejecting the null hypothesis when it is actually true and type 2 error is accepting
the null hypothesis when its actually false. In this case to define the error, we need to first define the null and
alternate hypothesis.
9. What happens to the confidence interval when we introduce some outliers to the data?
A medical doctor wants to reduce blood sugar level of all his patients by altering their diet. He finds that the
mean sugar level of all patients is 180 with a standard deviation of 18. Nine of his patients start dieting and
the mean of the sample is observed to 175. Now, he is considering recommending all his patients to go on a
diet.
A) 9
B) 6
C) 7.5
D) 18
The standard error of the mean is the standard deviation by the square root of the number of values. i.e.
Standard error = =6
11. What is the probability of getting a mean of 175 or less after all the patients start dieting?
A) 20%
B) 25%
C) 15%
D) 12%
A) The doctor has valid evidence that dieting reduces blood sugar level.
B) The doctor does not have enough evidence that dieting reduces blood sugar level.
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C) If the doctor makes all future patients diet in a similar way, the mean blood pressure will fall below 160.
A researcher is trying to examine the effects of two different teaching methods. He divides 20 students into
two groups of 10 each. For group 1, the teaching method is using fun examples. Where as for group 2 the
teaching method is using software to help students learn. After a 20 minutes lecture of both groups, a test is
conducted for all the students.
We want to calculate if there is a significant difference in the scores of both the groups.
It is given that:
A) 3.191
B) 3.395
C) Cannot be determined.
D) None of the above
The t statistic of the given group is nothing but the difference between the group means by the standard error.
=(10-7)/0.94 = 3.191
A) Yes
B) No
A) 36.13
B) 45.21
C) 40.33
D) 32.97
16. Correlation between two variables (Var1 and Var2) is 0.65. Now, after adding numeric 2 to all the values of
Var1, the correlation co-efficient will_______ ?
A) Increase
B) Decrease
C) None of the above
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
16. It is observed that there is a very high correlation between math test scores and amount of physical
exercise done by a student on the test day. What can you infer from this?
1. High correlation implies that after exercise the test scores are high.
2. Correlation does not imply causation.
3. Correlation measures the strength of linear relationship between amount of exercise and test scores.
A) Only 1
B) 1 and 3
C) 2 and 3
D) All the statements are true
17. If the correlation coefficient (r) between scores in a math test and amount of physical exercise by a
student is 0.86, what percentage of variability in math test is explained by the amount of exercise?
A)86%
B) 74%
C)14%
D) 26%
18. Consider a regression line y=ax+b, where a is the slope and b is the intercept. If we know the value of the
slope then by using which option can we always find the value of the intercept?
B) Put any value from the points used to fit the regression line and compute the value of b False
C) Put the mean values of x & y in the equation along with the value a to get b False
19. What happens when we introduce more variables to a linear regression model?
A) The r squared value may increase or remain constant, the adjusted r squared may increase or decrease.
B) The r squared may increase or decrease while the adjusted r squared always increases.
C) Both r square and adjusted r square always increase on the introduction of new variables in the model.
20. In univariate linear least squares regression, relationship between correlation coefficient and coefficient
of determination is ______ ?
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C) The coefficient of determination is the square root of the coefficient of correlation False
21. What is the relationship between significance level and confidence level?
22.. Let the coefficient of determination computed to be 0.39 in a problem involving one independent variable
and one dependent variable. This result means that
A. Z-test
B. Z-test
C. Chi-square test
D. F-test
C. The relationship between the two variables is strong and but negative
25. If “time” is used as the independent variable in a simple linear regression analysis, then which of the
following assumption could be violated
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
26. In multiple regression, when the global test of significance is rejected, we can conclude that
A. Independent variables are correlated less than -0.70 or more than 0.70
28. The strength (degree) of the correlation between a set of independent variables XX and a dependent
variable YY is measured by
A. Coefficient of Correlation
B. Coefficient of Determination
C. Standard error of estimate
D. All of the mentioned
29. The estimate of β in the regression equation Y=α+βX+eY=α+βX+e by the method of least square is:
A. Biased
B. Unbiased
C. Consistent
D. Efficient
30. An investigator reports that the arithmetic mean of two regression coefficients of a regression line is 0.7 and
the correlation coefficient is 0.75. Are the results
A. Valid
B. Invalid
C. Inconclusive
31. The average of two regression coefficients is always greater than or equal to the correction coefficient is
called:
A. Fundamental property
B. Signature property
C. Magnitude property
D. Mean property
A. (XX, YY)
B. (, )
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. (00 00)
D. (11, 11)
A. Origin
B. Scale
35. If the two lines of regression are perpendicular to each other, the correlation coefficient r is:
A. 00
B. -1-1
C. 11
D. Nothing can be said
38. If there is a very strong correlation between two variables then the correlation coefficient must be
A. any value larger than 1
B. much smaller than 0, if the correlation is negative
C. much larger than 0, regardless of whether the correlation is negative or positive
D. None of these alternatives is correct.
39. In regression, the equation that describes how the response variable (y) is related to the
explanatory variable (x) is:
A. the correlation model
B. the regression model
C. used to compute the correlation coefficient
D. None of these alternatives is correct.
40. The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in
16 male college students by using least squares regression. The following regression equation was
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
obtained from this study:
!= -0.0127 + 0.0180x
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
41. Regression modeling is a statistical framework for developing a mathematical equation that describes how
A. one explanatory and one or more response variables are related
B. several explanatory and several response variables response are related
C. one response and one or more explanatory variables are related
D. All of these are correct.
43. Regression analysis was applied to return rates of sparrowhawk colonies. Regression analysis was used to study the
relationship between return rate (x: % of birds that return to the colony in a given year) and immigration rate (y: % of
new adults that join the colony per year). The following regression equation was obtained.
! = 31.9 – 0.34x
Based on the above estimated regression equation, if the return rate were to decrease by 10% the rate of
immigration to the colony would:
A. increase by 34%
B. increase by 3.4%
C. decrease by 0.34%
D. decrease by 3.4%
44. In least squares regression, which of the following is not a required assumption about the error term?
A. The expected value of the error term is one.
B. The variance of the error term is the same for all values of x.
C. The values of the error term are independent.
D. The error term is normally distributed.
45. Larger values of r2 (R2) imply that the observations are more closely grouped about the
A. average value of the independent variables
B. average value of the dependent variable
C. least squares line
D. origin
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. is the square root of the coefficient of determination
C. is the same as r-square
D. can never be negative
48. In regression analysis, the variable that is used to explain the change in the outcome of an experiment, or some
natural process, is called
A. the x-variable
B. the independent variable
C. the predictor variable
D. the explanatory variable
E. all of the above (a-d) are correct
F. none are correct
49. In the case of an algebraic model for a straight line, if a value for the x variable is specified, then
A. the exact value of the response variable can be computed
B. the computed response to the independent value will always give a minimal residual
C. the computed value of y will always be the best estimate of the mean response
D. None of these alternatives is correct.
50. A regression analysis between sales (in $1000) and price (in dollars) resulted in the following equation:
! = 50,000 - 8X
51. If the coefficient of determination is a positive value, then the regression equation
A. must have a positive slope
B. must have a negative slope
C. could have either a positive or a negative slope
D. must have a positive y intercept
52. If two variables, x and y, have a very strong linear relationship, then
A. there is evidence that x causes a change in y
B. there is evidence that y causes a change in x
C. there might not be any causal relationship between x and y
D. None of these alternatives is correct.
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
54. In regression analysis, if the independent variable is measured in kilograms, the dependent variable
A. must also be in kilograms
B. must be in some unit of weight
C. cannot be in kilograms
D. can be in any units
55. The relationship between number of beers consumed (x) and blood alcohol content (y) was studied in 16 male college
students by using least squares regression. The following regression equation was obtained from this study:
!= -0.0127 + 0.0180x
Suppose that the legal limit to drive is a blood alcohol content of 0.08. If Ricky consumed 5 beers the model would
predict that he would be:
A. 0.09 above the legal limit
B. 0.0027 below the legal limit
C. 0.0027 above the legal limit
D. 0.0733 above the legal limit
56. If the correlation coefficient is 0.8, the percentage of variation in the response variable explained by the variation in the
explanatory variable is
a. 0.80%
b. 80%
c. 0.64%
d. 64%
57. If the correlation coefficient is a positive value, then the slope of the regression line
A. must also be positive
B. can be either negative or positive
C. can be zero
D. cannot be zero
A. . is 0.6561
B. could be either + 0.9 or - 0.9
C. must be positive
D. must be negative
60. Regression analysis was applied between $ sales (y) and $ advertising (x) across all the branches of a major
international corporation. The following regression function was obtained.
! = 5000 + 7.25x
If the advertising budgets of two branches of the corporation differ by $30,000, then what will be the predicted
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
difference in their sales?
A. $217,500
B. $222,500
C. $5000
D. $7.25
61. Suppose the correlation coefficient between height (as measured in feet) versus weight (as measured in
pounds) is 0.40. What is the correlation coefficient of height measured in inches versus weight measured in
ounces? [12 inches = one foot; 16 ounces = one pound]
A 0.40
B. 0.30
C. 0.533
D. cannot be determined from information given
E. none of these
62. If height is measured in feet and weight is measured in pounds. Now, suppose that the units of both variables are
converted to metric (meters and kilograms). The impact on the slope is:
A. the sign of the slope will change
B. the magnitude of the slope will change
C. both a and b are correct
D. neither a nor b are correct
63. You have carried out a regression analysis; but, after thinking about the relationship between variables, you have
decided that you have to swap the explanatory and the response variables. After refitting the regression model to
the data you expect that:
A. the value of the correlation coefficient will change
B. the value of SSE will change
C. the value of the coefficient of determination will change
D. the sign of the slope will change
E. nothing changes
64. Suppose you use regression to predict the height of a woman’s current boyfriend by using her own height as the
explanatory variable. Height was measured in feet from a sample of 100 women undergraduates, and their
boyfriends at Techno India University. Now, suppose that the height of both the women and the men are
converted to centimeters. The impact of this conversion on the slope is:
A. the sign of the slope will change
B. the magnitude of the slope will change
C. both the options are correct
D. neither of the options is correct
65. You studied the impact of the dose of a new drug treatment for high blood pressure. You think that the drug might be
more effective in people with very high blood pressure. Because you expect a bigger change in those patients who start
the treatment with high blood pressure, you use regression to analyze the relationship between the initial blood
pressure of a patient (x) and the change in blood pressure after treatment with the new drug (y). If you find a very
strong positive association between these variables, then:
A. there is evidence that the higher the patient’s initial blood pressure, the bigger the impact of the new
drug.
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. there is evidence that the higher the patient’s initial blood pressure, the smaller the impact of the new
drug.
C. there is evidence for an association of some kind between the patient’s initial blood pressure and
the impact of the new drug on the patient’s blood pressure
D. none of these are correct, this is a case of regression fallacy
If the outcomes of a discrete random variable follow a Poisson distribution, then which of the following is true?
A. The mean equals the variance.
67. B. The mean equals the standard deviation.
C. The median equals the variance.
D. The median equals the standard deviation
69. The number of arrivals of delivery trucks per hour at a loading station is an example of which of the
following processes?
A. Binomial
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. Uniform
C. Poisson
D. Normal
72. . A random variable x has a binomial distribution with n=4 and p=1/6. What is the probability that x is 1?
A. 0.3458
B. 0.4158
C. 0.4358
D. 0.3858
73. A random variable x has a binomial distribution with n=64 and p=0.65. What is the probability that x is 47 or less?
A. 0.9417
B. 0.9717
C. 0.8817
D. 0.9017
74. A random variable x has a binomial distribution with n=100 and p=0.35. What is the probability x falls in the range from 26
to 34, inclusive?
A. 0.3813
B. 0.5413
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. 0.4413
D. 0.4913
75. A random variable x has a binomial distribution with n =28 and p=0.55. What is the probability that x will be greater than
18?
A. 0.1787
B. 0.1187
C. 0.2256
D. 0.0887
76. Suppose x is a Poisson-distributed random variable with an expected value of 5 occurrences per interval. What is p(x=3)?
A. 0.2004
B. 0.1404
C. 0.1704
D. 0.0904
77. Suppose x is a Poisson-distributed random variable with an expected value of 12 occurrences per interval. What is p(x<10)?
A. 0.2424
B. 0.2124
C. 0.2824
D. 0.2624
78. Suppose x is a Poisson-distributed random variable with an expected value of 55 occurrences per interval. What is
p(45<x<60)?
A. 0.6055
B. 0.6755
C. 0.6355
D. 0.6955
79. Suppose x is a Poisson-distributed random variable with an expected value of 105 occurrences per interval. What is
p(x>90)?
A. 0.9641
B. 0.8741
C. 0.8341
D. 0.9241
80. An urn has 10 marbles: 3 red, 7 black. If we draw a random sample of 4, what is the probability we will end up with 2 red
and 2 black?
A. 0.1787
B. 0.1187
C. 0.2256
D. 0.0887
81. Suppose 10 cards are drawn from a deck of 52 cards consisting of 13 hearts, 13 diamonds, 13 clubs, and 13 spades. What is
the probability that the hand of 10 cards will include 3 hearts, 3 diamonds, 2 clubs, and 2 spades?
A. 0.0315
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. 0.0515
C. 0.0815
D. 0.0215
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
E. None of the mentioned
119. The term JSON is an acronym of
120. ODBC
A. Is an API
B. permits application programmers to access SQL- based DBMSs
C. for R is a package named RODBC
D. Each of the mentioned one is correct
E. None of the mentioned one is correct
A. Mean-Mode=3(Mean-Median)
B. Harmonic mean≤Geometric mean ≤Arithmetic mean
C. Geometric Mean =
D. Harmonic mean≥Geometric mean ≤Arithmetic mean
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
126. A confidence interval consists of
A. Confidence level(C)
B. Statistic(S)
C. Margin of error(M)
D. All of (C ), (S) and (M)
E. None of (C ), (S) and (M)
A. σ/√n
B. σ
C. σ^2/n
D. None of the mentioned ones
128. Which one of the following is called the distribution of rare events?
A. Bernoulli distribution
B. Binomial distribution
C. Normal distribution
D. Poisson distribution
E. None of the mentioned
129.A standard normal deviate is a normally distributed random variable
130. The probability distribution of the sum of squared standard normal deviates is called
A. A numerical outcome
B. The result of a random phenomenon
C. One of the values of a sample space
D. All of the mentioned ones
E. None of the mentioned ones
A. Is a relationship between each possible outcome of a random variable and their probabilities
B. summarizes the relationship between possible values and their probability for a random variable
C. can have variable structure and type based on the properties of the random variable
D. All the mentioned points are correct
E. None of the mentioned points is correct
A. Discrete (D)
B. Continuous(C )
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. Hybrid (H)
D. Both (D) and (C)
E. All of (D), (C ) and (H)
134. Which of the following is incorrect?
A. P.M.F. is related to a discrete R.V.
B. P.D.F. is related to a continuous R.V.
C. C.D.F. is related to a continuous R.V. only.
D. C.D.F. is related to both discrete and continuous R.V.
135. A Bernoulli distribution is related to
A. Binary outcome
B. Binomial distribution
C. Any one of the mentioned ones
D. Both of the mentioned ones
138. A random variable that has a finite or countable infinite possible values is a
139. A random variable that has an interval for its set of possible values is a
140. The function that defines the probability distribution for a discrete random variable is called
141. A function that assigns a probability that a discrete random variable will have a value of less than or equal to a specific
discrete value is called
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
A. Cumulative distribution function
B. Discrete distribution function
C. Probability distribution function
D. None of the mentioned
142. A discrete probability distribution that covers a case where an event will have a binary outcome is called
A. Bernoulli distribution
B. Binomial distribution
C. Binary distribution
D. None of the mentioned
143. The single flip of a coin that may have a head (0) or a tail (1) outcome is an example of a
A. Binomial Trial
B. Bernoulli Trial
C. Poisson Trial
D. None of the mentioned
144. When there are exactly two mutually exclusive outcomes of a trial, we use
A. Bernoulli distribution
B. Binomial distribution
C. Uniform distribution
D. Poisson distribution
E. None of the mentioned
A. Binomial distribution
B. Uniform distribution
C. Poisson distribution
D. Normal distribution
E. None of the mentioned
146. If a model is meant for a series of discrete events where the average time between events is known, but the exact timing of
events is random, then it is called
A. Binomial Process
B. Bernoulli Process
C. Poisson process
D. Normal process
E. None of the mentioned
147. When some discrete event occurs in a continuous, but finite interval of time or sample space in S, we say that
A. Poisson process
B. Binomial process
C. Bernoulli process
D. Normal process
150. The distribution of the sum of squared standard normal deviates is called
A. Sufficiently small
B. Sufficiently large
C. Insignificant
D. None of the mentioned.
154. We use t statistic (also known as the t score) when
A. When sample sizes are small[case 1]
B. We do not know the standard deviation of the population[case 2]
C. Either case 1 or case 2 is true
D. None of the mentioned.
155. The F distribution takes
A. One parameter
B. Two parameters
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. More than two parameters
D. None of these
156. Let X be a vector of values, G be a vector of labels then which of the following command will be correct to draw a colored
pie chart?
A. pie(X,labels=G,radius=0.7,clockwise=FALSE,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
B. pie(labels=G,X,radius=0.7,clockwise=FALSE,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
C. pie(labels=G,X,radius=0.7,clockwise=True,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
D. pie(labels=G,X,clockwise=False,col=cols,main="Presention of BCA2,TIU,WB Results of ESA on R")
E. None of the mentioned
157. Let a represents the number of admitted students in four different streams and s represents the names of the streams
corresponding to the admission counts. Then, which of the following commands is appropriate for drawing a bar chart?
A. barplot(names.arg=s,a, xlab="Streams",ylab="No. of students got
admitted",col="green",main="Student Strength in 2018",border="red")
A. A histogram groups the values in continuous ranges whereas a bargraph groups in discrete values.
B. A histogram groups the values in discrete ranges whereas a bargraph groups in continuous values
159. Which one of the following is not true about a scatter plot ?
160. Which one of the following is not true about a line chart:
B. Line charts are most often used to visualize data that changes over time
161. The median of the data values is pointed out by which of the following graphical representation?
A. A line graph
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. A bar graph
A. Spread of all the data is represented on a boxplot by the horizontal distance between the smallest value and the largest
value, including any outliers.
C. Spread of all the data is represented on a boxplot by the horizontal distance between the smallest value and the largest value,
excluding the outliers.
B. Text Analytics
C. A machine-supported analysis of text with a view to extract interesting and useful patterns and information
A. NLP
B. tm
C. wordcloud
166. A corpus is created in text mining by using which of the following expressions?
A. corpus(VectorSource(text_file_name))
B. Corpus(VectorSource(text_file_name))
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. Vectorsource(Corpus(text_file_name))
A. storing textual data that is used throughout linguistics and text analysis
B. containing each document, along with some Meta attributes that help describe that document
A. Its purpose is to remove unnecessary special characters, white spaces, common stopwords then to convert the text to lower
case form.
B. The tm_map() function is used for text cleaning
C. The tm_clean() function is used for text cleaning
D. When the text cleaning function is used, the warning messages tell us that the desired cleaning has been done.
169. Which of the following is not true about the stemDocument() function?
170. Which is not true about a Term Document Matrix/Document Term Matrix?
B. The system shows the percentage of sparsity after the creation of a TDM/DTM.
C. The sparsity represents the proportion of entries that are zero in TDM/DTM
171. Which of the following packages are required for sentiment analysis?
A. SentimentAnalysis
B. dplyr
C. zealot
172. For using the pipelining symbol ‘%>% ‘ in queries, we need to install
A. zealot package
B. dplyr
173. Which of the following packages is used to pull out emotions from the Comments?
A. syuzhet
B. ggplot2
C. SentimentAnalysis
A. mutate()
B. extract()
C. kable()
D. None
A. Word cloud
B. Tag cloud
176. For data mining from the social media Face Book using R,
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
E. Both the mentioned points.
177. Which of the following packages has no role in data mining from the social media Face Book using R
A. httpuv
B. Rfacebook
C. RCurl
A. informatics
B. operations research
C. mathematics
180. The hypothesis that is tested for rejection considering it to be true is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
180. The rejection probability of Null Hypothesis when it is true is called as:
a) Level of Confidence
b) Level of Significance
c) Level of Margin
d) Level of Rejection
181. The point where the Null Hypothesis gets rejected is called ?
a) Significant Value
b) Rejection Value
c) Acceptance Value
d) Critical Value
182. If the Critical region is evenly distributed then the test is referred to as?
a) Two tailed
b) one tailed
c) Three tailed
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
d) Zero tailed
183. Which of the following is defined as the rule or formula to test a Null Hypothesis?
a) Test statistic
b) Population statistic
c) Variance statistic
d) Null statistic
184. Consider a hypothesis H0 where ϕ0 = 5 against H1 where ϕ1 > 5. The test is?
a) Right tailed
b) Left tailed
c) Center tailed
d) Cross tailed
185. Consider a hypothesis where H0 where ϕ0 = 23 against H1 where ϕ1 < 23. The test is?
a) Right tailed
b) Left tailed
c) Center tailed
d) Cross tailed
189. Two types of errors associated with hypothesis testing are Type I and Type II. Type II error is committed when
A. If the two extreme values (min or max) of the sample need to be rejected
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
C. If the region of rejection is located in one or two tails of the distribution
B. We are 95% confident that the results have not occurred by chance
C. We are 95% confident that the results have occurred by chance
191. The level of significance can be viewed as the amount of risk that an analyst will accept when making a decision
A. True
B. False
192. Parametric test, unlike the non-parametric tests, make certain assumptions about
193. Rejection of the null hypothesis is a conclusive proof that the alternative hypothesis is
A. True
B. False
C. Neither
194. A 99% t-based confidence interval for the mean price for a gallon of gasoline (dollars) is calculated using a simple
random sample of gallon gasoline prices for 50 gas stations. Given that the 99% confidence interval is $3.32 < < $3.98,
what is the sample mean price for a gallon of gasoline (dollars)?
A. $0.33
B. $3.65
C. Not Enough Information; we would need to know the variation in the sample of gallon gasoline prices
D. Not Enough Information; we would need to know the variation in the population of gallon gasoline prices
195. Green sea turtles have normally distributed weights, measured in kilograms, with a mean of 134.5 and a variance of 49.0. A
particular green sea turtle’s weight has a z-score of -2.4. What is the weight of this green sea turtle rounded to the nearest
whole number?
A. 17 kg
B. 151 kg
C. 118 kg
D. 252 kg
196. Which of the following exam scores is better relative to other students enrolled in the course?
● A psychology exam grade of 85; the mean grade for the psychology exam is 92 with a
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
standard deviation of 3.5
● An economics exam grade of 67; the mean grade for the economics exam is 79 with a
standard deviation of 8
● A chemistry exam grade of 62; the mean grade for the chemistry exam is 62 with a
standard deviation of 5
197. The statement “If there is sufficient evidence to reject a null hypothesis at the 10% significance level, then there is
sufficient evidence to reject it at the 5% significance level” is: Please select the best answer of those provided below.
198.A randomly selected sample of 1,000 college students was asked whether they had ever used the drug Ecstasy. Sixteen
percent (16% or 0.16) of the 1,000 students surveyed said they had. Which one of the following statements about the number
0.16 is correct?
A. It is a sample proportion.
B. It is a population proportion.
C. It is a margin of error.
D. It is a randomly chosen number.
199.In a random sample of 1000 students, pˆ = 0.80 (or 80%) were in favor of longer hours at the school library. The
standard error of pˆ (the sample proportion) is
A. .013
B. .160
C. .640
D. .800
200.For a random sample of 9 women, the average resting pulse rate is x = 76 beats per minute, and the sample standard
deviation is s = 5. The standard error of the sample mean is
A. 0.557
B. 0.745
C. 1.667
D. 2.778
2. Assume the cholesterol levels in a certain population have mean = 200 and standard deviation =
24. The cholesterol levels for a random sample of n = 9 individuals are measured and the sample mean x is determined.
What is the z-score for a sample mean x = 180?
A. –3.75
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
B. –2.50
C. 0.83
D. 2.50
201.In a past General Social Survey, a random sample of men and women answered the question “Are you a member of any
sports clubs?” Based on the sample data, 95% confidence intervals for the population proportion who would answer “yes” are .
13 to .19 for women and .247 to .33 for men. Based on these results, you can reasonably conclude that
A. At least 25% of American men and American women belong to sports clubs.
B. At least 16% of American women belong to sports clubs.
C. There is a difference between the proportions of American men and American women who belong to sports
clubs.
D. There is no conclusive evidence of a gender difference in the proportion belonging to sports clubs.
202.Suppose a 95% confidence interval for the proportion of Americans who exercise regularly is 0.29 to
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
203.In hypothesis testing, a Type 2 error occurs when
199.A. The null hypothesis is not rejected when the null hypothesis is true.
199.B. The null hypothesis is rejected when the null hypothesis is true.
199.C. The null hypothesis is not rejected when the alternative hypothesis is true.
199.D. The null hypothesis is rejected when the alternative hypothesis is true.
A.population parameters.
B.sample parameters.
C.sample statistics.
205.A hypothesis test is done in which the alternative hypothesis is that more than 10% of a population is left-handed. The p-
value for the test is calculated to be 0.25. Which statement is correct?
211.B. We can conclude that more than 25% of the population is left-handed.
211.C. We can conclude that exactly 25% of the population is left-handed.
211.D. We cannot conclude that more than 10% of the population is left-handed.
206.Which of the following is NOT true about the standard error of a statistic?
218.A. The standard error measures, roughly, the average difference between the statistic
and the population parameter.
B.The standard error is the estimated standard deviation of the sampling distribution for the statistic.
207.A prospective observational study on the relationship between sleep deprivation and heart disease was done by Ayas, et.
al. (Arch Intern Med 2003). Women who slept at most 5 hours a night were compared to women who slept for 8 hours a night
(reference group). After adjusting for potential confounding variables like smoking, a 95% confidence interval for the relative
risk of heart disease was (1.10, 1.92). Based on this confidence interval, a consistent conclusion would be
226.A. Sleep deprivation is associated with a modestly increased risk of heart disease.
226.B. Sleep deprivation is associated with a modestly decreased risk of heart disease.
226.C. There was no evidence of an association between sleep deprivation and heart disease.
226.D. Lack of sleep causes the risk of heart disease to increase by 10% to 92%.
208. Consider a random sample of 100 females and 100 males. Suppose 15 of the females are left-handed and 12 of the
males are left-handed. What is the estimated difference between population proportions of females and males who are left-
handed (females males)? Select the choice with the correct notation and numerical value.
A. p1 - p2 = 3
B. p1 - p2 = 0.03
C. pˆ1 - pˆ 2 D.pˆ1 - pˆ 2= 3
= 0.03
MCQ on Data Science. Compiled by Prof. A.B. Chowdhury, HOD, CA, TIU, W.B., India
209.A result is called “statistically significant” whenever
• the probability the procedure provides an interval that covers the sample mean.
• the probability of making a Type 1 error if the interval is used to test a null hypothesis about the population mean.
• the probability that individuals in the population have values that fall into the interval.
• the probability the procedure provides an interval that covers the population mean.
For the next two questions: It is known that for right-handed people, the dominant (right) hand tends to be stronger. For left-
handed people who live in a world designed for right-handed people, the same may not be true. To test this, muscle strength was
measured on the right and left hands of a random sample of 15 left-handed men and the difference (left - right) was found. The
alternative hypothesis is one-sided (left hand stronger). The resulting t-statistic was 1.80.
A. A two-sample t-test.
B. A paired t-test.
C. A pooled t-test.
D. An unpooled t-test.
212.Assuming the conditions are met, based on the t-statistic of 1.80, the appropriate conclusion for this test using= .05 is: (Table
would be provided with exam.)
A. Df = 14, so p-value < .05 and the null hypothesis can be rejected.
B. Df = 14, so p-value > .05 and the null hypothesis cannot be rejected.
E. Df = 28, so p-value < .05 and the null hypothesis can be rejected.
F. Df = 28, so p-value > .05 and the null hypothesis cannot be rejected.
213.A test of H0: = 0 versus Ha: > 0 is conducted on the same population independently by two different researchers. They both
use the same sample size and the same value of = 0.05. Which of the following will be the same for both researchers?
215.A test to screen for a serious but curable disease is similar to hypothesis testing, with a null hypothesis of no disease, and an
alternative hypothesis of disease. If the null hypothesis is rejected treatment will be given. Otherwise, it will not. Assuming the
treatment does not have serious side effects, in this scenario it is better to increase the probability of:
A. H0: d = 0.5
B. H0: d = 1.0
C. H0: d = 1.5
D. H0: d = 2.0
217.The average time in years to get an undergraduate degree in computer science was compared for men and women. Random
samples of 100 male computer science majors and 100 female computer science majors were taken. Choose the appropriate
parameter(s) for this situation.
218.If the word significant is used to describe a result in a news article reporting on a study,
A. the p-value for the test must have been very large.
B. the effect size must have been very large.
C. the sample size must have been very small.
D. it may be significant in the statistical sense, but not in the everyday sense.
219. A random sample of 5000 students were asked whether they prefer a 10 week quarter system or a 15 week semester system.
Of the 5000 students asked, 500 students responded. The results of this survey
A. can be generalized to the entire student body because the sampling was random.
B. can be generalized to the entire student body because the margin of error was 4.5%.
C. should not be generalized to the entire student body because the non-response rate was 90%.
D. should not be generalized to the entire student body because the margin of error was 4.5%.
220.In a report by ABC News, the headlines read “City Living Increases Men’s Death Risk” The headlines were based on a study of
3,617 adults who lived in the United States and were more than 25 years old. One researcher said, “Elevated levels of tumor deaths
suggest the influence of physical, chemical and biological exposures in urban areas… Living in cities also involves potentially stressful
levels of noise, sensory stimulation and overload, interpersonal relations and conflict, and vigilance against hazards ranging from crime
to accidents.” Is a conclusion that living in an urban environment causes an increased risk of death justified?
221.A significance test based on a small sample may not produce a statistically significant result even if the true value differs
substantially from the null value. This type of result is known as
224.The best way to determine whether a statistically significant difference in two means is of practical importance is to
A. find a 95% confidence interval and notice the magnitude of the difference.
B. repeat the study with the same sample size and see if the difference is statistically significant again.
C. see if the p-value is extremely small.
D. see if the p-value is extremely large.
225.A large company examines the annual salaries for all of the men and women performing a certain job and finds that the means
and standard deviations are $32,120 and $3,240, respectively, for the men and $34,093 and $3521, respectively, for the women. The
best way to determine if there is a difference in mean salaries for the population of men and women performing this job in this
company is
219. One problem with hypothesis testing is that a real effect may not be detected. This problem is most likely to
occur when
A. the effect is small and the sample size is small.
B. the effect is large and the sample size is small.
C. the effect is small and the sample size is large.
D. the effect is large and the sample size is large.
231. In a two-tail test for the population mean, if the null hypothesis is rejected when the alternative is true, then:
232. In a one-tail test for the population mean, if the null hypothesis is not rejected when the
233. In a one-tail test for the population mean, if the null hypothesis is rejected when the
A. there is enough statistical evidence to infer that the alternative hypothesis is true
B. there is not enough statistical evidence to infer that the alternative hypothesis is true
C. there is enough statistical evidence to infer that the null hypothesis is true
D. the test is statistically insignificant at whatever level of significance the test was conducted at
E. further tests need to be carried out to determine for sure whether the null hypothesis should
be rejected or not
A. there is enough statistical evidence to infer that the alternative hypothesis is true
B. there is not enough statistical evidence to infer that the alternative hypothesis is true
C. there is enough statistical evidence to infer that the null hypothesis is true
D. the test is statistically insignificant at whatever level of significance the test was conducted at
E. further tests need to be carried out to determine for sure whether the null hypothesis should
be rejected or not
237. In order to determine the p-value of a hypothesis test, which of the following is not needed?
238. Which of the following p-values will lead us to reject the null hypothesis if the significance
A. 0.15
B. 0.10
C. 0.06
D. 0.20
E. 0.025
239. Suppose that we reject a null hypothesis at the 5% level of significance. For which of the
A. 6%
B. 2.5%
C. 4%
D. 3%
E. 2%
A. If the p-value is greater than the significance level, we fail to reject Ho.
C. If the alternative hypothesis is that the population mean is greater than a specified value, then
D. The significance level equals one minus the probability of a Type I error.
242. In hypothesis testing, what level of significance would be most appropriate to choose if you knew that making a Type I error would
be more costly than making a Type II error?
A. 0.005
B. 0.025
C. 0.050
D. 0.100
E. 0.028
A. the probability that the null hypothesis is true given the data
B. the probability that the null hypothesis is false given the data
C. the probability of observing the data or more extreme values if the null hypothesis is true
D. the probability of observing the data or more extreme values if the alternative hypothesis is
true
E. the probability that the observed data were obtained due to chance
A. Be at least 30
B. Be greater than 5
D. Be greater than 50
245. What assumption is being made when we use the t-distribution to perform a hypothesis
test?
A. That the underlying distribution has more then one modal class
247. The probability that the test statistic will fall inside the region of rejection due to chance alone is equal to which of the following
probabilities?
266. ‘Children can learn a second language faster before the age of 7’. Is this statement:
A. A non-scientific statement
B. A one-tailed hypothesis
A two-tailed hypothesis
A null hypothesis
267. If my experimental hypothesis were ‘Eating cheese before bed affects the number of nightmares you have’, what
would the null hypothesis be?
D. The number of nightmares you have is not affected by eating cheese before bed.
268. If my null hypothesis is ‘Dutch people do not differ from English people in height’, what is my alternative
hypothesis?
269. Of what is p the probability if the null hypothesis were true? (Hint: NHST relies on fitting a ‘model’ to the data and
then evaluating the probability of this ‘model’ given the assumption that no effect exists.)
A. p is the probability that the results are due to chance, the probability that the null hypothesis (H0) is true.
B. p is the probability of observing a test statistic at least as big as the one we have if there were no effect in the
population (i.e., the null hypothesis were true).
C. p is the probability that the results are not due to chance, the probability that the null hypothesis (H0) is false.
D. p is the probability that the results would be replicated if the experiment was conducted a second time.
270. A Type I error occurs when: (Hint: When we use test statistics to tell us about the true state of the world, we’re
trying to see whether there is an effect in our population.)
A. We conclude that there is not an effect in the population when in fact there is.
C. The data we have typed into SPSS is different from the data collected.
D. We conclude that there is an effect in the population when in fact there is not.
271. A statement about a population developed for the purpose of testing is called:
A. Hypothesis
B. Hypothesis testing
C. Level of significance
D. Test-statistic
272. Any hypothesis which is tested for the purpose of rejection under the assumption that it is true is called:
A. Null hypothesis
B. Alternative hypothesis
C. Statistical hypothesis
D. Composite hypothesis
B. Alternative hypothesis
C. Simple hypothesis
D. Composite hypothesis
274. Any statement whose validity is tested on the basis of a sample is called:
A. Null hypothesis
B. Alternative hypothesis
C. Statistical hypothesis
D. Simple hypothesis
B. Composite hypothesis
C. Simple hypothesis
D. Statistical hypothesis
276. A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is false is called:
A. Simple hypothesis
B. Composite hypothesis
C. Statistical hypothesis
D. Alternative hypothesis
B. Statistical hypothesis
C. Research hypothesis
D. Simple hypothesis
B. Composite hypothesis
C. Alternative hypothesis
A. Simple
B. Composite
C. Null
B. Level of significance
D. Difficult to tell
283. The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected is said
to be:
A. Critical region
B. Critical value
C. Acceptance region
D. Significant region
284. If the critical region is located equally in both sides of the sampling distribution of test-statistic, the test is called:
A. One tailed
B. Two tailed
C. Right tailed
D. Left tailed
B. Alternative hypothesis
C. None of these
D. Composite hypotheses
B. Right-tailed test
C. Two-tailed test
D. Difficult to tell
C. Two-sided test
B. . Left-tailed test
C. Right-tailed test
B. Population statistic
C. Both of these
B. -1 to +1
C. 0 to ∞
D. -∞ to +∞
B. 0 to 1
C. -∞ to +∞
D. -1 to +1
B. Type-II error
C. Standard error
D. Sampling error
B. Type-II error
C. Unbiased decision
D. Difficult to tell
296. A passing student is failed by an examiner, it is an example of:
A. Type-I error
B. Type-II error
C. Best decision
D. Level of significance
B. Type-II error
C. Level of confidence
D. Level of significance
B. 1 - β
C. Critical value
301. A null hypothesis is rejected if the value of a test statistic lies in the:
A. Rejection region
B. Acceptance region
C. Level of confidence
D. Confidence coefficient
B. 0 and 1
C. 0 and n
D. -∞ to +∞
B. Rejection region
C. Confidence region
D. Statistical region
C. Level of confidence
D. Confidence coefficient
B. Type-II error
A. One
B. Zero
C. Two
D. Difficult to tell
B. H1: µ > 0
C. H1: µ ≥ 0
D. H1: µ ≠ 0
310. If the magnitude of calculated value of t is less than the tabulated value of t and H1 is two-sided, we
should:
A. Reject Ho
B. Accept H1
C. Not reject Ho
D. Difficult to tell
D. Proves that µ ≤ 0
312. The chance of rejecting a true hypothesis decreases when sample size is:
A. Decreased
B. Increased
C. Constant
B. Simple hypothesis
C. Alternative hypothesis
D. Both (a) and (b)
B. Alternative hypothesis
C. Simple hypothesis
D. Composite hypothesis
B. µ ≤ µo
C. µ = µo
D. µ ≠ µo
B. β
C. 1 – α (
D. 1 – β
A. α
B. β
C. 1 – α
D. 1 – β
320. α / 2 is called:
A. One tailed significance level
C. Small samples
322. Paired t-test is applicable when the observations in the two samples are:
A. Equal in number
B. Paired
C. Correlation
323. The degree of freedom for paired t-test based on n pairs of observations is:
A. 2n - 1
B. n - 2
C. 2(n - 1)
D. n – 1
324. The test-statistic : t= with = has d.f = :
A. n
B. m-1
C. n - 2
D. m+n - 2
325. In an unpaired samples t-test with sample sizes n 1= 11 and n2= 11, the value of tabulated t should be obtained
for:
A. 10 degrees of freedom
B. 21 degrees of freedom
C. 22 degrees of freedom
D. 20 degrees of freedom
326. In analyzing the results of an experiment involving seven paired samples, tabulated t should be obtained
for:
A. 13 degrees of freedom
B. 6 degrees of freedom
C. 12 degrees of freedom
D. 14 degrees of freedom
327.The mean difference between 16 paired observations is 25 and the standard deviation of differences is
10. The value of statistic-t is:
A. 4
B. 10
C. 16
D. 25
328. Statistic-t is defined as deviation of sample mean from population mean µ expressed in terms of:
A. Standard deviation
B. Standard error
329. Student’s t-distribution has (n-1) d.f. when all the n observations in the sample are:
A. Dependent
B. Independent
C. Maximum
D. Minimum
B. Degree of freedom
C. Level of significance
D. Level of confidence
332. Suppose that the null hypothesis is true and it is rejected, is known as:
A. A type-I error, and its probability is β
B. A type-I error, and its probability is α
C. A type-II error, and its probability is α
D. A type-Il error, and its probability is β
333. An advertising agency wants to test the hypothesis that the proportion of adults in Pakistan who read a Sunday
Magazine is 25 percent. The null hypothesis is that the proportion reading the Sunday Magazine is:
A. Different from 25%
B. Equal to 25%
C. Less than 25 %
D. More than 25 %
335. If µ1 and µ2 are means of two populations and two samples of sizes n1 and n2 are drawn from them then the
pooled sample is distributed:
A. As a standard normal variable, if both samples are independently drawn and n 1 + n2 is less than 30
337. When σ is known, the hypothesis about population mean is tested by:
A. t-test
B. Z-test
C. χ2-test
D. F-test
B. Z
C. χ2
D. F
339. Given Ho: µ = µo, H1: µ ≠ µo, α = 0.05 and we reject Ho; the absolute value of the Z-statistic must have equalled or
been beyond what value?
(a) 1.96 (b) 1.65 (c) 2.58 (d) 2.33
339. If p1 and p2 are not identical, then standard error of the difference of proportions (p 1 – p2) is:
A. cannot be determined
B. sqrt{ p * ( 1 - p ) * [ (1/n ) + (1/n ) ] }, where p is the pooled sample proportion, n is the size of sample
1 2 1
340. Under the hypothesis Ho: p1 = p2, the formula for the pooled sample proportion is:
A. p = (p1 * n1 + p2 * n2) / (n1+ n2) where p is the sample proportion from population 1, p is
1 2
the sample proportion from population 2, n1 is the size of sample 1, and n2 is the size of
sample 2.
B. . p = (p1 * n2 + p2 * n1) / (n1+ n2)
C. None of the mentioned
D. Any one can be used.
341.A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs, and utility.
A. Decision tree
B. Graphs
C. Trees
D. Neural Networks
345. Choose from the following that are Decision Tree nodes?
a) Decision Nodes
b) End Nodes
c) Chance Nodes
d) All of the mentioned
B. It is a measure of purity
352. Tree/Rule based classification algorithms generate ... rule to perform the classification.
A. if-then.
B. while.
C. do while.
D. switch.
353. Which of the following sentences are correct in reference to Information gain?
A. It is biased towards single-valued attributes
B. It is biased towards multi-valued attributes(X)
C. ID3 makes use of information gain(Y)
D. The approach used by ID3 is greedy (Z)
E. All of X, Y and Z
A. True
B. False
C. None
357.Gain ratio tends to prefer unbalanced splits in which one partition is much smaller than the other.
A. True
B. False
C. None
A. True
B. False
C. Cannot be determined
360. When the number of classes is large Gini index is not a good choice
A. True
B. False
C. Cannot be decided
A. True
B. False
C. None
Consider the following dataset in table 1 for question 5-10 where each record represents the age, income
and is a student or not. And we need to classify it as buyers/non-buyers of computer.
Table 1. Playing_condition
368. The records at the root node will be divided into how many nodes in the next level based on
Sky_Appearance?
A. 4
B. 3
C. 14
D. 2
369. Using ID3 algorithm what is the information gain for each possible split if you split based on
Sky_Appearance?
A. 0.30
B. 0.40
C. 0.247
D. 0.1
A.0.1
B. 0.3
C. 0.029
D. 0.35
A. 0.35
B. 0.248
C. 0.151
D. 0.43
A. 0.024
B. 0.019
C. 0.34
D. 0.112
373. Calculate the gini(income)?
A. 0.34
B. 0.443
C. 0.56
D. 0.123
374. How will you counter over-fitting in decision tree?
A. By pruning the longer rules
B. By creating new rules
C. Both By pruning the longer rules’ and ‘ By creating new rules’
D. None of the options
375.. What are two steps of tree pruning work?
A. Pessimistic pruning and Optimistic pruning
B. Postpruning and Prepruning
C. Cost complexity pruning and time complexity pruning
D. None of the options Ans: b
376. Which of the following sentences are true?
A. In pre-pruning a tree is 'pruned' by halting its construction early.
B. A pruning set of class labelled tuples is used to estimate cost complexity.
C. The best pruned tree is the one that minimizes the number of encoding bits.
D. All of the above
377.The most widely used metrics and tools to assess a classification model are:
A. Confusion matrix
B. Cost-sensitive accuracy
C. Area under the ROC curve
D. All of the above
378.Which of the following is a good test dataset characteristic?
A. Large enough to yield meaningful results
B. Is representative of the dataset as a whole
C. 1, 2, and 3 - answer
D. 1 and 3
391.Which of the following is a reasonable way to select the number of principal components "k"?
A. Choose k to be the smallest value so that at least 99% of the varinace is retained. - answer
B. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
C. Choose k to be the largest value so that 99% of the variance is retained.
D. Use the elbow method
392.You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration. You find
that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the following
conclusions seems most plausible?
A. Rather than using the current value of a, use a larger value of a (say a=1.0)
B. Rather than using the current value of a, use a smaller value of a (say a=0.1)
C. a=0.3 is an effective choice of learning rate - answer
D. None of the above
393.What is a sentence parser typically used for?
A. It is used to parse sentences to check if they are utf-8 compliant.
B. It is used to parse sentences to derive their most likely syntax tree structures. - answer
C. It is used to parse sentences to assign POS tags to all tokens.
D. It is used to check if sentences can be parsed into meaningful tokens.
394.Suppose you have trained a logistic regression classifier and it outputs a new example x with a prediction
ho(x) = 0.2. This means
A. Our estimate for P(y=1 | x)
399. In a two-tail test for the population mean, if the null hypothesis is rejected when the
alternative is true, then:
A. a Type I error is committed
B. a Type II error is committed
C. a correct decision is made
D. a one-tail test should be used instead of a two-tail test
E. it is unclear whether a correct or incorrect decision has been made
400. In a one-tail test for the population mean, if the null hypothesis is not rejected when the
alternative hypothesis is true, then:
A. a Type I error is committed
B. a Type II error is committed
C. a correct decision is made
D. a two-tail test should be used instead of a one-tail test
E. it is unclear whether a correct or incorrect decision has been made
401. In a one-tail test for the population mean, if the null hypothesis is rejected when the
alternative hypothesis is not true, then:
A. a Type I error is committed
B. a Type II error is committed
c. a correct decision is made
d. a two-tail test should be used instead of a one-tail test
e. it is unclear whether a correct or incorrect decision has been made
404. In order to determine the p-value of a hypothesis test, which of the following is not needed?
A. whether the test is one-tail or two-tail
B. the value of the test statistic
C. the form of the null and alternate hypotheses
D. the level of significance
E. all of the above are needed to determine the p-value
405. Which of the following p-values will lead us to reject the null hypothesis if the significance
level of the test if 5%?
A. 0.15
B. 0.10
C. 0.06
D. 0.20
E. 0.025
406. Suppose that we reject a null hypothesis at the 5% level of significance. For which of the
following levels of significance do we also reject the null hypothesis?
A. 6%
B. 2.5%
C. 4%
D. 3%
E. 2%
409. In hypothesis testing, what level of significance would be most appropriate to choose if you
knew that making a Type I error would be more costly than making a Type II error?
A. 0.005
B. 0.025
C. 0.050
D. 0.100
E. 0.028
412. What assumption is being made when we use the t-distribution to perform a hypothesis
test?
A. That the underlying distribution has more then one modal class
B. That the underlying population has a constant variance
C. That the underlying population has a non-symmetrical distribution
D. That the underlying population follows an approximately Normal distribution
E. None of the above
A. 1=
B. = -
C. 1=-
E. Only 1= is correct