2. • Regression analysis is a supervised learning method for predicting continuous variables.
• Oldest and most popular technique
• Models relationship between independent variables (x) and dependent variable (y)
• Regression represents the relationship as:
y = f(x)
where:
x = independent variable(s)
y = dependent variable
• feature variable x is also known as an explanatory variable, exploratory variable, a predictor variable, an
independent variable, a covariate, or a domain point.
• y is a dependent variable.
• Dependent variables are also called as labels, target variables, or response variables. Regression analysis
determines the change in response variables when one exploration variable is varied while keeping all other
parameters constant.
• This is used to determine the relationship each of the exploratory variables exhibits. Thus, regression analysis is
used for prediction and forecasting.
Introduction to Regression
3. Thus, the primary concern of regression analysis is to find answer to questions such as:
1. What is the relationship between the variables?
2. What is the strength of the relationships?
3. What is the nature of the relationship such as linear or non-linear?
4. What is the relevance of the attributes?
5. What is the contribution of each attribute?
Introduction to Regression
4. There are many applications of regression analysis. Some of the applications of regressions include
predicting:
1. Sales of a goods or services
2. Value of bonds in portfolio management
3. Premium on insurance companies
4. Yield of crops in agriculture
5. Prices of real estate
Introduction to Regression
5. Introduction to Linearity, Correlation, and Causation.
The quality of the regression analysis is determined by the factors such as correlation and causation.
Regression and Correlation
. Scatter plots show relationships between two variables
X-axis: independent variables
Y-axis: dependent variables
Pearson correlation coefficient (r) measures strength/direction
• Types:
- Positive correlation
- Negative correlation
- No correlation
6. Regression and Causation
Causation: One variable directly influences another.
Represented as: x → y
Example: Increasing study time → higher test scores.
Causation is about causal relationship among variables, say x and y.
Causation means knowing whether x causes y to happen or vice versa. x causes y is often
denoted as x implies y.
Correlation and Regression relationships are not same as causation relationship.
Introduction to Linearity, Correlation, and Causation.
Scenario Relationship Type
High temperature ↔ more ice cream sales Correlation (not causation)
Exercise → lower blood pressure (from a controlled
study)
Causation
7. Linearity and Non-linearity
Introduction to Linearity, Correlation, and Causation.
The linearity relationship between the variables means the relationship between the dependent and independent
variables can be visualized as a straight line.
The line of the form, y = ax + b can be fitted to the data points that indicate the relationship between x and y.
By linearity, it is meant that as one variable increases, the corresponding variable also increases in a linear manner.
A non-linear relationship exists in functions such as exponential function and power function . Here, x-axis is given
by x data and y-axis is given by y data.
8. Types of Regression Methods
Introduction to Linearity, Correlation, and Causation.
The classification of regression methods
9. Linear Regression
It is a type of regression where a line is fitted upon given data for finding the linear
relationship between one independent variable and one dependent variable to describe
relationships.
Multiple Regression
It is a type of regression where a line is fitted for finding the linear relationship between two or more independent
variables and one dependent variable to describe relationships among variables.
Polynomial Regression
It is a type of non-linear regression method of describing relationships among variables where Nth degree
polynomial is used to model the relationship between one independent variable and one dependent variable.
Polynomial multiple regression is used to model two or more independent variables and one dependent variable.
Logistic Regression
It is used for predicting categorical variables that involve one or more independent variables and one dependent
variable. This is also known as a binary classifier.
Lasso and Ridge Regression Methods
These are special variants of regression method where regularization methods are used to limit the number and
size of coefficients of the independent variables.
Types of Regression Methods
Introduction to Linearity, Correlation, and Causation.
10. 1. Outliers – Outliers are abnormal data. It can bias the outcome of the regression model, as outliers push the
regression line towards it.
2. Number of cases – The ratio of independent and dependent variables should be at least 20: 1. For every
explanatory variable, there should be at least 20 samples. Atleast five samples are required in extreme cases.
3. Missing data – Missing data in training data can make the model unfit for the sampled data.
4. Multicollinearity – If exploratory variables are highly correlated (0.9 and above), the regression is vulnerable
to bias. Singularity leads to perfect correlation of 1. The remedy is to remove exploratory variables that
exhibit correlation more than 1. If there is a tie, then the tolerance (1 – R squared) is used to eliminate
variables that have the greatest value.
Introduction to Linearity, Correlation, and Causation.
Limitations of Regression Method
12. Ordinary Least Squares (OLS)
Introduction To Linear Regression
•The OLS approach fits a straight line through the data points.
•The goal is to minimize the errors (residuals) between the observed values and predicted values on the line.
•The residual for a data point is:
15. Consider the following dataset in Table where the week and number of working hours per week spent by a research
scholar in a library are tabulated. Based on the dataset, predict the number of hours that will be spent by the research
scholar in the 7th and 9th week. Apply linear regression model.
12
18
22
28
35
0
5
10
15
20
25
30
35
40
1 2 3 4 5
17. The regression equation is given as
12
18
22
28
35
45.4
56.6
0
10
20
30
40
50
60
1 2 3 4 5 7 9
18. Consider an the five weeks' sales data (in Thousands) is given as shown below in Table.
Apply linear regression technique to predict the 7th and 9th month sales.
19. Linear Regression in Matrix Form
Linear Regression Equation for each data point is given by :
This can be written as a system of equations:
Express in Matrix Form
20. This is written as:
Linear Regression in Matrix Form
Estimating Coefficients Using Matrix Algebra
To find the best-fit line (least squares solution), minimize the sum of squared
errors:
22. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
23. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
Step 1: Create matrices X and Y
24. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
Step 2 :Apply the Normal Equation
25. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
Step 2 :Apply the Normal Equation
26. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
Step 2 :Apply the Normal Equation
Inverse of a 2×2 Matrix
27. X (Week) 1 2 3 4 5
Y (Hours) 12 18 22 28 35
Find linear regression of the data. Use linear regression in matrix form.
Step 2 :Apply the Normal Equation
Final Regression Equation
28. Find linear regression of the data. Use linear regression in matrix form.
Week (x) Sales (y)
1 1
2 3
3 4
4 8
29. Multiple Linear Regression
Multiple Linear Regression (MLR) is a statistical technique that models the relationship between one
dependent variable and two or more independent variables. It extends simple linear regression, which
involves only one independent variable, to capture more complex real-world scenarios where multiple factors
influence an outcome.
The multiple regression of two variables x1 and x2 is given as follows:
In general, this is given for ‘n’ independent variables as:
30. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
31. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The matrix X and Y is given as follows:
32. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The regression coefficients can be found as follows
Substituting the values one get,
33. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The regression coefficients can be found as follows
34. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The regression coefficients can be found as follows
35. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The regression coefficients can be found as follows
=
36. Using multiple regression, fit a line for the following dataset shown in Table. Here, z is the equity, x is the net sales
and y is the asset. z is the dependent variable and x, y are independent variables. All the data is in million dollars.
z x y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
The regression coefficients can be found as follows
Therefore, the regression line is given as
37. Apply multiple regression for the values given in Table where weekly sales along with sales for products x1 and x2
are provided. Use matrix approach for finding multiple regression.
x₁ (Product
One Sales)
x₂ (Product Two
Sales)
y (Output Weekly
Sales in Thousands)
1 4 1
2 5 6
3 8 8
4 2 12
38. Polynomial Regression
When the relationship between the independent and dependent variables is non-linear, standard
linear regression may not accurately model the data, resulting in large errors.
To address this, two main approaches can be used:
1. Transformation of non-linear data to linear data, so that the linear regression can handle the
data
2. Using polynomial regression
39. Transformation of Non-linear Data to Linear
This approach involves transforming the non-linear equation into a linear form, allowing the use of
linear regression techniques. Common transformations include:
Polynomial Regression
41. Polynomial Regression
Polynomial regression is a technique used to model non-linear relationships between the independent
variable x and the dependent variable y by fitting a polynomial equation of degree n. It provides a flexible
approach to capture curvilinear trends in data without transforming the variables.
Polynomial regression provides a non-linear curve such as quadratic and cubic.
the second-degree transformation (called quadratic transformation) is given as:
third-degree polynomial is called cubic transformation given as:
Generally, polynomials of maximum degree 4 are used, as higher order polynomials take some strange shapes
and make the curve more flexible. It leads to a situation of overfitting and hence is avoided.
42. Polynomial Regression
The Polynomial Regression system can be written in matrix form:
This is of the form:
X a=B
Where:
•X is the matrix of sums of powers of x,
•a is the column vector of coefficients,
•B is the column vector of target sums.
To solve for the coefficients:
43. Consider the data provided in Table and fit it using the second-order polynomial.
x y
1 6
2 11
3 18
4 27
5 38
Find the best-fitting quadratic polynomial of the form:
47. Consider the data provided in Table and fit it using the second-order polynomial.
48. Logistic Regression
Linear regression predicts the numerical response but is not suitable for predicting the categorical
variables.
When categorical variables are involved, it is called classification problem. Logistic regression is
suitable for binary classification problem.
Here, the output is often a categorical variable.
For example, the following scenarios are instances of predicting categorical variables.
1. Is the mail spam or not spam? The answer is yes or no. Thus, categorical dependent variable is
a binary response of yes or no.
2. If the student should be admitted or not is based on entrance examination marks. Here,
categorical variable response is admitted or not.
3. The student being pass or fail is based on marks secured.
49. Logistic Regression
Logistic regression is used as a binary classifier and works by predicting the probability of the
categorical variable.
In general, it takes one or more features x and predicts the response y.
If the probability is predicted via linear regression, it is given as:
Linear regression generated value is in the range -∞ to +∞, whereas the probability of the
response variable ranges between 0 and 1.
Hence, there must be a mapping function to map the value -∞ to +∞ to 0–1.
The core of the mapping function in logistic regression method is sigmoidal function.
A sigmoidal function is a ‘S’ shaped function that yields values between 0 and 1.
This is known as logit function. This is mathematically represented as:
where
•x: independent variable
•e: Euler’s number (~2.718)
50. Logistic Regression
The Logistic Function is given by :
This function is S-shaped and maps any real value to the range (0, 1).
Here,
x is the explanatory or predictor variable,
e is the Euler number
a0, a1 are the regression coefficients.
The coefficients a0, a1 can be learned and the predictor predicts p(x) directly using the threshold function
as:
51. Let us assume a binomial logistic regression problem where the classes are pass and fail. The student dataset has
entrance mark based on the historic data of those who are selected or not selected. Based on the logistic
regression, the values of the learnt parameters are a0 = 1 and a1 = 8. Assuming marks of x = 60, compute the
resultant class.
Given:
a0 = 1
a1 = 8
x = 60
Compute z
Compute Sigmoid Function
Since threshold value as 0.5,
then it is observed that 0.44 < 0.5,
therefore, the candidate with marks 60 is not selected.