Simple linear regression quantifies the relationship between two variables by producing an equation for a straight line of the form y = a + bx which uses the independent variable (x) to predict the dependent variable (y).

The aim is to establish a mathematical formula between the the response variable (Y) and the predictor variables (X)

We can use this formula to predict Y, when only X values are known.

Regression involves estimating the values of the gradient (b )and intercept (a) of the line that best fits the data .

This is defined as the line which minimises the sum of the squared residuals.

First building simple regression model while taking all variables,using Fuel Economy as our main response variable

As this full model contains all variables as predictors,

we see some variables which are not that much useful or significant in terms therefore leaving out that variables,also the summary of model stats that the variables engine displacement,number of cylinders,transmission lockup,variable valve timing
are statistically significant as they are having P-values <0.05 ,as model having less predictor is quite more accurate in prediction than model with all variables as predictors,therefore we go for further variable selection and processing while leaving out some variables.

Lets build another Simple regression model by taking those 4 predictors only

As this model contains left out variables as predictors,

we see all the variables engine displacement,number of cylinders,transmission lockup,variable valve timing are
statistically significant as they are having P-values <0.05 .We see multiple R squared value of model which is the accuracy as 64% (approx).

Building our main simple regression model while taking Engine Displacement as predictor,

using Fuel Economy as our main response variable.

Now after going over with various model and different predictors while keeping in mind

there significant statistical impact and p values,also
using variable selection and thus finally come down

to this using Engine Displacement variable as our main predictor.

This is our final logistic regression model.

We have used Engine Displacement as our predictor due to there higher impact on FE and also as there p values is less

than 0.05,hence making them statistical significant.

That is engine displacement as a function for FE we can notice that 'Coefficients' part having two components:
Intercept:50.563, speed: -4.521
These are also called the beta coefficients.

y=b0 + b1x
FE=50.563-4.521 × engine displacement

By building the linear regression model, we have

established the relationship between the predictor and

response in the form of a mathematical formula.

ASSUMPTIONS OF MODEL

Assumptions of the linear regression model and test
Normality
Linearity
Independence of error
Constant error variance

assumption model

SELECTING BEST PREDICTOR

&

PREDICTING FUEL ECONOMY

Finding the best input variable for predicting FE using suitable statistical test thus for that we are using Correlation.

A measure used to represent how strongly two random variables are related is known as correlation.

Correlation refers to the scaled form of covariance.The value of correlation takes

place between -1 and +1. It is not influenced by the change in scale.

Correlation Coefficient of two variables in a data set equals to their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.If the correlation coefficient is close

to 1, it would indicate that the variables are positively linearly related and the scatter plot falls almost along a straight line

with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along

a straight line with negative slope. For 0, it would indicate a weak linear relationship between the variables.

Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables.

We will find correlation of all variables with respect to Fuel Economy

For Example finding correlation between FE and engine displacement we will be using this R code.

Engine Displacement is the best Predictor

since it got the highest correlation,therefore we have used this in final model.

Making Prediction

We make Prediction on the final model using test FE2011 data

selection n prediction

GOODNESS OF MODEL

T-statistics (-42.46),Std Error of regression coefficient (0.1065) and Probability (of committing Type I Error) is < 2e-16 almost 0]
are associated with a t-test which tests following Null Hypothesis:

Ho: The slope of fuel economy with engine displacement is not significant
Ha: The slope of fuel economy with engine displacement is significant

In mathematical symbols:
Ho:B1 =0(there is no linear relationship)
Ha:B1 !=0(there is linear relationship)

Now let's discuss about Pr(>|t|) or Probability value (Significance value)
Here the rule goes like this:
If, p value is <,= 0.05 (for 5% Level of Significance) REJECT Ho
If, p value is > 0.05 (for 5% Level of Significance) ACCEPT Ho

MULTIPLE R-SQUARED

As the model variable p value is 0 which is less than 0.05,we REJECT the Ho (and ACCEPT Ha) and
conclude that "Slope is significant"

Multiple R- squared 0.62 by this we understand that
62% of the variance in FE can be explained by Engine displacement
[Remaining 38% is unexplained variance....due to factors outside the model].

Accuracy of the model is 62%

F-STATISTIC

F-statistics 1803 on 1 and 1105 DF, p-value: <2.2e-16

The higher the F-Statistic the better fit the model will be all about goodness of fit
Ho:Model is not good
Ha:Model is good

As p-value (Significance value) is <2.2e-16
which is nothing but 0, we reject the null hypothesis
and conclude that the model is statistically a good fit model

Performance Features of the Model

goodness of model

PREDICTING FUEL

ECONOMY

​

PREDICTING FUEL ECONOMY

PREDICTING

FUEL ECONOMY

MODEL

MODEL

SIMPLE REGRESSION MODEL FIT

ASSUMPTIONS OF MODEL

Assumptions of the linear regression model and test Normality Linearity Independence of error Constant error variance

SELECTING BEST PREDICTOR

&

PREDICTING FUEL ECONOMY

We will find correlation of all variables with respect to Fuel Economy

GOODNESS OF MODEL

T-statistics (-42.46),Std Error of regression coefficient (0.1065) and Probability (of committing Type I Error) is < 2e-16 almost 0] are associated with a t-test which tests following Null Hypothesis:

Ho: The slope of fuel economy with engine displacement is not significant Ha: The slope of fuel economy with engine displacement is significant

In mathematical symbols: Ho:B1 =0(there is no linear relationship) Ha:B1 !=0(there is linear relationship)

Now let's discuss about Pr(>|t|) or Probability value (Significance value) Here the rule goes like this: If, p value is <,= 0.05 (for 5% Level of Significance) REJECT Ho If, p value is > 0.05 (for 5% Level of Significance) ACCEPT Ho

MULTIPLE R-SQUARED

As the model variable p value is 0 which is less than 0.05,we REJECT the Ho (and ACCEPT Ha) and conclude that "Slope is significant"

Multiple R- squared 0.62 by this we understand that 62% of the variance in FE can be explained by Engine displacement [Remaining 38% is unexplained variance....due to factors outside the model].

Accuracy of the model is 62%

F-STATISTIC

F-statistics 1803 on 1 and 1105 DF, p-value: <2.2e-16

The higher the F-Statistic the better fit the model will be all about goodness of fit Ho:Model is not good Ha:Model is good

As p-value (Significance value) is <2.2e-16 which is nothing but 0, we reject the null hypothesis and conclude that the model is statistically a good fit model

Performance Features of the Model

Assumptions of the linear regression model and test
Normality
Linearity
Independence of error
Constant error variance

T-statistics (-42.46),Std Error of regression coefficient (0.1065) and Probability (of committing Type I Error) is < 2e-16 almost 0]
are associated with a t-test which tests following Null Hypothesis:

Ho: The slope of fuel economy with engine displacement is not significant
Ha: The slope of fuel economy with engine displacement is significant

In mathematical symbols:
Ho:B1 =0(there is no linear relationship)
Ha:B1 !=0(there is linear relationship)

Now let's discuss about Pr(>|t|) or Probability value (Significance value)
Here the rule goes like this:
If, p value is <,= 0.05 (for 5% Level of Significance) REJECT Ho
If, p value is > 0.05 (for 5% Level of Significance) ACCEPT Ho

As the model variable p value is 0 which is less than 0.05,we REJECT the Ho (and ACCEPT Ha) and
conclude that "Slope is significant"

Multiple R- squared 0.62 by this we understand that
62% of the variance in FE can be explained by Engine displacement
[Remaining 38% is unexplained variance....due to factors outside the model].

The higher the F-Statistic the better fit the model will be all about goodness of fit
Ho:Model is not good
Ha:Model is good

As p-value (Significance value) is <2.2e-16
which is nothing but 0, we reject the null hypothesis
and conclude that the model is statistically a good fit model