PREPROCESSING

PREPROCESSING & PREPARATION OF THE DATA

MISSING VALUE TREATMENT

VARIABLE DATA TYPE CONVERSION

FEATURES

SELECTION

OUTLIER DETECTION & TREATMENT

MISSING VALUE TREATMENT

In R, Missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

Some of the data may be missing or having missing values thus with the help of summary and describe function we can see which variable has how many NA value or just check for missing values.

This visualization plot helps us understand that there is no missing values hence,no missing information.

missing value treatment

FEATURES SELECTION

Variables selection and Features selection
The process of selecting a subset of relevant features (variables, predictors) for use in model construction.
Feature selection techniques are used for three reasons
simplification of models to make them easier to interpret by researchers/users,

shorter training times,
enhanced generalization by reducing overfitting.

Finding the best predictor/input variable for predicting Fuel Economy using suitable statistical test,Thus for that we are using Correlation.

Out of all predictors that are available we will look for that variable and take only that variable in the final model which is statistical significant in terms of having p-value <0.05 and also having most impact on our main response variable Fuel Economy .

Further variable selection we will do after more further analysis about our variables and so later either remove them from the data or not use them in our model .

This Correlation visualization plot helps us understand which variable is making higher impact on Fuel Economy .

Engine Displacement variable is making highest impact on Fuel Economy .

VARIABLES DATA TYPE CONVERSION

We got categorical variables in the data we need to convert them into factor,as it is highly advisable.

TransLockup,TransCreeperGear,IntakeValvePerCyl,ExhaustValvesPerCyl,VarValveTiming,VarValveLift

we convert this following variables.

features selection

variables data type conversion

OUTLIAR DETECTION & TREATMENT

Outliers are extreme values that might affect the assumptions of a parametric models.

Presence of outliers for each variables can be detected using a boxplot or a histogram.

For our analysis we have used boxplots to detect the variables with outliers.

We have decided to treat the outliers by whinsorizing, i.e. replacing the outliers by a certain benchmark.

We have many numeric variables we will be checking outliers for these variables through box plot.

Example Say for variable Number of Gears,we have used this following R code,same for others we used them,Here df is my dataset name.

Box-Plot of variables after outliar treatment

outliar detection and treatment