PREPROCESSING

PREPROCESSING & PREPARATION OF THE DATA

MISSING VALUE TREATMENT

VARIABLE DATA TYPE CONVERSION

FEATURES

SELECTION

DUPLICACY REMOVAL

OUTLIER DETECTION & TREATMENT

MISSING VALUE TREATMENT

In R, Missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

Some of the data may be missing or having missing values thus with the help of summary and describe function we can see which variable has how many NA value or just check for missing values.

This visualization plot helps us understand that there is no missing values hence,no missing information.

DUPLICACY REMOVAL

As number of unique rows and total records are equal which means no duplicate values.

missing value treatment

duplicacy removal

FEATURES SELECTION

Variables selection and Features selection
The process of selecting a subset of relevant features (variables, predictors) for use in model construction.
Feature selection techniques are used for three reasons
simplification of models to make them easier to interpret by researchers/users,

shorter training times,
enhanced generalization by reducing overfitting.

The state and phone number variable we are removing from our dataset as they are not so relevant in meaning in respect of our prediction.

Keeping two highly correlated variables in the model make no sense,therefore from our 4 pairs

Data Minutes & Data Charge,

Evening Minutes & Evening Charge,

Night Minutes & Night Charge,

International Minutes & International Charge etc.

We will keep only 1 out of each variable,as it makes prediction more accurate.

Further variable selection we will do after more further analysis about our variables and so later either remove them from the data or not use them in our model .

VARIABLES DATA TYPE CONVERSION

We got categorical variables in the data we need to convert them into factor,as it is highly advisable.

Churn,International Plan,Voice mail Plan,State.

we convert this following variables.

features selection

variables data type conversion

OUTLIAR DETECTION & TREATMENT

Outliers are extreme values that might affect the assumptions of a parametric models.

Presence of outliers for each variables can be detected using a boxplot or a histogram.

For our analysis we have used boxplots to detect the variables with outliers.

We have decided to treat the outliers by whinsorizing, i.e. replacing the outliers by a certain benchmark.

We have many numeric variables we will be checking outliers for these variables through box plot.

Example Say for variable Evening Calls,we have used this following R code,same for others we used them,Here df is my dataset name.

Box-Plot of variables after outliar treatment

outliar detection and treatment