Churn Prediction /Analysis on the given set of datasets The objective of the Churn Prediction /Analysis Project:
To obtain a Logistic Regression Model of the Insurance data which includes various attributes of customers mentioned in the dataset. This model will allow us to predict customers who will stick on with Allianz or in turn, renew their existing service plans (Churn=0) and customers who will not (Churn=1). Therefore, this model will help us to make a decision to retain the customers (churned) who will look for other Insurance companies.
The approach is taken for this Project:
First of all, start, there are around 21 variables in the given dataset. Before building a model we need to identify the most significant variables out of these 21. So, first, we need to do some preprocessing of the data. As part of pre-processing, we need to make sure that the data contains only numeric values and that the data is fully available. Out of these 21 variables, there were 3 variables (Phone number, Area Code, etc.) that can be removed as they are not relevant to our prediction at this moment. Also, there were some data missing in certain rows. So those rows were omitted (using na.omit function in R). Then the significant variables can be found from this data. The following steps were carried out:
- To start with, the data was divided into 2 sets namely the Training dataset and the test dataset with a proportion of 80:20.
- Created fit0 taking “Churn” as the dependent value and rest as the independent values
- Similarly, Fit1 was created
- The iteration continued and created Fit2 where we got the significant variables. This was done using the step function.
- Hence finalized the model Fit 2.
- The next step is to predict the values for the training data using predict function. The predicted values are the probabilities. Using the view () function we can see the probability values
- Converted these predicted values to “0” and “1”.
- The threshold needs to be set and the advisable one was 0.5
- Got the best accuracy rate with this threshold. Hence proceeded
- So based on the predicted values, compared them with the actual values using the Table function.
- Post this, a confusion matrix was created and the accuracy was 86% for a 0.5 threshold
- Plotted ROC Curve to determine the area under the curve.
- Plotted a KS Plot as well (just to understand the percentage of responders).
Also, the aim of this Churn Prediction /Analysis project was to visualize this in Tableau using the “R Integration with Tableau” feature. The following steps were carried out:
- Using the finalized model in R (fit2), created one named Final_model (Prediction) in Tableau. This was created with the help of a confusion matrix
- Created a graph using variables Churn (Actual Churn), Predicted Churn(Final_model), Accuracy, and Customer ID to identify the churn
- For the Odds sheet, created a filter with all the measure values & threshold parameters.
- A model created with R code with all variables (significant ones).
Based on our predicted Model Fit2, We can conclude that 307 customers will not churn and there is a chance of 63 customers to get churned. This is based on an accuracy of 86.13%. Therefore the Telecom Company (Service Provider) should contact these 63 customers to retain them by providing any attractive offers.
Regression Analysis on Fuel Economy Project
The aim of this project is to identify the dependency of different variables with respect to the Fuel economy of a Car. In the attached excel I have done the regression Analysis considering the best suitable variable as an independent variable against the dependent variable FE.
PROCESSES & STEPS CARRIED OUT IN THE PROJECT
- Used Excel “ Data Analytics” to carry out the entire process
- To identify the best suitable variable, performed “correlation” in Excel and concluded that Engine Displacement is the best one
- Post this, carried out regression Analysis using Excel function
- Also, plotted a scatter plot against this to represent the trend line
- Based on the P-value of regression analysis, the hypothesis was accepted /rejected
- Divided the entire data set into 3 parts by random sampling method
- Considered the 3rd set of data as the Test dataset and the first 2 as the Training data set
- Calculated MAPE value based on the predicted value and actual value.
We could understand that the FE is related to Engine Displacement. From the Scatter plot, we could identify that the best Fuel Efficient car has got a very less Engine Displacement.