Predict the Forest Fires Python Project using Machine Learning Techniques

Predict the Forest Fires Python Project using Machine Learning Techniques is a Summer Internship Report Submitted in partial fulfillment of the requirement for an undergraduate degree of  Bachelor of Technology In Computer Science Engineering. I submit this industrial training workshop entitled “PREDICT THE FOREST FIRES” to the University, Hyderabad in partial fulfillment of the requirements for the award of the degree of “Bachelor of Technology” in “Computer Science Engineering”. 

Apart from my effort, the success of this internship largely depends on the encouragement and guidance of many others. I take this opportunity to express my gratitude to the people who have helped me in the successful competition of this internship.

I would like to thank the respected faculties who helped me to make this internship a successful accomplishment.

I would also like to thank my friends who helped me to make my work more organized and well-stacked till the end.

OBJECTIVE OF THE PROJECT:

This is a regression problem with clear outliers which cannot be predicted using any reasonable method. A comparison of the three methods has been done :

(a) Random Forest Regressor,
(b) Neural Network,
(c) Linear Regression

The output ‘area’ was first transformed with an ln(x+1) function.

One regression metric was measured: RMSE and r2 score is obtained. An analysis of the regression error curve(REC) shows that the RFR model predicts more examples within a lower admitted error. In effect, the RFR model predicts better small fires, and the r2 score is obtained by using Linear Regression.

Best Algorithm for the project:

The best model is the Random Forest Regressor which has an RMSE value of 0.628 for which we are using GridSearchCV.

Scikit-learn has the functionality of trying a bunch of combinations and seeing what works best, built-in with GridSearchCV. The CV stands for cross-validation.

MODEL BUILDING

PREPROCESSING OF THE DATA:

Preprocessing of the data actually involves the following steps:

GETTING THE DATASET:

we can get the data from the client. we can get the data from the database.
https://archive.ics.uci.edu/ml/datasets/forest+fires

IMPORTING THE LIBRARIES:

We have to import the libraries as per the requirement of the algorithm.

IMPORTING THE DATA SET:

Pandas in python provide an interesting method read_csv(). The read_csv function reads the entire dataset from a comma-separated values file and we can assign it to a DataFrame to which all the operations can be performed. It helps us to access each and every row as well as columns and each and every value can be accessed using the data frame. Any missing value or NaN value has to be cleaned.

HANDLING MISSING VALUES:

OBSERVATION:

As we can see there are no missing values in the given dataset of forest fires

DATA VISUALIZATION:

  • scatterplots and distributions of numerical features to see how they may affect the output ‘area’
  • Boxplot of how categorical column day affects the outcome
  • Boxplot of how categorical column month affects the outcome

CATEGORICAL DATA:

  • Machine Learning models are based on equations, we need to replace the text with numbers. So that we can include the numbers in the equations.
  • Categorical Variables are of two types: Nominal and Ordinal
  • Nominal: The categories do not have any numeric ordering between them. They don’t have any ordered relationship between each of them. Examples: Male or Female, any color
  • Ordinal: The categories have a numerical ordering between them. Example: Graduate is less than Post Graduate, Post Graduate is less than Ph.D. customer satisfaction survey, high low medium
  • Categorical data can be handled by using dummy variables, which are also called indicator variables.
  • Handling categorical data using dummies: In the panda’s library, we have a method called get_dummies() which creates dummy variables for those categorical data in the form of 0’s and 1’s.
  • Once these dummies got created we have to concat this dummy set to our data frame or we can add that dummy set to the data frame.
  • Categorical data-column ‘month
  • dummy set for column ‘month’
  • Categorical column-‘day’
  • dummy set for column ‘day’
  • Concatenating dummy sets to a data frame
  • Getting dummies using label encoder from scikit learn package
  • We have a method called label encoder in scikit learn package. we need to import the label encoder method from scikitlearn package and after that, we have to fit and transform the data frame to make the categorical data into dummies.
  • If we use this method to get dummies then in place of categorical data we get the numerical values (0,1,2….)
  • importing label encoder and one hot encoder
  • Handling categorical data of column month
  • Handling categorical data of column day

TRAINING THE MODEL:

  • Splitting the data: after the preprocessing is done then the data is split into train and test set
  • In Machine Learning in order to access the performance of the classifier. You train the classifier using a ‘training set’ and then test the performance of your classifier on an unseen ‘test set’. An important point to note is that during training the classifier only uses the training set. The test set must not be used during the training of the classifier. The test set will only be available during the testing of the classifier.
  • training set – a subset to train a model. (Model learns patterns between Input and Output)
  • test set – a subset to test the trained model. (To test whether the model has correctly learned)
  • The amount or percentage of Splitting can be taken as specified (i.e. train data = 75%, test data =25% or train data = 80%, test data= 20%)
  • First we need to identify the input and output variables and we need to separate the input set and output set
  • In scikit learn library we have a package called model_selection in which the train_test_split method is available. we need to import this method
  • This method splits the input and output data to train and test based on the percentage specified by the user and assigns them to four different variables(we need to mention the variables)

 EVALUATING THE CASE STUDY:

Building the model (using splitting):

First, we have to retrieve the input and output sets from the given dataset

  • Retrieving the input columns
  • Retrieving output column

MODEL BUILDING:

  • Defining Regression Error Characteristic (REC)

Download the complete project Code, Report on Predict the Forest Fires using Project using Machine Learning Techniques