Decision Model for Prediction of Movie Success Rate Data Mining J Component Project

ABSTRACT

The purpose of this Movie Success Rate Prediction project is to predict the success of any upcoming movie using Data Mining Tools. For this purpose, we have proposed a method that will analyze the cast and crew of the movie to find the success rate of the film using existing knowledge. Many factors like the cast (actors, actresses, directors, producers), budget, worldwide gross, and language will be considered for the algorithm to train and test the data. Two algorithms will be tested on our dataset and their accuracy will be checked.

 LITERATURE REVIEW

  • They developed a model to find the success of upcoming movies based on certain factors. The number of audience plays a vital role in a movie becoming successful
  • The factorization Machines approach was used to predict movie success by predicting IMDb ratings for newly released movies by combining movie metadata with social media data
  • Using the grossattribute as a training element for the model. The data are converted into .csv files after the pre-processing is done
  • Using S-PLSA – the sentiment information from online reviews and tweets, we have used the ARSA model for predicting the sales performance of movies using sentiment information and past box office performance.
  • A mathematical Model is used to predict the success and failure of upcoming movies depending on certain criteria. Their work makes use of historical data in order to successfully predict the ratings of movies to be released
  • According to them, Twitter is a platform that can provide geographical as well as timely information, making it a perfect source for spatiotemporal models.
  • The data they collected was gathered from Box Office Mojo and Wikipedia. Their data was comprised of movies released in 2016
  • Initially having a dataset of 3183 movies, they removed movies whose budget could not be found or missed key features in the end a dataset of 755 movies were obtained. After Key feature extraction was completed.
  • some useful data mining on the IMDb data, and uncovered information that cannot be seen by browsing the regular web frontend to the database.
  • According to their conclusion, brand power, actors or directors isn’t strong enough to affect the box office.
  • Their neural network was able to obtain an accuracy of 36.9% and compromising mistakes made within one category an accuracy of a whopping 75.2%
  • They divided the movies into three classes rise, stay, and fall finding that support vector machine SMO can give up to 60% correct predictions
  • The data was taken from the Internet Movie Database or IMDb as the data source, the data they obtained was from the years 1945 to 2017.
  • A more accurate classifier is also well within the realm of possibility, and could even lead to an intelligent system capable of making suggestions for a movie in preproduction, such as a change to a particular director or actor, which would be likely to increase the rating of the resulting film.
  • In this study, we proposed a movie investor assurance system (MIAS) to aid movie investment decisions at the early stage of movie production. MIAS learns from freely available historical data derived from various sources and tries to predict movie success based on profitability.
  • The data they gathered from movie databases was cleaned, integrated, and transformed before the data mining techniques were applied.
  • They used feature, extraction techniques, and polarity scores to create a list of successful or unsuccessful movies. This was done by gathering the data using IMDb and YouTube.

PROBLEM STATEMENT

in this Movie Success Rate Prediction project, The method of using the ratings of the films by the cast and crew has been an innovative and original way to solve the dilemma of film producers. Film producers have often trouble casting successful actors and directors and still trying to keep a budget. Looking at the average ratings of each actor and director together with all the films they participated in should be able to give the producer a good idea of who to cast and who not to cast in a film that is to be out right now.

Implementation:

  • Data Preprocessing & Correlation Analysis
  • Application of Decision Tree Algorithm
  • Application of Random Forest Algorithm

RESULTS & CONCLUSION

After testing both the algorithms on the IMDb dataset i.e. Decision Tree and Random Forest algorithm, we found that the Random Forest algorithm got a better accuracy (99.6%) on the data rather than the decision tree algorithm in which we obtained just 60% accuracy.

Predict the Forest Fires Python Project using Machine Learning Techniques

Predict the Forest Fires Python Project using Machine Learning Techniques is a Summer Internship Report Submitted in partial fulfillment of the requirement for an undergraduate degree of  Bachelor of Technology In Computer Science Engineering. I submit this industrial training workshop entitled “PREDICT THE FOREST FIRES” to the University, Hyderabad in partial fulfillment of the requirements for the award of the degree of “Bachelor of Technology” in “Computer Science Engineering”. 

Apart from my effort, the success of this internship largely depends on the encouragement and guidance of many others. I take this opportunity to express my gratitude to the people who have helped me in the successful competition of this internship.

I would like to thank the respected faculties who helped me to make this internship a successful accomplishment.

I would also like to thank my friends who helped me to make my work more organized and well-stacked till the end.

OBJECTIVE OF THE PROJECT:

This is a regression problem with clear outliers which cannot be predicted using any reasonable method. A comparison of the three methods has been done :

(a) Random Forest Regressor,
(b) Neural Network,
(c) Linear Regression

The output ‘area’ was first transformed with an ln(x+1) function.

One regression metric was measured: RMSE and r2 score is obtained. An analysis of the regression error curve(REC) shows that the RFR model predicts more examples within a lower admitted error. In effect, the RFR model predicts better small fires, and the r2 score is obtained by using Linear Regression.

Best Algorithm for the project:

The best model is the Random Forest Regressor which has an RMSE value of 0.628 for which we are using GridSearchCV.

Scikit-learn has the functionality of trying a bunch of combinations and seeing what works best, built-in with GridSearchCV. The CV stands for cross-validation.

MODEL BUILDING

PREPROCESSING OF THE DATA:

Preprocessing of the data actually involves the following steps:

GETTING THE DATASET:

we can get the data from the client. we can get the data from the database.
https://archive.ics.uci.edu/ml/datasets/forest+fires

IMPORTING THE LIBRARIES:

We have to import the libraries as per the requirement of the algorithm.

IMPORTING THE DATA SET:

Pandas in python provide an interesting method read_csv(). The read_csv function reads the entire dataset from a comma-separated values file and we can assign it to a DataFrame to which all the operations can be performed. It helps us to access each and every row as well as columns and each and every value can be accessed using the data frame. Any missing value or NaN value has to be cleaned.

HANDLING MISSING VALUES:

OBSERVATION:

As we can see there are no missing values in the given dataset of forest fires

DATA VISUALIZATION:

  • scatterplots and distributions of numerical features to see how they may affect the output ‘area’
  • Boxplot of how categorical column day affects the outcome
  • Boxplot of how categorical column month affects the outcome

CATEGORICAL DATA:

  • Machine Learning models are based on equations, we need to replace the text with numbers. So that we can include the numbers in the equations.
  • Categorical Variables are of two types: Nominal and Ordinal
  • Nominal: The categories do not have any numeric ordering between them. They don’t have any ordered relationship between each of them. Examples: Male or Female, any color
  • Ordinal: The categories have a numerical ordering between them. Example: Graduate is less than Post Graduate, Post Graduate is less than Ph.D. customer satisfaction survey, high low medium
  • Categorical data can be handled by using dummy variables, which are also called indicator variables.
  • Handling categorical data using dummies: In the panda’s library, we have a method called get_dummies() which creates dummy variables for those categorical data in the form of 0’s and 1’s.
  • Once these dummies got created we have to concat this dummy set to our data frame or we can add that dummy set to the data frame.
  • Categorical data-column ‘month
  • dummy set for column ‘month’
  • Categorical column-‘day’
  • dummy set for column ‘day’
  • Concatenating dummy sets to a data frame
  • Getting dummies using label encoder from scikit learn package
  • We have a method called label encoder in scikit learn package. we need to import the label encoder method from scikitlearn package and after that, we have to fit and transform the data frame to make the categorical data into dummies.
  • If we use this method to get dummies then in place of categorical data we get the numerical values (0,1,2….)
  • importing label encoder and one hot encoder
  • Handling categorical data of column month
  • Handling categorical data of column day

TRAINING THE MODEL:

  • Splitting the data: after the preprocessing is done then the data is split into train and test set
  • In Machine Learning in order to access the performance of the classifier. You train the classifier using a ‘training set’ and then test the performance of your classifier on an unseen ‘test set’. An important point to note is that during training the classifier only uses the training set. The test set must not be used during the training of the classifier. The test set will only be available during the testing of the classifier.
  • training set – a subset to train a model. (Model learns patterns between Input and Output)
  • test set – a subset to test the trained model. (To test whether the model has correctly learned)
  • The amount or percentage of Splitting can be taken as specified (i.e. train data = 75%, test data =25% or train data = 80%, test data= 20%)
  • First we need to identify the input and output variables and we need to separate the input set and output set
  • In scikit learn library we have a package called model_selection in which the train_test_split method is available. we need to import this method
  • This method splits the input and output data to train and test based on the percentage specified by the user and assigns them to four different variables(we need to mention the variables)

 EVALUATING THE CASE STUDY:

Building the model (using splitting):

First, we have to retrieve the input and output sets from the given dataset

  • Retrieving the input columns
  • Retrieving output column

MODEL BUILDING:

  • Defining Regression Error Characteristic (REC)

Download the complete project Code, Report on Predict the Forest Fires using Project using Machine Learning Techniques

Analysis Of Energy Consumption In India Python Project

Energy is one of the most important resources available to man and it is necessary to keep a check on the growing need for energy day by day.

The Issue of the availability of Energy is getting prominent these days. So to analyze the consumption of energy and production of Energy via available Energy Resources is important.

The project describes the consumption of energy resources of all states of India in the last few years with respect to the population of India state-wise and predicts the future energy requirements for every state.

INTRODUCTION

India is a growing economic superpower. At this point in time, we are sitting at the tip of our economic explosion. The vast reserves of resources in all factors of production have earned us the title of The Land of Potential. But this comes at a cost, with this growth potential comes the need to satisfy the potential through the generation of energy.

To meet this challenge of growing energy is very important for India and it is even more important to predict the future requirements of energy in our country.

If we are able to predict the energy required in the future it will boost the potential of the country and increase the overall growth in every field

Background and Basics:

The programming language Python is very useful for the analysis of data in every field.
Python has been used to show the analysis of data in a diagrammatical format like a Pie Chart, Bar Chart, and Multiple Bar Chart.
It also shows a map of India with respect to the intensity of Energy Consumption as well as the Population of India state-wise. By using Machine Learning.
We have predicted the requirement of the amount of energy for every state using The Linear Regression Machine Learning Algorithm. It uses two parameters on the outcome and one on which the outcome depends
The population has been used as a parameter on which energy depends

Future Use

This program gives a clear idea about the energy requirement in the Future.

Software and Hardware Requirements

Details of software

Python
Anaconda (Spyder) IDE
Required Python Libraries:
Numpy
Pandas
Matplotlib
Tkinter
PIL
Mpl_toolkits.basemapDetails of hardware

Details of Hardware

Working PC

Methodology

The SUBMIT button on the GUI checks the availability of state i.e.it checks the correct state.

The PIE Chart on the GUI plots the energy resource required percentage-wise.
The BAR Chart on the GUI plots the energy resource required percentage-wise.
Flow of Project
Our project takes a dataset of the population from the year 2013 to 2017 and energy requirements in India per state from the year 2013 to 2016.

The data from every set from the years 2013-16 has been used in order to train the machine using linear Regression and data from 2017 for the population has been used in order to test the model to predict the future requirement of energy.

The energy requirements predicted as well as actually have been represented using the map of India i.e. greater the intensity on the map higher the energy required for that state, bar chart, and Pie chart.

Results and Discussion

Pie Chart of Energy Resources of Maharashtra Year 2015
Resource-wise Production of energy

Map of India according to energy consumption

Conclusion

We have used python to show the analysis of data in a diagrammatical format like a Pie Chart, Bar Chart, and Multiple Bar Chart.
It also shows a map of India with respect to the intensity of Energy Consumption as well as the Population of India state-wise.
By using Machine Learning, we have predicted the requirement of the amount of energy in the specified year for each state in India.
Technologies used in the project are Python, Machine learning, and Data Analysis.
This program gives a clear idea about the energy requirement in the Future.

MOODIFY – Suggestion of Songs on the basis of Facial Emotion Recognition Project

Modify is a song suggested that recommends the song to the user according to his mood. ‘Modify’ will do the job leaving the user to get carried away with the music.

I/We, student(s) of B.Tech, hereby declare that the project entitled “MOODIFY (Suggestion of Songs on the basis of Facial Emotion Recognition)” which is submitted to the Department of CSE in partial fulfillment of the requirement for the award of the degree of Bachelor of Technology in CSE. The Role of Team Mates involved in the project is listed below:

  • Training the model for facial emotion recognition.
  • Designing the algorithm for image segregation.
  • Algorithm designing for music player.
  • Graphical user interface designing.
  • Testing the model.
  • Collection of data for model and music player.
  • Preprocessing of the data and images.

Dataset:

The dataset we have used is “Cohn-Kanade”. 
This dataset is classified so we cannot provide the actual dataset but the link for you to download is :
http://www.consortium.ri.cmu.edu/ckagree/index.cgi
And to read more about the dataset you can refer to:
http://www.pitt.edu/~emotion/ck-spread.htm

Feature Extraction and Selection:

1. Lips
2. Eyes
3. Forehead
4. Nose

These features are processed by CNN layers and then selected by the algorithm and then they are converted to a NumPy array then the model is trained by that and the following three classifications are made.

How this project works:

  • First Open the Application, CHOOSE THE MODE IN WHICH YOU WANT TO LISTEN to THE SONG
  • Then it shows “YOUR MOOD, YOUR MUSIC”
  • Press “OKAY TO CAPTURE THE IMAGE”
  • After that press “c” to capture
  • You seem Happy please select your favorite genre
  • You seem Excited please select your favorite genre
  • You Seem Sad please select your favorite genre

CODE DESCRIPTION

  • All libraries are imported into this.
  • Model Initialization and building.
  • Training of test and testing.
  • Training our model
  • Model Building, Splitting of test and train set, and training of the model.
  • Saving a model.
  • Loading a saved model.
  • Saving image with OpenCV after cropping and loading it and then the prediction
  • Suggesting songs in Offline mode
  • Suggesting songs online(Youtube)
  • Rest of the GUI part
  • Variable Explorer

IPython Console

  • Importing Libraries
  • Model Training
  • Model Summary
  • Online Mode
  • Offline Mode

GUI

  • Splash Screen
  • Main Screen
  • Selection screen
  • Display songs and then select them, after that they will play

Summary

We successfully build a model for Facial Emotion Recognition(FER) and trained it with an average accuracy over various test sets of over 75%. Then we successfully build a Desktop application to suggest songs on the basis of their facial expression and hence completed our project. This FER model can be widely used for various purposes such as home automation, social media, E-commerce, etc and we have the motivation to take this project to a next level.

Download the complete Project code, report on MOODIFY – Suggestion of Songs on the basis of Facial Emotion Recognition Project

Implementation of E-voting Machine Project using Python and Arduino

INTRODUCTION

Our E-voting Machine project is very useful, This Project was implemented using Python and Arduino. The user is no longer required to check his register in search of records, after the voting procedure gets over, the admin will be able to calculate the total number of votes in just one click since the entire work is done using computers. The user just needs to enter his/her unique voter ID.

In today’s world, no one likes to manually analyze the result after the voting procedure gets over because the process is time-consuming and of which results get usually delayed. Everyone wants his/her work to be done by computer automatically and displaying the result for further manipulations. So this E-voting Machine project is about providing convenience regarding voting.

OBJECTIVE

  • Our objective for the E-voting Machine project is to make a user-friendly Electronic Voting Machine that makes the current voting process faster, easier, and error-free.
  • We have used Arduino in our project for the implementation of push buttons and Python as a programming language.

PROBLEM STATEMENT 

The problem statement was to design a module:

  • Which is a user-friendly E-voting Machine
  • Which will restrict the user from accessing other users’ data.
  • Which will ease the calculations and storage of data.
  • Which will help the jury to declare the result without any biasing.

FUNCTIONS TO BE PROVIDED:

The E-voting Machine system will be user-friendly and completely secured so that the users shall have no problem using all options.

  • The system will be efficient and fast in response.
  • The system will be customized according to needs.

FOR e-VOTING SYSTEM

  • (Check
  • Store
  • )

SYSTEM REQUIREMENTS

  • Programming Language Used: Python, C
  • Hardware Used: Arduino UNO
  • Components Used: Push buttons, Connecting Wires, Resistances(100k ohm), Breadboard
  • Software Used: Anaconda 2.7.x, Python 2.7.x, Arduino IDE
  • Modules Used: Serial, SQLite, Tkinter, tkMessageBox

WORKING

  • The user has to enter his/her ID in the system.
  • After verifying the user ID, the system will show a message that whether a user is eligible to vote or not after checking his/her details stored in the system.
  • A message will be displayed accordingly. The user will then have to press the button against which the name of the candidate is written and whom he/she wants to vote.
  • The votes hence are stored in the database and the results will be announced accordingly.

FUTURE SCOPE OF THE PROJECT

My project “e-VOTING SYSTEM” will be a great help in conducting voting at various organizations. So the modifications that can be done in our project is to add one major change which can be done in this project is to add the data of the voters. This will result in the total identification of the voter.

CONCLUSION

From this E-voting Machine project, we can conclude that this program is very useful in conducting the voting procedures smoothly. It provides easy methods to analyze the voting result. It helps in conducting faster, more secure, and more efficient voting. The program can be used per the norms of the voting requirements.

Download the complete project code, report, and PPT on E-voting Machine using Python and Arduino.

Competitive Programming Platform for Students Project Synopsis

Introduction

Most of the major IT corporations are leveraging online coding competitions to judge the pressure handling and fundamentals of upcoming software engineers. This has led to a significant increase in the number of online judges and coding competitions. Most of the students are now confused, about which platform they should opt for and how to approach these coding contests on time, every time. This is where the Competitive Programming Platform comes into the picture.
A competitive Programming Platform is a collection of extensions, APIs, bots, and web apps aimed to simplify competitive programming. With this project, students can observe, compare, shortlist and outperform these online judges and compare the improvements and achievements with their peers in a healthy environment. Technologies that we’ll be using in this Competitive Programming Platform project will be Python, Javascript, Node, Flask, Selenium, VueJS, and Tailwind.

Objectives

The main objective is to create a platform on which students can easily select and prepare for online coding competitions in the best possible way.
The key objectives of the Competitive Programming Platform are:
1. Looking at all the competitive profiles at a glance.
2. Get updates about the latest programming contests.
3. Getting all the updates through an email newsletter and push notification.
4. Fetching global and local leadership.
5. VS code extension to speed up local development.
6. Chrome extension to view upcoming contests on the go.
7. Standalone REST API.

Methodology

In the first step, we will scrape the data from various resources using a crawler built in Python with Selenium. We will store this data in our database and create a pipeline with a cronjob every six hours.
Now we will deliver all the extracted data through our SPA using VueJS. We will use Workbox 6.0 to convert our SPA into a Progressive Web Application and natively support push notifications.

Web Scrapping: Web scraping is an automatic method to obtain large amounts of data from websites.
Cronjobs in recurrent pipelines: A cron job is normally used to schedule a job that is executed periodically. In our case, we use a cronjob to run our python script that will fetch, and extract unstructured HTML data, validate it and save it in our required database.
User interfaces: Building user-friendly interfaces that bring meaning to our extracted data and visualize it through various tables, charts, and graphs.

Work Flow

Facilities required

• Vue, Tailwind, ChartJS, Babel, GSAP, Node
• Flask, Postgre, Selenium, Python
• Git, GitHub, CodeQL, VS Code
• NGINX, PM2, Travis, Certbot

Expected Outcome

• Responsive, minimalistic user interface with a clutter-free user experience.
• Powerful REST API that can power other third-party applications.
• Healthy competitive environment with ’ friendly competition’ among peers and making competitive programming a constructive habit.

Audio Classification On Cat’s And Dog’s Python Project

Our Audio Classification project illustrates a straightforward audio classification model supported by deep learning. we tend to address the matter of classifying the sort of sound-supported short audio signals and their generated spectrograms, from classifying dog’s audio to cat’s audio throughout model training. So as to satisfy this challenge, we tend to use a model-supported Convolutional Neural Network (CNN). The audio was processed with Mel-frequency Cepstral Coefficients (MFCC) into what is unremarkably called Mel spectrograms, and hence, was reworked into a picture. Our final CNN model achieved 89% accuracy on the testing dataset.

Project Overview :

The input to our model, in this project, is cats and associated dogs recording audio go in WAV kind. It lies below the supervised machine learning class. Thus, a dataset is also present as well as a target class. Hence, the intention here is to classify if the given input wav file is that of a cat or dog. Each of the dog and cat sounds is incredibly distinguished like in their pitch and frequency level since completely different| sounds have different sample rates. By default, Librosa mixes all audio to mono and resamples them to 22050 cycles/second at load time. For music and audio analysis, Librosa is associated ASCII text file python package. The info and the sampling rate are provided by Librosa. Audio or sound is in its raw kind, and the data provided should be pre-processed to extract significant and meaningful features so we implemented an algorithm i.e., MFCC (Mel Frequency Cepstral Coefficients) rule. Then, when audio extraction is done, the information is fed and the dataset is split into training and test set. So, after the preprocessing, a Convolutional Neural Network model is designed using tensor flow. For every code and model building, Keras API was used to implement Google colab.

Motivation

Machine learning can be used in image processing, understanding speech, and musical instruments, speech-to-text, environmental sound classification, and many more. And as for our project, we implemented a class of speech processing i.e, audio classification. Converting sound waves into audio and spectrograms which is a visual representation of frequencies with the help of function provided by machine learning.

There are many techniques to classify images as many different in-built neural networks under CNN are already there, especially if it is related to images. And it’s straightforward to extract options from pictures as a result of pictures already being available in the shape of numbers, because the formation of a picture may be an assortment of pixels, and pixels area units within the sort of numbers. When we have data as text, we use the sequential encoder and decoder-based techniques to find features. But if it is to sound recognition or audio it is more difficult compared to text because it is based on frequency and time. Therefore a proper model is to be made to extract the frequency and pitch of that audio so as to make it easier to later recognize it.

Flow Chart:

Preliminaries and Background 

Related work

Machine learning: Image classification of cats and dogs – Before a decade, in computer notion, many problems had been saturating in accordance with their precision. However, the accuracy of those troubles significantly stepped forward with the boom of deep gaining knowledge of strategies. The majority of the problems that arise from image class is that it is defined as predicting the distinct categories a photo can belong to. Hence, for the supplied enter/ photograph detection with the aim of accomplishing high precision, a state-of-the-art approach is incorporated, i.e., a convolutional neural network turned into the build for the photo category mission of puppies and cats. A dataset become given from Kaggle comprising a total of 25000 pix of each dog and cat.

Machine learning: Audio classification of different bird species – Here, the methodology and results of using deep learning to assist in the classification of birds by their sounds are presented. As birds indicate the health of an ecosystem, hence this topic is of high importance. Random Forest Classification and custom-made six CNN models from the literature were performed on a dataset of ten birds that were composed of xeno-canto.org. The highest accuracy was achieved at around 65% by the Random Forest and at about 58% for the CNN model.

conclusion and future work 

In this report, we first briefly explained the overview of this project and showed some referred project work already established. Then, we precisely illustrated our task, including the learning task and the performance task. After that, we explained the approach we are heading toward in order to classify the datasets. The approach/model we used is a neural network which is an implementation of the deep network which is a trainable model by which we were able to classify the dog’s and cat’s audio. The highest accuracy we got was 89.6%.

  1. In the future, we will try to implement the different high-level models in order to achieve much higher
  2. We’ll build a system that can directly intake a live raw

Fake Disaster Tweet Detection Web-App Python Machine Learning Project

This project “Fake Disaster Tweet Detection” aims to help predict, whether a tweet weather it is fake or real. It uses the Multinomial Naïve Bayes approach for detecting fake or real tweets from existing datasets available on Kaggle. The classifier will be trained only on text data. Traditionally text analysis is performed using Natural Language Processing also known as NLP. Natural language processing is a field that comes under Artificial Intelligence. Its main focus is on letting computers understand human language and process it. NLP helps recognize and predict diseases using speech, it helps in sentiment analysis, cognitive assistant, spam detection, the healthcare industry, etc. In this project Training Data is pre-processed, then sent to the classifier, then and the classifier predicts weather the tweet is real or fake.

This project is made on Jupyter Notebook which is a part of Anaconda Navigator. This project ran successfully on Jupyter Notebook. The dataset was successfully loaded into the notebook. All the extra python packages which were required for project completion were also loaded into the notebook. The model is also deployed successfully using HTML, CSS, python, and flask.

The accuracy score on test data is 77.977%. average recall value is 0.775 and the average precision score is 0.775. Precision is used to calculate a number of correct positive predictions made by the model. The recall is used to calculate the number of correct positive predictions made out of all the positive predictions that could have been made.

System Design

System Flowchart

System Flowchart

Problem: To detect disaster tweets whether it’s fake or real using a machine learning algorithm. In this, the concept of Natural language Processing is used.

Identification of data: In this project, I have used a dataset available on Kaggle competition based on Natural language processing. This project works only on text data. It has five columns:

  1. Id: It tells the unique identification of each tweet
  2. Text: It tells the tweet in text form
  3. Location: It tells the place from where the tweet was sent and it can be blank
  4. Keyword: It tells a particular word in the tweet and it can be blank
  5. Target: It tells the actual value of the tweet weather it’s a real tweet or Fake

Data-preprocessing: First the preprocessing is done in the dataset which includes the removal of punctuations, then the removal of URLs, digits, non-alphabets, and contractions, then tokenization and removing Stopwords, and removing Unicode. Then lemmatization is done on the dataset. After preprocessing Countvectorizer is used to convert text data into numerical data as the classifier only works for numerical data. The dataset is then split into 70% training data and 30% test data.

Definition of Training Data: The training dataset which contains 70% of the whole dataset is used for training the model.

Algorithm Section: In this project Multinomial Naïve Bayes classifier algorithm is used for detecting disaster tweets whether they are fake or real.

Evaluation with test set: Several text samples are passed through the model to check whether the classification algorithm gives the correct result or not.

Prediction Model

Implementation Work Details

The data-set which is used in this project “Fake disaster tweet detection” is taken from the Kaggle competition “Natural Language Processing with Disaster Tweets”. The data set contains 7613 samples. This project works only on text data. It has five columns:

  • Id: It tells the unique identification of each tweet
  • Text: It tells the tweet in text form
  • Location: It tells the place from where the tweet was sent and it can be blank
  • Keyword: It tells a particular word in the tweet and it can be blank
  • Target: It tells the actual value of the tweet weather it’s a real tweet

Step 2: Data-Preprocessing

  1. Removing Punctuations: Punctuations are removed with the help of the following python code
  1. Removing URLs, digits, non-alphabets, _: True means it has HTTP, and False means it does not have HTTP
  1. Removing Contraction: It expands the words which are written in short form like can’t is expanded into cannot, I’ll is expanded into I will, etc.
  1. Lowercase the text, tokenize them, and remove Stopwords: Tokenizing means splitting the text into a list of tokens. Stopwords are the words in the text which does not provide additional meaning to the text.
  1. Lemmatizing: It converts any word into its root form like running, ran into a run.
  1. Countvectorizer:

Text cannot be used to train our model, it has to be converted into numbers that our computer can understand, so far in this project, Countvectorizer is used. Countvectorizer counts the number of times each word appears in a document. Countvectorizer works as:

Step1: It first identifies unique words in the complete dataset.

Step 2: then it will create an array of zeros for each sample of the same length as above Step 3: It then takes each word at a time and find its occurrence in each sample in the dataset. The number of times the word appears in the sample will replace the zero positioned at the word in the list. This will repeat for every word. 

Step 3: Model Used:

In this project, the Multinomial Naïve Bayes approach is used for detecting fake or real tweets from existing datasets available on Kaggle. Naïve Bayes classifier is based on the probability theorem “Bayes Theorem” and also has an assumption of conditional independence among every pair.

System Testing

This project is made on Jupyter Notebook which is a part of Anaconda Navigator. This project ran successfully on Jupyter Notebook. The dataset was successfully loaded into the notebook. All the extra python packages which were required for project completion were also loaded into the notebook. The model is also deployed successfully using HTML, CSS, python, and flask.

The machine learning model is evaluated we normally use classification accuracy which is the number of correct predictions divided by the total number of predictions.

This accuracy measuring technique works well when there is an equal number of samples in the dataset belonging to each class. The accuracy score on test data is 77.977%. average recall value is 0.775 and the average precision score is 0.775. Precision is used to calculate a number of correct positive predictions made by the model. The recall is used to calculate the number of correct positive predictions made out of all the positive predictions that could have been made.

  • Precision = True Positives / (True Positives + False Positives)
  • Recall = True Positives / (True Positives + False Negatives)

Conclusion

In this project only one classification algorithm is used which is Multinomial Naïve Bayes. First, the preprocessing is done in the dataset which includes the removal of punctuations, then removal of URLs, digits, non-alphabets, and contractions, then tokenization and removing Stopwords, and removing Unicode. Then lemmatization is done on the dataset. After preprocessing Countvectorizer is used to convert text data into numerical data as the classifier only works for numerical data. The dataset is then split into 70% training data and 30% test data. The accuracy score on test data is 77.977%. average recall value is 0.775 and the average f1 score is 0.775.

Future Scope

In the future, some other classification algorithms can also be tried on this dataset like KNN, Support vector machine (SVM), Logistic Regression, and even Deep learning algorithms can also be used which give very high accuracy. Vectorizing can be done using other methods like word2vec, Tf-Idf vectorizer, etc.

Download the Complete Project on ake Disaster Tweet Detection Web Application Python-based Machine Learning Project.

Covid-19 Outbreak Prediction Using Machine Learning Python Project

The aim of this Covid-19 Outbreak Prediction project is to make a model which will forecast the number of confirmed cases covid-19 virus in the upcoming days. Covid-19 is an infectious disease that is affecting a huge number of people all around the world.

This virus was first identified in Wuhan, China, and later spread throughout the world causing a pandemic that forced most countries to go into lockdown.

Various machine learning models and time series forecasting models.

The predictive model will be created using machine learning and using the dataset obtained from Kaggle. Machine learning automates the formation of analytical models. It is a branch of artificial intelligence focused on the principle that data can be learned from processes, It can find patterns and take decisions.

Time series forecasting will be used which is a type of predictive model. Time series forecasting is the use of a model centered on earlier observed values to evaluate future values. 

INTRODUCTION

The aim of this project is to make a predictive model which will predict the trajectory of the outbreak of the covid-19 virus in the upcoming days. Covid-19 is an infectious disease that is affecting a huge number of people all around the world.

It was first identified in Wuhan, China, and then later spread all over the world causing a pandemic.

Since no vaccine is developed which can be available all throughout the world, we have to take preventive measures which can stop the spread of the disease. Since a lockdown cannot last forever, we have to know how fast the spread is and how much more people will be infected.

The predictive model will be created using machine learning and using the dataset obtained from Kaggle. Machine learning automates the formation of analytical models. It is a branch of artificial intelligence focused on the principle that data can be learned from processes, It can find patterns and take decisions.

Time series forecasting will be used which is a type of predictive model. Time series forecasting is the use of a model centered on earlier observed values to evaluate future values.

PRESENT SYSTEM

Various work on this problem related to covid-19 is being done. Officials all over the world are using several outbreak prediction models for covid-19 to make informed decisions and implement relevant control measures. Simple statistical models have received greater attention from authorities among the standard models for covid-19 global pandemic prediction. One of the works suggests using SEIR models. SEIR means susceptible-exposed-infected-recovered model.

This model aims to forecast factors like the spread of a disease, the total number of infected, and the span of an outbreak, and estimate different epidemiological parameters like the number of reproductive. Such models can illustrate how the outcome of the disease can be affected by various public health measures.

PROPOSED SYSTEM 

In this project, we will first collect and evaluate the dataset. We will transform the raw data into an accessible format and visualize it using data preprocessing. Various machine learning algorithms such as Linear regression, polynomial regression, SVM, holt’s linear model, Holt’s winter model, AR model, ARIMA model, and SARIMA model are used. The tools used in this project are mainly sklean for model selection, and NumPy library which is used to work with the arrays and pandas that use a key data structure called a data frame that allows us to store and manipulate tabular data in observation rows and variable columns, matplotlib is a library of plotting that is used to plot graphs. After implementing the model, the model with the least mean square error will be considered the best-fit model.

System Design 

The dataset is first preprocessed and visualized so that it is in a usable format for analysis. After this, we model the data using Linear regression, polynomial regression, SVM, holt’s linear model, Holt’s winter model, AR model, ARIMA model, and SARIMA. Then we evaluate the model and choose the best one according to its root mean square.

The flowchart depicts the following

Dataset 

The dataset involves the collection of data from various sources.

Data Pre-processing and visualization 

In order to obtain accurate results, data preprocessing is done to check if there is any inconsistency in the data, if there is it is handled accordingly. We then visualize the data to study the pattern and trends in the data.

Model Building 

Various models are used in this project-: Linear Regression

Polynomial Regression SVM

Holt’s Linear

Holt’s Winter Model

Auto Regressive Model (AR)

Moving Average Model (MA) ARIMA Model

SARIMA Model

DATASET

In this project, the dataset is taken from Kaggle which is the Novel Corona Virus 2019 Dataset and the goal is to study the effect and spread of COVID-19 in the coming days, and conduct predictions and time series forecasting.

Hardware and Software Details 

  •  Software Details Python 3.7(64-bit) Jupyter notebook

Implementation work details  

First, the data is pre-processed and visualization is done and analyzed. Afterward, various models are used to train the data and the model with the least root mean squared error is selected as the best fit model. Various machine learning models are used and time series forecasting models such as holt’s linear model and ARIMA model are used. The dataset is obtained from Kaggle.

Real-life applications 

It can be used by the government for predicting the extent of the spread of the infectious disease and take action accordingly.

Data implementation and program execution 

The data is analyzed and visualized afterward. On different models, the data is trained and the one with the least mean square error is considered to be the best fit model and can be used for forecasting. The program is executed on a Jupyter notebook.

Output Screens 

Fig: Growth of different types of cases in India

Fig: Confirmed cases Linear Regression Prediction

Fig: Polynomial Regression Prediction for confirmed cases

Fig: SVM regressor Prediction for confirmed cases

Fig: Holts Linear Model Prediction for confirmed cases

Fig: Holt’s Winter model prediction for confirmed cases

Fig: AR model prediction for confirmed cases

Fig: SARIMA model Prediction

System Testing 

In this project, the model evaluation part is very important as by the means of it we can identify which model can best fit the problem.

Here the models are evaluated on the basis of their root mean square error(rmse).

The root-mean-square variance (RMSD) or root-mean-square error (RMSE) is a commonly used calculation of the differences expected by the model or estimator between values (sample or population values) and the values observed.

According to the rmse values of all the models tested in the project, the one with the least rmse value was the SARIMA model. So it can be considered the best fit model for this problem.

Conclusion

 It is concluded that machine learning models can be used to forecast the spread of infectious diseases like Covid-19. In the project, we used various algorithms to forecast the rise of confirmed cases. It was observed among all the algorithms used, SARIMA had the least rmse so it was considered the best fit model for the data that was available.

Limitations

It is a new virus so only a year worth of dataset is available. Generally, the more data we have the better accuracy we get and we have to keep updating the data.

Scope for future work

 It can be implemented such that it can update its graphs or predictions according to real-time values.

Download the Complete project on Covid-19 Outbreak Prediction Using Machine Learning Python Project Code & Report.

Prediction of the growth of Corona Virus Python Project

The upsurge of this disease is CORONA VIRUS has created a life-and-death situation in the world of the living. The virus is increasing day by day and effective lives. Machine Learning can be established very effectively in tracing the disease predicting its growth and forming an effective strategy in order to manage the effect of the virus. This report gives us a full glance and the best mathematical computation with modeling for predicting growth.

In an Corona Virus Prediction ML-based project, we come up with various computations and modeling to suspect or predict the growth of a particular dataset. Although this concept can be used on a dynamic dataset that is changing day to day, here in this report we will study a particular dataset.

Working on the dataset led to various challenges such as modeling different algorithms of machine learning but finally worked on them in order to get the best result. This report is an insight into the working brief of the project such as descriptive information about machine learning, algorithms, statistical description, and most important the programming language used here which is python.

INTRODUCTION

This deadly disease is caused by the spread of various germs and harmful bacteria(pathogens) which transmits from one human to many humans, from one animal to many, and from animal to human. Early diagnoses are curable, while the patients suffering from it with a maximum number of days are not 100% curable.

There is a need for innovation in predicting the growth with deep thorough analysis, of huge global data on the rise of the virus.

The Corona Virus Prediction project comprises two main features or methods we can say, first predicting and analyzing cumulative confirmed cases and then representing with visuals that are data visualization. The second one is predicting the growth of total, confirmed, and new cases and finding accuracy.

  • PRESENT SYSTEM

Many employers are working on the same data and with the same idea of predicting the growth of the virus by analyzing cases. The COVID crisis has led many colleges and students to work in teams to get into a solution against corona.

There are many ongoing types of research and many projects have already been developed in predicting creating awareness on the same

  • PROPOSED SYSTEM

Working on the dataset led to various challenges such as modeling different algorithms of machine learning but finally worked on them in order to get the best result. It is an insight into the working brief of the project such as descriptive information about machine learning, algorithms, statistical description, and most important the programming language used here which is python. 

System Design 

System Flow Chart

Data Dictionary 

Data Pre-Processing: Our dataset needs to be pre-processed. Therefore, data pre-processing is required in this project.

Definition of Training Set: The training set is the data that the algorithm will learn from. Learning looks different depending on which algorithm you are using.

Algorithm Selection: Our project has been implemented using various algorithms such as linear regression, random forest, and decision trees.

Decision Tree: In python, we use a decision tree to observe and figure out the trained data in the structure of the tree in order for any future implementation. Decision Tree, here the target variables take continuous values called regression tree. 

Implementation Work Details 

Libraries used

Numpy

It contains among other things:

  • a powerful N-dimensional array object
  • broadcasting Functions
  • Tools for integrating
  • Useful linear algebra etc.

Pandas

Pandas is an open-source, BSD-authorized library giving superior, simple-to-utilize information structures and information investigation apparatuses for the Python programming language.

  • Benefits:

Python has for some time been incredible for information munging and planning, however less so for information examination and display. pandas help fill this hole, empowering you to do your whole information examination work process in Python without changing to a more space-explicit language like R.

Joined with the amazing IPython toolbox and different libraries, the earth for doing information examination in Python exceeds expectations in execution, profitability, and the capacity to work together.

More work is as yet expected to make Python a top-notch measurable displaying condition.

Download the Complete Project on Prediction of the growth of Corona Virus Python Project Code and Report