Covid-19 Outbreak Prediction Using Machine Learning Python Project

The aim of this Covid-19 Outbreak Prediction project is to make a model which will forecast the number of confirmed cases covid-19 virus in the upcoming days. Covid-19 is an infectious disease that is affecting a huge number of people all around the world.

This virus was first identified in Wuhan, China, and later spread throughout the world causing a pandemic that forced most countries to go into lockdown.

Various machine learning models and time series forecasting models.

The predictive model will be created using machine learning and using the dataset obtained from Kaggle. Machine learning automates the formation of analytical models. It is a branch of artificial intelligence focused on the principle that data can be learned from processes, It can find patterns and take decisions.

Time series forecasting will be used which is a type of predictive model. Time series forecasting is the use of a model centered on earlier observed values to evaluate future values. 

INTRODUCTION

The aim of this project is to make a predictive model which will predict the trajectory of the outbreak of the covid-19 virus in the upcoming days. Covid-19 is an infectious disease that is affecting a huge number of people all around the world.

It was first identified in Wuhan, China, and then later spread all over the world causing a pandemic.

Since no vaccine is developed which can be available all throughout the world, we have to take preventive measures which can stop the spread of the disease. Since a lockdown cannot last forever, we have to know how fast the spread is and how much more people will be infected.

The predictive model will be created using machine learning and using the dataset obtained from Kaggle. Machine learning automates the formation of analytical models. It is a branch of artificial intelligence focused on the principle that data can be learned from processes, It can find patterns and take decisions.

Time series forecasting will be used which is a type of predictive model. Time series forecasting is the use of a model centered on earlier observed values to evaluate future values.

PRESENT SYSTEM

Various work on this problem related to covid-19 is being done. Officials all over the world are using several outbreak prediction models for covid-19 to make informed decisions and implement relevant control measures. Simple statistical models have received greater attention from authorities among the standard models for covid-19 global pandemic prediction. One of the works suggests using SEIR models. SEIR means susceptible-exposed-infected-recovered model.

This model aims to forecast factors like the spread of a disease, the total number of infected, and the span of an outbreak, and estimate different epidemiological parameters like the number of reproductive. Such models can illustrate how the outcome of the disease can be affected by various public health measures.

PROPOSED SYSTEM 

In this project, we will first collect and evaluate the dataset. We will transform the raw data into an accessible format and visualize it using data preprocessing. Various machine learning algorithms such as Linear regression, polynomial regression, SVM, holt’s linear model, Holt’s winter model, AR model, ARIMA model, and SARIMA model are used. The tools used in this project are mainly sklean for model selection, and NumPy library which is used to work with the arrays and pandas that use a key data structure called a data frame that allows us to store and manipulate tabular data in observation rows and variable columns, matplotlib is a library of plotting that is used to plot graphs. After implementing the model, the model with the least mean square error will be considered the best-fit model.

System Design 

The dataset is first preprocessed and visualized so that it is in a usable format for analysis. After this, we model the data using Linear regression, polynomial regression, SVM, holt’s linear model, Holt’s winter model, AR model, ARIMA model, and SARIMA. Then we evaluate the model and choose the best one according to its root mean square.

The flowchart depicts the following

Dataset 

The dataset involves the collection of data from various sources.

Data Pre-processing and visualization 

In order to obtain accurate results, data preprocessing is done to check if there is any inconsistency in the data, if there is it is handled accordingly. We then visualize the data to study the pattern and trends in the data.

Model Building 

Various models are used in this project-: Linear Regression

Polynomial Regression SVM

Holt’s Linear

Holt’s Winter Model

Auto Regressive Model (AR)

Moving Average Model (MA) ARIMA Model

SARIMA Model

DATASET

In this project, the dataset is taken from Kaggle which is the Novel Corona Virus 2019 Dataset and the goal is to study the effect and spread of COVID-19 in the coming days, and conduct predictions and time series forecasting.

Hardware and Software Details 

  •  Software Details Python 3.7(64-bit) Jupyter notebook

Implementation work details  

First, the data is pre-processed and visualization is done and analyzed. Afterward, various models are used to train the data and the model with the least root mean squared error is selected as the best fit model. Various machine learning models are used and time series forecasting models such as holt’s linear model and ARIMA model are used. The dataset is obtained from Kaggle.

Real-life applications 

It can be used by the government for predicting the extent of the spread of the infectious disease and take action accordingly.

Data implementation and program execution 

The data is analyzed and visualized afterward. On different models, the data is trained and the one with the least mean square error is considered to be the best fit model and can be used for forecasting. The program is executed on a Jupyter notebook.

Output Screens 

Fig: Growth of different types of cases in India

Fig: Confirmed cases Linear Regression Prediction

Fig: Polynomial Regression Prediction for confirmed cases

Fig: SVM regressor Prediction for confirmed cases

Fig: Holts Linear Model Prediction for confirmed cases

Fig: Holt’s Winter model prediction for confirmed cases

Fig: AR model prediction for confirmed cases

Fig: SARIMA model Prediction

System Testing 

In this project, the model evaluation part is very important as by the means of it we can identify which model can best fit the problem.

Here the models are evaluated on the basis of their root mean square error(rmse).

The root-mean-square variance (RMSD) or root-mean-square error (RMSE) is a commonly used calculation of the differences expected by the model or estimator between values (sample or population values) and the values observed.

According to the rmse values of all the models tested in the project, the one with the least rmse value was the SARIMA model. So it can be considered the best fit model for this problem.

Conclusion

 It is concluded that machine learning models can be used to forecast the spread of infectious diseases like Covid-19. In the project, we used various algorithms to forecast the rise of confirmed cases. It was observed among all the algorithms used, SARIMA had the least rmse so it was considered the best fit model for the data that was available.

Limitations

It is a new virus so only a year worth of dataset is available. Generally, the more data we have the better accuracy we get and we have to keep updating the data.

Scope for future work

 It can be implemented such that it can update its graphs or predictions according to real-time values.

Download the Complete project on Covid-19 Outbreak Prediction Using Machine Learning Python Project Code & Report.

Prediction of the growth of Corona Virus Python Project

The upsurge of this disease is CORONA VIRUS has created a life-and-death situation in the world of the living. The virus is increasing day by day and effective lives. Machine Learning can be established very effectively in tracing the disease predicting its growth and forming an effective strategy in order to manage the effect of the virus. This report gives us a full glance and the best mathematical computation with modeling for predicting growth.

In an Corona Virus Prediction ML-based project, we come up with various computations and modeling to suspect or predict the growth of a particular dataset. Although this concept can be used on a dynamic dataset that is changing day to day, here in this report we will study a particular dataset.

Working on the dataset led to various challenges such as modeling different algorithms of machine learning but finally worked on them in order to get the best result. This report is an insight into the working brief of the project such as descriptive information about machine learning, algorithms, statistical description, and most important the programming language used here which is python.

INTRODUCTION

This deadly disease is caused by the spread of various germs and harmful bacteria(pathogens) which transmits from one human to many humans, from one animal to many, and from animal to human. Early diagnoses are curable, while the patients suffering from it with a maximum number of days are not 100% curable.

There is a need for innovation in predicting the growth with deep thorough analysis, of huge global data on the rise of the virus.

The Corona Virus Prediction project comprises two main features or methods we can say, first predicting and analyzing cumulative confirmed cases and then representing with visuals that are data visualization. The second one is predicting the growth of total, confirmed, and new cases and finding accuracy.

  • PRESENT SYSTEM

Many employers are working on the same data and with the same idea of predicting the growth of the virus by analyzing cases. The COVID crisis has led many colleges and students to work in teams to get into a solution against corona.

There are many ongoing types of research and many projects have already been developed in predicting creating awareness on the same

  • PROPOSED SYSTEM

Working on the dataset led to various challenges such as modeling different algorithms of machine learning but finally worked on them in order to get the best result. It is an insight into the working brief of the project such as descriptive information about machine learning, algorithms, statistical description, and most important the programming language used here which is python. 

System Design 

System Flow Chart

Data Dictionary 

Data Pre-Processing: Our dataset needs to be pre-processed. Therefore, data pre-processing is required in this project.

Definition of Training Set: The training set is the data that the algorithm will learn from. Learning looks different depending on which algorithm you are using.

Algorithm Selection: Our project has been implemented using various algorithms such as linear regression, random forest, and decision trees.

Decision Tree: In python, we use a decision tree to observe and figure out the trained data in the structure of the tree in order for any future implementation. Decision Tree, here the target variables take continuous values called regression tree. 

Implementation Work Details 

Libraries used

Numpy

It contains among other things:

  • a powerful N-dimensional array object
  • broadcasting Functions
  • Tools for integrating
  • Useful linear algebra etc.

Pandas

Pandas is an open-source, BSD-authorized library giving superior, simple-to-utilize information structures and information investigation apparatuses for the Python programming language.

  • Benefits:

Python has for some time been incredible for information munging and planning, however less so for information examination and display. pandas help fill this hole, empowering you to do your whole information examination work process in Python without changing to a more space-explicit language like R.

Joined with the amazing IPython toolbox and different libraries, the earth for doing information examination in Python exceeds expectations in execution, profitability, and the capacity to work together.

More work is as yet expected to make Python a top-notch measurable displaying condition.

Download the Complete Project on Prediction of the growth of Corona Virus Python Project Code and Report

Corona Virus Prediction and Analysis Machine Learning Project

1.  Introduction 

Background 

Currently, there are many people, who are being affected by CoronaVirus. It started in China and now it is spreading all over the world. Till now, there is no medicine for this virus, and it’s killing millions of millions of people. So, it is a big question among all of us of how many people are going to be affected.

Problem Statement 

Currently, there is no application that can predict the spread of CoronaVirus for the future 30 days. So, with this project, we would like to create awareness among the people, by showing them how the corona rises for the future 30 days so that they can take some preventive measures by staying indoors.

Project Goal 

The main objective of this Corona Virus Prediction project is :

  • Future prediction of the increase/decrease in the number of active Coronavirus Cases for the next 30 days – for the whole world as well as for the United States of America. We have chosen the USA among all the counties as it is the highly affected country due to corona.
  • Future prediction of the increase/decrease in the number of deaths due to Coronavirus for the next 30 days.
  • Future prediction of the increase/decrease in the number of recovered cases due to Coronavirus for the next 30 days.

2.  Literature Review

 There is an outbreak of Corona in early December. This is caused due to severe acute respiratory syndrome coronavirus 2, which is basically the family of SARS virus. Many governments all over the world are issuing their own preventive measures to control the spread of coronavirus. So, we have conducted a literature review regarding this virus, based on the information that is publicly available.

Background of Literature Review:

China alerted WHO on 31st December 2019 that many people are reported to be suffering from Pneumonia, in Wuhan City. They reported that it started on Dec 8th, 2019, and there were an increasing number of patients who are working or living around the Huanan Seafood Wholesale Market.

When we started working on this project at the start of February, the Coronavirus was majorly prevalent in China. Initially, at the time of our project proposal, the mortality rate in China among all the confirmed cases is around 1.2% as of February 2020. And the mortality rate in all other countries, other than china was around only 0.2%. Among all the patients, who were admitted to the hospitals, the mortality rate, was around 11%. COVID-19 is increasing with great speed, and now there is a relatively very high mortality rate

A Way to Further Research :

So, we have performed this literature review, to analyze the spread of coronavirus. After analyzing how increasingly it’s spreading all over the world, we thought of performing our own prediction regarding this virus, so as to make people aware of its spread, and with this, they can take their own preventive measures, so that they do not fall prey to this dangerous virus.

We had very little amount of data when we started this project. It is a very trending topic all over the world. And millions of millions of people are losing their lives due to this virus. So, we are very curious to analyze this pandemic and so we have taken up this project.

We have found many datasets to collect data regarding the corona cases. Some of them include Kaggle, John Hoppkins, etc. So, we thought of choosing the dataset from John Hoppkins, as it’s updating the dataset on a daily basis. So, we collected the data and performed our own future predictions.

3. Methodology 

Approach

 So, basically, we have followed the below approach to kick-start our Corona Virus Prediction project:

  1. Firstly, we have started with research on choosing the datasets. On performing research on various datasets, we have finalized with John Hoppkins data set, as it gives us the live data on coronavirus.
  2. Secondly, we have collected the data and performed our preprocessing operation, so as to make our data ready for future predictions.
  3. Next, coming to choosing the machine learning algorithm. We have chosen appropriate machine learning(we will discuss below regarding this).
  4. Finally, we have performed our predictions to analyze the active cases, deaths, and recoveries for the next 30 days, based on the data available from the datasets and the chosen machine learning algorithm.

Figure: Approach

4.  Implications 

Benefits of the Project: 

  • This Corona Virus Prediction project helps in the prediction of coronavirus cases for the next 30 days, all over the world.
  • With this, we can also predict the increase in corona cases in the world.
  • By this, we can know how fast the coronavirus is spreading all over the world.
  • We can create awareness among people.
  • We can also create awareness in government so that they can take preventive measures to stop the spread of corona.

Lessons Learned:

Initially, I had no idea of a Machine learning algorithm. I started learning about machines from scratch. I bought some Udemy tutorials and through that, I learned everything step by step. At the start of the project, I am not even aware of what machine learning algorithm to use.

It was really an exciting experience doing this project. I am inspired to take up a Machine Learning Course for my next semester to learn deeply about Machine Learning Algorithms.

I tried my level best and contributed my 100% to this project.

Now, I came to know about machine learning, different types of machine learning Algorithms, and the differences between classification and regression algorithms -when to use what, creating test and train sets, building up the model, choosing the appropriate parameters, and performing future predictions. In the future, I would also love to take up a project related to Classification Algorithms.

5.  Conclusion 

  • Finally, to conclude, we have performed prediction using SVR and Polynomial Regression Algorithm.
  • SVR predictions are mainly for predicting the world case scenario, which includes confirmed, death, and recovered cases.
  • Polynomial Regression is used for the prediction of US Cases.
  • Based on the results, we believe that our predictions were almost accurate, with some little differences from the actual values.
  • This project can be further scalable, to include the predictions for various individual

6.  Appendix 

  • We have used Google Collab for our project. As we are two members of the team, we have chosen this, because it enables us to simultaneously work on the project from different
  • No Installation is Required.
  • We just need to have a google account. And we can easily create a Google Collaboratory file in our google drive, just like Google docs.
  • We will provide both .py files as well as .ipynb files along with this report, so as to run on google collab.
  • .ipynb can be uploaded to google collab directly and the results of the projects can be easily checked.

Intelligent Customer Help Desk with Smart Document Understanding

INTRODUCTION

Overview

We will be designing an application that leverages multiple Watson Airservices (Discovery, Assistant, Cloud function, and Node Red). By the end of the project, we’ll learn best practices of combining Watson services, and how they can be used to build interactive information retrieval systems with

Discovery + Assistant.

  • Project Requirements: Python, IBM Cloud, IBM Watson
  • Functional Requirements: IBM Cloud
  • Technical Requirements: AI, ML, WATSON AI, PYTHON
  • Software Requirements: Watson assistant, Watson

Scope of Work

  • Create a customer care dialog skill in Watson Assistant
  • Use Smart Document Understanding to build an enhanced Watson Discovery collection
  • Create an IBM Cloud Functions web action that allows Watson Assistant to post queries to Watson Discovery

Proposed solution

For the above problem, we are able to put a virtual agent in the chatbot so it can understand the queries that are posted by customers. The virtual agent should train from some insight records-based company background so it can answer queries supported by the merchandise or associated with the company. In other words, some styles of manual will be accustomed to training the bot using AI. Here I’m using Watson Discovery as a tool for implementing AI and getting trained by the owner’s manual.

THEORETICAL ANALYSIS

Block/Flow Diagram

Hardware / Software Designing

  1. Create IBM Cloud services
  2. Configure Watson Discovery
  3. Create IBM Cloud Functions action
  4. Configure Watson Assistant
  5. Create flow and configure the node
  6. Deploy and run Node-Red app

EXPERIMENTAL INVESTIGATIONS

Create IBM Cloud services

Create the following services:

  • Watson Discovery
  • Watson Assistant
  • IBM cloud function
  • Node-Red

Advantages

  • Companies can deploy chatbots to rectify simple and general human queries.
  • Reduces manpower
  • Cost efficient
  • No need to divert calls to customer agents and customer agents can look at other

Disadvantages:

  • Sometimes chatbots can mislead customers
  • Giving the same answer for different sentiments.
  • Sometimes cannot connect to customer sentiments and intentions

APPLICATIONS

  • It can deploy in popular social media applications like Facebook, slack, and telegram.
  • A chatbot can deploy any website to clarify basic doubts of viewer

CONCLUSION

By doing the above procedure and all we successfully created an Intelligent help desk smart chatbot using Watson assistant, Watson discovery,

Node-RED and cloud functions.

FUTURE SCOPE

We can include Watson studio text-to-speech and speech-to-text services to access the chatbot hands-free. This is one of the future scopes of this project.

Airbnb User Bookings Prediction Project Synopsis

Airbnb User Bookings Synopsis

1. Objective of work

The main objective of this project is to predict where will new guest book their first travel experience. 

2. Motivation

This project helps Airbnb to better predict their demand and take consequent informed decisions. Earlier a new user was overwhelmed with the various choices available for a perfect vacation or stay.

By predicting where a new user will book their first travel experience the company is better able to inform its users by sharing personalized content with their community. It will drastically decrease the time to first booking which will increase the company’s output and help them gain popularity among its user and an edge over its competitors in the market. 

3. Target Specifications if any

Predicting where a new guest books their first travel experience. 

4. Functional Partitioning of the project

4.1 Research and gaining knowledge

Undertaking various courses and familiarizing ourselves with the working process of Data Science problems. Exposure and exploration of the Kaggle website, understanding kernels, and datasets. Learning the prerequisites: programming in Python, and Pandas along with Machine Learning algorithms and data visualization methods.

4.2 Frequent Discussions and Guidance

Frequent discussions with our mentor along with his guidance in the same will allow us to work in the right direction and take informed decisions.

 4.3 Applying the knowledge gained

After much exposure to this field and gaining the knowledge, we will now apply our skills to real-life problems and contribute to society.

5. Methodology

5.1 Using the Kaggle platform

In the test set, we will predict all the new users with their first activities after 7/1/2014.In the sessions dataset, the data only dates back to 1/1/2014, while the user’s dataset dates back to 2010. Taking the help of the Kaggle platform for testing out datasets as it is not feasible to have a large dataset say 1TB be stored in a local machine.

5.2 Working on the dataset

 Using the dataset and studying various patterns of users’ first booking after signing up with Airbnb from different countries. Next plot out the observed and collected information. We can then apply various Machine Learning algorithms and calculate prediction scores. Finally, choose the algorithm with the highest score to recommend to users which are from that country the destinations that have been frequently used by travelers belonging to that region.

5.3 Submitting our work on the Kaggle platform

The result can now finally be uploaded on the platform and be used by Airbnb to better connect with their users.

6. Tools required

6.1 Kaggle Kernels

Kaggle is a platform for doing and sharing Data Science. Kaggle Kernels are essentially Jupyter notebooks in the browser that can be run right before your eyes, all free of charge. The processing power for the notebook comes from servers in the cloud, not our local machine allowing us to experience Data Science and Machine Learning without burning through the laptop’s battery and space.

6.2 Dataset

Airbnb will be providing us with the dataset, which would contain: Airbnb will be providing us with the dataset, which would contain

  • csv-the training set of users
  • csv-the test set of users
  • csv-web sessions log for users
  • csv-summary statistics of destination countries in this dataset and their locations
  • csv-summary statistics of users’ age group, gender, and country of destination.
  • csv-correct format for submitting our predictions

7. Work Schedule

(a) January

Enroll and start the course on Machine Learning using Kaggle. Start recapitulating the basics of Python and its various libraries such as NumPy, pandas, etc.

(b) February

End course and start analyzing the dataset

(c) March

Start coding and implementing various algorithms for the prediction

(d) April

Pick the final algorithm by trial and test and finish coding

(e) May

Appropriate documentation and upload our solution

Detection of Currency Notes and Medicine Names for the Blind People Project

OBJECTIVE:-

We have seen blind people facing many problems like fake Currency Notes Detection in our society. So, we have come up with some solutions for some problems they face. As they are blind, they are not able to read the medicine’s name and they always depend on another person for help. Some people take advantage of their disability and cheat them by taking extra money or by giving them less money. And by this Currency Notes Detection project, we are making them independent in terms of medical benefits.

METHODOLOGY: –

To overcome the problem of blind people we have come up with an innovative idea, where we are making use of machine learning, image processing, OpenCV, text-to-speech, and OCR technologies. To make their life comfortable.
In this Currency Notes Detection project, we are using a camera for getting the input, where the inputs are pictures of medicine and currency. These images can be manipulated using image processing and OpenCV. Once the processed image is obtained then it is cropped and thresholding is done, In the next stage we will extract the name of the medicine, then we will convert that text into speech using text-to-speech technology.

Similarly, we will also take pictures of currency, and then by using image processing and machine learning we will compare the picture with a predefined database of the currency that we have already prepared. The next process will be to convert the value of currency into text and then the text is converted into speech using text-to-speech technology.

Block Diagram: –

Technology Used:

  • Image Processing: To extract necessary information
  • OpenCV: To threshold image, color shifting, scanning, and cropping, setting grey level, and extract contours
  • Python 3: To set up the environment and interact with devices
  • OCR (Optical Character Recognition)
  • Machine Learning: Handwritten data is trained in a classifier to process manual marks awarded.

Results

The Detection of Currency Notes and Medicine Names for the Blind People Project can help the blind person in the detection of currency notes and medicine names. By this, the blind person would take care of himself without the help of any caretakers. This would make their life easier and simpler. The talk-back feature used would help them to access the application easily without any complications.

  • This project would help blind people to detect the proper currency that they have received or which they need to give without being cheated for receiving the wrong currency or by avoiding giving the wrong currency. This would make them economically stable and strong
  • Not only in currency detection but also this project would help blind people to recognize the name of the tablet and also help them to know how many dosages they need to take as per the name of the tablet.

This Currency Notes Detection project would help blind persons both in an economical way and in the perspective of health. This would make their life easier and make them confident.

Applications

  • Blind persons will be able to recognize the correct currency without getting cheated in any type of money transaction.
  • Blind persons always need not be dependent on others to know which medicines they need to take at a particular time.

Advantages

  • This project will work on mobile phones only no need to buy any extra things.
  • This work is implemented using TalkBack for android and Voiceover for iOS which means blind people can easily access the application.
  • Easy to set up.
  • Open-source tools were used for this project.
  • Accessible to all devices irrespective of the OS.
  • Cheap and cost-efficient.

Disadvantages

  • It is very difficult to determine whether the currency is a fake one when it is an exact copy of the real currency.
  • For the medicine part, the image should be taken from any side where the name of the medicine is written.

Conclusion

This work shows how visually impaired people (blind persons) can protect themselves from getting cheated in terms of money transactions and also how to reduce the dependency on other people to take the right amount of medicine at the right time Whenever the blind person takes the image using his phone camera the image will be compared with the data set which is created.

After comparing the image if it gets the accuracy above the threshold value then it will give the spoken feedback to the person by saying the value of the currency Similarly in the case of medicine detection extract the name of the medicine and gives the spoken feedback as how many times that person needs to take the medicine, thus making this work as one of the assistants for a blind person.

Future Scope

• Include the data set of photos that contain a person’s images it can also be used to detect a person who has a blind person meets.
• It can also be used to track the blind person using GPS

Used Car Price Prediction AI / Machine Learning Project using Python

Abstract

Used Car price prediction using AI / Machine Learning techniques has picked researchers’ interest since it takes a significant amount of work and expertise on the part of the field expert. For a dependable and accurate forecast, a large number of unique attributes are analyzed. We employed 6 different machine learning approaches to develop a model for forecasting the price of used automobiles.

Problem statement

With the Coronavirus sway on the lookout, we have seen a lot of changes in the vehicle market. Presently some vehicles are sought after subsequently making them exorbitant and some are not popular and consequently less expensive. With the adjustment of the market due to the Coronavirus 19 effect, people/sellers are facing issues with their past Car Price valuation AI/Machine Learning models. Along these lines, they are searching for new AI models from new information. Here we are building the new car price, valuation model.

The primary point of this Used Car Price Prediction AI / Machine Learning Project is to create a dataset with the help of web scraping and anticipate the cost of a trade-in vehicle given different elements.

The objective of the Project:

1. Data Collection: To scrape the data of at least 5000 used cars from various websites like Olx, cardekho, cars24, auto portal, cartrade, etc.
2. Model Building: To build a supervised machine learning model for forecasting the value of a vehicle based on multiple attributes.

Motivation Behind the Project:

There are a few major worldwide multinational participants in the automobile sector, as well as several merchants. By trade, international companies are mostly manufacturers, although the retail industry includes both new and used automobile dealers. The used automobile market has seen a huge increase in value, resulting in a bigger percentage of the entire market. In India, about 3.4 million automobiles are sold each year on the secondhand car market.

Collecting the data

We have scraped the data for over 5000 cars using Selenium script from 4 different websites from different locations around the country. The websites are as followed:
1. OLX
2. Cars24
3. CarDekho
4. Autoportal

There are 9 columns:

1.’Brand & Model’: It gives us the brand of the car along with its model name and      manufacturing year

2.’Varient’: It gives us a variety of particular car model

3.’Fuel Type’: It gives us the type of fuel used by the car

4.’Driven Kilometers’: It gives us the total distance in km covered by car

5.’Transmission’: It tells us whether the gear transmission is Manual or Automatic

6.’Owner’: It tells us the total number of owners cars had previously

7.’Location’: It gives us the location of the car

8.’Date of Posting Ad’: It tells us when the advertisement for selling that car was posted online

9.’Price (in ₹)’: It gives us the price of the car.

Here ‘Price (in ₹)’ is our target variable.

Reading the dataset

Now we read the dataset into Pandas and since the target column ‘Price’ is of integer data type, we will apply regression algorithms to it.

Data Cleaning

We check for null values and find that there are few in column ‘Variant’ and we will treat them with Mode.
Since all the features are categorical hence we need not check for outliers and skewness.
Exploratory data analysis
Firstly, we will plot the boxplot and distribution plot for the target variable. And find that few outliers need not be treated and the data is tightly distributed with an almost normalized distribution.

Bar graph

Since Brands, Varients, Driven Kilometers & locations have a wide range of values in them, we will not perform bivariate analysis for them as they will not give us any specific details. Now by plotting the graph of Fuel Type, Transmission, and Owner against Price, we conclude that a Car that uses Diesel has automatic Transmission, and Has only 1 owner is more likely to have a high price.

Model building

The models used in training and testing datasets are as followed:

SVR
Linear Regression
SGD Regressor
neighbors Regressor
Decision Tree Regressor
Random Forest Regressor
Only Decision Tree Regressor and Random Forest Regressor are performing well and giving an accuracy of 80.2 % and 87.7%, respectively.

Final model

The accuracy of Model ‘PriceCar’ (Random Forest Regressor) after applying Hyper Tuned Parameters is found to be 87.79% and the score is 0.98 which is quite good.

Conclusion

Here, we can see that all the predicted prices are either equal or nearly equal to the original prices of the car. Hence we conclude that our model ‘price car’ is working very well. And we shall save it for further use.

Limitations of this work and Scope for Future Work

As a part of future work, we aim at the variable choices over the algorithms that were used in the project. We could only explore two algorithms whereas many other algorithms exist and might be more accurate. More specifications will be added to a system or provide more accuracy in terms of price in the system i.e.
1) Horsepower
2) Battery power
3) Suspension
4) Cylinder
5) Torque

As we know technologies are improving day by day and there is also advancement in-car technology, so our next upgrade will include hybrid cars, electric cars, and Driverless cars.

Download Used Car Price Prediction AI / Machine Learning Project using Python. For more details about the project feel free to contact the developer at github

Predictive Analytics for Retail Banking Machine Learning Python Project

Retail banking

Typical mass-market banking is used by local clients of local branches of large commercial banks. The services offered include checking and savings accounts, mortgages, personal loans, debit/credit cards. Attention is paid to the customer.

The main problems in this sector are:

Which product is right to recommend to the customer?
What is the best time to sell a product?
What is the most effective channel to communicate with the customer?

PROBLEM STATEMENT

The data in this regard are related to the direct marketing campaigns of the banking institution. Marketing campaigns are based on phone calls. Often multiple contacts with a customer were required to reach the product (term bank deposit) whether to subscribe (‘yes’) or unsubscribe (‘no’). The goal is to know in advance whether the customer will sign up for a term deposit.

ABOUT THE INFORMATION

This is a classic marketing bank data package uploaded to the UCI machine learning repository. The data set provides you with information about the financial institution’s marketing campaign, which you need to analyze to find ways to look for future strategies to improve the bank’s future marketing campaigns.

These are the columns in the data set:

Age: age of the client (quantitative)
Job: Client’s profession – (categorical) (administrator, worker, entrepreneur, housekeeper, manager, retiree, self-employed, service provider, student, technician, unemployed, unknown)
Marital: Client’s marital status – (categorical) (divorced, married, single, unknown, note: divorced means divorced or widowed)
Education: Client’s level of knowledge – (categorical)
By default: Indicates whether the customer has a debt – (categorical) (no, yes)
Balance: average annual balance, in euros (quantitative).
Housing: Does the client have a home loan? – (categorical) (no, yes)
Loan: Is the client a personal loan? – (categorical) (no, yes)
Contact: Contact type – (categorical) (unknown, mobile, phone)
Date: The date of the last contact with the customer.
Month: Last customer contact month – (categorical) (January-December)

CONCLUSION

In the real world, most classification problems are two-sided. Also, data sets have almost no meaning. In this post, we cover strategies to combat unbalanced data sets with missing values. We will also explore different ways to build packages within the sklearn. Here are some exceptions:

Accuracies compared

  • K-nearest Neighbour: 75.3%
  • Logistic Regression: 80.9%
  • Decision Tree: 78.2%
  • Random Forest Classifier: 78%
  • Support vector Machine: 53%


Sometimes we may be willing to give up some model improvements if this estimate is much more complex than the percentage change in metric improvement.
When building ensemble models, try to use as many different good models as possible to minimize the correlation between basic students. We could improve the model of our concentrated ensemble by adding a dense neural network and other types of basic learners, as well as adding more layers to the concentrated model.
Easy Ensemble generally works better than other re-modeling methods.

Download the complete Predictive Analytics for Retail Banking Machine Learning Python Project Source Code, Report, PPT.

For more details about the project visit this page

 

Liver Patient Analysis Machine Learning Project

Project description

In India, delays in diagnosing diseases are a major problem due to a lack of medical professionals. The typical scenario, which is mainly in rural and slightly urban areas:

1. A patient who sees a doctor with certain symptoms.
2. The doctor will perform some tests, such as blood and urine tests, depending on the symptoms.
3. The patient undergoes the above tests in the analytical laboratory.
4. The patient takes the reports back to the hospital, where they are examined and diagnosed.

The goal of this project is to reduce some of the delays caused by unnecessary detours between the hospital and the pathology laboratory. Historically, work has been done to detect the onset of heart disease, such as Parkinson’s, and machine learning algorithms have been developed to predict liver disease in patients based on a variety of characteristics.

Problem statement

The problem report is officially defined as follows:

“Considering the data set containing the various attributes of 584 Indian patients, use the functions in the data set and determine a controlled classification algorithm to determine whether a person is suffering from liver disease. This data set contains 416 liver recordings and 167 non-liver recordings. collected in northeastern Andhra Pradesh. This data set contains records of 441 male patients and 142 female patients. Each patient over the age of 89 is “90” years old.

Strategy

This seems to be a classic example of controlled learning. We are given a fixed number of functions for each data point, and our goal is to teach different controlled learning algorithms based on this data, so that when a new data point appears, our best-performing classifier can be used. information point as a positive or negative example. Detailed information on the number and types of algorithms used for training is contained in the “Algorithms and Techniques” section of the “Analysis” section.

Conclusion

Initially, the data set was studied and prepared for inclusion in the classifiers. This was achieved by removing some rows containing zero values, modifying some columns indicating the skewness, and using appropriate conversion techniques (a hot coding) to make the labels more useful for classification purposes. The performance indicators for which the models will be evaluated have been resolved. The data set was then divided into a reading and testing package.

First, a simple predictive and base model (“Logistic Regression”) was developed in the data set to determine the value of the base accuracy. The biggest challenge in implementing this project was in two areas: defining learning algorithms and selecting the appropriate parameters for precise configuration. Initially, making a decision on 3 or 4 methods out of the many choices available at sklearn was very tedious.

Algorithms and Techniques used to develop this Liver Patient Analysis Machine Learning Project are

1. Random Forest Classifier:
2. Gaussian Naive Bayes Classifier
3. Logistic Regression:

Download the complete Liver Patient Analysis Machine Learning Project Python source code, project report, PPT Presentation.

For more details about the project visit this page

Student Grade Analysis & Prediction Machine Learning Project

The main objective of this Python project is to Analysis & Prediction the final grade of Portuguese high school students. This is Machine Learning Project and The algorithms used to implement this project are Linear Regression, ElasticNet Regression, Random Forest, Extra Trees, SVM, Gradient Boosted, Baseline.

Problem statement

The problem statement can be defined as follows: Considering the data set containing the attributes of 396 Portuguese students, the available functions of the data set were used and classification algorithms were defined to determine whether the student performed well in the final qualifying exam.

Description of the data set

These data reflect the performance of secondary school students in two Portuguese schools. Information attributes (student grades, demographic, social, and school-related characteristics) were collected using school reports and questionnaires. There are two sets of indicators for two different subjects: mathematics (math) and Portuguese (por). [Cortez and Silva, 2008], two data sets were modeled under binary / five-level classification and regression tasks. Important Note: The target attribute G3 has a strong correlation with the G2 and G1 attributes.

Methodology

As universities are prestigious universities, it is a matter of great concern that students stay at these universities. It was found that the majority of students dropped out of university in the first year due to a lack of adequate support for undergraduate courses. For this reason, the first year of a bachelor’s degree is called the “make or break” year. Without finding any support for mastering the course and its complexity, you can lower the student’s motivation and cause him to drop out of the course.

There is a great need to develop an appropriate solution to help students stay in higher education. Early grade forecasting is one of the solutions aimed at monitoring the progress of students in the undergraduate courses of the university and leads to the improvement of the learning process of students on the basis of projected grades.

The use of machine learning with the extraction of educational information improves the learning process of students. Various models can be developed to estimate a student’s grades in registered courses, which will provide valuable information to make it easier for students to stay in those courses. This information can be used for the early identification of students at risk, based on which the system can recommend teachers to pay special attention to those students. This information will also help predict the grades of students in different courses to better monitor their performance in a way that will improve student retention in universities.

Using different packages, such as cuffs, seaborn, and matplotlib, to analyze the data set to predict the final price (G3), to display the data graphically or graphically along with various attributes.

Download the Complete Student Grade Analysis & Prediction Machine Learning Python Project Source code, Report.

For more details regarding the project – Click here