Spark Streaming Hadoop Project

Agenda:

Abstract:

In this project, Spark Streaming is developed as part of Apache Spark.
Spark Streaming is used to analyze streaming data and batch data.
It can read data from HDFS, Flume, Kafka, Twitter, process the data using Scala, Java or python and analyze the data based on the scenario.
This basically implements the Streaming Data Analysis for DataError extraction, Analyse the type of errors.

Existing System:

Drawbacks of Existing System:

Total Time=10*(network latency + server latency + network latency)=

20*(network latency ) + 10*(server latency)

Proposed System and Advantages:

Micro-batch processing.
Low network latency
Total time=network latency + 10* server latency +network latency =2*network latency + 10*server latency
“Exactly once” delivery semantics.
High fault tolerance
No duplicate data

Methodology:

Let us consider different types of logs and store in one host.
This creates a large number of log files and processes the useful information from these logs which is required for monitoring purposes.
Using Flume it sends these logs to another host where it needs to be processed.
The solution providing for streaming real-time log data is to extract the error logs.
It provides a file which contains the keywords of error types for error identification in the spark processing logic.
Processing logic is written in spark-scala or spark-java and store in HDFS/HBase for tracking purposes.
It uses Flume for sending the streaming data into another port Spark-streaming to receive the data from the port and check the logs which contain error information, extract those logs and store into HDFS or HBase.
On the Stored error data, it categorizes the errors using Tableau Visualisation.

Architecture & Flow:

System Requirements:

6 Replies to “Spark Streaming Hadoop Project”