- Existing system
- Drawbacks of existing system
- Proposed system
- System requirements
- In this project, Spark Streaming is developed as part of Apache Spark.
- Spark Streaming is used to analyze streaming data and batch data.
- It can read data from HDFS, Flume, Kafka, Twitter, process the data using Scala, Java or python and analyze the data based on the scenario.
- This basically implements the Streaming Data Analysis for DataError extraction, Analyse the type of errors.
- Apache storm is an open source engine which can process data in real-time.
- Distributed architecture.
- Written predominantly in Clojure and Java programming languages.
- Stream processing.
- It processes one incoming event at a time.
Drawbacks of Existing System:
- One at a time processing
- Higher network latency
Total Time=10*(network latency + server latency + network latency)=
20*(network latency ) + 10*(server latency)
- “At least once” delivery semantics.
- Less fault tolerance
- Duplicate data
Proposed System and Advantages:
- Micro-batch processing.
- Low network latency
- Total time=network latency + 10* server latency +network latency =2*network latency + 10*server latency
- “Exactly once” delivery semantics.
- High fault tolerance
- No duplicate data
- Let us consider different types of logs and store in one host.
- This creates a large number of log files and processes the useful information from these logs which is required for monitoring purposes.
- Using Flume it sends these logs to another host where it needs to be processed.
- The solution providing for streaming real-time log data is to extract the error logs.
- It provides a file which contains the keywords of error types for error identification in the spark processing logic.
- Processing logic is written in spark-scala or spark-java and store in HDFS/HBase for tracking purposes.
- It uses Flume for sending the streaming data into another port Spark-streaming to receive the data from the port and check the logs which contain error information, extract those logs and store into HDFS or HBase.
- On the Stored error data, it categorizes the errors using Tableau Visualisation.
Architecture & Flow:
- Hadoop environment
- Apache Spark
- Apache Flume
- Tableau Software
- 8 GB RAM
- 64- bit processor