Understanding Big Data – In the Context of Internet of Things Data

Executive Summary

This Knowing Internet of Things Data: A Technology Review is a critical review of Internet of Things in the context of Big Data as a technology solution for business needs. It evaluates the potential exploitation of big data and its management in correlation to devices which are Internet of Things. The article begins with literature review of internet of things data, thereby defining it in academic context. It then analyzes big data usability in commercial or business economics context. It discusses and evaluates the application of Internet of Things Data which ensures there is value-addition to a Business. The main objective of this Knowing Internet of Things Data: A Technology Review is to communicate the business sense or the business intelligence in use of big data by an organization. The premise of paper, is that, Internet of things data is an emerging science with infinite solutions for organizations to exploit and build services, products or bridge ‘gaps’ in delivery of technology solutions. The possibilities of using big data for marketing, healthcare, personal safety, education and many other economic-technology solutions are discussed.

Introduction

Organizational decisions are increasingly being made from data generated by Internet of Things (IoT), apart from traditional inputs.  IoT data is empowering organizations to manage assets, enhance and strengthen performances and build new business models. According to MacGillivray, C., Turner, V., & Lund, D. (2013) the number of IoT installations is expected to be more than 212 billion devices by 2020. Thus, management of data becomes a crucial aspect of IoT, since different types of objects interconnect and constantly interchange different types of information. The scale or volume of data generated and the processes in handling data are critical to IoT and requires the use several technologies and factors.

Addressable market area globally for IoT is estimated to be $1.3 trillion by 2019. New business opportunities are thus plenty, allowing organizations to become smarter and enhance their product, services and improve user/customer experience, thereby creating Quantified Economy. According to Angeles et al ( 2016) (1) Internet of Things spending is $669(2) smart homes connectivity spend $174 million (3) Connected cars by 2020 spend $220 million.  The has led to companies revisiting their decisions (1) Are services or products of their organization capable to connect or transmit data (2) Are the organizations able to optimize value from the data they have (3) Are the connected devices at the organization able to provide end-to-end-view (4) Do organizations need to build IoT infrastructure or just parts of a solution to connect devices.  Some examples of IoT and business value – (a) real estate holding company adopts smart buildings networking for ‘real-time’ power management and save substantially on expenses incurred in this sector (2) incorporating sensors in vehicles allows logistics companies to gain real-time input on environmental, behavioural factors that determine performance (3) Mining companies can monitor quality of air for safety measures and protecting miners.

Hence, the immediate results of IoT data are tangible and relate to various organizational fronts – optimize performance, lower risks, increase efficiencies. IoT data becomes the vital bridge for organizations to gain insight and strengthen core business, improve safety and leverage data for business intelligence, without having to become a data company itself. Organizations can continue to focus on their deliverables instead of the backend of generating value from data, by using several IoT data management, storage technologies offered by vendors competitively.

Algorithm Marketplaces

As big data enters the ‘industrial revolution’ stage, where machines based on social networks, sensor networks, ecommerce, web logs, call detail records, surveillance, genomics, internet text or documents generate data faster than people and grow exponentially with Moore’s Law, share analytic vendors.  Therefore, virtual marketplaces where algorithms (code snippets) are purchased or sold is expected to commonplace by 2020. Gartner expects three vendors to dominate the market place and are all set to transform the software market of today, with analytics domination. Simply said, algorithm marketplace improves on the current app economy and are entire ‘’building blocks” which can be tailored to match end-point needs of the organization. (1) Granular software will be sold in more quantities, since software for just a function or a feature will be available at cheap prices. (2) Access to powerful, advanced, cutting-edge algorithms by inventors who earlier restricted their products in-house are now commercially made available, widening application scope and benefitting businesses. (3) reuse or recycling of algorithms is now optimized. (4) quality assessment optimized.

Model Factory

Data storage is cheap and hence can be mined for information generation.  Technologies such as MPP (massively parallel processing) databases, distributed databases, cloud computing platforms, distributed file system, as well as scalable storage systems are in use.  Using open source platforms such as Hadoop the data lake built can be developed to predict analytics by adopting a modelling factory principle. In this technology, of which there are several vendors, the data that an organization generates does not have to handled by data scientist but focus on asking right questions with relation to predictive models. The technology allows real automation to data science, where traditionally work was moved from one tool to the next, so that different data sets were generated and validated by models. The automation of such processing not only removes human error but also allows managing hundreds of models in real time. In model factories of the future, software will pre-manage data and scientists have to concentrate only on how to run models and not iterate their work. Model factories of the future are the Google and Facebook of today, but without the number crunching army of engineers but automated software to manage data science processing via tooling and pervasiveness of machine learning technologies. Examples include Skytree.

Edge analytics

Business environment creates unstructured databases which could exceed zettabytes and petabytes and demand specific treatment in terms of storage of processing and display.  Hence, large data-crunching companies such as Facebook or Google cannot use conventional database analytic tools such as those offered by Oracle as big repositories require agile, robust platforms based on either distributed, cloud systems or open source systems such as Hadoop. These involve the use of massive data repositories and thousands of nodes which evolved from tools developed by Google Inc, like the MapReduce or File Systems or NoSQL. None of these are compliant with conventional database characteristics such as – atomicity, isolation, durability or consistency.  Hence to overcome the challenge data scientists collect data, analyze it by using automated analytic computation on data at a sensor or the network switch or other device and does require that data is returned to data store for processing. Thus, by annotating and interpreting data, network resources mining of data acquired is possible.

Anomaly detection

Data Scientists use the outlier detection or anomaly detection process to identify instances or events which fall short of a template pattern of an item on a data set. In short, they are the set of data points which are different in many ways from the remainder of the data. These are used in credit card frauds, fault detection, telecommunication frauds, network intrusion detection. This also used statistical tools such as Grubbs’ test to detect outliers or univariate data (Tan, P. N., Steinbach, M., & Kumar, 2013).

Event streaming processing

ESP or Event Stream Processing is described as the set of technologies which are designed to aid the construction of an information system that are event-based. Thus, this technology include – event visualization, event databases, event driven middleware, event processing languages as well as complex event processing. Here data that is collected is immediately processed without a waiting period, and creates output instantaneously.

Text analytics

Text analytics refers to text data mining and uses text as the units for information generation and analysis. The quality of information derived from texts is optimal as patterns are devised and trends are used in the form of statistical pattern leaning.  Unstructured text data is processed to form meaningful data for analysis so that customer opinions, feedback, product reviews are quantified. Some of the applications here are sentimental analysis, entity modelling support for decision making.

Data lakes

Data lakes are storage repositories of raw data in its native format. These are held in this state, until they are required. Such storage is done in a flat architectural format and contrasts with that ot data stored hierarchically in data warehouse stores. Data structures are defined only when the data is needed. Vendors include Microsoft Azure, apart from several open source options.

Spark

Spark is a key application of IOT data which simplifies real-time big data integration for advanced analytics and uses realtime cases for driving business innovation. Such platforms generate native code and needs to be further processed for Spark streaming.

Conclusion

According to Gartner as many as 43% of organizations are committed to invest and implement IoT, and is indicative of the massive scale of data the organizations will come to generate. Thus, utilities or fleet management or healthcare organizations, the use of IoT data will overturn their cost savings, operational infrastructure as well as asset utilization, apart from safety and risk mitigation and efficiency building capabilities. The right technologies deliver on the  promise of big data analytics of IoT data repositories.

References

Angeles, R. (2016). STEADYSERV BEER: IOT-ENABLED PRODUCT MONITORING USING RFID. IADIS International Journal on Computer Science & Information Systems11(2).

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From big data to big impact. MIS quarterly36(4), 1165-1188.

Fredriksson, C. (2015, November). Knowledge management with Big Data Creating new possibilities for organizations. In The XXIVth Nordic Local Government Research Conference (NORKOM).

MacGillivray, C., Turner, V., & Lund, D. (2013). Worldwide Internet of Things (IoT) 2013–2020 Forecast: Billions of Things. Trillions of Dollars, Gartnet Market Analysis.

Tan, P. N., Steinbach, M., & Kumar, V. (2013). Data mining cluster analysis: basic concepts and algorithms. Introduction to data mining.

Troester, M. (2012). Big data meets big data analytics: Three key technologies for extracting real-time business value from the big data that threatens to overwhelm traditional computing architectures. SAS Institute. SAS Institute Inc. White Paper.

Spark Streaming Hadoop Project

Agenda:

  • Abstract
  • Existing system
  • Drawbacks of existing system
  • Proposed system
  • System requirements

Abstract:

  • In this project, Spark Streaming is developed as part of Apache Spark.
  • Spark Streaming is used to analyze streaming data and batch data.
  • It can read data from HDFS, Flume, Kafka, Twitter, process the data using Scala, Java or python and analyze the data based on the scenario.
  • This basically implements the Streaming Data Analysis for DataError extraction, Analyse the type of errors.

Existing System:

  • Apache storm is an open source engine which can process data in real-time.
  • Distributed architecture.
  • Written predominantly in Clojure and Java programming languages.
  • Stream processing.
  • It processes one incoming event at a time.

Drawbacks of Existing System:

  • One at a time processing
  • Higher network latency

Total Time=10*(network latency + server latency + network latency)=

20*(network latency ) + 10*(server latency)

  • “At least once” delivery semantics.
  • Less fault tolerance
  • Duplicate data

Proposed System and Advantages:

  • Micro-batch processing.
  • Low network latency
  • Total time=network latency + 10* server latency +network latency     =2*network latency + 10*server latency
  • “Exactly once” delivery semantics.
  • High fault tolerance
  • No duplicate data

Methodology:

  • Let us consider different types of logs and store in one host.
  • This creates a large number of log files and processes the useful information from these logs which is required for monitoring purposes.
  • Using Flume it sends these logs to another host where it needs to be processed.
  • The solution providing for streaming real-time log data is to extract the error logs.
  • It provides a file which contains the keywords of error types for error identification in the spark processing logic.
  • Processing logic is written in spark-scala or spark-java and store in HDFS/HBase for tracking purposes.
  • It uses Flume for sending the streaming data into another port Spark-streaming to receive the data from the port and check the logs which contain error information, extract those logs and store into HDFS or HBase.
  • On the Stored error data, it categorizes the errors using Tableau Visualisation.

Architecture & Flow:

System Requirements:

  • Java
  • Hadoop environment
  • Apache Spark
  • Apache Flume
  • Tableau Software
  • 8 GB RAM
  • 64- bit processor

Apriori Algorithm Implementation Using Map Reduce On Hadoop

The main aim of the Apriori Algorithm Implementation Using Map Reduce On Hadoop project is to use the apriori algorithm  which is a data mining algorithm along with mapreduce. This is mainly used to find the frequent item sets for a application which consists of various transactions.

Using this algorithm we will take the inputs from the data sets present in the application and the output is given as frequent item sets .

In the map reduce  part we will write the code using key value pairs accordingly.

Big Data Hadoop Projects Titles

These are the below Projects on Big Data Hadoop.

1) Twitter data sentimental analysis using Flume and Hive

2) Business insights of User usage records of data cards

3) Wiki page ranking with hadoop

4) Health care Data Management using Apache Hadoop ecosystem

5) Sensex Log Data Processing using BigData tools

6) Retail data analysis using BigData

7) Facebook data analysis using Hadoop and Hive

8) Archiving  LFS(Local File System) & CIFS  Data to Hadoop

9) Aadhar Based Analysis using Hadoop

10) Web Based Data Management of Apache hive

11) Automated RDBMS Data Archiving and Dearchiving  using Hadoop and Sqoop

12) BigData Pdf  Printer

13) Airline on-time performance

14) Climatic Data analysis using Hadoop (NCDC)

15) MovieLens  Data processing and analysis.

16) Two-Phase  Approach for Data Anonymization Using MapReduce

17) Migrating Different Sources To Bigdata And Its Performance

18) Flight History Analysis

19) Pseudo distributed hadoop cluster in script

HARDWARE REQUIREMENT FOR CLUSTER

  • 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-512GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)

SOFTWARE REQUIREMENT

  • FRONT END :           Jetty server, WebUI in JSP
  • BACK END :           Apache Hadoop, Apache FLUME, Apache HIVE, Apache PIG, JDK 1.6
  • OS       :           Linux-UBUNTU
  • IDE             :           ECLIPSE

Web Page Ranking using Hadoop

Abstract

With a dramatic growth of the world-wide web exceeding 800 million pages, quality of the search results are given importance more than the content of the page. The quality of the page is determined by using web page ranking where the importance of the page depends on the importance of its parent page. For very large sub-graphs of the web, page rank can be computed with limited memory using Hadoop.

Pages in XML format are given as input for Page Ranking program. The forward and backward links are used to compute the rank of a page.

Business insights of User usage records of data cards

Project title: Business insights of User usage records of data cards

Software: Hadoop 1.1.2, Hive,Sqoop,Tableau

Hardware:5 node cluster.

Project short description: Aim of this project is finding the business insights of current user records data. And get the benefits for business growth. The parameters to be considered for analysis are

  1. Daily user count and bytes transmitted on a particular time slot.
  2. Area wise business(usage) share in the total business
  3. Since every network owner will be depending on partners to get the service where they does not have the service tower.

From case1: We can find the exact what time more users using the network and what time more downloads and uploads happening. Based on that, they can concentrate tower capacity enhancements. If the tower is underutilized then they can reduce the tower capacity.

From case2: They can concentrate the area where they can invest more to get the more users.

From case3: Find out the areas of partner leading and try to improve the owner tower installations.

All above activities currently happening using data warehousing technologies.  But it is more expensive and time consuming.  To help better in this area, we are using the Hadoop and Hadoop Echo systems.

SOFTWARE AND HARDWARE

HARDWARE REQUIREMENT FOR EACH NODE/MACHINE IN A CLUSTER

Processor                     :  Intel

Speed                          :  2.5 GHz

RAM                           :  8 GB or More

Hard Disk                   :  500 GB or More 

SOFTWARE REQUIREMENT

Operating System       :  Linux

Technology                 :  Hadoop

Tools                           :Hive,Sqoop

Reporting Tool            :  Tableau

Database                     :  My SQL

Java Version                :   JDK1.6 or Higher version