Analyzing real-time data with Spark Streaming and Kafka

ban

Project Description

The project deals with the processing the weather data from www.weatherbit.io using Kafka and Spark Streaming. Here we are simulating the streaming data using previous days data. Then we used a PySpark program to run the spark SQL queries to process the data consumed from the kafka topic along with their required dependencies and finally publish the processed data to another Kafka topic. Then we will consume the data into another python program and plotted the real time graph using Matplotlib.

Workflow

Technologies Used

Python 3.9.2
Spark 3.1.2
Kafka 2.8.0
PySpark 2.4.8
Matplotlib 3.4.3
Hadoop 2.7.7
kafka-python 2.0.2
requests 2.26.0

</p>

Features

List of features ready and TODOs for future development

Get the data for any city my making minor changes.
Show a set of graphs that are plotted at near real-time.
Can use the same program for other real-time data like price of cryptocurrency with minor modifications.

To-do

Change the data source to an actual real-time stream rather than simulation.
Create a dashboard to display the real-time data.

Getting Started

All the operations below are for Windows OS

Make sure to install the required dependencies as mentioned in the project.

Start the Zookeeper server

zookeeper-server-start.bat C:\kafka\config\zookeeper.properties

Start the Kafka server

kafka-server-start.bat C:\kafka\config\server.properties

Create the required topics in Kafka

kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic weather
kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic output

Clone this repository and execute the rest of the command within the directory containing the files.

git clone https://github.com/redon-n-roy/Analyzing-real-time-data-with-spark-streaming-and-kafka.git

Usage

The following are the steps to get the program working.

Execute the producer.py program. This will take the data from the API and start publishing to the Kafka topic “weather”.
```
python producer.py
```
Start the consumer using the Spark-Submit. This will start processing the data using Spark Structured Streaming and send the output to the Kafka topic “output”.
```
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 consumer.py
```
Execute the output.py program. This will take the data from the Kafka topic “output” and visualize it.
```
python output.py
```

Output

Contirbutors

License

This project uses the MIT license.

Reference

https://www.weatherbit.io/api

https://www.goavega.com/install-apache-kafka-on-windows/

https://phoenixnap.com/kb/install-spark-on-windows-10

https://matplotlib.org/devdocs/index.html

Analyzing-real-time-data-with-spark-streaming-and-kafka

This is my Project 3 under Revature training