Big Data

Big Data Chapter 5 – Working with Streaming Data | Spark Streaming Basics

Working with Streaming Data in Big Data

In many real-world applications, data is not generated in batches but arrives
continuously in real time. Examples include sensor data, financial transactions,
social media feeds, and application logs. Streaming data processing enables
systems to analyze and act on this data as it arrives.

Apache Spark provides powerful tools for real-time data processing through
Spark Streaming and Structured Streaming, making it easier to build scalable
streaming applications.

⭐ What is Streaming Data?

Streaming data refers to continuously generated data that is processed in
near real time. Unlike batch processing, streaming systems handle data
incrementally as it flows into the system.

📌 Characteristics of Streaming Data

  • Continuous and unbounded data flow
  • Low-latency processing requirements
  • High volume and velocity
  • Event-driven nature

⭐ Spark Streaming Overview

Spark Streaming is an extension of the Spark Core API that enables scalable,
fault-tolerant stream processing. It divides incoming data streams into
small batches and processes them using Spark’s batch engine.

📌 Spark Streaming Architecture

  • Data sources (Kafka, Flume, sockets)
  • Micro-batch processing engine
  • Transformation and output operations

📌 Limitations of Classic Spark Streaming

  • Micro-batch latency
  • Limited support for complex event-time processing

⭐ Structured Streaming

Structured Streaming is a high-level streaming API built on Spark SQL.
It treats streaming data as an unbounded table and allows users to write
streaming queries using SQL or DataFrame operations.

📌 Advantages of Structured Streaming

  • Simple and declarative API
  • Exactly-once processing guarantees
  • Built-in fault tolerance
  • Event-time and window-based processing

📌 Example: Streaming Data from Socket


from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("StreamingExample") \
    .getOrCreate()

df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

df.printSchema()

📌 Example: Streaming Word Count


from pyspark.sql.functions import explode, split

words = df.select(
    explode(split(df.value, " ")).alias("word")
)

word_counts = words.groupBy("word").count()

query = word_counts.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

📌 Common Streaming Data Sources

  • Apache Kafka
  • Apache Flume
  • Sockets
  • Cloud event streams

📌 Real-Life Applications

  • Real-time fraud detection
  • Live social media analytics
  • Monitoring IoT sensor data
  • Log and event processing

📌 Project Title

Real-Time Streaming Data Processing Using Apache Spark

📌 Project Description

In this project, you will build a real-time data processing pipeline using
Spark Structured Streaming. The system will ingest live data, perform
transformations such as filtering and aggregation, and display results
in real time.

📌 Summary

Streaming data processing is essential for modern Big Data applications.
Spark Streaming and Structured Streaming provide scalable and fault-tolerant
solutions for handling real-time data. This chapter prepares you to integrate
streaming systems with databases and NoSQL storage solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *