Skip to main content

Apache Spark

https://spark.apache.org/
https://sparkbyexamples.com/

Unified engine for large-scale data analytics.

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Key features

  • Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

  • SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

  • Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

  • Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Stream processing is a continuous method of ingesting, analyzing, and processing data as it is generated. The input data is unbounded and has no predetermined beginning or end. It is a series of events that arrive at the stream processing system (e.g., credit card transactions, clicks on a website, or sensor readings from Internet of Things [IoT] devices).

Two prominent technologies, Apache Spark™, and Apache Flink®, are leading frameworks in stream processing. Where Spark initially gained popularity for batch processing, it has since evolved to incorporate structured streaming for real-time data analysis. In contrast, Flink was built from the ground up for real-time processing and can do batch processing too. Despite their distinct origins, both excel as low-latency and scalable technologies.

This article explores the two frameworks, their features, and why they are often compared in the context of real-time data analysis.

SparkFlink
Data ingestion toolSpark Streaming SourcesFlink Data Stream API
Data processingBatch/Stream (Micro Batch)Batch/Stream (Real-time)
WindowingTumbling/SlidingTumbling/Sliding/Session/Global
JoinsStream-stream/Stream-datasetWindow/Interval
State backendHDFSIn-memory/RocksDB
Fault toleranceYes (WAL)Yes (Chandy-Lamport)
User-defined functionsYesYes
LanguagesScala, Java, Python, R, and SQLJava, Python, SQL, and Scala (deprecated)
API/LibrariesSpark Streaming, Spark SQL,
MLlib for machine learning,
GraphX for graph processing,
PySparkDataStream API, Table API, Flink SQL,
Flink ML, Gelly, PyFlink

Install Spark

How to Install Spark on Ubuntu
Interactive Analysis with the Spark Shell (with install)

Ubuntu

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

Start Standalone Spark Master Server

start-master.sh
http://127.0.0.1:8080/
Start Spark Worker Server (Start a Worker Process)

Use the following command format to start a worker server in a single-server setup:

start-worker.sh spark://[master_server]:[port]

Test Spark Shell

To load the Scala shell, enter:

spark-shell

Type :q and press Enter to exit Scala.


Enter the following command to start the PySpark shell (Python):

pyspark

To exit the PySpark shell, type quit() and press Enter.


Basic Commands to Start and Stop Master Server and Workers

Command           Description
start-master.sh Start the driver (master) server instance on the current machine.
stop-master.sh Stop the driver (master) server instance on the current machine.
start-worker.sh spark://master_server:port Start a worker process and connect it to the master server (use the master's IP or hostname).
stop-worker.sh Stop a running worker process.
start-all.sh Start both the driver (master) and worker instances.
stop-all.sh Stop all the driver (master) and worker instances.

The start-all.sh and stop-all.sh commands work for single-node setups, but in multi-node clusters, you must configure passwordless SSH login on each node. This allows the master server to control the worker nodes remotely.

Quick Start

Quick Start

citation

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.