Apache Spark

https://spark.apache.org/
https://sparkbyexamples.com/

Unified engine for large-scale data analytics.

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Key features

Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Flink vs. Spark

Stream processing is a continuous method of ingesting, analyzing, and processing data as it is generated. The input data is unbounded and has no predetermined beginning or end. It is a series of events that arrive at the stream processing system (e.g., credit card transactions, clicks on a website, or sensor readings from Internet of Things [IoT] devices).

Two prominent technologies, Apache Spark™, and Apache Flink®, are leading frameworks in stream processing. Where Spark initially gained popularity for batch processing, it has since evolved to incorporate structured streaming for real-time data analysis. In contrast, Flink was built from the ground up for real-time processing and can do batch processing too. Despite their distinct origins, both excel as low-latency and scalable technologies.

This article explores the two frameworks, their features, and why they are often compared in the context of real-time data analysis.

Summary of differences: Flink vs. Spark

	Spark	Flink
Data ingestion tool	Spark Streaming Sources	Flink Data Stream API
Data processing	Batch/Stream (Micro Batch)	Batch/Stream (Real-time)
Windowing	Tumbling/Sliding	Tumbling/Sliding/Session/Global
Joins	Stream-stream/Stream-dataset	Window/Interval
State backend	HDFS	In-memory/RocksDB
Fault tolerance	Yes (WAL)	Yes (Chandy-Lamport)
User-defined functions	Yes	Yes
Languages	Scala, Java, Python, R, and SQL	Java, Python, SQL, and Scala (deprecated)
API/Libraries	Spark Streaming, Spark SQL,
	MLlib for machine learning,
	GraphX for graph processing,
	PySpark	DataStream API, Table API, Flink SQL,
		Flink ML, Gelly, PyFlink

Install Spark

How to Install Spark on Ubuntu
Interactive Analysis with the Spark Shell (with install)

Ubuntu

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

Start Standalone Spark Master Server

start-master.sh

http://127.0.0.1:8080/

Start Spark Worker Server (Start a Worker Process)

Use the following command format to start a worker server in a single-server setup:

start-worker.sh spark://[master_server]:[port]

Test Spark Shell

To load the Scala shell, enter:

spark-shell

Type :q and press Enter to exit Scala.

Enter the following command to start the PySpark shell (Python):

pyspark

To exit the PySpark shell, type quit() and press Enter.

Basic Commands to Start and Stop Master Server and Workers

Command           Description
start-master.sh   Start the driver (master) server instance on the current machine.
stop-master.sh    Stop the driver (master) server instance on the current machine.
start-worker.sh   spark://master_server:port	Start a worker process and connect it to the master server (use the master's IP or hostname).
stop-worker.sh    Stop a running worker process.
start-all.sh      Start both the driver (master) and worker instances.
stop-all.sh       Stop all the driver (master) and worker instances.

The start-all.sh and stop-all.sh commands work for single-node setups, but in multi-node clusters, you must configure passwordless SSH login on each node. This allows the master server to control the worker nodes remotely.

Quick Start

Quick Start

citation

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.

To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.

Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.

Apache Spark

Key features

Flink vs. Spark

Summary of differences: Flink vs. Spark

Install Spark

Ubuntu

Start Standalone Spark Master Server

Start Spark Worker Server (Start a Worker Process)

Test Spark Shell

Basic Commands to Start and Stop Master Server and Workers

Quick Start

📄️ Spark Architecture

📄️ Interactive Analysis with the Spark Shell

🗃️ Examples

📄️ References

Key features​

Flink vs. Spark​

Summary of differences: Flink vs. Spark​

Install Spark​

Ubuntu​

Start Standalone Spark Master Server​

Start Spark Worker Server (Start a Worker Process)​

Test Spark Shell​

Basic Commands to Start and Stop Master Server and Workers​

Quick Start​

📄️ Spark Architecture

📄️ Interactive Analysis with the Spark Shell

🗃️ Examples

📄️ References

Key features

Flink vs. Spark

Summary of differences: Flink vs. Spark

Install Spark

Ubuntu

Start Standalone Spark Master Server

Start Spark Worker Server (Start a Worker Process)

Test Spark Shell

Basic Commands to Start and Stop Master Server and Workers

Quick Start