Apache Spark
https://spark.apache.org/
https://sparkbyexamples.com/
Unified engine for large-scale data analytics.
Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Key features
-
Batch/streaming data Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
-
SQL analytics Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
-
Data science at scale Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
-
Machine learning Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Flink vs. Spark
Stream processing is a continuous method of ingesting, analyzing, and processing data as it is generated. The input data is unbounded and has no predetermined beginning or end. It is a series of events that arrive at the stream processing system (e.g., credit card transactions, clicks on a website, or sensor readings from Internet of Things [IoT] devices).
Two prominent technologies, Apache Spark™, and Apache Flink®, are leading frameworks in stream processing. Where Spark initially gained popularity for batch processing, it has since evolved to incorporate structured streaming for real-time data analysis. In contrast, Flink was built from the ground up for real-time processing and can do batch processing too. Despite their distinct origins, both excel as low-latency and scalable technologies.
This article explores the two frameworks, their features, and why they are often compared in the context of real-time data analysis.
Summary of differences: Flink vs. Spark
Spark | Flink | |
---|---|---|
Data ingestion tool | Spark Streaming Sources | Flink Data Stream API |
Data processing | Batch/Stream (Micro Batch) | Batch/Stream (Real-time) |
Windowing | Tumbling/Sliding | Tumbling/Sliding/Session/Global |
Joins | Stream-stream/Stream-dataset | Window/Interval |
State backend | HDFS | In-memory/RocksDB |
Fault tolerance | Yes (WAL) | Yes (Chandy-Lamport) |
User-defined functions | Yes | Yes |
Languages | Scala, Java, Python, R, and SQL | Java, Python, SQL, and Scala (deprecated) |
API/Libraries | Spark Streaming, Spark SQL, | |
MLlib for machine learning, | ||
GraphX for graph processing, | ||
PySpark | DataStream API, Table API, Flink SQL, | |
Flink ML, Gelly, PyFlink |
Install Spark
How to Install Spark on Ubuntu
Interactive Analysis with the Spark Shell (with install)
Ubuntu
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3
Start Standalone Spark Master Server
start-master.sh
http://127.0.0.1:8080/
Start Spark Worker Server (Start a Worker Process)
Use the following command format to start a worker server in a single-server setup:
start-worker.sh spark://[master_server]:[port]
Test Spark Shell
To load the Scala shell, enter:
spark-shell
Type :q and press Enter to exit Scala.
Enter the following command to start the PySpark shell (Python):
pyspark
To exit the PySpark shell, type quit() and press Enter.
Basic Commands to Start and Stop Master Server and Workers
Command Description
start-master.sh Start the driver (master) server instance on the current machine.
stop-master.sh Stop the driver (master) server instance on the current machine.
start-worker.sh spark://master_server:port Start a worker process and connect it to the master server (use the master's IP or hostname).
stop-worker.sh Stop a running worker process.
start-all.sh Start both the driver (master) and worker instances.
stop-all.sh Stop all the driver (master) and worker instances.
The start-all.sh and stop-all.sh commands work for single-node setups, but in multi-node clusters, you must configure passwordless SSH login on each node. This allows the master server to control the worker nodes remotely.
Quick Start
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.
To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.
Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.
📄️ Spark Architecture
Understanding Spark Architecture: How It All Comes Together
📄️ Interactive Analysis with the Spark Shell
Basics
🗃️ Examples
4 items
📄️ References
https://spark.apache.org/