Python Spark Quickstart
Description

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
Prerequisites
- Python
- Spark
In spark make
spark/my-examples
i.e.:
.
├── bin
├── conf
├── data
├── examples
├── jars
├── kubernetes
├── LICENSE
├── licenses
├── logs
├── my-examples
├── NOTICE
├── python
├── R
├── README.md
├── RELEASE
├── sbin
└── yarn
Go to spark/my-examples and create:
python-spark-quickstart.py
"""python-spark-quickstart"""
from pyspark.sql import SparkSession
# Should be some file on your system
# logFile = "YOUR_SPARK_HOME/README.md"
# Our case
logFile = "/opt/spark/README.md"
spark = SparkSession.builder.appName("python-spark-quickstart").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
spark.stop()
Run
# go to spark and do:
./bin/spark-submit --master local[4] ./my-examples/python-spark-quickstart.py
local[4] represents the number of cores that will be assigned to the spark-submit process.
Output
25/02/21 16:56:26 INFO SparkContext: Running Spark version 3.5.4
25/02/21 16:56:26 INFO SparkContext: OS info Linux, 6.8.0-52-generic, amd64
25/02/21 16:56:26 INFO SparkContext: Java version 11.0.26
...
...
25/02/21 16:56:29 INFO DAGScheduler: Job 4 finished: count at NativeMethodAccessorImpl.java:0, took 0.011863 s
Lines with a: 72, lines with b: 39
...
...