Skip to main content

Python Spark Quickstart

Description

05-spark-logo.png

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.

Prerequisites

  • Python
  • Spark

In spark make

spark/my-examples

i.e.:

.
├── bin
├── conf
├── data
├── examples
├── jars
├── kubernetes
├── LICENSE
├── licenses
├── logs
├── my-examples
├── NOTICE
├── python
├── R
├── README.md
├── RELEASE
├── sbin
└── yarn

Go to spark/my-examples and create:

python-spark-quickstart.py

"""python-spark-quickstart"""
from pyspark.sql import SparkSession

# Should be some file on your system
# logFile = "YOUR_SPARK_HOME/README.md"

# Our case
logFile = "/opt/spark/README.md"

spark = SparkSession.builder.appName("python-spark-quickstart").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

Run

# go to spark and do:
./bin/spark-submit --master local[4] ./my-examples/python-spark-quickstart.py

local[4] represents the number of cores that will be assigned to the spark-submit process.

Output

25/02/21 16:56:26 INFO SparkContext: Running Spark version 3.5.4
25/02/21 16:56:26 INFO SparkContext: OS info Linux, 6.8.0-52-generic, amd64
25/02/21 16:56:26 INFO SparkContext: Java version 11.0.26

...
...

25/02/21 16:56:29 INFO DAGScheduler: Job 4 finished: count at NativeMethodAccessorImpl.java:0, took 0.011863 s
Lines with a: 72, lines with b: 39

...
...