Skip to main content

Python Spark DataFrame

https://spark.apache.org/examples.html

Description

06-dataframe-picture.png

This section shows you how to create a Spark DataFrame and run simple operations. The examples are on a small DataFrame, so you can easily see the functionality.

Prerequisites

  • Python
  • Spark

Change

spark/conf/log4j2.properties:
rootLogger.level = info --> rootLogger.level = error


Go to spark/my-examples and create:

python-spark-DataFrame.py

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("demo").getOrCreate()

print("Create a Spark DataFrame:")
df = spark.createDataFrame(
[
("sue", 32),
("li", 3),
("bob", 75),
("heo", 13),
],
["first_name", "age"],
)
print("...\n\n")

print("Use the show() method to view the contents of the DataFrame:")
df.show()

print("Now, let’s perform some data processing operations on the DataFrame.")
print("\n")
print("ADD a COLUMN TO a SPARK DataFrame:")
from pyspark.sql.functions import col, when

df1 = df.withColumn(
"life_stage",
when(col("age") < 13, "child")
.when(col("age").between(13, 19), "teenager")
.otherwise("adult"),
)
print("...\n\n")

print("Let’s view the contents of df1.")
df1.show()

print("Notice how the original DataFrame is unchanged:")
df.show()

print("FILTER a SPARK DataFrame")
print("Now, filter the DataFrame so it only includes teenagers and adults.")
df1.where(col("life_stage").isin(["teenager", "adult"])).show()

print("GROUP BY AGGREGATION ON SPARK DataFrame")
print("Now, let’s compute the average age for everyone in the dataset:")
from pyspark.sql.functions import avg
df1.select(avg("age")).show()

print("You can also compute the average age for each life_stage:")
df1.groupBy("life_stage").avg().show()

print("\nSpark lets you run queries on DataFrames with SQL if you don’t want to use the programmatic APIs.")
print("QUERY THE DataFrame WITH SQL")
print("Here’s how you can compute the average age for everyone with SQL:")
spark.sql("select avg(age) from {df1}", df1=df1).show()

print("And here’s how to compute the average age by life_stage with SQL:")
spark.sql("select life_stage, avg(age) from {df1} group by life_stage", df1=df1).show()

Run

./bin/spark-submit --master local[4] ./my-examples/python-spark-DataFrame.py

Output

Create a Spark DataFrame:
...


Use the show() method to view the contents of the DataFrame:
+----------+---+
|first_name|age|
+----------+---+
| sue| 32|
| li| 3|
| bob| 75|
| heo| 13|
+----------+---+

Now, let’s perform some data processing operations on the DataFrame.


ADD a COLUMN TO a SPARK DataFrame:
...


Let’s view the contents of df1.
+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
| sue| 32| adult|
| li| 3| child|
| bob| 75| adult|
| heo| 13| teenager|
+----------+---+----------+

Notice how the original DataFrame is unchanged:
+----------+---+
|first_name|age|
+----------+---+
| sue| 32|

...
...