Python Spark SQL 01
Description

Spark lets you use the programmatic API, the SQL API, or a combination of both. This flexibility makes Spark accessible to a variety of users and powerfully expressive.
Prerequisites
- Python
- Spark
Go to spark/my-examples and create:
python-spark-sql-01.py
Let’s persist the DataFrame in a named Parquet table that is easily accessible via the SQL API. (df1.write.saveAsTable("some_people") - python-spark-sql-01.py)
print("Create a Spark DataFrame")
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
from pyspark.sql.functions import col, when
df = spark.createDataFrame(
[
("sue", 32),
("li", 3),
("bob", 75),
("heo", 13),
],
["first_name", "age"],
)
df1 = df.withColumn(
"life_stage",
when(col("age") < 13, "child")
.when(col("age").between(13, 19), "teenager")
.otherwise("adult"),
)
df1.show()
print("Let’s persist the DataFrame")
df1.write.saveAsTable("some_people")
print("...\n...")
print("\n\nMake sure that the table is accessible via the table name 'select * from some_people':")
spark.sql("select * from some_people").show()
print("Now, let’s use SQL to insert a few more rows of data into the table:")
spark.sql("INSERT INTO some_people VALUES ('frank', 4, 'child')")
print("...\n...")
print("Inspect the table contents to confirm the row was inserted:")
spark.sql("select * from some_people").show()
print("Run a query that returns the teenagers:")
spark.sql("select * from some_people where life_stage='teenager'").show()
print("\n\nstop the spark session")
spark.stop()
Run
./bin/spark-submit --master local[4] ./my-examples/python-spark-sql-01.py
Output
Create a Spark DataFrame
+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
| sue| 32| adult|
| li| 3| child|
| bob| 75| adult|
| heo| 13| teenager|
+----------+---+----------+
Let’s persist the DataFrame
...
...
Make sure that the table is accessible via the table name 'select * from some_people':
+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
| heo| 13| teenager|
| bob| 75| adult|
| sue| 32| adult|
| li| 3| child|
+----------+---+----------+
Now, let’s use SQL to insert a few more rows of data into the table:
...
...
Inspect the table contents to confirm the row was inserted:
+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
| heo| 13| teenager|
| frank| 4| child|
| bob| 75| adult|
| sue| 32| adult|
| li| 3| child|
+----------+---+----------+
Run a query that returns the teenagers:
+----------+---+----------+
|first_name|age|life_stage|
+----------+---+----------+
| heo| 13| teenager|
+----------+---+----------+
stop the spark session