Python Tutorial: Introduction to Spark SQL in Python

Опубликовано: 22 Февраль 2020
на канале: DataCamp

4,459

Want to learn more? Take the full course at https://learn.datacamp.com/courses/in... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

----
Hello and welcome to this lesson about Spark SQL.

Spark provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

In this lesson, we will create a SQL table from a dataframe
and then query it.

To load a comma-separated file into a dataframe called df, let df = spark.read.csv, where the first argument gives the name of the file.

If the first line of the data gives the column names, set the header argument to True.

Now, to create a Spark table from this dataframe, do df.createOrReplaceTempView, with the first argument giving the desired name of the table.
Next, to run a query, do spark.sql, with the first argument giving the query.

To inspect the column names of this table set result = spark.sql, with the first argument set to the string “show columns from table”

Another way is to use the query “SELECT * FROM table LIMIT 0”

Yet another way is the “DESCRIBE table” query.

To see the column names of result=spark.sql(query), we can either do result.show() and visually inspect the result, or, do print result.columns.

The dataframe is a fundamental data abstraction in Spark.

A Spark DataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to a table in a relational database, also called, simply, “tabular” data.

We could have two dataframes having the same types of columns, and containing different data.

We could then concatenate the rows of data in these two tables into a single dataframe. Recall that a Spark DataFrame is a distributed collection of data organized into named columns. What do we mean by “distributed”?

Spark can split this dataset into parts then store each part on a different server.

In this case, Spark is partitioning the data and distributing it automatically, on our behalf. This is one technique that Spark uses to handle large datasets, even though each server may not have enough storage to hold the entire dataset on its own. What’s more, Spark allows us to treat a dataframe like a table, and query it using SQL.

SQL stands for “Structured Query Language”. A query tells the computer what to fetch.

What’s useful about the Spark SQL table is that it allows us to take the data that is in a dataframe, namely, a distributed collection of rows having named columns, and treat it as a single table, and fetch data from it using an SQL query.

We often use an instance of a SparkSession object. By convention this is provided in a variable called "spark". Some implementations of Spark, such as Pyspark Shell, automatically provide an instance of a SparkSession.

The following Spark command reads delimited text data into a dataframe from a file. One of its options allows it to use the first row to define the names of the columns.

It automatically splits each row into columns using the delimiter, which by default is a comma but which can be changed.

Let’s load some data into a dataframe, convert it into a SQL table and query it.