Pivoting Data in SparkSQL
JANUARY 5TH, 2016
One of the core values at Silicon Valley Data Science (SVDS) is contributing back to the community, and one way we do that is through open source contributions. One of the many new features in Spark 1.6.0 is the ability to pivot data in data frames. This was a feature requested by one of my colleagues that I decided to work on.
Pivot tables are an essential part of data analysis and reporting. A pivot can be thought of as translating rows into columns while applying one or more aggregations. Many popular data manipulation tools (pandas, reshape2, and Excel) and databases (MS SQL and Oracle 11g) include the ability to pivot data.
Below you’ll find some examples of how to use pivot in PySpark, using this dataset: https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv
$ bin/pyspark --packages com.databricks:spark-csv_2.10:1.3.0
>>> df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('mpg.csv')
>>> df.withColumnRenamed("manufacturer", "manuf").show(5)
+---+-----+-----+-----+----+---+----------+---+---+---+---+-------+
| |manuf|model|displ|year|cyl| trans|drv|cty|hwy| fl| class|
+---+-----+-----+-----+----+---+----------+---+---+---+---+-------+
| 1| audi| a4| 1.8|1999| 4| auto(l5)| f| 18| 29| p|compact|
| 2| audi| a4| 1.8|1999| 4|manual(m5)| f| 21| 29| p|compact|
| 3| audi| a4| 2.0|2008| 4|manual(m6)| f| 20| 31| p|compact|
| 4| audi| a4| 2.0|2008| 4| auto(av)| f| 21| 30| p|compact|
| 5| audi| a4| 2.8|1999| 6| auto(l5)| f| 16| 26| p|compact|
+---+-----+-----+-----+----+---+----------+---+---+---+---+-------+
If we wanted to see the average highway mpg of cars by class and year, we could do a simple group by and aggregation:
>>> df.groupBy('class', 'year').avg('hwy').show()
+----------+----+------------------+
| class|year| avg(hwy)|
+----------+----+------------------+
| pickup|2008|16.941176470588236|
|subcompact|1999| 29.0|
|subcompact|2008| 27.125|
| 2seater|1999| 24.5|
| 2seater|2008| 25.0|
| compact|1999| 27.92|
| compact|2008|28.727272727272727|
| minivan|1999| 22.5|
| midsize|1999| 26.5|
| minivan|2008| 22.2|
| midsize|2008|28.047619047619047|
| suv|1999|17.551724137931036|
| suv|2008|18.636363636363637|
| pickup|1999| 16.8125|
+----------+----+------------------+
Pivoting on the year column gives the same information, just laid out differently:
>>> df.groupBy('class').pivot('year').avg('hwy').show()
+----------+------------------+------------------+
| class| 1999| 2008|
+----------+------------------+------------------+
| suv|17.551724137931036|18.636363636363637|
| 2seater| 24.5| 25.0|
| pickup| 16.8125|16.941176470588236|
| midsize| 26.5|28.047619047619047|
| compact| 27.92|28.727272727272727|
| minivan| 22.5| 22.2|
|subcompact| 29.0| 27.125|
+----------+------------------+------------------+
We can also specify the values of the pivot column, which is more efficient as otherwise they need to be determined with a query to the data frame. To do so, just provide a list of values as the second argument to pivot.
>>> df.groupBy('class').pivot('year', [1999, 2008]).avg('hwy').show()
Finally, just like a normal group by we can use multiple aggregations:
>>> from pyspark.sql import functions as F
>>> df.groupBy('class').pivot('year', [1999, 2008]).agg(F.min(df.hwy), F.max(df.hwy)).show()
+----------+-------------+-------------+-------------+-------------+
| class|1999_min(hwy)|1999_max(hwy)|2008_min(hwy)|2008_max(hwy)|
+----------+-------------+-------------+-------------+-------------+
| suv| 15| 25| 12| 27|
| 2seater| 23| 26| 24| 26|
| pickup| 15| 20| 12| 22|
| midsize| 24| 29| 23| 32|
| compact| 23| 44| 25| 37|
| minivan| 21| 24| 17| 24|
|subcompact| 21| 44| 20| 36|
+----------+-------------+-------------+-------------+-------------+
No comments:
Post a Comment