Sunday, April 28, 2019

pyspark date and time

Dates_and_Timestamps

Dates and Timestamps

You will often find yourself working with Time and Date information, let's walk through some ways you can deal with it!

In [1]:
from pyspark.sql import SparkSession
# May take a little while on a local computer
spark = SparkSession.builder.appName("dates").getOrCreate()
In [3]:
df = spark.read.csv("appl_stock.csv",header=True,inferSchema=True)
In [4]:
df.show()
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+
|                Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+
|2010-01-04 00:00:...|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05 00:00:...|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06 00:00:...|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-07 00:00:...|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|
|2010-01-08 00:00:...|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|
|2010-01-11 00:00:...|212.79999700000002|        213.000002|        208.450005|210.11000299999998|115557400|         27.221758|
|2010-01-12 00:00:...|209.18999499999998|209.76999500000002|        206.419998|        207.720001|148614900|          26.91211|
|2010-01-13 00:00:...|        207.870005|210.92999500000002|        204.099998|        210.650002|151473000|          27.29172|
|2010-01-14 00:00:...|210.11000299999998|210.45999700000002|        209.020004|            209.43|108223500|         27.133657|
|2010-01-15 00:00:...|210.92999500000002|211.59999700000003|        205.869999|            205.93|148516900|26.680197999999997|
|2010-01-19 00:00:...|        208.330002|215.18999900000003|        207.240004|        215.039995|182501900|27.860484999999997|
|2010-01-20 00:00:...|        214.910006|        215.549994|        209.500002|            211.73|153038200|         27.431644|
|2010-01-21 00:00:...|        212.079994|213.30999599999998|        207.210003|        208.069996|152038600|         26.957455|
|2010-01-22 00:00:...|206.78000600000001|        207.499996|            197.16|            197.75|220441900|         25.620401|
|2010-01-25 00:00:...|202.51000200000001|        204.699999|        200.190002|        203.070002|266424900|26.309658000000002|
|2010-01-26 00:00:...|205.95000100000001|        213.710005|        202.580004|        205.940001|466777500|         26.681494|
|2010-01-27 00:00:...|        206.849995|            210.58|        199.530001|        207.880005|430642100|26.932840000000002|
|2010-01-28 00:00:...|        204.930004|        205.500004|        198.699995|        199.289995|293375600|25.819922000000002|
|2010-01-29 00:00:...|        201.079996|        202.199995|        190.250002|        192.060003|311488100|         24.883208|
|2010-02-01 00:00:...|192.36999699999998|             196.0|191.29999899999999|        194.729998|187469100|         25.229131|
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+
only showing top 20 rows

Let's walk through how to grab parts of the timestamp data

In [44]:
from pyspark.sql.functions import format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format
In [45]:
df.select(dayofmonth(df['Date'])).show()
+----------------+
|dayofmonth(Date)|
+----------------+
|               4|
|               5|
|               6|
|               7|
|               8|
|              11|
|              12|
|              13|
|              14|
|              15|
|              19|
|              20|
|              21|
|              22|
|              25|
|              26|
|              27|
|              28|
|              29|
|               1|
+----------------+
only showing top 20 rows

In [46]:
df.select(hour(df['Date'])).show()
+----------+
|hour(Date)|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows

In [8]:
df.select(dayofyear(df['Date'])).show()
+---------------+
|dayofyear(Date)|
+---------------+
|              4|
|              5|
|              6|
|              7|
|              8|
|             11|
|             12|
|             13|
|             14|
|             15|
|             19|
|             20|
|             21|
|             22|
|             25|
|             26|
|             27|
|             28|
|             29|
|             32|
+---------------+
only showing top 20 rows

In [11]:
df.select(month(df['Date'])).show()
+-----------+
|month(Date)|
+-----------+
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          1|
|          2|
+-----------+
only showing top 20 rows

So for example, let's say we wanted to know the average closing price per year. Easy! With a groupby and the year() function call:

In [15]:
df.select(year(df['Date'])).show()
+----------+
|year(Date)|
+----------+
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
|      2010|
+----------+
only showing top 20 rows

In [19]:
df.withColumn("Year",year(df['Date'])).show()
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+----+
|                Date|              Open|              High|               Low|             Close|   Volume|         Adj Close|Year|
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+----+
|2010-01-04 00:00:...|        213.429998|        214.499996|212.38000099999996|        214.009998|123432400|         27.727039|2010|
|2010-01-05 00:00:...|        214.599998|        215.589994|        213.249994|        214.379993|150476200|27.774976000000002|2010|
|2010-01-06 00:00:...|        214.379993|            215.23|        210.750004|        210.969995|138040000|27.333178000000004|2010|
|2010-01-07 00:00:...|            211.75|        212.000006|        209.050005|            210.58|119282800|          27.28265|2010|
|2010-01-08 00:00:...|        210.299994|        212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|2010|
|2010-01-11 00:00:...|212.79999700000002|        213.000002|        208.450005|210.11000299999998|115557400|         27.221758|2010|
|2010-01-12 00:00:...|209.18999499999998|209.76999500000002|        206.419998|        207.720001|148614900|          26.91211|2010|
|2010-01-13 00:00:...|        207.870005|210.92999500000002|        204.099998|        210.650002|151473000|          27.29172|2010|
|2010-01-14 00:00:...|210.11000299999998|210.45999700000002|        209.020004|            209.43|108223500|         27.133657|2010|
|2010-01-15 00:00:...|210.92999500000002|211.59999700000003|        205.869999|            205.93|148516900|26.680197999999997|2010|
|2010-01-19 00:00:...|        208.330002|215.18999900000003|        207.240004|        215.039995|182501900|27.860484999999997|2010|
|2010-01-20 00:00:...|        214.910006|        215.549994|        209.500002|            211.73|153038200|         27.431644|2010|
|2010-01-21 00:00:...|        212.079994|213.30999599999998|        207.210003|        208.069996|152038600|         26.957455|2010|
|2010-01-22 00:00:...|206.78000600000001|        207.499996|            197.16|            197.75|220441900|         25.620401|2010|
|2010-01-25 00:00:...|202.51000200000001|        204.699999|        200.190002|        203.070002|266424900|26.309658000000002|2010|
|2010-01-26 00:00:...|205.95000100000001|        213.710005|        202.580004|        205.940001|466777500|         26.681494|2010|
|2010-01-27 00:00:...|        206.849995|            210.58|        199.530001|        207.880005|430642100|26.932840000000002|2010|
|2010-01-28 00:00:...|        204.930004|        205.500004|        198.699995|        199.289995|293375600|25.819922000000002|2010|
|2010-01-29 00:00:...|        201.079996|        202.199995|        190.250002|        192.060003|311488100|         24.883208|2010|
|2010-02-01 00:00:...|192.36999699999998|             196.0|191.29999899999999|        194.729998|187469100|         25.229131|2010|
+--------------------+------------------+------------------+------------------+------------------+---------+------------------+----+
only showing top 20 rows

In [29]:
newdf = df.withColumn("Year",year(df['Date']))
newdf.groupBy("Year").mean()[['avg(Year)','avg(Close)']].show()
+---------+------------------+
|avg(Year)|        avg(Close)|
+---------+------------------+
|   2015.0|120.03999980555547|
|   2013.0| 472.6348802857143|
|   2014.0| 295.4023416507935|
|   2012.0| 576.0497195640002|
|   2016.0|104.60400786904763|
|   2010.0| 259.8424600000002|
|   2011.0|364.00432532142867|
+---------+------------------+

Still not quite presentable! Let's use the .alias method as well as round() to clean this up!

In [43]:
result = newdf.groupBy("Year").mean()[['avg(Year)','avg(Close)']]
result = result.withColumnRenamed("avg(Year)","Year")
result = result.select('Year',format_number('avg(Close)',2).alias("Mean Close")).show()
+------+----------+
|  Year|Mean Close|
+------+----------+
|2015.0|    120.04|
|2013.0|    472.63|
|2014.0|    295.40|
|2012.0|    576.05|
|2016.0|    104.60|
|2010.0|    259.84|
|2011.0|    364.00|
+------+----------+

Perfect! Now you know how to work with Date and Timestamp information!

No comments:

Post a Comment

Loud fan of desktop

 Upon restart the fan of the desktop got loud again. I cleaned the desktop from the dust but it was still loud (Lower than the first sound) ...