Wednesday, May 1, 2019

Speed up spark with small data

Spark is faster than Python/R only when it has massive data, then parallelization is powerful. For small data sizes, the parameter partitions should be reduced.

Partitions are the number of chunks the data is broken down too before parallel processing. With small data, it should not have a lot of partitions because the overhead of setting up the shuffles is large

Dont forget to set local[*]

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.
By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.
1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.
You are right, you will not see much difference at lower volumes. Spark can be slower as well.
2) My spark is running locally and I should run it in something like Amazon EC instead.
For your volume it might not help much.
3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.
Again it does not matter for 20MB data set.
4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)
On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter





 https://stackoverflow.com/questions/34625410/why-does-my-spark-run-slower-than-pure-python-performance-comparison

No comments:

Post a Comment

Loud fan of desktop

 Upon restart the fan of the desktop got loud again. I cleaned the desktop from the dust but it was still loud (Lower than the first sound) ...