Key Factors To Consider When Optimizing, Scheduling, And Monitoring Spark Jobs

Spark

5 MIN READ

August 28, 2021

Spark job optimization

Apache Spark is one of the most popular engines for distributed computing on Big Data clusters. Spark has proven its efficiency in many areas. Developing and running something on spark is a very straightforward task. The problem arises when we deploy it on a cluster in an optimal way to gain maximum performance with best practices for developing spark jobs. Spark job optimization has been the number one factor that can impact spark’s performance. Although spark has its own catalyst to optimize jobs, sometimes you might encounter memory related issues and it is always advised to follow some tips and tricks.

 

In this article we will discuss all the key factors that we consider in Ksolves for an enriched spark performance. Let’s start our journey. 

 

What are Spark jobs?

 

Spark jobs are operations that physically move data in order to produce some results. Some jobs are triggered by user API, other jobs require spark to physically inspect data and require a job of its own. Spark jobs can come in all shapes, sizes and clusters. 

Let’s start with all the factors needed for optimizing spark jobs.

 

Key factors for Spark job optimization, Spark job scheduling and spark job monitoring you need to follow

 

  • Data serialization

Spark offers two types of serialization- Java serialization and kryo serialization. Java

Serialization is the default. Kryo serialization is much faster and compact than java serialization and thus it is recommended to use kryo during production and Spark job optimization. Sometimes kryo may fail to serialize a few classes. To avoid this you need to register the classes which are being serialized.

  • Broadcasting

Another parameter to consider during spark job optimization is broadcasting. It is always better to broadcast the value then sending it to the executor if the tasks across multiple stages need the same data. And, during joining two tables, if one join is small and fits correctly into the memory, it is best advised to broadcast it in order to avoid any shuffle.

  • Avoid UDF and UDAF

Ksolves advise you  to avoid UDF and choose SQL functions as it avoids deserialization of data so it can be processed in scala and get serialized again. Apart from that SQL functions are well tested and will yield better results during spark job optimization and monitoring. 

UDAF produces sortAggregate which is slower as compared to hashAggregated.

 

  • Dynamic allocation

Spark has this excellent mechanism to use resources of the cluster more efficiently. Enabling dynamic allocation is necessary to enable spark’s external shuffle. We can scale up and scale down the numbers of executors based on the workload. This will help in better Spark job optimization.

 

  • Avoid long lineage

Spark offers two types of operations- actions and transformations. It is not recommended to chain a lot of transformations in a lineage during processing of high volume of data with minimal resources. It is advised to break the lineage by writing intermediate results for better performance.

 

  • Executor tuning

The most crucial thing is to allocate the execution resources while performing spark job optimization. Optimally that include- number of executors, executors core, and executors memory.

  • Number of executor cores- define the number of cores for each executor.
  • Number of executors- The number of executors depends upon the number of cores.
  • Memory per executor- One GB in each node will be allocated to the OS process.

 

  • Partitioning your dataset

If a spark job runs out of memory or runs slowly, bad partitioning could be one of the reasons. If the dataset is large, you can try re-partitioning, to allow more parallelism. On the other hand, if your data is less, and you have many partitions, the overheads of partitions can make your job slow. You can use the coalesce method as it is faster.

 

  • Use datasets instead of RDDs

We all know that RDD is the basic abstraction in spark, but the newer version of spark utilizes DataFrame API where you can store the data as datasets. DataFrame is much faster than RDD that allows spark to optimize the query plan.

 

Ksolves expertise for better performance of spark jobs

 

Spark itself is a huge platform with a myriad of excellent features which can optimize your job. We have walked you through the major key considerations which will help you to improve performance of Spark jobs. If you are looking for services like Spark job optimization, Spark job Scheduling or Spark job monitoring, Ksolves is your one stop solution. Get all your solutions with 350+ spark developers providing services across the world. 

 

Write to us in the comment section below and we will present you with best suited solutions.

AUTHOR

author image
Anil Kushwaha

Spark

Anil Kushwaha, Technology Head at Ksolves, is an expert in Big Data and AI/ML. With over 11 years at Ksolves, he has been pivotal in driving innovative, high-volume data solutions with technologies like Nifi, Cassandra, Spark, Hadoop, etc. Passionate about advancing tech, he ensures smooth data warehousing for client success through tailored, cutting-edge strategies.

One thought on “Key Factors To Consider When Optimizing, Scheduling, And Monitoring Spark Jobs

  1. Thanks for sharing this list of best cross platform app development frameworks. Businesses want to penetrate the market with engaging and effective mobile apps that the users can easily use to connect with the business at any time and from anywhere.

Leave a Comment

Your email address will not be published. Required fields are marked *

(Text Character Limit 350)