Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java’s memory management and garbage collection. Therefore, garbage collection can be a major issue that can affect many Spark applications.
Common symptoms of excessive Garbage Collection in Spark are:
#Application speed.
#Executor heartbeat timeout.
#garbage collection overhead limit exceeded error.
The first step in Garbage Collection tuning is to collect statistics by choosing – verbose while submitting spark jobs.
In an ideal situation we try to keep GC overheads < 10% of heap memory.
The Spark execution engine and Spark storage can both store data off-heap.
You can switch on off-heap storage using the following commands:
–conf spark.memory.offHeap.enabled = true
–conf spark.memory.offHeap.size = Xgb.
If using RDD-based applications, use data structures with fewer objects. For example, use an array instead of a list.
If you are dealing with primitive data types, consider using specialized data structures like Koloboke or fastutil. These structures optimize memory usage for primitive types.
Be careful when using off-heap storage as it does not impact on-heap memory size, i.e. it won’t shrink heap memory. So, to define an overall memory limit, assign a smaller heap size.
If you are using #sparksql , try to use the built-in functions as much as possible, instead of writing new UDFs. Mostly Spark UDFs can work on UnsafeRow and don’t need to convert to wrapper data types. This avoids creating garbage, also it plays well with code generation.
Remember we may be working with billions of rows. If we create even a small temporary object with 100-byte size for each row, it will create 1 billion * 100 bytes of garbage.