admin 管理员组文章数量: 1086019
I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.
My Processing Steps:
- Read the JSON file.
- Explode nested structures.
- Filter unnecessary data
- Select relevant columns.
- Count the records.
- Write the final DataFrame.
Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.
What I Tried:
Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).
Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().
Spark seems to recompute all transformations during these actions, leading to excessive memory usage.
Questions: How can I prevent Spark from recomputing all transformations at the final stage?
Is caching or checkpointing an effective solution here?
Are there any specific configurations to handle this buffer size limitation? enter image description here
I am working with a large and nested JSON dataset in Apache Spark and encountering a "max buffer size exceeded" exception during the writing process.
My Processing Steps:
- Read the JSON file.
- Explode nested structures.
- Filter unnecessary data
- Select relevant columns.
- Count the records.
- Write the final DataFrame.
Issue: During the count() or write() operations, Spark is recomputing all transformations from the beginning, leading to excessive memory usage and eventually the max buffer size exceeded error.
What I Tried:
Initially, I got a GC (Garbage Collection) error, so I increased the executor memory (spark.driver.memory, spark.executor.memory).
Now, the GC error is gone, but I still get the "max buffer size exceeded" error during count() or write().
Spark seems to recompute all transformations during these actions, leading to excessive memory usage.
Questions: How can I prevent Spark from recomputing all transformations at the final stage?
Is caching or checkpointing an effective solution here?
Are there any specific configurations to handle this buffer size limitation? enter image description here
Share Improve this question asked Mar 27 at 9:29 Ram ShanRam Shan 112 bronze badges 2- Filter first and select before exploding. If you can. – Ged Commented Mar 27 at 9:58
- Cache? Appropiate level. – Ged Commented Mar 27 at 10:11
1 Answer
Reset to default 0Spark is not recomputing in the final stage but it is doing lazy evaluation i.e it is doing actual computation when it sees the action. So basically when it sees write action it starts reading the data , transforms it and writes to the destination. This is what spark usually does. Since you have not attached your code i am assuming that has been the case. Cache would not help here. This error comes if one column size turns out to be of very big size. So you need to check what exactly is your explode doing. Below is a thread on same error which yo u can refer. There are couple of solutions provided in the thread.
https://community.databricks/t5/data-engineering/bufferholder-exceeded-on-json-flattening/td-p/12873
Btw you can attach code and insert image in the question itself instead of linking it. Will be easier to check.
本文标签:
版权声明:本文标题:Appache spark : Cannot grow BufferHolder by size 524432 because the size after growing exceeds size limitation 2147483632 - Stac 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://roclinux.cn/p/1744100195a2533460.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论