Spark-Operator on AWS EKS Systematic Troubleshooting
If you have tried Spark Operator on AWS EKS , you might have run into several hard to solve errors. I would like share my experience on how I trouble shot systematically and got the final clue.
Errors/Exceptions
java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider
java.lang.NoSuchMethodError: com.amazonaws.http.HttpResponse.getHttpRequest()Lcom/amazonaws/thirdparty/apache/http/client/methods/HttpRequestBase;
java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
java.lang.RuntimeException: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
It is important to understand the differences among them to isolate the problem.
ClassNotFoundException: When the class in not found in the class path or
the user(executing) has no permissions to the directory
NoClassDefFoundError: Runtime complains about this when it has already found out a certain class was already tried and not loaded.
NoSuchMethodError : This is when class is available, but the version of the class is different from the version that the referring class was compiled with.
Initial trials were around creating a new image with the required aws-java-sdk-core, aws-java-sdk-bundle , aws-java-sdk-s3 jars @ /usr/bin and adding spark.driver.extraClassPath
These were only half working! Because there are so many jars which are referred,still it fails with one or the other. There are some reasons which I found with proper troubleshooting procedure
Systematic Root Cause Finding:
A lot of trials have been made to trying to understand the problem first.
Understanding how the spark operator works is one of the keys.
Understanding how the deps.jars, deps.files , deps.repositories,deps. packages specification
Understanding source code at a high level of Hadoop, StateStoreProvider
Adding class path, extra libraries, adding jars to the image and referring them in the spark-driver and spark-executor spec was only half solving the problem. It was missing new class files from other jars
Printing class path in Scala is the key:
Follow this article for reference code
Problem 1: The maven dependencies declared in the spark app are getting added to file:/root/.ivy2/jars and this directory has permission issues being root
Problem 2: Adding PodSecurityContext was not working. The containers were not starting. error validating data: [ValidationError(SparkApplication.spec.driver.securityContext): unknown field “fsGroup” in io.k8s.sparkoperator.v1beta2.SparkApplication.spec.driver.securityContext, ValidationError(SparkApplication.spec.executor.securityContext): unknown field “fsGroup” in io.k8s.sparkoperator.v1beta2.SparkApplication.spec.executor.securityContext]
There were few other trails on PodSecurityContext
Turning from S3A to S3 gave the final clue
After a keen observation,the main problem was coming StateStoreProvider classes and S3A initialization. So I updated s3a:// urls to s3:// to see what happens
Problem 3: UnsupportedFileSystemException: No FileSystem for scheme “s3”
After a little bit of googling around how to work around this S3, finally came down to AWS troubleshooting link here
Root Cause
Final Solution: It was a different error that we started with but after the above trouble shooting article from AWS, I came to the understanding that all the aws jars(and the uber bundle jar) were hiding over /usr/share/aws/ directory and /usr/share/aws/emr directory
/usr/share/aws/aws-java-sdk/*
/usr/share/aws/emr/emrfs/lib/*
The spark emr-6.x.x image has few library jars in /usr/share/aws directory while the rest of them are in usr/lib/spark/jars/ , /usr/lib/hadoop* etc.,
The respective directories have to be added to the following spark operator specification properties to get it working!
spark.driver.extraClassPath
spark.driver.extraLibraryPath
spark.executor.extraClassPath
spark.executor.extraLibraryPath
Hope you enjoyed my article.You can reach me at