Spark Operator on AWS EKS

Spark-Operator on AWS EKS Systematic Troubleshooting

Aravind Brahmadevara

--

If you have tried Spark Operator on AWS EKS , you might have run into several hard to solve errors. I would like share my experience on how I trouble shot systematically and got the final clue.

Errors/Exceptions

java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider

java.lang.NoSuchMethodError: com.amazonaws.http.HttpResponse.getHttpRequest()Lcom/amazonaws/thirdparty/apache/http/client/methods/HttpRequestBase;

java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider

java.lang.RuntimeException: java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException

It is important to understand the differences among them to isolate the problem.

ClassNotFoundException: When the class in not found in the class path or

the user(executing) has no permissions to the directory

NoClassDefFoundError: Runtime complains about this when it has already found out a certain class was already tried and not loaded.

NoSuchMethodError : This is when class is available, but the version of the class is different from the version that the referring class was compiled with.

Initial trials were around creating a new image with the required aws-java-sdk-core, aws-java-sdk-bundle , aws-java-sdk-s3 jars @ /usr/bin and adding spark.driver.extraClassPath

These were only half working! Because there are so many jars which are referred,still it fails with one or the other. There are some reasons which I found with proper troubleshooting procedure

Systematic Root Cause Finding:

A lot of trials have been made to trying to understand the problem first.

Understanding how the spark operator works is one of the keys.

Understanding how the deps.jars, deps.files , deps.repositories,deps. packages specification

Understanding source code at a high level of Hadoop, StateStoreProvider

Adding class path, extra libraries, adding jars to the image and referring them in the spark-driver and spark-executor spec was only half solving the problem. It was missing new class files from other jars

Printing class path in Scala is the key:

Follow this article for reference code

Problem 1: The maven dependencies declared in the spark app are getting added to file:/root/.ivy2/jars and this directory has permission issues being root

Problem 2: Adding PodSecurityContext was not working. The containers were not starting. error validating data: [ValidationError(SparkApplication.spec.driver.securityContext): unknown field “fsGroup” in io.k8s.sparkoperator.v1beta2.SparkApplication.spec.driver.securityContext, ValidationError(SparkApplication.spec.executor.securityContext): unknown field “fsGroup” in io.k8s.sparkoperator.v1beta2.SparkApplication.spec.executor.securityContext]

There were few other trails on PodSecurityContext

Turning from S3A to S3 gave the final clue

After a keen observation,the main problem was coming StateStoreProvider classes and S3A initialization. So I updated s3a:// urls to s3:// to see what happens

Problem 3: UnsupportedFileSystemException: No FileSystem for scheme “s3”

After a little bit of googling around how to work around this S3, finally came down to AWS troubleshooting link here

Root Cause

Final Solution: It was a different error that we started with but after the above trouble shooting article from AWS, I came to the understanding that all the aws jars(and the uber bundle jar) were hiding over /usr/share/aws/ directory and /usr/share/aws/emr directory

/usr/share/aws/aws-java-sdk/*
/usr/share/aws/emr/emrfs/lib/*

The spark emr-6.x.x image has few library jars in /usr/share/aws directory while the rest of them are in usr/lib/spark/jars/ , /usr/lib/hadoop* etc.,

The respective directories have to be added to the following spark operator specification properties to get it working!

spark.driver.extraClassPath
spark.driver.extraLibraryPath
spark.executor.extraClassPath
spark.executor.extraLibraryPath

--

--