Skip to content

Latest commit

 

History

History
148 lines (117 loc) · 9.98 KB

api.md

File metadata and controls

148 lines (117 loc) · 9.98 KB

SparkApplication API

The Spark Operator uses CustomResourceDefinitions named SparkApplication and ScheduledSparkApplication for specifying one-time Spark applications and Spark applications that are supposed to run on a standard cron schedule. Similarly to other kinds of Kubernetes resources, they consist of a specification in a Spec field and a Status field. The definitions are organized in the following structure. The v1alpha1 version of the API definition is implemented here.

ScheduledSparkApplication
|__ ScheduledSparkApplicationSpec
    |__ SparkApplication
|__ ScheduledSparkApplicationStatus

|__ SparkApplication
|__ SparkApplicationSpec
    |__ DriverSpec
        |__ SparkPodSpec
    |__ ExecutorSpec
        |__ SparkPodSpec
    |__ Dependencies
|__ SparkApplicationStatus
    |__ DriverInfo    

API Definition

SparkApplicationSpec

A SparkApplicationSpec has the following top-level fields:

Field Spark configuration property or spark-submit option Note
Type N/A The type of the Spark application. Valid values are Java, Scala, Python, and R.
Mode --mode Spark deployment mode. Valid values are cluster and client.
Image spark.kubernetes.container.image Unified container image for the driver, executor, and init-container.
InitContainerImage spark.kubernetes.initContainer.image Custom init-container image.
ImagePullPolicy spark.kubernetes.container.image.pullPolicy Container image pull policy.
MainClass --class Main application class to run.
MainApplicationFile N/A Main application file, e.g., a bundled jar containing the main class and its dependencies.
Arguments N/A List of application arguments.
SparkConf N/A A map of extra Spark configuration properties.
HadoopConf N/A A map of Hadoop configuration properties. The operator will add the prefix spark.hadoop. to the properties when adding it through the --conf option.
SparkConfigMap N/A Name of a Kubernetes ConfigMap carrying Spark configuration files, e.g., spark-env.sh. The controller sets the environment variable SPARK_CONF_DIR to where the ConfigMap is mounted.
HadoopConfigMap N/A Name of a Kubernetes ConfigMap carrying Hadoop configuration files, e.g., core-site.xml. The controller sets the environment variable HADOOP_CONF_DIR to where the ConfigMap is mounted.
Volumes N/A List of Kubernetes volumes the driver and executors need collectively.
Driver N/A A DriverSpec field.
Executor N/A An ExecutorSpec field.
Deps N/A A Dependencies field.
RestartPolicy N/A The policy regarding if and in which conditions the controller should restart a terminated application. Valid values are Never, Always, and OnFailure.
NodeSelector spark.kubernetes.node.selector.[labelKey] Node selector of the driver pod and executor pods, with key labelKey and value as the label's value.
MaxSubmissionRetries N/A The maximum number of times to retry a failed submission.
SubmissionRetryInterval N/A The unit of intervals in seconds between submission retries. Depending on the implementation, the actual interval between two submission retries may be a multiple of SubmissionRetryInterval, e.g., if linear or exponential backoff is used.

DriverSpec

A DriverSpec embeds a SparkPodSpec and additionally has the following fields:

Field Spark configuration property or spark-submit option Note
PodName spark.kubernetes.driver.pod.name Name of the driver pod.
ServiceAccount spark.kubernetes.authenticate.driver.serviceAccountName Name of the Kubernetes service account to use for the driver pod.

ExecutorSpec

Similarly to the DriverSpec, an ExecutorSpec also embeds a a SparkPodSpec and additionally has the following fields:

Field Spark configuration property or spark-submit option Note
Instances spark.executor.instances Number of executor instances to request for.

SparkPodSpec

A SparkPodSpec defines common attributes of a driver or executor pod, summarized in the following table.

Field Spark configuration property or spark-submit option Note
Cores spark.driver.cores or spark.executor.cores Number of CPU cores for the driver or executor pod.
CoreLimit spark.kubernetes.driver.limit.cores or spark.kubernetes.executor.limit.cores Hard limit on the number of CPU cores for the driver or executor pod.
Memory spark.driver.memory or spark.executor.memory Amount of memory to request for the driver or executor pod.
Image spark.kubernetes.driver.container.image or spark.kubernetes.executor.container.image Custom container image for the driver or executor.
ConfigMaps N/A A map of Kubernetes ConfigMaps to mount into the driver or executor pod. Keys are ConfigMap names and values are mount paths.
Secrets spark.kubernetes.driver.secrets.[SecretName] or spark.kubernetes.executor.secrets.[SecretName] A map of Kubernetes secrets to mount into the driver or executor pod. Keys are secret names and values specify the mount paths and secret types.
EnvVars spark.kubernetes.driverEnv.[EnvironmentVariableName] or spark.executorEnv.[EnvironmentVariableName] A map of environment variables to add to the driver or executor pod. Keys are variable names and values are variable values.
Labels spark.kubernetes.driver.label.[LabelName] or spark.kubernetes.executor.label.[LabelName] A map of Kubernetes labels to add to the driver or executor pod. Keys are label names and values are label values.
Annotations spark.kubernetes.driver.annotation.[AnnotationName] or spark.kubernetes.executor.annotation.[AnnotationName] A map of Kubernetes annotations to add to the driver or executor pod. Keys are annotation names and values are annotation values.
VolumeMounts N/A List of Kubernetes volume mounts for volumes that should be mounted to the pod.

Dependencies

A Dependencies specifies the various types of dependencies of a Spark application in a central place.

Field Spark configuration property or spark-submit option Note
Jars spark.jars or --jars List of jars the application depends on.
Files spark.files or --files List of files the application depends on.

SparkApplicationStatus

A SparkApplicationStatus captures the status of a Spark application including the state of every executors.

Field Note
AppID A randomly generated ID used to group all Kubernetes resources of an application.
SubmissionTime Time the application is submitted to run.
CompletionTime Time the application completes (if it does).
DriverInfo A DriverInfo field.
AppState Current state of the application.
ExecutorState A map of executor pod names to executor state.
SubmissionRetries The number of submission retries for an application.

DriverInfo

A DriverInfo captures information about the driver pod and the Spark web UI running in the driver.

Field Note
WebUIServiceName Name of the service for the Spark web UI.
WebUIPort Port on which the Spark web UI runs.
WebUIAddress Address to access the web UI from outside the cluster.
PodName Name of the driver pod.

ScheduledSparkApplicationSpec

A ScheduledSparkApplicationSpec has the following top-level fields:

Field Optional Default Note
Schedule No N/A The cron schedule on which the application should run.
Template No N/A A template from which SparkApplication instances of scheduled runs of the application can be created.
Suspend Yes false A flag telling the controller to suspend subsequent runs of the application if set to true.
ConcurrencyPolicy Allow Yes the policy governing concurrent runs of the application. Valid values are Allow, Forbid, and Replace
SuccessfulRunHistoryLimit Yes 1 The number of past successful runs of the application to keep track of.
FailedRunHistoryLimit Yes 1 The number of past failed runs of the application to keep track of.

ScheduledSparkApplicationStatus

A ScheduledSparkApplicationStatus captures the status of a Spark application including the state of every executors.

Field Note
LastRun The time when the last run of the application started.
NextRun The time when the next run of the application is estimated to start.
PastSuccessfulRunNames The names of SparkApplication objects of past successful runs of the application. The maximum number of names to keep track of is controlled by SuccessfulRunHistoryLimit.
PastFailedRunNames The names of SparkApplication objects of past failed runs of the application. The maximum number of names to keep track of is controlled by FailedRunHistoryLimit.
ScheduleState The current scheduling state of the application. Valid values are FailedValidation and Scheduled.
Reason Human readable message on why the ScheduledSparkApplication is in the particular ScheduleState.