Advanced Job Features

Admittedly, running a simple Job as earlier has little advantage over running a single Pod but there is more to Jobs than meets the eye as you will see in the following lessons.

Automatic Retry-Logic

The Job concept in Kubernetes provides options when a container has not been executed successfully. This does not only cover the creation of Pods and their containers but also the workload running inside them.

A Failing Job

Run the following Job which will fail with certainty. Create the file 30-failing-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: simple-one-off-job-from-yaml
spec:
  template:
    spec:
      containers:
        - name: simple-one-off-job-container
          image: busybox
          imagePullPolicy: Always
          command: ['/bin/sh', '-c']
          args:
            - 'echo "I represent a failing maintenance task"; exit 1'
      restartPolicy: OnFailure

Apply it:

kubectl apply -f 30-failing-job.yaml

As usual a kubectl describe is executed to obtain more information:

kubectl describe job simple-one-off-job-from-yaml

Interestingly, this produces an event SuccessfulCreate originating from the job-controller. Everything seems normal, although we know that the Job must have failed. A closer look reveals there is no contradiction. The event informs that the Pod simple-one-off-job-from-yaml-5x2xt has been created successfully and in fact it has been. It's just that the container inside the Pod has failed. So a closer look at the pod is necessary:

kubectl get pods -l job-name=simple-one-off-job-from-yaml

This time we see a clear indication that something went wrong:

STATUS is CrashLoopBackOff
RESTARTS is 4

And an even closer look with:

kubectl describe pod simple-one-off-job-from-yaml-547xc

Reveals a warning: Back-off restarting failed container which indicated a non-zero return value from starting the container. For those new to Unix/Linux systems [2]:

For the shell's purposes, a command which exits with a status code of zero has succeeded. A non-zero exit status indicates failure. This seemingly counter-intuitive scheme is used so there is one well-defined way to indicate success and a variety of ways to indicate various failure modes. When a command terminates on a fatal signal whose number is N, Bash uses the value 128+N as the exit status.

Without further specification Kubernetes determines the success of a container start by looking at the exit value of the container startup command to be zero. Any non-zero value will be considered a failing container start.

A Flaky Job

But there is more to the previous example. The field RESTARTS: 4 suggests that Kubernetes has not given up immediately but only after four (4) failed attempts. This means that if a failure is sporadic, the retry logic can help to accomplish a Job nevertheless.

Consider a Job that sometimes fails and sometimes succeeds. While such a case calls for a bugfix by the developer, it is also nice if Kubernetes can help so that the developer does not have to get up at night.

We simulate a flaky Job with the following shell command:

(( RANDOM%3 == 0 )) && exit 0 || exit 1

This will randomly exit with either success (0) or failure (1) with a bias towards failure.

Use this version of you want to try it without existing your shell:

(( RANDOM%3 == 0 )) && echo 0 || echo 1

Create the file 40-flaky-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: flaky-job
spec:
  backoffLimit: 5
  template:
    spec:
      containers:
        - name: flaky-job-container
          image: ubuntu
          imagePullPolicy: Always
          command: ['/bin/bash', '-c']
          args:
            - 'echo "I represent a flaky maintenance task"; (( RANDOM%3 == 0 )) && exit 0 || exit 1'
      restartPolicy: OnFailure

Execute the Job:

kubectl apply -f 40-flaky-job.yaml

While the Job is being created open another terminal and observe the Pods belonging to the Job:

kubectl get pods -l job-name=flaky-job -w

The option -w as in "watch" keeps kubectl polling for changes. So you are likely to see a sequence of events such as:

flaky-job-b7nj8   0/1     Pending             0          0s
flaky-job-b7nj8   0/1     Pending             0          0s
flaky-job-b7nj8   0/1     ContainerCreating   0          1s
flaky-job-b7nj8   0/1     Error               0          3s
flaky-job-b7nj8   0/1     Error               1          5s
flaky-job-b7nj8   0/1     CrashLoopBackOff    1          6s
flaky-job-b7nj8   0/1     Completed           2          24s

As the containers are failing randomly you may have to delete and create the Job several times to observe a similar sequence. Give it a try.

You can see from the output that the Pod has failed two times before it succeeded. This is free robustness for workloads when using Kubernetes Jobs.

Related Videos​

Advanced Job Features

Automatic Retry-Logic​

A Failing Job​

A Flaky Job​

Links​

Related Videos

Automatic Retry-Logic

A Failing Job

A Flaky Job

Links