This approach uses a Persistent Volume to share a set of SDC Stage Libraries across multiple SDC Pods. The stage libs are mounted by SDC Pods at deployment time so there is no need to package them in a custom SDC image in advance (as is done in the Custom Docker Image example).
There are several Persistent Volume types to choose from, with various capabilities. For this example, I'll deploy SDC on AKS using an Azure File Volume, which makes it easy to share SDC stage libs across multiple nodes and Pods.
Choose your PV type carefully, as not all PV types can share files across more than one node. See the docs here for a list of Access Modes for each PV type. For example, an AWSElasticBlockStore can only be shared across Pods on the same node. And Google provides an example here of how to write to a GCEPersistentDisk using ReadWriteOnce
access mode and then how to unmount the disk and mount it again with ReadOnlyMany
access mode so that multiple Pods and nodes can access the disk's data.
In this example, a Persistent Volume will be dynamically provisioned by a Persistent Volume Claim and populated by a Job. The Job will download a set of stage libs based on a list stored in a ConfigMap. The stage libs on the Persistent Volume will then be shared with multiple SDC Pods.
Here is an example ConfigMap that contains two lists: one for standard SDC stage libs and one for Enterprise stage libs. Two separate lists are needed because Enterprise stage libs are downloaded from a different URL than standard SDC stage libs:
apiVersion: v1
kind: ConfigMap
metadata:
name: sdc-stage-libs-list
data:
sdc-stage-libs: |
streamsets-datacollector-aws-lib
streamsets-datacollector-basic-lib
streamsets-datacollector-bigtable-lib
streamsets-datacollector-dataformats-lib
streamsets-datacollector-dev-lib
streamsets-datacollector-google-cloud-lib
streamsets-datacollector-groovy_2_4-lib
streamsets-datacollector-jdbc-lib
streamsets-datacollector-jms-lib
streamsets-datacollector-jython_2_7-lib
streamsets-datacollector-stats-lib
streamsets-datacollector-windows-lib
sdc-enterprise-stage-libs: |
streamsets-datacollector-databricks-lib-1.0.0
streamsets-datacollector-snowflake-lib-1.4.0
streamsets-datacollector-oracle-lib-1.2.0
streamsets-datacollector-sql-server-bdc-lib-1.0.1
Create the ConfigMap by executing the command:
$ kubectl apply -f sdc-stage-libs-configmap.yaml
Here is an example StorageClass for an Azure File Volume:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: sdc-stage-libs-sc
provisioner: kubernetes.io/azure-file
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=0
- gid=0
- mfsymlinks
- cache=strict
parameters:
skuName: Standard_LRS
Create the StorageClass by executing the command:
$ kubectl apply -f sdc-stage-libs-sc.yaml
Create a Persistent Volume Claim (PVC) that requests 5GB of storage and refers to the StorageClass defined above. I'll use ReadWriteOnce
access mode as there is only one writer (the Job) but an Azure File Share is readable across multiple nodes):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: sdc-stage-libs-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: sdc-stage-libs-sc
resources:
requests:
storage: 5Gi
When this PVC is created, it will dynamically create an Azure File-based Persistent Volume.
Create the PVC by executing the command:
$ kubectl apply -f sdc-stage-libs-pvc.yaml
Inspect the PVC and wait until its status is Bound
. For example here's what I see in my environment:
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES
sdc-stage-libs-pvc Bound pvc-a2d1f7d6-6b6b-4ec7-8b8b-defa0422894c 5Gi RWO
Create a Job to populate the Persistent Volume.
An example Job is defined here. It spins up a BusyBox container that uses shell commands to download and extract the stage libs, and writes them to the directory /streamsets-libs
in the Persistent Volume.
Run the Job by executing the command:
$ kubectl apply -f sdc-stage-libs-job.yaml
You can tail the log of the Job's Pod to see the progress of the stage libs downloads:
$ kubectl logs -f sdc-stage-libs-job-4xtln
Make sure the Job completes successfully:
$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
sdc-stage-libs-job 1/1 73s 98s
When the Job completes, the Persistent Volume will be populated with the specified SDC stage libs.
Create a Control Hub-based SDC Deployment that mounts the Persistent Volume using the PVC, using a manifest like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: sdc
spec:
selector:
matchLabels:
app: sdc
template:
metadata:
labels:
app: sdc
spec:
containers:
- name: sdc
image: streamsets/datacollector:latest
ports:
- containerPort: 18630
env:
- name: SDC_JAVA_OPTS
value: "-Xmx2g -Xms2g"
volumeMounts:
- name: sdc-stage-libs
mountPath: /opt/streamsets-datacollector-3.16.1/streamsets-libs
readOnly: "true"
volumes:
- name: sdc-stage-libs
persistentVolumeClaim:
claimName: sdc-stage-libs-pvc
Specify three instance of SDC and start the deployment. These Pods will load the streamsets-libs
directory as a read-only file system.
We can see the three SDC Pods are running (along with the completed Job and our Control Agent):
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
control-agent-764778f746-pjcpr 1/1 Running 0 168m
sdc-8689648457-2pkhk 1/1 Running 0 62s
sdc-8689648457-9tth5 1/1 Running 0 62s
sdc-8689648457-f8r9b 1/1 Running 0 62s
sdc-stage-libs-job-4xtln 0/1 Completed 0 5m35s
The three SDC instances will register with Control Hub:
Here we can see (using kubectl port-forward
) the installed stage libs for one of the SDCs: