Building kubeflow pipeline components
You can write Kubeflow pipeline components specification with this walkthrough.
Test Environment
- Provisioning machine - Ubuntu 18.04
- Node machines - Ubuntu 20.04
- deepops - 22.08
- kubernetes - 1.23.7
- kubeflow - 1.6.1
Walkthrough
0. Prerequisition
- Kubeflow installed on Kubernetes
1. Understanding pipeline components
Pipeline components represent one step in your MLops pipeline, such as preprocessing data or training a model.
To create a component, you must write the component’s executable code and the Docker container image that the code runs in. Also you can define your component’s metadata and interface.
When Kubeflow Pipelines executes a component, a container image is started in a Kubernetes Pod and your component’s inputs are passed in as command-line arguments.
You can pass small inputs, such as strings and numbers, by value. Larger inputs, such as CSV data, must be passed as paths to files. When your component has finished, the component’s outputs are returned as files.
The following is an example program written using Python3. This program downloads a directory from the remote storage. Let’s build a component with this example.
import pysftp
import argparse
import os
parser = argparse.ArgumentParser()
parser.add_argument('--host')
parser.add_argument('--username')
parser.add_argument('--pwd')
parser.add_argument('--src_dir')
parser.add_argument('--save_dir')
args = parser.parse_args()
host = args.host
username = args.username
pwd = args.pwd
src_dir = args.src_dir
save_dir = args.save_dir
os.makedirs(save_dir)
cnopt = pysftp.CnOpts()
hostkeys = None
if cnopt.hostkeys.lookup(host) is None:
print("Hostkey for", host, "is not exist.. Generating..")
hostkeys = cnopt.hostkeys
cnopt.hostkeys = None
print("Hostkey Generated.")
with pysftp.Connection(host, username=username, password=pwd, cnopts=cnopt) as sftp:
with sftp.cd(src_dir):
sftp.get_r(".", save_dir)
print("Download Completed")
2. Containerize your code
The component must be packaged as a docker image and pushed to a container registry that your k8s cluster can access.
Here is an example Dockerfile
.
FROM python:latest
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ .
Then you can build and push a docker image.
docker build -t YOUR_REPO/IMAGE_NAME:IMAGE_TAG . # Don't forget a dot fot path param
docker push YOUR_REPO/IMAGE_NAME:IMAGE_TAG
3. Creating a component specification
Component specification defines the component’s metadata, interface and implementation. The following example creates a component specification YAML.
Fisrt, define your component’s metadata and interface.
name: Download dataset
description: Download dataset via sftp
inputs:
- {name: sftp_host, type: String, description: 'IP address of the NAS'}
- {name: sftp_username, type: String, description: 'Username to access'}
- {name: sftp_pwd, type: String, description: 'Password to access'}
- {name: sftp_dir, type: String, description: 'Directory path for dataset on NAS'}
outputs:
- {name: raw_data_dir, description: 'Path to store dataset on local.'}
To define an input in your component.yaml
, add an item to the inputs list with the following attributes:
- name: Name of this input. Each input’s name must be unique.
- description(Optional): Description of the input.
- default(Optional): The default value for this input.
- type(Optional): The input’s type.
- optional: Specifies if this input is optional. The default value is false.
These are for the outputs list:
- name: Name of this input. Each input’s name must be unique.
- description(Optional): Description of the input.
- type(Optional): The input’s type.
Then, write your component’s implementation section. You have to specify the name of your container image and define a command for your component’s implementation.
implementation:
container:
image: YOUR_REPO/IMAGE_NAME:IMAGE_TAG
# command is a list of strings (command-line arguments).
# The YAML language has two syntaxes for lists and you can use either of them.
# Here we use the "flow syntax" - comma-separated strings inside square brackets.
command: [
python3,
# Path of the program inside the container
/usr/src/app/sftp_download.py,
--host,
{inputValue: sftp_host},
--username,
{inputValue: sftp_username},
--pwd,
{inputValue: sftp_pwd},
--src_dir,
{inputValue: sftp_dir},
--save_dir,
{outputPath: raw_data_dir},
]
The command
is formatted as a list of strings. Each string in the command is a command-line argument or a placeholder. At runtime, placeholders are replaced with an input or output.
There are three types of input/output placeholders:
{inputValue: input-name}
: This placeholder is replaced with the value of the specified input. This is useful for small pieces of input data, such as numbers or small strings.{inputPath: input-name}
: This placeholder is replaced with the path to this input as a file. Your component can read the contents of that input at that path during the pipeline run.{outputPath: output-name}
: This placeholder is replaced with the path where your program writes this output’s data. This lets the Kubeflow Pipelines system read the contents of the file and store it as the value of the specified output.
After you define all sections for component, the component.yaml
should be something like the following.
name: Download dataset
description: Download dataset via sftp
inputs:
- {name: sftp_host, type: String, description: 'Data for input_1'}
- {name: sftp_username, type: String, description: 'Data for input_1'}
- {name: sftp_pwd, type: String, description: 'Data for input_1'}
- {name: sftp_dir, type: String, description: 'Data for input_1'}
outputs:
- {name: raw_data_dir, description: 'Path to dataset.'}
implementation:
container:
image: YOUR_REPO/IMAGE_NAME:IMAGE_TAG
command: [
python3,
/usr/src/app/sftp_download.py,
--host,
{inputValue: sftp_host},
--username,
{inputValue: sftp_username},
--pwd,
{inputValue: sftp_pwd},
--src_dir,
{inputValue: sftp_dir},
--save_dir,
{outputPath: raw_data_dir},
]
Done! You can use the Kubeflow Pipelines SDK to load your component.
4. Building components from python function
If your component’s code is implemented as a Python function, use the Kubeflow Pipelines SDK to package your function as a component
Leave a comment