Install kubernetes and kubeflow on your on-premise machines

2 minute read

You can set up the k8s through deepops and install the kubeflow on your on-premise machines with this walkthrough.

Test Environment

  • Provisioning machine - Ubuntu 18.04
  • Node machines - Ubuntu 20.04
  • deepops - 22.08
  • kubernetes - 1.23.7
  • kubeflow - 1.6.1

Walkthrough

0. Prerequisition

  • Provisioning machine(Linux)
  • At least one Node machine(Linux)

1. Clone deepops repository

# provisioning machine
git clone https://github.com/NVIDIA/deepops.git
cd deepops
git checkout tags/22.08

2. SSH public key authentication

Add public key to nodes for logging into them without typing password. (This step may duplicate with k8s-cluster playbook in step 5)

# provisioning machine
ssh-keygen # Just press Enter for input (Generating key with defalut option)
scp ~/.ssh/id_rsa.pub YourUser@Your.Node.Ip:id_rsa.pub

# node machines
cat id_rsa.pub >> /home/YourUser/.ssh/authorized_keys

# provisioning machine (testing ssh access without password)
ssh YourUser@Your.Node.IP

3. Run the setup script for provisioning machine

This script will activate the python virtual environment(venv). you can change the venv directory path by modifying your environment variable $VENV_DIR (default is /opt/deepops/env) Also, this will create config directory which is copied from default configuration directory (config.example).

# deepops directory, provisioning machine
./scripts/setup.sh

4. Modify configuration files

One of the first things that the setup script does is copy the config.example directory to a new directory called config. This directory is your DeepOps configuration directory, and contains several files that govern the behavior of the Ansible playbooks used by DeepOps.

You can refer to this link as you edit the configurations.

# deepops directory, provisioning machine
vi config/inventory
vi config/group_vars/k8s-cluster.yml
vi config/group_vars/all.yml
...

# Verify the configurations
ansible all -m raw -a "hostname" -K 
# opthon 'K' for passing sudo password of nodes

5. Run ansible playbook(Set up Kubernetes cluster)

ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml -K
# Add option like '-e kube_version=v1.22.8' if you want another k8s version

You don’t need to add option -K for your second run and after.

6. Install kubeflow

# Clone kubeflow manifests repository
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout tags/v1.6.1

# Download kustomize 3.2.0
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
sudo cp ./kustomize_3.2.0_linux_amd64 /usr/local/bin/kustomize

# Install all Kubeflow official components
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

# Make sure all pods are ready
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com

# Accessing kubeflow via port-forward 
# defalut login email address is `user@example.com` with password `12341234`
kubectl port-forward svc/istio-ingressgateway -n istio-system --address=0.0.0.0 8080:80

99. FAQ

  • Error ‘usr/bin/python: not found’
    # Add the variable to the bottom of your inventory file
    vi config/inventory
    '''
    [all:vars]
    ~ existed vars ~
    ansible_python_interpreter=/usr/bin/python3
    '''
    
  • Stuck on the copying kubectl step
    # Run the playbook without -K option
    ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml
    #I guess the `-K` option of Ansible which ask node machine's sudo password is applying the password to you provisioning machine.
    #Maybe if the provisioning machine's password is different with node machines, Ansible will stuck on the step.
    
  • Training training-operator keeps crashing
    # Edit the limitation of memory resource to 100Mi
    kubectl edit deployment -n kubeflow training-operator
    '''
    resources:
    limits:
      cpu: 100m
      memory: 100Mi #Edit Here
    '''
    kubectl rollout restart deployment -n kubeflow training-operator
    

Leave a comment