如何在阿里云上搭建Kubeflow

Database and Ruby, Python, History


什么是 Kubeflow

Kubeflow 是 Google 开源的一个基于 Kubernetes 的 ML workflow 平台,集成了很多机器学习工具,比如 Jupyterlab,pipeline 等。

安装 KubeFlow

Kubeflow 有一个官方的文档,提示如何在 google 云上安装。但是因为国内的环境,一键安装几乎不太可能。

安装 local-path-provisioner 动态分配 PV

kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

并且需要把这个设置为默认存储,这样创建一个 PVC 就会自动创建一个新的 PV。

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-path
  annotations: #添加为默认StorageClass
    storageclass.beta.kubernetes.io/is-default-class: "true"
provisioner: rancher.io/local-path
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

安装 Kubeflow

接下来就是同步镜像。因为镜像是在 gcr.io 上,被墙了,所以需要同步到自己的 CR 上,再更新manifests 文档里面的所用到的镜像地址。我的做法是一步一步地拉起服务,确认没有问题,再拉起下一个服务。

images = [
  'gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:421aa67057240fa0c56ebf2c6e5b482a12842005805c46e067129402d1751220',
  'gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:bfa1dfea77aff6dfa7959f4822d8e61c4f7933053874cd3f27352323e6ecd985',
  'gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:dabaecec38860ca4c972e6821d5dc825549faf50c6feb8feb4c04802f2338b8a',
  'gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:c2994c2b6c2c7f38ad1b85c71789bf1753cc8979926423c83231e62258837cb9',
  'gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:8319aa662b4912e8175018bd7cc90c63838562a27515197b803bdcd5634c7007',
  'gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:98a2cc7fd62ee95e137116504e7166c32c65efef42c3d1454630780410abf943',
  'gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping@sha256:f66c41ad7a73f5d4f4bdfec4294d5459c477f09f3ce52934d1a215e32316b59b',
  'gcr.io/knative-releases/knative.dev/serving/cmd/domain-mapping-webhook@sha256:7368aaddf2be8d8784dc7195f5bc272ecfe49d429697f48de0ddc44f278167aa',
  'gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:4305209ce498caf783f39c8f3e85dfa635ece6947033bf50b0b627983fd65953',
  'gcr.io/ml-pipeline/cache-server:2.0.1',
  'gcr.io/ml-pipeline/metadata-envoy:2.0.1',
  'gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0',
  'gcr.io/ml-pipeline/metadata-writer:2.0.1',
  'gcr.io/ml-pipeline/api-server:2.0.1',
  'gcr.io/ml-pipeline/persistenceagent:2.0.1',
  'gcr.io/ml-pipeline/scheduledworkflow:2.0.1',
  'gcr.io/ml-pipeline/frontend:2.0.1',
  'gcr.io/ml-pipeline/viewer-crd-controller:2.0.1',
  'gcr.io/ml-pipeline/visualization-server:2.0.1',
  'gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance',
  'gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance',
  'gcr.io/ml-pipeline/argoexec:v3.3.10-license-compliance',
  'gcr.io/ml-pipeline/cache-deployer:2.0.1',
  'gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1',
  'gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0'
]