安装在 Kubernetes 集群上
目录
安装在 Kubernetes 集群上¶
此处提供在 Kubernetes 集群上安装和配置 dask-gateway-server
的说明。
架构¶
在 Kubernetes 上运行时,Dask Gateway 由以下组件组成
多个活跃的 Dask 集群(可能每个用户有多个)
一个 Traefik 代理,用于代理用户客户端与其各自调度器之间的连接,以及每个集群的 Dask Web UI
一个 Gateway API 服务器,用于处理用户 API 请求
一个 Gateway 控制器,用于管理每个集群使用的 kubernetes 对象(例如 pods、secrets 等)。
Traefik Proxy 部署和 Gateway API Server 部署都可以扩展到多个副本,以提高可用性和可扩展性。
在 Kubernetes 上运行的 Dask Gateway pods 包括以下内容
api
: Gateway API 服务器traefik
: Traefik 代理controller
: Kubernetes Gateway 控制器,用于管理 Dask-Gateway 资源scheduler
和worker
: 用户的 Dask 调度器和工作节点
网络通信以下列方式进行
traefik
pods 将连接代理到端口 8000 上的api
pods,以及端口 8786 和 8787 上的scheduler
pods。api
pods 通过端口 8788 向scheduler
pods 发送 api 请求。如果使用 JupyterHub 身份验证,
api
pod 会向 JupyterHub 服务器发送请求以进行身份验证。根据配置,请求直接通过服务查找或通过 JupyterHub 代理发送到 JupyterHub pods。
worker
pods 在端口 8786 上与scheduler
通信。traefik
pods 在端口 8787 上代理worker
通信(用于仪表板)。worker
pods 在一个随机的高端口上监听传入通信,scheduler
会回连到该端口。worker
pods 也通过这些随机的高端口相互通信。scheduler
pods 使用端口 8000 上的api
服务 DNS 名称向api
服务器 pods 发送心跳请求。controller
pod 只与 Kubernetes API 通信,不接收入站流量。
创建 Kubernetes 集群(可选)¶
如果您还没有运行中的集群,您需要创建一个。网上有很多关于如何执行此操作的指南。我们建议遵循 zero-to-jupyterhub-k8s 提供的手册。
安装 Helm¶
如果您还没有安装 Helm,您需要将其本地安装。如上所述,网上有很多关于执行此操作的教学材料。我们建议遵循 zero-to-jupyterhub-k8s 提供的手册。
安装 Dask-Gateway Helm chart¶
至此,您应该已经拥有一个 Kubernetes 集群。现在,您可以在集群上安装 Dask-Gateway Helm chart 了。
配置¶
Helm chart 提供了配置 dask-gateway-server
大部分方面的访问权限。这些配置通过一个 YAML 配置文件提供(该文件的名称不重要,我们将使用 config.yaml
)。
Helm chart 暴露了许多配置值,有关更多信息,请参阅 默认的 values.yaml 文件。
安装 Helm Chart¶
要安装 Dask-Gateway Helm chart,请运行以下命令
RELEASE=dask-gateway
NAMESPACE=dask-gateway
helm upgrade $RELEASE dask-gateway \
--repo=https://helm.dask.org \
--install \
--namespace $NAMESPACE \
--values path/to/your/config.yaml
其中
RELEASE
是要使用的 Helm 发布名称(我们建议使用dask-gateway
,但任何发布名称都可以)。NAMESPACE
是要安装网关的 Kubernetes 命名空间(我们建议使用dask-gateway
,但任何命名空间都可以)。path/to/your/config.yaml
是您上面创建的config.yaml
文件的路径。
运行此命令可能需要一些时间,因为会创建资源并下载镜像。一切准备就绪后,运行以下命令将显示 LoadBalancer
服务的 EXTERNAL-IP
地址(如下高亮所示)。
kubectl get service --namespace dask-gateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
api-<RELEASE>-dask-gateway ClusterIP 10.51.245.233 <none> 8000/TCP 6m54s
traefik-<RELEASE>-dask-gateway LoadBalancer 10.51.247.160 146.148.58.187 80:30304/TCP 6m54s
您还可以检查以确保 daskcluster CRD 已成功安装
kubectl get daskcluster -o yaml
apiVersion: v1
items: []
kind: List
metadata:
resourceVersion: ""
selfLink: ""
至此,您已拥有一个完全运行的 dask-gateway-server
。
连接到网关¶
要连接到运行中的 dask-gateway-server
,您需要从上面的 traefik-*
服务中获取外部 IP 地址。Traefik 服务提供 API 请求访问,代理 Dask 仪表板,并代理 Dask 客户端和调度器之间的 TCP 流量。(您也可以选择让 Traefik 通过单独的端口处理调度器流量,请参阅Helm chart 参考)。
要连接,请创建一个 dask_gateway.Gateway
对象,指定两个地址(如果使用单独的端口,第二个 traefik-*
端口应放在 proxy_address
下)。使用与上面相同的值
from dask_gateway import Gateway
gateway = Gateway(
"http://146.148.58.187",
)
现在您应该可以使用网关客户端进行 API 调用了。要验证这一点,请调用 dask_gateway.Gateway.list_clusters()
。由于您目前没有运行中的集群,这应该返回一个空列表。
gateway.list_clusters()
额外配置¶
此处提供了一些常见部署场景的配置片段。有关所有可用的配置字段,请参阅Helm chart 参考。
使用自定义镜像¶
默认情况下,由 dask-gateway 启动的调度器/工作节点将使用 daskgateway/dask-gateway
镜像。这是一个只安装了最少依赖项的基础镜像。要使用自定义镜像,您可以配置
gateway.backend.image.name
: 默认镜像名称gateway.backend.image.tag
: 默认镜像标签
要使镜像与 dask-gateway 一起工作,它必须安装兼容版本的 dask-gateway
(我们建议始终使用与部署在 dask-gateway-server
上的版本相同的版本)。
此外,我们建议在您的镜像中使用 init 进程。这不是严格要求的,但没有 init 进程运行时可能会导致 worker 行为异常。我们建议使用 tini,但任何 init 进程都应该可以。
对镜像没有其他要求,任何满足上述条件的镜像都应该能正常工作。您可以安装您需要的任何额外的库或依赖项。
我们鼓励您为调度器和 worker pods 维护自己的镜像,因为本项目仅提供一个用于测试目的的极简镜像。
使用 extraPodConfig
/extraContainerConfig
¶
Kubernetes API 功能强大,但并非所有您可能想在调度器/worker pods 上设置的配置字段都直接由 Helm chart 暴露。为了解决这个问题,我们提供了一些字段,用于将配置直接转发到底层的 kubernetes 对象
gateway.backend.scheduler.extraPodConfig
gateway.backend.scheduler.extraContainerConfig
gateway.backend.worker.extraPodConfig
gateway.backend.worker.extraContainerConfig
这些字段允许您分别为调度器和 worker pods/容器配置任何未暴露的字段。每个字段接受一个键值对的映射,该映射将与 dask-gateway 自身设置的任何设置进行深度合并(优先考虑 extra*Config
值)。请注意,键应使用 camelCase
(而不是 snake_case
)以匹配 kubernetes API 中的命名规范。
例如,这对于在调度器或 worker pods 上设置 tolerations 或 node affinities 等非常有用。此处我们为调度器 pods 配置节点反亲和性,以避免使用可抢占节点
gateway:
backend:
scheduler:
extraPodConfig:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-preemptible
operator: DoesNotExist
有关允许的字段的信息,请参阅 Kubernetes 文档
使用 extraConfig
¶
并非所有配置选项都已通过 helm chart 暴露。要设置未暴露的选项,您可以使用 gateway.extraConfig
字段。此字段接受以下任一类型
一个 Python 代码块(作为字符串),追加到生成的
dask_gateway_config.py
文件末尾。一个键 -> 代码块的映射(推荐)。以这种形式应用时,代码块按键的字母顺序追加(键本身没有意义)。这允许合并多个
values.yaml
文件,因为 Helm 可以原生合并映射。
例如,此处我们使用 gateway.extraConfig
来设置 c.Backend.cluster_options
,从而暴露 worker 资源和镜像的选项(有关更多信息,请参阅暴露集群选项)。
gateway:
extraConfig:
# Note that the key name here doesn't matter. Values in the
# `extraConfig` map are concatenated, sorted by key name.
clusteroptions: |
from dask_gateway_server.options import Options, Integer, Float, String
def option_handler(options):
return {
"worker_cores": options.worker_cores,
"worker_memory": "%fG" % options.worker_memory,
"image": options.image,
}
c.Backend.cluster_options = Options(
Integer("worker_cores", 2, min=1, max=4, label="Worker Cores"),
Float("worker_memory", 4, min=1, max=8, label="Worker Memory (GiB)"),
String("image", default="daskgateway/dask-gateway:latest", label="Image"),
handler=option_handler,
)
有关所有可用配置选项的信息,请参阅配置参考(特别是KubeClusterConfig 部分)。
使用 JupyterHub 进行身份验证¶
JupyterHub 提供了一个多用户交互式笔记本环境。通过zero-to-jupyterhub-k8s 项目,许多公司和机构已将 JuypterHub 设置为在 Kubernetes 上运行。将 Dask-Gateway 与 JupyterHub 一起部署时,您可以配置 Dask-Gateway 使用 JupyterHub 进行身份验证。
如果 dask-gateway chart 和 jupyterhub chart 安装在同一个命名空间中,配置它们会更直接,原因有二。首先,JupyterHub chart 会为注册服务生成 api token,并将它们存储在一个 dask-gateway 可以使用的 k8s Secret 中。其次,dask-gateway pods/容器可以自动检测到 JupyterHub chart 资源中的 k8s Service。
如果 dask-gateway 与 jupyterhub 安装在同一个命名空间中,这是推荐的配置方式。
# jupyterhub chart configuration
hub:
services:
dask-gateway:
display: false
注意
display
属性会从 JupyterHub 主页的“服务”下拉菜单中隐藏 dask-gateway,因为 dask-gateway 不提供任何 UI。
# dask-gateway chart configuration
gateway:
auth:
type: jupyterhub
注意
此配置依赖于 dask-gateway chart 中 display.auth.jupyterhub.apiTokenFromSecretName
和 display.auth.jupyterhub.apiTokenFromSecretKey
的默认值,您可以在默认的 values.yaml 文件中查看这些值。
如果 dask-gateway 未与 jupyterhub 安装在同一个命名空间中,这是推荐的配置步骤。
首先生成一个要使用的 api token,例如使用 openssl
openssl rand -hex 32
获得 api token 后,您的配置应如下所示,其中 <API URL>
应类似 https://<JUPYTERHUB-HOST>:<JUPYTERHUB-PORT>/hub/api
,而 <API TOKEN>
应为您生成的 api token。
# jupyterhub chart configuration
hub:
services:
dask-gateway:
apiToken: "<API TOKEN>"
display: false
# dask-gateway chart configuration
gateway:
auth:
type: jupyterhub
jupyterhub:
apiToken: "<API TOKEN>"
apiUrl: "<API URL>"
配置好 JupyterHub 身份验证后,它可用于对 dask-gateway 客户端用户与 api-dask-gateway pod 中运行的 dask-gateway 服务器之间的请求进行身份验证。
Dask-Gateway 客户端用户在创建 Gateway dask_gateway.Gateway
对象时应添加 auth="jupyterhub"
,或提供配置使 dask-gateway 客户端通过 JupyterHub 进行身份验证。
from dask_gateway import Gateway
gateway = Gateway(
"http://146.148.58.187",
auth="jupyterhub",
)
选择不安装 Traefik¶
如果 Traefik 已经在您的集群中安装,您可以选择不安装它。可以通过在您的 values.yaml
文件中将 traefik.installTraefik
的值设置为 false
来实现。
traefik:
installTraefik: false
这将阻止在运行 helm chart 时安装 Traefik 服务。运行 helm 时,您可能也想不安装 Traefik 的 CRD。这可以通过向 helm 命令提供 --skip-crds
标志来完成。但这会阻止 daskclusters CRD 的安装。该 CRD 需要手动安装。
将 2024.1.0 替换为您正在使用的 dask-gateway 版本。
kubectl apply \
-f https://raw.githubusercontent.com/dask/dask-gateway/2024.1.0/resources/helm/dask-gateway/crds/daskclusters.yaml
Helm chart 参考¶
此处包含 dask-gateway Helm chart 的完整默认 values.yaml 文件以供参考
## Provide a name to partially substitute for the full names of resources (will maintain the release name)
##
nameOverride: ""
## Provide a name to substitute for the full names of resources
##
fullnameOverride: ""
# gateway nested config relates to the api Pod and the dask-gateway-server
# running within it, the k8s Service exposing it, as well as the schedulers
# (gateway.backend.scheduler) and workers gateway.backend.worker) created by the
# controller when a DaskCluster k8s resource is registered.
gateway:
# Number of instances of the gateway-server to run
replicas: 1
# Annotations to apply to the gateway-server pods.
annotations: {}
# Resource requests/limits for the gateway-server pod.
resources: {}
# Path prefix to serve dask-gateway api requests under
# This prefix will be added to all routes the gateway manages
# in the traefik proxy.
prefix: /
# The gateway server log level
loglevel: INFO
# The image to use for the dask-gateway-server pod (api pod)
image:
name: ghcr.io/dask/dask-gateway-server
tag: "set-by-chartpress"
pullPolicy:
# Add additional environment variables to the gateway pod
# e.g.
# env:
# - name: MYENV
# value: "my value"
env: []
# Image pull secrets for gateway-server pod
imagePullSecrets: []
# Configuration for the gateway-server service
service:
annotations: {}
auth:
# The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
type: simple
simple:
# A shared password to use for all users.
password:
kerberos:
# Path to the HTTP keytab for this node.
keytab:
jupyterhub:
# A JupyterHub api token for dask-gateway to use. See
# https://gateway.dask.org.cn/install-kube.html#authenticating-with-jupyterhub.
apiToken:
# The JupyterHub Helm chart will automatically generate a token for a
# registered service. If you don't specify an apiToken explicitly as
# required in dask-gateway version <=2022.6.1, the dask-gateway Helm chart
# will try to look for a token from a k8s Secret created by the JupyterHub
# Helm chart in the same namespace. A failure to find this k8s Secret and
# key will cause a MountFailure for when the api-dask-gateway pod is
# starting.
apiTokenFromSecretName: hub
apiTokenFromSecretKey: hub.services.dask-gateway.apiToken
# JupyterHub's api url. Inferred from JupyterHub's service name if running
# in the same namespace.
apiUrl:
custom:
# The full authenticator class name.
class:
# Configuration fields to set on the authenticator class.
config: {}
livenessProbe:
# Enables the livenessProbe.
enabled: true
# Configures the livenessProbe.
initialDelaySeconds: 5
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 6
readinessProbe:
# Enables the readinessProbe.
enabled: true
# Configures the readinessProbe.
initialDelaySeconds: 5
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3
# nodeSelector, affinity, and tolerations the for the `api` pod running dask-gateway-server
nodeSelector: {}
affinity: {}
tolerations: []
# Any extra configuration code to append to the generated `dask_gateway_config.py`
# file. Can be either a single code-block, or a map of key -> code-block
# (code-blocks are run in alphabetical order by key, the key value itself is
# meaningless). The map version is useful as it supports merging multiple
# `values.yaml` files, but is unnecessary in other cases.
extraConfig: {}
# backend nested configuration relates to the scheduler and worker resources
# created for DaskCluster k8s resources by the controller.
backend:
# The image to use for both schedulers and workers.
image:
name: ghcr.io/dask/dask-gateway
tag: "set-by-chartpress"
pullPolicy:
# Image pull secrets for a dask cluster's scheduler and worker pods
imagePullSecrets: []
# The namespace to launch dask clusters in. If not specified, defaults to
# the same namespace the gateway is running in.
namespace:
# A mapping of environment variables to set for both schedulers and workers.
environment: {}
scheduler:
# Any extra configuration for the scheduler pod. Sets
# `c.KubeClusterConfig.scheduler_extra_pod_config`.
extraPodConfig: {}
# Any extra configuration for the scheduler container.
# Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
extraContainerConfig: {}
# Cores request/limit for the scheduler.
cores:
request:
limit:
# Memory request/limit for the scheduler.
memory:
request:
limit:
worker:
# Any extra configuration for the worker pod. Sets
# `c.KubeClusterConfig.worker_extra_pod_config`.
extraPodConfig: {}
# Any extra configuration for the worker container. Sets
# `c.KubeClusterConfig.worker_extra_container_config`.
extraContainerConfig: {}
# Cores request/limit for each worker.
cores:
request:
limit:
# Memory request/limit for each worker.
memory:
request:
limit:
# Number of threads available for a worker. Sets
# `c.KubeClusterConfig.worker_threads`
threads:
# controller nested config relates to the controller Pod and the
# dask-gateway-server running within it that makes things happen when changes to
# DaskCluster k8s resources are observed.
controller:
# Whether the controller should be deployed. Disabling the controller allows
# running it locally for development/debugging purposes.
enabled: true
# Any annotations to add to the controller pod
annotations: {}
# Resource requests/limits for the controller pod
resources: {}
# Image pull secrets for controller pod
imagePullSecrets: []
# The controller log level
loglevel: INFO
# Max time (in seconds) to keep around records of completed clusters.
# Default is 24 hours.
completedClusterMaxAge: 86400
# Time (in seconds) between cleanup tasks removing records of completed
# clusters. Default is 5 minutes.
completedClusterCleanupPeriod: 600
# Base delay (in seconds) for backoff when retrying after failures.
backoffBaseDelay: 0.1
# Max delay (in seconds) for backoff when retrying after failures.
backoffMaxDelay: 300
# Limit on the average number of k8s api calls per second.
k8sApiRateLimit: 50
# Limit on the maximum number of k8s api calls per second.
k8sApiRateLimitBurst: 100
# The image to use for the controller pod.
image:
name: ghcr.io/dask/dask-gateway-server
tag: "set-by-chartpress"
pullPolicy:
# Settings for nodeSelector, affinity, and tolerations for the controller pods
nodeSelector: {}
affinity: {}
tolerations: []
# traefik nested config relates to the traefik Pod and Traefik running within it
# that is acting as a proxy for traffic towards the gateway or user created
# DaskCluster resources.
traefik:
# If traefik is already installed in the cluster, we do not need to install traefik
# To not install CRDs use --skip-crds flag with helm install, the daskclusters crd then
# needs to be installed manually.
# `kubectl apply -f https://raw.githubusercontent.com/dask/dask-gateway/main/resources/helm/dask-gateway/crds/daskclusters.yaml`
installTraefik: true
# Number of instances of the proxy to run
replicas: 1
# Any annotations to add to the proxy pods
annotations: {}
# Resource requests/limits for the proxy pods
resources: {}
# The image to use for the proxy pod
image:
name: docker.io/traefik
tag: "3.3.5"
pullPolicy:
imagePullSecrets: []
# Any additional arguments to forward to traefik
additionalArguments: []
# The proxy log level
loglevel: WARN
# Whether to expose the dashboard on port 9000 (enable for debugging only!)
dashboard: false
# Additional configuration for the traefik service
service:
type: LoadBalancer
annotations: {}
spec: {}
ports:
web:
# The port HTTP(s) requests will be served on
port: 80
nodePort:
tcp:
# The port TCP requests will be served on. Set to `web` to share the
# web service port
port: web
nodePort:
# Settings for nodeSelector, affinity, and tolerations for the traefik pods
nodeSelector: {}
affinity: {}
tolerations: []
# rbac nested configuration relates to the choice of creating or replacing
# resources like (Cluster)Role, (Cluster)RoleBinding, and ServiceAccount.
rbac:
# Whether to enable RBAC.
enabled: true
# Existing names to use if ClusterRoles, ClusterRoleBindings, and
# ServiceAccounts have already been created by other means (leave set to
# `null` to create all required roles at install time)
controller:
serviceAccountName:
gateway:
serviceAccountName:
traefik:
serviceAccountName:
# global nested configuration is accessible by all Helm charts that may depend
# on each other, but not used by this Helm chart. An entry is created here to
# validate its use and catch YAML typos via this configurations associated JSON
# schema.
global: {}