安装在 Kubernetes 集群上

此处提供在 Kubernetes 集群上安装和配置 dask-gateway-server 的说明。

架构

在 Kubernetes 上运行时,Dask Gateway 由以下组件组成

  • 多个活跃的 Dask 集群(可能每个用户有多个)

  • 一个 Traefik 代理,用于代理用户客户端与其各自调度器之间的连接,以及每个集群的 Dask Web UI

  • 一个 Gateway API 服务器,用于处理用户 API 请求

  • 一个 Gateway 控制器,用于管理每个集群使用的 kubernetes 对象(例如 pods、secrets 等)。

Dask-Gateway high-level kubernetes architecture

Traefik Proxy 部署和 Gateway API Server 部署都可以扩展到多个副本,以提高可用性和可扩展性。

在 Kubernetes 上运行的 Dask Gateway pods 包括以下内容

  • api: Gateway API 服务器

  • traefik: Traefik 代理

  • controller: Kubernetes Gateway 控制器,用于管理 Dask-Gateway 资源

  • schedulerworker: 用户的 Dask 调度器和工作节点

网络通信以下列方式进行

  • traefik pods 将连接代理到端口 8000 上的 api pods,以及端口 8786 和 8787 上的 scheduler pods。

  • api pods 通过端口 8788 向 scheduler pods 发送 api 请求。

  • 如果使用 JupyterHub 身份验证,api pod 会向 JupyterHub 服务器发送请求以进行身份验证。

  • 根据配置,请求直接通过服务查找或通过 JupyterHub 代理发送到 JupyterHub pods。

  • worker pods 在端口 8786 上与 scheduler 通信。

  • traefik pods 在端口 8787 上代理 worker 通信(用于仪表板)。

  • worker pods 在一个随机的高端口上监听传入通信,scheduler 会回连到该端口。

  • worker pods 也通过这些随机的高端口相互通信。

  • scheduler pods 使用端口 8000 上的 api 服务 DNS 名称向 api 服务器 pods 发送心跳请求。

  • controller pod 只与 Kubernetes API 通信,不接收入站流量。

创建 Kubernetes 集群(可选)

如果您还没有运行中的集群,您需要创建一个。网上有很多关于如何执行此操作的指南。我们建议遵循 zero-to-jupyterhub-k8s 提供的手册

安装 Helm

如果您还没有安装 Helm,您需要将其本地安装。如上所述,网上有很多关于执行此操作的教学材料。我们建议遵循 zero-to-jupyterhub-k8s 提供的手册

安装 Dask-Gateway Helm chart

至此,您应该已经拥有一个 Kubernetes 集群。现在,您可以在集群上安装 Dask-Gateway Helm chart 了。

配置

Helm chart 提供了配置 dask-gateway-server 大部分方面的访问权限。这些配置通过一个 YAML 配置文件提供(该文件的名称不重要,我们将使用 config.yaml)。

Helm chart 暴露了许多配置值,有关更多信息,请参阅 默认的 values.yaml 文件

安装 Helm Chart

要安装 Dask-Gateway Helm chart,请运行以下命令

RELEASE=dask-gateway
NAMESPACE=dask-gateway

helm upgrade $RELEASE dask-gateway \
    --repo=https://helm.dask.org \
    --install \
    --namespace $NAMESPACE \
    --values path/to/your/config.yaml

其中

  • RELEASE 是要使用的 Helm 发布名称(我们建议使用 dask-gateway,但任何发布名称都可以)。

  • NAMESPACE 是要安装网关的 Kubernetes 命名空间(我们建议使用 dask-gateway,但任何命名空间都可以)。

  • path/to/your/config.yaml 是您上面创建的 config.yaml 文件的路径。

运行此命令可能需要一些时间,因为会创建资源并下载镜像。一切准备就绪后,运行以下命令将显示 LoadBalancer 服务的 EXTERNAL-IP 地址(如下高亮所示)。

kubectl get service --namespace dask-gateway

NAME                              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE
api-<RELEASE>-dask-gateway        ClusterIP      10.51.245.233   <none>           8000/TCP         6m54s
traefik-<RELEASE>-dask-gateway    LoadBalancer   10.51.247.160   146.148.58.187   80:30304/TCP     6m54s

您还可以检查以确保 daskcluster CRD 已成功安装

kubectl get daskcluster -o yaml

apiVersion: v1
items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

至此,您已拥有一个完全运行的 dask-gateway-server

连接到网关

要连接到运行中的 dask-gateway-server,您需要从上面的 traefik-* 服务中获取外部 IP 地址。Traefik 服务提供 API 请求访问,代理 Dask 仪表板,并代理 Dask 客户端和调度器之间的 TCP 流量。(您也可以选择让 Traefik 通过单独的端口处理调度器流量,请参阅Helm chart 参考)。

要连接,请创建一个 dask_gateway.Gateway 对象,指定两个地址(如果使用单独的端口,第二个 traefik-* 端口应放在 proxy_address 下)。使用与上面相同的值

from dask_gateway import Gateway
gateway = Gateway(
    "http://146.148.58.187",
)

现在您应该可以使用网关客户端进行 API 调用了。要验证这一点,请调用 dask_gateway.Gateway.list_clusters()。由于您目前没有运行中的集群,这应该返回一个空列表。

gateway.list_clusters()

关闭所有内容

如果您使用完网关,您需要删除您的部署并清理所有内容。您可以使用 helm delete 命令执行此操作

helm delete $RELEASE

额外配置

此处提供了一些常见部署场景的配置片段。有关所有可用的配置字段,请参阅Helm chart 参考

使用自定义镜像

默认情况下,由 dask-gateway 启动的调度器/工作节点将使用 daskgateway/dask-gateway 镜像。这是一个只安装了最少依赖项的基础镜像。要使用自定义镜像,您可以配置

  • gateway.backend.image.name: 默认镜像名称

  • gateway.backend.image.tag: 默认镜像标签

要使镜像与 dask-gateway 一起工作,它必须安装兼容版本的 dask-gateway(我们建议始终使用与部署在 dask-gateway-server 上的版本相同的版本)。

此外,我们建议在您的镜像中使用 init 进程。这不是严格要求的,但没有 init 进程运行时可能会导致 worker 行为异常。我们建议使用 tini,但任何 init 进程都应该可以。

对镜像没有其他要求,任何满足上述条件的镜像都应该能正常工作。您可以安装您需要的任何额外的库或依赖项。

我们鼓励您为调度器和 worker pods 维护自己的镜像,因为本项目仅提供一个用于测试目的的极简镜像

使用 extraPodConfig/extraContainerConfig

Kubernetes API 功能强大,但并非所有您可能想在调度器/worker pods 上设置的配置字段都直接由 Helm chart 暴露。为了解决这个问题,我们提供了一些字段,用于将配置直接转发到底层的 kubernetes 对象

  • gateway.backend.scheduler.extraPodConfig

  • gateway.backend.scheduler.extraContainerConfig

  • gateway.backend.worker.extraPodConfig

  • gateway.backend.worker.extraContainerConfig

这些字段允许您分别为调度器和 worker pods/容器配置任何未暴露的字段。每个字段接受一个键值对的映射,该映射将与 dask-gateway 自身设置的任何设置进行深度合并(优先考虑 extra*Config 值)。请注意,键应使用 camelCase(而不是 snake_case)以匹配 kubernetes API 中的命名规范。

例如,这对于在调度器或 worker pods 上设置 tolerationsnode affinities 等非常有用。此处我们为调度器 pods 配置节点反亲和性,以避免使用可抢占节点

gateway:
  backend:
    scheduler:
      extraPodConfig:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                  - key: cloud.google.com/gke-preemptible
                    operator: DoesNotExist

有关允许的字段的信息,请参阅 Kubernetes 文档

使用 extraConfig

并非所有配置选项都已通过 helm chart 暴露。要设置未暴露的选项,您可以使用 gateway.extraConfig 字段。此字段接受以下任一类型

  • 一个 Python 代码块(作为字符串),追加到生成的 dask_gateway_config.py 文件末尾。

  • 一个键 -> 代码块的映射(推荐)。以这种形式应用时,代码块按键的字母顺序追加(键本身没有意义)。这允许合并多个 values.yaml 文件,因为 Helm 可以原生合并映射。

例如,此处我们使用 gateway.extraConfig 来设置 c.Backend.cluster_options,从而暴露 worker 资源和镜像的选项(有关更多信息,请参阅暴露集群选项)。

gateway:
  extraConfig:
    # Note that the key name here doesn't matter. Values in the
    # `extraConfig` map are concatenated, sorted by key name.
    clusteroptions: |
        from dask_gateway_server.options import Options, Integer, Float, String

        def option_handler(options):
            return {
                "worker_cores": options.worker_cores,
                "worker_memory": "%fG" % options.worker_memory,
                "image": options.image,
            }

        c.Backend.cluster_options = Options(
            Integer("worker_cores", 2, min=1, max=4, label="Worker Cores"),
            Float("worker_memory", 4, min=1, max=8, label="Worker Memory (GiB)"),
            String("image", default="daskgateway/dask-gateway:latest", label="Image"),
            handler=option_handler,
        )

有关所有可用配置选项的信息,请参阅配置参考(特别是KubeClusterConfig 部分)。

使用 JupyterHub 进行身份验证

JupyterHub 提供了一个多用户交互式笔记本环境。通过zero-to-jupyterhub-k8s 项目,许多公司和机构已将 JuypterHub 设置为在 Kubernetes 上运行。将 Dask-Gateway 与 JupyterHub 一起部署时,您可以配置 Dask-Gateway 使用 JupyterHub 进行身份验证。

如果 dask-gateway chart 和 jupyterhub chart 安装在同一个命名空间中,配置它们会更直接,原因有二。首先,JupyterHub chart 会为注册服务生成 api token,并将它们存储在一个 dask-gateway 可以使用的 k8s Secret 中。其次,dask-gateway pods/容器可以自动检测到 JupyterHub chart 资源中的 k8s Service。

如果 dask-gateway 与 jupyterhub 安装在同一个命名空间中,这是推荐的配置方式。

# jupyterhub chart configuration
hub:
  services:
    dask-gateway:
      display: false

注意

display 属性会从 JupyterHub 主页的“服务”下拉菜单中隐藏 dask-gateway,因为 dask-gateway 不提供任何 UI。

# dask-gateway chart configuration
gateway:
  auth:
    type: jupyterhub

注意

此配置依赖于 dask-gateway chart 中 display.auth.jupyterhub.apiTokenFromSecretNamedisplay.auth.jupyterhub.apiTokenFromSecretKey 的默认值,您可以在默认的 values.yaml 文件中查看这些值。

如果 dask-gateway 未与 jupyterhub 安装在同一个命名空间中,这是推荐的配置步骤。

首先生成一个要使用的 api token,例如使用 openssl

openssl rand -hex 32

获得 api token 后,您的配置应如下所示,其中 <API URL> 应类似 https://<JUPYTERHUB-HOST>:<JUPYTERHUB-PORT>/hub/api,而 <API TOKEN> 应为您生成的 api token。

# jupyterhub chart configuration
hub:
  services:
    dask-gateway:
      apiToken: "<API TOKEN>"
      display: false
# dask-gateway chart configuration
gateway:
  auth:
    type: jupyterhub
    jupyterhub:
      apiToken: "<API TOKEN>"
      apiUrl: "<API URL>"

配置好 JupyterHub 身份验证后,它可用于对 dask-gateway 客户端用户与 api-dask-gateway pod 中运行的 dask-gateway 服务器之间的请求进行身份验证。

Dask-Gateway 客户端用户在创建 Gateway dask_gateway.Gateway 对象时应添加 auth="jupyterhub",或提供配置使 dask-gateway 客户端通过 JupyterHub 进行身份验证。

from dask_gateway import Gateway
gateway = Gateway(
    "http://146.148.58.187",
    auth="jupyterhub",
)

选择不安装 Traefik

如果 Traefik 已经在您的集群中安装,您可以选择不安装它。可以通过在您的 values.yaml 文件中将 traefik.installTraefik 的值设置为 false 来实现。

traefik:
  installTraefik: false

这将阻止在运行 helm chart 时安装 Traefik 服务。运行 helm 时,您可能也想不安装 Traefik 的 CRD。这可以通过向 helm 命令提供 --skip-crds 标志来完成。但这会阻止 daskclusters CRD 的安装。该 CRD 需要手动安装。

将 2024.1.0 替换为您正在使用的 dask-gateway 版本。

kubectl apply \
  -f https://raw.githubusercontent.com/dask/dask-gateway/2024.1.0/resources/helm/dask-gateway/crds/daskclusters.yaml

Helm chart 参考

此处包含 dask-gateway Helm chart 的完整默认 values.yaml 文件以供参考

## Provide a name to partially substitute for the full names of resources (will maintain the release name)
##
nameOverride: ""

## Provide a name to substitute for the full names of resources
##
fullnameOverride: ""

# gateway nested config relates to the api Pod and the dask-gateway-server
# running within it, the k8s Service exposing it, as well as the schedulers
# (gateway.backend.scheduler) and workers gateway.backend.worker) created by the
# controller when a DaskCluster k8s resource is registered.
gateway:
  # Number of instances of the gateway-server to run
  replicas: 1

  # Annotations to apply to the gateway-server pods.
  annotations: {}

  # Resource requests/limits for the gateway-server pod.
  resources: {}

  # Path prefix to serve dask-gateway api requests under
  # This prefix will be added to all routes the gateway manages
  # in the traefik proxy.
  prefix: /

  # The gateway server log level
  loglevel: INFO

  # The image to use for the dask-gateway-server pod (api pod)
  image:
    name: ghcr.io/dask/dask-gateway-server
    tag: "set-by-chartpress"
    pullPolicy:

  # Add additional environment variables to the gateway pod
  # e.g.
  # env:
  # - name: MYENV
  #   value: "my value"
  env: []

  # Image pull secrets for gateway-server pod
  imagePullSecrets: []

  # Configuration for the gateway-server service
  service:
    annotations: {}

  auth:
    # The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
    type: simple

    simple:
      # A shared password to use for all users.
      password:

    kerberos:
      # Path to the HTTP keytab for this node.
      keytab:

    jupyterhub:
      # A JupyterHub api token for dask-gateway to use. See
      # https://gateway.dask.org.cn/install-kube.html#authenticating-with-jupyterhub.
      apiToken:

      # The JupyterHub Helm chart will automatically generate a token for a
      # registered service. If you don't specify an apiToken explicitly as
      # required in dask-gateway version <=2022.6.1, the dask-gateway Helm chart
      # will try to look for a token from a k8s Secret created by the JupyterHub
      # Helm chart in the same namespace. A failure to find this k8s Secret and
      # key will cause a MountFailure for when the api-dask-gateway pod is
      # starting.
      apiTokenFromSecretName: hub
      apiTokenFromSecretKey: hub.services.dask-gateway.apiToken

      # JupyterHub's api url. Inferred from JupyterHub's service name if running
      # in the same namespace.
      apiUrl:

    custom:
      # The full authenticator class name.
      class:

      # Configuration fields to set on the authenticator class.
      config: {}

  livenessProbe:
    # Enables the livenessProbe.
    enabled: true
    # Configures the livenessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 6
  readinessProbe:
    # Enables the readinessProbe.
    enabled: true
    # Configures the readinessProbe.
    initialDelaySeconds: 5
    timeoutSeconds: 2
    periodSeconds: 10
    failureThreshold: 3

  # nodeSelector, affinity, and tolerations the for the `api` pod running dask-gateway-server
  nodeSelector: {}
  affinity: {}
  tolerations: []

  # Any extra configuration code to append to the generated `dask_gateway_config.py`
  # file. Can be either a single code-block, or a map of key -> code-block
  # (code-blocks are run in alphabetical order by key, the key value itself is
  # meaningless). The map version is useful as it supports merging multiple
  # `values.yaml` files, but is unnecessary in other cases.
  extraConfig: {}

  # backend nested configuration relates to the scheduler and worker resources
  # created for DaskCluster k8s resources by the controller.
  backend:
    # The image to use for both schedulers and workers.
    image:
      name: ghcr.io/dask/dask-gateway
      tag: "set-by-chartpress"
      pullPolicy:

    # Image pull secrets for a dask cluster's scheduler and worker pods
    imagePullSecrets: []

    # The namespace to launch dask clusters in. If not specified, defaults to
    # the same namespace the gateway is running in.
    namespace:

    # A mapping of environment variables to set for both schedulers and workers.
    environment: {}

    scheduler:
      # Any extra configuration for the scheduler pod. Sets
      # `c.KubeClusterConfig.scheduler_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the scheduler container.
      # Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for the scheduler.
      cores:
        request:
        limit:

      # Memory request/limit for the scheduler.
      memory:
        request:
        limit:

    worker:
      # Any extra configuration for the worker pod. Sets
      # `c.KubeClusterConfig.worker_extra_pod_config`.
      extraPodConfig: {}

      # Any extra configuration for the worker container. Sets
      # `c.KubeClusterConfig.worker_extra_container_config`.
      extraContainerConfig: {}

      # Cores request/limit for each worker.
      cores:
        request:
        limit:

      # Memory request/limit for each worker.
      memory:
        request:
        limit:

      # Number of threads available for a worker. Sets
      # `c.KubeClusterConfig.worker_threads`
      threads:


# controller nested config relates to the controller Pod and the
# dask-gateway-server running within it that makes things happen when changes to
# DaskCluster k8s resources are observed.
controller:
  # Whether the controller should be deployed. Disabling the controller allows
  # running it locally for development/debugging purposes.
  enabled: true

  # Any annotations to add to the controller pod
  annotations: {}

  # Resource requests/limits for the controller pod
  resources: {}

  # Image pull secrets for controller pod
  imagePullSecrets: []

  # The controller log level
  loglevel: INFO

  # Max time (in seconds) to keep around records of completed clusters.
  # Default is 24 hours.
  completedClusterMaxAge: 86400

  # Time (in seconds) between cleanup tasks removing records of completed
  # clusters. Default is 5 minutes.
  completedClusterCleanupPeriod: 600

  # Base delay (in seconds) for backoff when retrying after failures.
  backoffBaseDelay: 0.1

  # Max delay (in seconds) for backoff when retrying after failures.
  backoffMaxDelay: 300

  # Limit on the average number of k8s api calls per second.
  k8sApiRateLimit: 50

  # Limit on the maximum number of k8s api calls per second.
  k8sApiRateLimitBurst: 100

  # The image to use for the controller pod.
  image:
    name: ghcr.io/dask/dask-gateway-server
    tag: "set-by-chartpress"
    pullPolicy:

  # Settings for nodeSelector, affinity, and tolerations for the controller pods
  nodeSelector: {}
  affinity: {}
  tolerations: []



# traefik nested config relates to the traefik Pod and Traefik running within it
# that is acting as a proxy for traffic towards the gateway or user created
# DaskCluster resources.
traefik:
  # If traefik is already installed in the cluster, we do not need to install traefik
  # To not install CRDs use --skip-crds flag with helm install, the daskclusters crd then
  # needs to be installed manually.
  # `kubectl apply -f https://raw.githubusercontent.com/dask/dask-gateway/main/resources/helm/dask-gateway/crds/daskclusters.yaml`
  installTraefik: true

  # Number of instances of the proxy to run
  replicas: 1

  # Any annotations to add to the proxy pods
  annotations: {}

  # Resource requests/limits for the proxy pods
  resources: {}

  # The image to use for the proxy pod
  image:
    name: docker.io/traefik
    tag: "3.3.5"
    pullPolicy:
  imagePullSecrets: []

  # Any additional arguments to forward to traefik
  additionalArguments: []

  # The proxy log level
  loglevel: WARN

  # Whether to expose the dashboard on port 9000 (enable for debugging only!)
  dashboard: false

  # Additional configuration for the traefik service
  service:
    type: LoadBalancer
    annotations: {}
    spec: {}
    ports:
      web:
        # The port HTTP(s) requests will be served on
        port: 80
        nodePort:
      tcp:
        # The port TCP requests will be served on. Set to `web` to share the
        # web service port
        port: web
        nodePort:

  # Settings for nodeSelector, affinity, and tolerations for the traefik pods
  nodeSelector: {}
  affinity: {}
  tolerations: []



# rbac nested configuration relates to the choice of creating or replacing
# resources like (Cluster)Role, (Cluster)RoleBinding, and ServiceAccount.
rbac:
  # Whether to enable RBAC.
  enabled: true

  # Existing names to use if ClusterRoles, ClusterRoleBindings, and
  # ServiceAccounts have already been created by other means (leave set to
  # `null` to create all required roles at install time)
  controller:
    serviceAccountName:

  gateway:
    serviceAccountName:

  traefik:
    serviceAccountName:



# global nested configuration is accessible by all Helm charts that may depend
# on each other, but not used by this Helm chart. An entry is created here to
# validate its use and catch YAML typos via this configurations associated JSON
# schema.
global: {}