blog

Kubernetes 基于 Namespace 的物理队列实现

Author: ninehills
Labels: blog
Created: 2020-04-10T04:22:00Z
Link and comments: https://github.com/ninehills/blog/issues/77

Kubernetes 基于 Namespace 的物理队列实现

作者：swulling+pub@gmail.com
摘要：Kubernetes 实现基于 Namespace 的物理队列，即Namespace下的Pod和Node的强绑定

0x00 背景

Kuberntes 目前在实际业务部署时，有两个流派：一派推崇小集群，一个或数个业务共享小集群，全公司有数百上千个小集群组成；另一派推崇大集群，每个AZ（可用区）一个或数个大集群，各个业务通过Namespace的方式进行隔离。

两者各有优劣，但是从资源利用率提升和维护成本的角度，大集群的优势更加突出。但同时大集群也带来相当多的安全、可用性、性能的挑战和维护管理成本。

本文属于Kubernetes多租户大集群实践的一部分，用来解决多租户场景下，如何实现传统的物理队列隔离。

物理队列并不是一个通用的业界名词，它来源于一种集群资源管理模型，该模型简化下如下：

逻辑队列（Logical Queue）：逻辑队列是虚拟资源分配的最小单元，将虚拟资源配额（Quota）配置在逻辑队列上（如CPU 200 标准核、内存 800GB等）
- 逻辑队列对应Kubernetes的Namespace概念。参考Resource Quotas
- 不同的逻辑队列之间可以设置Qos优先级，实现优先级调度。参考Limit Priority Class consumption by default可以限制每个Namespace下Pod的优先级选择
- 配额分两种：Requests（提供保障的资源）和Limits（资源的最大限制），其中仅Requests才能算Quota，Limits 由管理员视情况选择
物理队列（Physical Queue）：物理队列对应底层物理机资源，同一台物理机仅能从属于同一个物理队列。物理队列的资源总额就是其下物理机可提供的资源的总和。
- 物理队列当前在Kubernetes下缺乏概念映射
- 逻辑队列和物理队列是多对多绑定的关系，即同一个逻辑队列可以跨多个物理队列。
- 逻辑队列的配额总和 / 物理队列的资源总和 = 全局超售比
租户：租户可以绑定多个逻辑队列，对应关系仅影响往对应的Namespace中部署Pod的权限。

资源结构如图所示：

物理队列和逻辑队列

0x01 原理

物理队列实现：

给节点配置Label和Taint，Label用于选择，Taint用于拒绝非该物理队列的Pod部署。

和Namespace的自动绑定的原理：

配置两个Admission Controller: PodNodeSelector和PodTolerationRestriction，参考Admission Controllers
给Namespace增加默认的NodeSelector和Tolerations策略，并自动应用到该 Namespace 下的全部新增 Pod 上，从而自动将Pod绑定到物理队列上。

0x02 配置

测试集群版本：1.16.4, 1.17.0，使用的测试集群为Kind创建

1.18.0 版本的Kind集群创建有问题，后续进行测试

# this config file contains all config fields with comments
# NOTE: this is not a particularly useful config file
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
# patch the generated kubeadm config with some extra settings
kubeadmConfigPatches:
- |
  apiVersion: kubelet.config.k8s.io/v1beta1
  kind: KubeletConfiguration
  evictionHard:
    nodefs.available: "0%"
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  apiServer:
    extraArgs:
      enable-admission-plugins: PodNodeSelector,PodTolerationRestriction
# 1 control plane node and 3 workers
nodes:
# the control plane node config
- role: control-plane
# the three workers
- role: worker
- role: worker
- role: worker

集群开启Admission Controller: PodNodeSelector,PodTolerationRestriction

可以使用 kubeadmin 或 api-server启动参数：

apiServer:
  extraArgs:
    enable-admission-plugins: PodNodeSelector,PodTolerationRestriction

创建 Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: public
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: "node-restriction.kubernetes.io/physical_queue=public-phy"
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "effect": "NoSchedule", "key": "node-restriction.kubernetes.io/physical_queue", "value": "public-phy"}]'
    # scheduler.alpha.kubernetes.io/tolerationsWhitelist: '[{"operator": "Equal", "effect": "NoSchedule", "key": "node-restriction.kubernetes.io/physical_queue", "value": "public-phy"}]'

此处要点：

文档有问题，toleration 配置是一个list，配置错误在部署时会提示解析JSON错误
tolerationsWhitelist配置后，就算配置有defaultTolerations且相同，也需要在Pod中指定对应的toleration，所以不能配置
NoSchedule 已经足够限制，无需 NoExecute，Node配置的时候同样配置，此处可根据需求进行选择。
物理队列的前缀建议为 node-restriction.kubernetes.io/physical_queue，此处是根据文档的建议，后续可以配合NodeRestriction admission plugin限制kubelet自定配置
目前Namespace尚不能绑定多个物理队列:
- NodeSelector 无法支持in语法，见 0x04
- defaultTolerations 可以配置多个 Torleration

给Node绑定物理队列

$ kubectl label node kind-worker node-restriction.kubernetes.io/physical_queue=public-phy
$ kubectl taint nodes kind-worker node-restriction.kubernetes.io/physical_queue=public-phy:NoSchedule

0x03 测试

测试的Deployment如下

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

验证提交到指定物理队列中的Pod默认增加NodeSelector和Toleration

$ kubectl apply -f nginx_deployment.yaml --namespace public
$ kubectl describe pod nginx-deployment-574b87c764-kb9k7 --namespace public
Node-Selectors:  node-restriction.kubernetes.io/physical_queue=public-phy
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 node-restriction.kubernetes.io/physical_queue=public-phy:NoSchedule

验证Pod是否都配置到一起

kubectl describe node kind-worker

验证和物理队列中指定的NodeSelector冲突的Pod无法提交

$ kubectl delete deployment nginx-deployment --namespace public
# 修改nginx_deployment.yaml ，增加spec.template.spec.nodeSelector
      nodeSelector:
        node-restriction.kubernetes.io/physical_queue: second-phy
# 验证能否部署
$ kubectl apply -f nginx_deployment.yaml --namespace public
# 查看deployments
$ kubectl describe replicaset nginx-deployment-585fcd8d7d --namespace public
  Warning  FailedCreate  49s (x15 over 2m11s)  replicaset-controller  Error creating: pods is forbidden: pod node label selector conflicts with its namespace node label selector

0x04 相关问题

Q: NodeSelector无法使用Set-based语法，导致逻辑队列（NameSpace）无法绑定多个物理队列

后续考虑使用Node Affinity配置节点亲和性。但是目前并没有现成的Adminssion Controller去给Namespace绑定默认的节点亲和性，如有需求需要自己开发。

NodeSelector 和 Toleration 的功能，可以被 Node Affinity 进行替代，且后者提供更高级的调度功能，后续尝试是否基于此进行资源调度的整体设计。

此外Node Affinity还可以实现一个逻辑队列绑定多个物理队列的情况下，配置物理队列的调度权重的功能，即优先部署到某个物理队列。