Kubernetes on NVIDIA GPUs란?

GTC 2018 Keynote에서 발표된 제품

Kubernetes on NVIDIA GPUs:
htps : //에서 ゔぇぺぺr. 응아아. 코 m / 쿠베 r 네 s 푸

훌륭하지만 K8s 자체를 GPU로 고속화가 아니라 GPU 리소스의 스케줄링에 대응한 NVIDIA의 K8s 패키지.

주요 내용

Kubernetes(+Device Plugin)

NVIDIA device plugin for Kubernetes

nvidia-docker2

Ptometheus + Grafana

과 전부터 GPU+K8s를 만지고 있던 사람에게는 실은 거기까지 참신한 것은 없다.

추가 요소

Compute Capabilities 및 GPU 메모리로 스케줄링

Prometheus, Grafana에 의한 GPU 모니터링

컨테이너 런타임 다중 지원 (Docker, CRI-O)

DGX System에 대한 NVIDIA 공식 지원

Compute Capabilities 및 GPU 메모리를 통한 스케줄링

통상의 K8s+ Device Plugin에서는 GPU의 개수로의 스케줄링 밖에 할 수 없지만, NVIDIA의 패키지에서는 양쪽을 확장해 보다 세세한 스케줄링을 할 수 있게 되어 있다

NVIDIA Kubernetes: htps : // 기주 b. 코 m / 응 ぃ 아 / 쿠베 r
v1.9.6을 기반으로 Device Plugin 주변 및 스케줄러 확장이 포함되어 있습니다

NVIDIA device plugin for Kubernetes(nvidiak8s/v1.9 branch): htps : // 기주 b. 코 m / 응 ぃぢ 아 / k8s - ゔぃせ p ぅ 긴 / t ええ / 응 ぃぢ 아 k8 s / v1.9
NVIDIA K8s와 함께 G8의 다양한 정보를 K8s에 전달하도록 확장되었습니다. https://魏Tub. 작은 m/n 비아아/k8s-에서 비세pぅ은/코미 t/페 FB0372아

이러한 변경으로 인해 각 노드의 GPU

GPU 메모리 용량

GPU 메모리 ECC 유무

Compute Capability

가 자동으로 취득되어 스케줄링의 조건으로서 사용할 수 있게 되어 있다.

gpu-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-base
      command: ["sleep"]
      args: ["100000"]
      extendedResourceRequests: ["nvidia-gpu"]
  extendedResources:
    - name: "nvidia-gpu"
      resources:
        limits:
          nvidia.com/gpu: 1
      affinity:
        required:
          - key: "nvidia.com/gpu-memory"
            operator: "Gt"
            values: ["8000"] # change value to appropriate mem for GPU

nvidia.com/gpu에서의 지정만으로는 어느 GPU에서도 함께 스케줄링 되어 버리기 때문에, 복수 종류의 GPU가 혼재하는 상황에서는 편리할지도 모른다

Prometheus, Grafana로 GPU 모니터링

Prometheus Operator ( htps : // 기주 b. 코 m / 이것 오 s / p )를 바탕으로 dcgm을 사용해 GPU의 정보를 감시할 수 있도록 확장되고 있다. 이것은 보통 K8s에서도 배포 가능.

node-exporter-daemonset.yaml

$ diff -u  manifests/node-exporter/node-exporter-daemonset.yaml /etc/kubeadm/dcgm/node-exporter-daemonset.yaml
--- manifests/node-exporter/node-exporter-daemonset.yaml    2018-04-09 13:25:28.000000000 +0000
+++ /etc/kubeadm/dcgm/node-exporter-daemonset.yaml  2018-06-15 22:44:00.000000000 +0000
@@ -14,18 +14,31 @@
       name: node-exporter
     spec:
       serviceAccountName: node-exporter
-      securityContext:
-        runAsNonRoot: true
-        runAsUser: 65534
       hostNetwork: true
       hostPID: true
+      nodeSelector:
+        hardware-type: NVIDIAGPU
+      initContainers:
+      - image: nvcr.io/nvidia/k8s/dcgm-exporter:1.4.3
+        name: nvidia-dcgm-exporter-hook
+        command: ["cp"]
+        args:
+        - "/work/dcgm.json"
+        - "/hook/dcgm.json"
+        volumeMounts:
+        - name: dcgm-docker-hook
+          mountPath: /hook
       containers:
       - image: quay.io/prometheus/node-exporter:v0.15.2
         args:
         - "--web.listen-address=127.0.0.1:9101"
         - "--path.procfs=/host/proc"
         - "--path.sysfs=/host/sys"
+        - "--collector.textfile.directory=/run/prometheus"
         name: node-exporter
+        securityContext:
+          runAsNonRoot: true
+          runAsUser: 65534
         resources:
           requests:
             memory: 30Mi
@@ -40,6 +53,17 @@
         - name: sys
           readOnly: true
           mountPath: /host/sys
+        - name: collector-textfiles
+          readOnly: true
+          mountPath: /run/prometheus
+      - image: nvcr.io/nvidia/k8s/dcgm-exporter:1.4.3
+        name: nvidia-dcgm-exporter
+        securityContext:
+          runAsNonRoot: false
+          runAsUser: 0
+        volumeMounts:
+        - name: collector-textfiles
+          mountPath: /run/prometheus
       - name: kube-rbac-proxy
         image: quay.io/brancz/kube-rbac-proxy:v0.2.0
         args:
@@ -66,4 +90,9 @@
       - name: sys
         hostPath:
           path: /sys
-
+      - name: collector-textfiles
+        emptyDir:
+          medium: Memory
+      - name: dcgm-docker-hook
+        hostPath:
+          path: /usr/share/containers/docker/hooks.d

컨테이너 런타임 다중 지원 (Docker, CRI-O)

nvidia-docker2도 사용하고 있는 nvidia-container-runtime의 CRI-O 대응. NVIDIA K8s와는 관계없이 사용 가능.
htps : // 기주 b. 코 m / 흠 아 / 흠

Reference

이 문제에 관하여(Kubernetes on NVIDIA GPUs란?), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/tmatsu/items/0bc9c166a7a89d7f1024

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다