Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (2)

13531 단어 TensorFlow AzureCotainerServices kubernetes GPU

지난 블로그에서 GPU를 사용하고 싶을 때는 NVIDIA의 드라이버를 컨테이너와 공유할 필요성을 말했지만, 실은 그것을 명시적으로 하지 않아도 좋은 작전이 있다.

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (1)

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)

최강 튜토리얼을 붙여 둔다. 이것을 기본 팔로우하고 있다.

다음 명령으로 설치할 수 있습니다. (helm은 설치해야 함)

CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set cloud=azure

이제 TfJob이라는 리소스를 사용할 수 있게 된다.

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    metadata:
      name: example-job
    spec:
      restartPolicy: OnFailure
      volumes:
      - name: bin
        hostPath: 
          path: /usr/lib/nvidia-384/bin
      - name: lib
        hostPath: 
          path: /usr/lib/nvidia-384
      containers:
      - name: tensorflow
        image: tsuyoshiushio/mnst-gpu
        resources:
          requests:
            alpha.kubernetes.io/nvidia-gpu: 1 
        volumeMounts:
        - name: bin
          mountPath: /usr/local/nvidia/bin
        - name: lib
          mountPath: /usr/local/nvidia/lib64

이것으로 끝납니다.

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: example-tfjob
spec:
  replicaSpecs:
    - template:
        spec:
          containers:
            - image: tsuyoshiushio/mnst-gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

내용은, tf-job-operator-config 라고 하는 ConfigMap 에 드라이버의 설정이 쓰여져 있고 그것이 적용되고 있다. 그리고 새롭게, tf-job-operator 라고 하는 pod가 배치되어 그것이 클러스터를 모니터해, TfJob의 새로운 자원을 감시하고 있다.

$ kubectl describe configmaps tf-job-operator-config
Name:         tf-job-operator-config
Namespace:    default
Labels:       <none>
Annotations:  <none>

Data
====
controller_config_file.yaml:
----
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
accelerators:
  alpha.kubernetes.io/nvidia-gpu:
    volumes:
      - name: lib
        mountPath: /usr/local/nvidia/lib64
        hostPath:  /usr/lib/nvidia-384
      - name: bin
        mountPath: /usr/local/nvidia/bin
        hostPath: /usr/lib/nvidia-384/bin
      - name: libcuda
        mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
        hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1

Events:  <none>

학습된 모델 및 로그 검색

컨테이너에서 학습한 모델과 로그는 어디에서 얻을 수 있습니까? 예를 들어, Azure
File을 마운트하는 방법을 사용할 수 있다.

Using Azure Files with Kubernetes

구체적으로는 이런 것을 kubectl create -f 하면 좋은 느낌. 이제 시크릿을 만들 수 있습니다.

apiVersion: v1
kind: Secret
metadata:
  name: azure-secret
type: Opaque
data:
  azurestorageaccountname: BASE64_STORAGE_ACCOUNT_NAME
  azurestorageaccountkey: BASE64_STTORAGE_ACCOUNT_KEY

이 설정으로 해당 AzureFile 의 acsshare 에 모델이나 로그가 공유되게 된다. 간단하다.

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: module5-ex2-gpu
spec:
  replicaSpecs:
    - template:
        spec:
          containers:
            - image: tsuyoshiushio/tf-mnist:gpu
              name: tensorflow
              volumeMounts:
              - name: azurefile
                subPath: module5-ex2-gpu
                mountPath: /tmp/tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
          volumes:
            - name: azurefile
              azurefile:
                secretName: azure-secret
                shareName: acsshare
                readOnly: false
          restartPolicy: OnFailure

그런 다음 TensorBoard to 도구도 사용할 수있는 것 같습니다. yaml에 추가하자.

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: module5-ex3
spec:
  replicaSpecs:
    - template:
        spec:
          volumes:
            - name: azurefile
              azureFile:
                  secretName: azure-secret
                  shareName: acsshare
                  readOnly: false
          containers:
            - image: tsuyoshiushio/tf-mnist:gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                - mountPath: /tmp/tensorflow
                  subPath: module5-ex3 # Again we isolate the logs in a new directory on Azure Files
                  name: azurefile
          restartPolicy: OnFailure
  tensorboard:
    logDir: /tmp/tensorflow/logs
    serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
    volumes:
      - name: azurefile
        azureFile:
            secretName: azure-secret
            shareName: acsshare
    volumeMounts:
      - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
        subPath: module5-ex3 # This should match the directory our Master is actually writing in
        name: azurefile

서비스의 LoadBalancer 로 공개되므로, 보통 브라우저에서 열린다.

다만, Azure File의 iops는 1000 정도이므로, 그 이상은 다른 방법을 생각하는 것.

Reference

이 문제에 관하여(Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (2)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/TsuyoshiUshio@github/items/f3c28498e75ff1294114

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (1)

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다