Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)

세 번째는 자신도 아직 소화 불량이지만 TfJob을 사용한 분산 학습에 대해. 자신에게 메모

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (1)

Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (2)

최강 튜토리얼을 붙여 둔다. 이것을 기본 팔로우하고 있다.

포인트는 둘로

· 모델의 코드 변경
· TfJob의 기술을 마스터 워커 파라미터 서버의 구성으로한다

라는 느낌.

모델의 코드 변경

아래 코드를 추가합니다. 기본적으로,
　
* TF_CONFIG 환경 변수 얻기
* 그것을 바탕으로 클러스터 설정
* 그것을 바탕으로 서버 설정을
* 마스터인지 판별하기

마스터는 세션을 만들거나 요약을 저장합니다. (근로자는 안 함)

main.py


  tf_config_json = os.environ.get("TF_CONFIG", "{}")
  tf_config = json.loads(tf_config_json)

  task = tf_config.get("task", {})
  cluster_spec = tf_config.get("cluster", {})
  cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
  job_name = task["type"]
  task_id = task["index"]
  server_def = tf.train.ServerDef(
      cluster=cluster_spec_object.as_cluster_def(),
      protocol="grpc",
      job_name=job_name,
      task_index=task_id)
  server = tf.train.Server(server_def)

  is_chief = (job_name == 'master')

YAML의 경우 마스터, 작업자 및 매개 변수 서버를 지정합니다. 마스터인 경우에만 AzureFile을 설정합니다.

module6-ex2-gpu.yaml

apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
  name: module6-ex2
spec:
  replicaSpecs:
    - replicas: 1
      tfReplicaType: MASTER
      template:
        spec:
          volumes:
            - name: azurefile
              azureFile:
                  secretName: azure-secret
                  shareName: acsshare
                  readOnly: false
          containers:
            - image: tsuyoshiushio/minstexp:gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                - mountPath: /tmp/tensorflow
                  subPath: module6-ex2 # Again we isolate the logs in a new directory on Azure Files
                  name: azurefile
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: tsuyoshiushio/minstexp:gpu
              name: tensorflow
              resources:
                requests:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure
    - replicas: 1
      tfReplicaType: PS
  tensorboard:
    logDir: /tmp/tensorflow/logs
    serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
    volumes:
      - name: azurefile
        azureFile:
            secretName: azure-secret
            shareName: acsshare
    volumeMounts:
      - mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
        subPath: module6-ex2 # This should match the directory our Master is actually writing in
        name: azurefile

실행 결과

$ kubectl get jobs
NAME                                 DESIRED   SUCCESSFUL   AGE
module7-tf-paint-0-0-master-qdna-0   1         1            4h
module7-tf-paint-0-1-master-yuxn-0   1         1            4h
module7-tf-paint-0-2-master-eytn-0   1         1            4h
module7-tf-paint-1-0-master-eeie-0   1         1            4h
module7-tf-paint-1-1-master-xtmz-0   1         1            4h
module7-tf-paint-1-2-master-ffr4-0   1         1            4h
module7-tf-paint-2-0-master-evil-0   1         1            4h
module7-tf-paint-2-1-master-3mza-0   1         1            4h
module7-tf-paint-2-2-master-vab2-0   1         1            4h

Reference

이 문제에 관하여(Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/TsuyoshiUshio@github/items/33217e66067891f34fd1

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다