Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)
최강 튜토리얼을 붙여 둔다. 이것을 기본 팔로우하고 있다.
포인트는 둘로
· 모델의 코드 변경
· TfJob의 기술을 마스터 워커 파라미터 서버의 구성으로한다
라는 느낌.
모델의 코드 변경
아래 코드를 추가합니다. 기본적으로,
* TF_CONFIG 환경 변수 얻기
* 그것을 바탕으로 클러스터 설정
* 그것을 바탕으로 서버 설정을
* 마스터인지 판별하기
마스터는 세션을 만들거나 요약을 저장합니다. (근로자는 안 함)
main.py
tf_config_json = os.environ.get("TF_CONFIG", "{}")
tf_config = json.loads(tf_config_json)
task = tf_config.get("task", {})
cluster_spec = tf_config.get("cluster", {})
cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
job_name = task["type"]
task_id = task["index"]
server_def = tf.train.ServerDef(
cluster=cluster_spec_object.as_cluster_def(),
protocol="grpc",
job_name=job_name,
task_index=task_id)
server = tf.train.Server(server_def)
is_chief = (job_name == 'master')
YAML의 경우 마스터, 작업자 및 매개 변수 서버를 지정합니다. 마스터인 경우에만 AzureFile을 설정합니다.
module6-ex2-gpu.yamlapiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
name: module6-ex2
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: acsshare
readOnly: false
containers:
- image: tsuyoshiushio/minstexp:gpu
name: tensorflow
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /tmp/tensorflow
subPath: module6-ex2 # Again we isolate the logs in a new directory on Azure Files
name: azurefile
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: tsuyoshiushio/minstexp:gpu
name: tensorflow
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
tensorboard:
logDir: /tmp/tensorflow/logs
serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: acsshare
volumeMounts:
- mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
subPath: module6-ex2 # This should match the directory our Master is actually writing in
name: azurefile
실행 결과
$ kubectl get jobs
NAME DESIRED SUCCESSFUL AGE
module7-tf-paint-0-0-master-qdna-0 1 1 4h
module7-tf-paint-0-1-master-yuxn-0 1 1 4h
module7-tf-paint-0-2-master-eytn-0 1 1 4h
module7-tf-paint-1-0-master-eeie-0 1 1 4h
module7-tf-paint-1-1-master-xtmz-0 1 1 4h
module7-tf-paint-1-2-master-ffr4-0 1 1 4h
module7-tf-paint-2-0-master-evil-0 1 1 4h
module7-tf-paint-2-1-master-3mza-0 1 1 4h
module7-tf-paint-2-2-master-vab2-0 1 1 4h
Reference
이 문제에 관하여(Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다
https://qiita.com/TsuyoshiUshio@github/items/33217e66067891f34fd1
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
우수한 개발자 콘텐츠 발견에 전념
(Collection and Share based on the CC Protocol.)
tf_config_json = os.environ.get("TF_CONFIG", "{}")
tf_config = json.loads(tf_config_json)
task = tf_config.get("task", {})
cluster_spec = tf_config.get("cluster", {})
cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
job_name = task["type"]
task_id = task["index"]
server_def = tf.train.ServerDef(
cluster=cluster_spec_object.as_cluster_def(),
protocol="grpc",
job_name=job_name,
task_index=task_id)
server = tf.train.Server(server_def)
is_chief = (job_name == 'master')
apiVersion: tensorflow.org/v1alpha1
kind: TfJob
metadata:
name: module6-ex2
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: acsshare
readOnly: false
containers:
- image: tsuyoshiushio/minstexp:gpu
name: tensorflow
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /tmp/tensorflow
subPath: module6-ex2 # Again we isolate the logs in a new directory on Azure Files
name: azurefile
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: tsuyoshiushio/minstexp:gpu
name: tensorflow
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
tensorboard:
logDir: /tmp/tensorflow/logs
serviceType: LoadBalancer # We request a public IP for our TensorBoard instance
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: acsshare
volumeMounts:
- mountPath: /tmp/tensorflow/ #This could be any other path. All that maters is that LogDir reflects it.
subPath: module6-ex2 # This should match the directory our Master is actually writing in
name: azurefile
$ kubectl get jobs
NAME DESIRED SUCCESSFUL AGE
module7-tf-paint-0-0-master-qdna-0 1 1 4h
module7-tf-paint-0-1-master-yuxn-0 1 1 4h
module7-tf-paint-0-2-master-eytn-0 1 1 4h
module7-tf-paint-1-0-master-eeie-0 1 1 4h
module7-tf-paint-1-1-master-xtmz-0 1 1 4h
module7-tf-paint-1-2-master-ffr4-0 1 1 4h
module7-tf-paint-2-0-master-evil-0 1 1 4h
module7-tf-paint-2-1-master-3mza-0 1 1 4h
module7-tf-paint-2-2-master-vab2-0 1 1 4h
Reference
이 문제에 관하여(Kubernets에서 GPU를 사용하여 tensorflow 학습하기 (3)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/TsuyoshiUshio@github/items/33217e66067891f34fd1텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)