PPO의 하이퍼파라미터 메모 #2a: 배치 사이즈(이산 행동 공간)편

소개

PPO의 하이퍼파라미터를 여러가지 만지면 어떻게 되는지 실험의 메모. 이번에는 이산 행동 공간의 환경에서 PPO를 학습시킬 때의 배치 사이즈에 대해.

Best Practices when training with PPO 이라는 기사가 있어 배치 사이즈에 대해서 다음과 같은 것이 쓰여져 있다.

batch_size corresponds to how many experiences are used for each gradient descent update. This should always be a fraction of the buffer_size. If you are using a continuous action space, this value should be large (in 1000s). space, this value should be smaller (in 10s).
Typical Range (Continuous): 512 - 5120
Typical Range (Discrete): 32 - 512

요컨대:

배치 크기 (batch_size)는 그라디언트 강하의 각 업데이트에 얼마나 많은 샘플을 사용하는지에 해당합니다.

배치 크기의 배수는 버퍼 크기 (buffer_size) 여야합니다.

행동 공간이 이산될 때는 작고, 연속일 때는 큰 편이 좋다

이번에는 이산 행동 공간을 가진 MountainCar-v0에서 이것을 (잡잡하게) 검증해 보겠습니다.

실험

update_interval (= 버퍼 크기)을 2048로 고정하고 배치 크기를 1,8,16,32,64,128,256,512,1024,2048로 바꾸어 실험해 보자. 이산 행동 공간에서의 「전형예」의 범위는 32~512인 것 같지만, 과연 사실인가.

실험에는 ChainerRL을 사용합니다. 여기의 소스 코드를 빌렸다.
argument로 변경하는 것은 --batchsize (와 저장 장소를 실험마다 변경하기 때문에 --outdir)이다.


    parser.add_argument('--gpu', type=int, default=0)
    parser.add_argument('--env', type=str, default='MountainCar-v0')
    parser.add_argument('--arch', type=str, default='FFSoftmax',
                        choices=('FFSoftmax', 'FFMellowmax',
                                 'FFGaussian'))
    parser.add_argument('--bound-mean', action='store_true')
    parser.add_argument('--seed', type=int, default=0,
                        help='Random seed [0, 2 ** 32)')
    parser.add_argument('--outdir', type=str, default='results',
                        help='Directory path to save output files.'
                             ' If it does not exist, it will be created.')
    parser.add_argument('--steps', type=int, default=10 ** 6)
    parser.add_argument('--eval-interval', type=int, default=10000)
    parser.add_argument('--eval-n-runs', type=int, default=10)
    parser.add_argument('--reward-scale-factor', type=float, default=1e-2)
    parser.add_argument('--standardize-advantages', action='store_true')
    parser.add_argument('--render', action='store_true', default=False)
    parser.add_argument('--lr', type=float, default=3e-4)
    parser.add_argument('--weight-decay', type=float, default=0.0)
    parser.add_argument('--demo', action='store_true', default=False)
    parser.add_argument('--load', type=str, default='')
    parser.add_argument('--logger-level', type=int, default=logging.INFO)
    parser.add_argument('--monitor', action='store_true')

    parser.add_argument('--update-interval', type=int, default=2048)
    parser.add_argument('--batchsize', type=int, default=64)
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--entropy-coef', type=float, default=0.0)

최적화 기법은 이전 결과에 따라 SMORMS3 (인수 없음)을 사용합니다.

실험 환경

CPU : Intel Core i7-8700CPU @ 3.20Hz x 12

메모리 : 32GB

그래보: GeForce RTX2080Ti 11GB

결과

학습 곡선

100 구간의 이동 평균을 취한 것. (2048은 끝까지 성능이 향상되지 않았다)

최고의 모델 성능

best 모델로 10000회 달렸을 때의 누적 보수의 상자 수염도. 덧붙여서, 확인 가능한 보상의 최대치는 -83, 최소치는 -200이다.

학습 시간

오른쪽은 확대한 것. 당연한 이야기이지만 배치 크기를 줄이면 시간이 늘어납니다.

결론

학습 시간과 성능의 밸런스적으로, 32당이 좋을 것 같다.

Typical Range (Discrete): 32 - 512

선인의 가르침은 옳았다. 마음이 가면 계속됩니다.

Reference

이 문제에 관하여(PPO의 하이퍼파라미터 메모 #2a: 배치 사이즈(이산 행동 공간)편), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/deepirl_learner/items/fbd222870a2dfc439343

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다