Во время обучения Gradient Policy Gradient, как средняя доходность, так и потери увеличиваются. Как это может быть? - PullRequest
0 голосов
/ 07 апреля 2019

Для OpenAI Spinning-Up Введение в градиенты политики , во время обучения потери продолжают увеличиваться. Процесс обучения, похоже, идет хорошо, потому что средняя отдача от эпизода также увеличивается.

    For epoch 0 loss = 16.201689 , average return = 18.91698113207547 average length = 18.91698113207547
    For epoch 1 loss = 20.150068 , average return = 23.189814814814813 average length = 23.189814814814813
    For epoch 2 loss = 27.689816 , average return = 29.362573099415204 average length = 29.362573099415204
    For epoch 3 loss = 36.150036 , average return = 37.096296296296295 average length = 37.096296296296295
    For epoch 4 loss = 35.464123 , average return = 37.984848484848484 average length = 37.984848484848484
    For epoch 5 loss = 37.448246 , average return = 43.03389830508475 average length = 43.03389830508475
    For epoch 6 loss = 35.55625 , average return = 45.054054054054056 average length = 45.054054054054056
    For epoch 7 loss = 41.18429 , average return = 47.05607476635514 average length = 47.05607476635514
    For epoch 8 loss = 42.95302 , average return = 53.903225806451616 average length = 53.903225806451616
    For epoch 9 loss = 45.53735 , average return = 59.55952380952381 average length = 59.55952380952381
    For epoch 10 loss = 46.496105 , average return = 62.333333333333336 average length = 62.333333333333336
    For epoch 11 loss = 46.55202 , average return = 62.6 average length = 62.6
    For epoch 12 loss = 45.9215 , average return = 64.3974358974359 average length = 64.3974358974359
    For epoch 13 loss = 54.20737 , average return = 72.8695652173913 average length = 72.8695652173913
    For epoch 14 loss = 51.20874 , average return = 72.14285714285714 average length = 72.14285714285714
    For epoch 15 loss = 56.763493 , average return = 77.3030303030303 average length = 77.3030303030303
    For epoch 16 loss = 58.00795 , average return = 78.1875 average length = 78.1875
    For epoch 17 loss = 60.020435 , average return = 80.92063492063492 average length = 80.92063492063492
    For epoch 18 loss = 60.610153 , average return = 83.33870967741936 average length = 83.33870967741936
    For epoch 19 loss = 76.766464 , average return = 109.69565217391305 average length = 109.69565217391305
    For epoch 20 loss = 78.76357 , average return = 111.97777777777777 average length = 111.97777777777777
    For epoch 21 loss = 85.76431 , average return = 120.02380952380952 average length = 120.02380952380952
    For epoch 22 loss = 90.31511 , average return = 138.21621621621622 average length = 138.21621621621622
    For epoch 23 loss = 95.106926 , average return = 143.54285714285714 average length = 143.54285714285714
    For epoch 24 loss = 101.54981 , average return = 165.2258064516129 average length = 165.2258064516129
    For epoch 25 loss = 100.40508 , average return = 163.1290322580645 average length = 163.1290322580645
    For epoch 26 loss = 103.56158 , average return = 165.4516129032258 average length = 165.4516129032258
    For epoch 27 loss = 101.53268 , average return = 163.8709677419355 average length = 163.8709677419355
    For epoch 28 loss = 98.53237 , average return = 160.65625 average length = 160.65625
    For epoch 29 loss = 102.55508 , average return = 164.51612903225808 average length = 164.51612903225808
    For epoch 30 loss = 103.08756 , average return = 166.29032258064515 average length = 166.29032258064515
    For epoch 31 loss = 105.67014 , average return = 181.07142857142858 average length = 181.07142857142858
    For epoch 32 loss = 110.90505 , average return = 191.4814814814815 average length = 191.4814814814815
    For epoch 33 loss = 109.92474 , average return = 186.88888888888889 average length = 186.88888888888889
    For epoch 34 loss = 111.94729 , average return = 198.3846153846154 average length = 198.3846153846154
    For epoch 35 loss = 108.702065 , average return = 188.03703703703704 average length = 188.03703703703704
    For epoch 36 loss = 105.90295 , average return = 184.89285714285714 average length = 184.89285714285714
    For epoch 37 loss = 108.243744 , average return = 189.85185185185185 average length = 189.85185185185185
    For epoch 38 loss = 106.979355 , average return = 189.1851851851852 average length = 189.1851851851852
    For epoch 39 loss = 105.431175 , average return = 188.74074074074073 average length = 188.74074074074073
    For epoch 40 loss = 105.67837 , average return = 190.03703703703704 average length = 190.03703703703704
    For epoch 41 loss = 104.9668 , average return = 191.85185185185185 average length = 191.85185185185185
    For epoch 42 loss = 106.79012 , average return = 193.03846153846155 average length = 193.03846153846155
    For epoch 43 loss = 105.757576 , average return = 196.1153846153846 average length = 196.1153846153846
    For epoch 44 loss = 106.6897 , average return = 196.69230769230768 average length = 196.69230769230768
    For epoch 45 loss = 103.75026 , average return = 192.92307692307693 average length = 192.92307692307693
    For epoch 46 loss = 103.76846 , average return = 196.07692307692307 average length = 196.07692307692307
    For epoch 47 loss = 101.90884 , average return = 188.8148148148148 average length = 188.8148148148148
    For epoch 48 loss = 101.55637 , average return = 185.92592592592592 average length = 185.92592592592592
    For epoch 49 loss = 92.17365 , average return = 165.41935483870967 average length = 165.41935483870967

Согласно коду, мы пытаемся минимизировать потери, как ожидалось.

    train_op = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

Как потери могут увеличиваться?

...