Как генерировать последовательности, используя Q-Learning? - PullRequest
0 голосов
/ 21 января 2020

Моя функция Q-Learning находит оптимальную последовательность из одного места в другое. Но я хотел бы создать действительные последовательности. Скажите, что это мои последовательности:

['Hi,there,how,have,you,been,noreply,?',
 'Hi,there,where,have,you,been,noreply,?',
 'Hi,there,who,have,you,been,reply,?',
 'Hi,there,yes,have,you,been,reply,?']

Моя функция:

def get_optimal_route(start_location,end_location):
    rewards_new = np.copy(rewards)
    ending_state = location_to_state[end_location]
    rewards_new[ending_state,ending_state] = 999
    Q = np.array(np.zeros([12,12]))

    for i in range(1000):
        current_state = np.random.randint(0,12) # Python excludes the upper bound
        playable_actions = []
        for j in range(12):
            if rewards_new[current_state,j] > 0:
                playable_actions.append(j)
        next_state = np.random.choice(playable_actions)
        TD = rewards_new[current_state,next_state] + gamma * Q[next_state, np.argmax(Q[next_state,])] - Q[current_state,next_state]
        Q[current_state,next_state] += alpha * TD

    route = [start_location]

    next_location = start_location

    while(next_location != end_location):
        starting_state = location_to_state[start_location]    

        next_state = np.argmax(Q[starting_state,])

        next_location = state_to_location[next_state]
        route.append(next_location)
        start_location = next_location
    return route

Скажите, если два передаваемых мной параметра - «Привет» и «?», Я бы хотел сгенерировать все 4 из последовательностей.

This is my rewards matrix:
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
​
based on this tokenized sequence:
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 0,  1,  8,  3,  4,  5,  6,  7],
       [ 0,  1,  9,  3,  4,  5, 10,  7],
       [ 0,  1, 11,  3,  4,  5, 10,  7]])

...