defupdate_training_gameover(net_context,move_history,q_learning_player,final_board,discount_factor):game_result_reward=get_game_result_value(q_learning_player,final_board)# move history is in reverse-chronological order - last to first
next_position,move_index=move_history[0]backpropagate(net_context,next_position,move_index,game_result_reward)for (position,move_index)inlist(move_history)[1:]:next_q_values=get_q_values(next_position,net_context.target_net)qv=torch.max(next_q_values).item()backpropagate(net_context,position,move_index,discount_factor*qv)next_position=positionnet_context.target_net.load_state_dict(net_context.policy_net.state_dict())defbackpropagate(net_context,position,move_index,target_value):net_context.optimizer.zero_grad()output=net_context.policy_net(convert_to_tensor(position))target=output.clone().detach()target[move_index]=target_valueillegal_move_indexes=position.get_illegal_move_indexes()formiinillegal_move_indexes:target[mi]=LOSS_VALUEloss=net_context.loss_function(output,target)loss.backward()net_context.optimizer.step()
C:\Dev\python\tictac>python -m tictac.main
Playing random vs random
-------------------------
x wins: 60.10%
o wins: 28.90%
draw : 11.00%
Playing minimax not random vs minimax random:
---------------------------------------------
x wins: 0.00%
o wins: 0.00%
draw : 100.00%
Playing minimax random vs minimax not random:
---------------------------------------------
x wins: 0.00%
o wins: 0.00%
draw : 100.00%
Playing minimax not random vs minimax not random:
-------------------------------------------------
x wins: 0.00%
o wins: 0.00%
draw : 100.00%
Playing minimax random vs minimax random: