Fig. 5
From: Importance sampling in reinforcement learning with an estimated behavior policy

A simplified version of the neural network architecture used in Cart Pole. The true architecture has 32 hidden units in each layer. The current policy \({\pi _{\varvec{\theta }}}\) is given by a neural network that outputs the action probabilities as a function of state (black nodes). The estimated policy, \(\hat{\pi }\), is a linear policy that takes as input the activations of the final hidden layer of \({\pi _{\varvec{\theta }}}\). Only the weights on the red, dashed connections are changed when estimating \(\hat{\pi }\)