bounty将在5天后过期。回答此问题可获得+50声望奖励。I_Al-thamary希望吸引更多人关注此问题。
我正在运行an advantage actor-critic (A2C)强化学习模型,但是当我改变kernel_initializer,
时,它给了我一个错误,我的状态有值。而且,它只在kernel_initializer=tf.zeros_initializer()
时有效。我已经把模型改为下面的代码,我面临着一个不同的问题:重复同样的动作。然而,当我把kernel_initializer
改成tf.zeros_initializer()
时,它开始选择不同的动作。
state =[-103.91446672 -109. 7.93509779 0. 0.
1. ]
该模型
class Actor:
"""The actor class"""
def __init__(self, sess, num_actions, observation_shape, config):
self._sess = sess
self._state = tf.placeholder(dtype=tf.float32, shape=observation_shape, name='state')
self._action = tf.placeholder(dtype=tf.int32, name='action')
self._target = tf.placeholder(dtype=tf.float32, name='target')
self._hidden_layer = tf.layers.dense(inputs=tf.expand_dims(self._state, 0), units=32, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
self._output_layer = tf.layers.dense(inputs=self._hidden_layer, units=num_actions, kernel_initializer=tf.zeros_initializer())
self._action_probs = tf.squeeze(tf.nn.softmax(self._output_layer))
self._picked_action_prob = tf.gather(self._action_probs, self._action)
self._loss = -tf.log(self._picked_action_prob) * self._target
self._optimizer = tf.train.AdamOptimizer(learning_rate=config.learning_rate)
self._train_op = self._optimizer.minimize(self._loss)
def predict(self, s):
return self._sess.run(self._action_probs, {self._state: s})
def update(self, s, a, target):
self._sess.run(self._train_op, {self._state: s, self._action: a, self._target: target})
class Critic:
"""The critic class"""
def __init__(self, sess, observation_shape, config):
self._sess = sess
self._config = config
self._name = config.critic_name
self._observation_shape = observation_shape
self._build_model()
def _build_model(self):
with tf.variable_scope(self._name):
self._state = tf.placeholder(dtype=tf.float32, shape=self._observation_shape, name='state')
self._target = tf.placeholder(dtype=tf.float32, name='target')
self._hidden_layer = tf.layers.dense(inputs=tf.expand_dims(self._state, 0), units=32, activation=tf.nn.relu, kernel_initializer=tf.zeros_initializer())
self._out = tf.layers.dense(inputs=self._hidden_layer, units=1, kernel_initializer=tf.zeros_initializer())
self._value_estimate = tf.squeeze(self._out)
self._loss = tf.squared_difference(self._out, self._target)
self._optimizer = tf.train.AdamOptimizer(learning_rate=self._config.learning_rate)
self._update_step = self._optimizer.minimize(self._loss)
def predict(self, s):
return self._sess.run(self._value_estimate, feed_dict={self._state: s})
def update(self, s, target):
self._sess.run(self._update_step, feed_dict={self._state: s, self._target: target})
问题是我需要改进学习过程。所以,我想如果我改变kernel_initializer,它可能会改进,但它给了我这个错误信息。
action = np.random.choice(np.arange(lenaction), p=action_prob)
File "mtrand.pyx", line 935, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities contain NaN
知道是什么引起的吗?
1条答案
按热度按时间owfi6suc1#
Using a kernel_initializer of tf.zeros_initializer() for your dense layers in the actor and critic networks can lead to the issue you are experiencing, where the loss becomes NaN and the model repeats the same action. This is because using a kernel_initializer of tf.zeros_initializer() initializes all of the weights in the dense layers to zeros, which can prevent the network from learning.
In general, it is better to use a different kernel_initializer for your dense layers, such as tf.random_normal_initializer() or tf.glorot_uniform_initializer(). These initializers initialize the weights with random values, which allows the network to learn and produce more diverse outputs.
To fix the issue with your model, you can try changing the kernel_initializer for your dense layers to a different value, such as tf.random_normal_initializer() or tf.glorot_uniform_initializer(). This should allow your network to learn and avoid the issue where the loss becomes NaN and the model repeats the same action.
You can also try using a different optimizer, such as RMSProp or Adagrad, which may be better suited for this problem. Additionally, you can try adjusting the learning rate and other hyperparameters of the model to see if that improves its performance.
If the tf.zeros_initializer initializer is the only initializer that works for your network, but the performance is not good, there are several steps you can take to improve the performance of your network.
First, you can try adjusting the parameters of the tf.zeros_initializer initializer to fine-tune the starting weights for your network. The tf.zeros_initializer initializer does not have any parameters, so you will need to use a different initializer and adjust its parameters to control the starting weights for your network.
For example, you can try using the tf.random_normal_initializer initializer, which will provide random starting weights for the network. You can adjust the mean and stddev parameters to control the distribution of the starting weights, and experiment with different values to see which provides the best performance for your network.
Alternatively, you can try adjusting other hyperparameters, such as the learning rate or the optimizer, to improve the performance of your network. For example, you can try using a different optimizer, such as the Adam optimizer or the RMSprop optimizer, to see if it provides better performance for your network.
You can also try modifying the state, action, and reward definitions for your network to see if a different representation improves the performance of your network. For example, you can try using a different state representation, such as a different set of features or a different scaling or normalization method, to see if it improves the performance of your network.
Finally, you can try using more data or more complex network architectures to improve the performance of your network. For example, you can try using a larger dataset, or a deeper or wider network, to see if it provides better performance for your network. For more information, see the TensorFlow documentation on training and evaluating neural networks. https://www.tensorflow.org/guide/keras/train_and_evaluate