It’s an interesting idea, I have two questions.
- Surprise is detected by the norm of the gradients. So, doesn’t this suggest that the model already has a way of adjusting to surprise?
- Is there a danger of model instability when the gradients become larger and the learning rate is also increased?
It’s an interesting idea, I have two questions.
- Surprise is detected by the norm of the gradients. So, doesn’t this suggest that the model already has a way of adjusting to surprise?
- Is there a danger of model instability when the gradients become larger and the learning rate is also increased?