Can you try increasing the max_epochs
in your config to some large number (you can always kill your job earlier when you see training curves converge). It’s likely the checkpoints have reached the total training epochs and when you try continuing training it stops immediately as a result.
1 Like