Training
Explore how to train machine learning models using TensorFlow's MonitoredTrainingSession. Learn to log training metrics, handle NaN loss conditions, and save model checkpoints for efficient and reliable model execution.
We'll cover the following...
Chapter Goals:
- Understand how a
MonitoredTrainingSessionworks - Learn about saving checkpoints and tracking scalar values during training
- Train a machine learning model using a
MonitoredTrainingSession
A. Logging values
While tf.summary.scalar lets us keep track of certain values in an events file for TensorBoard, it is also useful to directly log values to STDOUT during training. For instance, it is customary to log the loss and iteration count, so we can stop training if there is an issue.
You’ll notice each line of output is prepended by “INFO:tensorflow”. This just means the logging level is set to INFO.
We log specific values while training using a tf.compat.v1.train.LoggingTensorHook object. The object is initialized with a dictionary mapping labels to scalar valued tensors. In our example, the labels we used were 'loss' and 'step', for the ...