Variational Autoencoders: Introduction and Implementation (Part -1)

10 min readJun 14, 2023

Welcome to the comprehensive guide on Variational Autoencoders (VAEs), a fascinating area within machine learning and artificial intelligence that has significantly influenced the field of data generation and processing. Originally introduced by Kingma and Welling in 2013, VAEs have been employed in an impressive range of applications such as image generation, style transfer, and as crucial components in complex systems like image-to-image translation.

VAEs are a type of autoencoder, a neural network model, trained to reproduce its input at the output. Autoencoders are typically employed for tasks like dimensionality reduction and feature extraction. However, VAEs inject a probabilistic twist into the autoencoding process, setting them apart from traditional models.

Unlike a standard autoencoder that compresses the input data to a latent representation and decompresses it back, a VAE learns the parameters of a probability distribution that represents the data. This ability allows VAEs to sample and generate novel data points from the learned distribution, even those absent in the training set, classifying them as generative models.

In this three-part series, we primarily focus on the practical implementation of VAEs using TensorFlow. The first part, “Variational Autoencoders: Introduction and Implementation,” lays the foundation for VAEs and walks you through the TensorFlow-based coding process.

Our journey continues in the second part, Variational Autoencoders: Training Procedures, where we delve deeper into the training methodologies. The series concludes with the third installment, Variational Autoencoders: Hyperparameter Tuning with Docker and Bash Scripts, where we discuss hyperparameter tuning, facilitating efficient experimentation using Docker and Bash scripts.

For those ready to get hands-on with the code, you can find the complete code for this series in our GitHub repository. We encourage you to clone the repository, experiment with the code. Through this practical approach, we hope you’ll gain a richer understanding of VAEs and their implementation using TensorFlow. Happy coding!

VAEs consist of two main parts: an encoder and a decoder. The encoder (also known as the recognition or inference model) transforms the input data into two parameters in a latent space: a mean and a standard deviation. The decoder (also known as the generative model) then reverses this process, taking samples from the latent space and generating outputs in the original input space.

Encoder

An encoder in an autoencoder is a neural network component that transforms the input data (usually high-dimensional) into a different representation, often of lower dimensionality, known as the latent representation. This process involves learning the underlying features of the input data, essentially compressing it.

def encoder_model(input_shape, filters, dense_layer_dim, latent_dim):
    """
    Creates an encoder model for grayscale images that maps input images to a lower-dimensional latent space.
    
    Args:
    - input_shape: Tuple representing the shape of the input images (height, width, channels).
    - filters: List of integers representing the number of filters in each convolutional layer.
    - dense_layer_dim: Integer representing the number of neurons in the dense layer.
    - latent_dim: Integer representing the dimensionality of the latent space.
    
    Returns:
    - encoder: Keras Model object representing the encoder model.
    - encoder_layers_dim: List of tuples representing the dimensionality of each layer in the encoder.
    """
    # Create input layer
    encoder_layers_dim = []  # List to store the dimensions of each layer in the encoder
    
    # Define the input layer
    encoder_inputs = Input(shape=input_shape)
    encoder_layers_dim.append(tuple(encoder_inputs.shape[1:]))  # Add input layer dimensions to list
    
    # Add convolutional layers with specified number of filters and activation function
    x = Conv2D(filters[0], (3,3), activation="relu", strides=2, padding="same")(encoder_inputs)
    encoder_layers_dim.append(tuple(x.shape[1:]))  # Add conv layer dimensions to list
    
    # Add additional convolutional layers with specified number of filters and activation function
    mid_layers = [Conv2D(f, 3, activation="relu", strides=2, padding="same") for f in filters[1:]]
    for mid_layer in mid_layers:
        x = mid_layer(x)
        encoder_layers_dim.append(tuple(x.shape[1:]))  # Add mid layer dimensions to list
    
    # Flatten convolutional output to prepare for dense layers
    x = Flatten()(x)
    encoder_layers_dim.append(tuple(x.shape[1:]))  # Add flattened layer dimensions to list
    
    # Add dense layer with specified number of neurons and activation function
    x = Dense(dense_layer_dim, activation='relu')(x)
    
    # Add output layers for latent space (mean and variance) and sample from this space
    z_mean = Dense(latent_dim, name = "z_mean")(x)
    z_log_var = Dense(latent_dim, name="z_log_var")(x)
    z = Sampling()([z_mean, z_log_var])
    encoder_layers_dim.append(tuple(z.shape[1:]))  # Add output layer dimensions to list
    
    # Create encoder model
    return Model(encoder_inputs, [z_mean, z_log_var, z], name='encoder'), encoder_layers_dim

In the provided implementation, the function encoder_model defines the structure of the encoder. The steps involved are:

Input layer: The model starts by defining an input layer. This layer takes in data of a shape specified by the parameter input_shape.
Convolutional layers: The input data is then passed through one or more convolutional layers. The number of convolutional layers and the number of filters in each layer are defined by the filters parameter. These layers are responsible for extracting features from the input data. The output dimensions of each layer are stored in the encoder_layers_dim list.
Flattening layer: The output from the convolutional layers is a multi-dimensional tensor. This tensor is flattened into a 1-D vector using the Flatten function. This prepares the data for input to the dense layer.
Dense layer: The flattened data is passed through a dense layer with dense_layer_dim neurons. This layer performs complex transformations on the data.
Latent layers: Two dense layers are then added, which output z_mean and z_log_var. These represent the mean and variance of the latent space distribution respectively.
Sampling layer: Finally, a sample is taken from the distribution defined by z_mean and z_log_var using the Sampling layer.

The Sampling layer is a custom layer that is used to sample from the distribution defined by z_mean and z_log_var. It takes as input the mean and log variance and outputs a random sample from the distribution. This is achieved by first generating a random normal variable, epsilon, and then transforming it using the formula z_mean + tf_exp(0.5 * z_log_var) * epsilon.

Decoder

The decoder model is the second half of the Variational Autoencoder (VAE). It takes as input a point in the latent space and outputs a reconstruction of the original data. The goal of the decoder is to map these latent points back to the original input space.

# decoder model for grayscale images
def decoder_model(encoder_layers_dim):
    # Extract necessary dimensions from encoder model output
    latent_dim = encoder_layers_dim[-1][0]
    dense_layer_dim = encoder_layers_dim[-2][0]
    first_conv_layer_dim = encoder_layers_dim[-3]
    output_layer = encoder_layers_dim[0]

    # Create input layer for latent space vector
    latent_inputs = Input(shape=(latent_dim,))

    # Determine number of filters for each transpose convolutional layer
    filters = [f[-1] for f in encoder_layers_dim[1:-2]]

    # Feed latent vector through a dense layer with ReLU activation
    # Note that we apply the first filter in the form of dense and reshape it
    x = Dense(dense_layer_dim, activation="relu")(latent_inputs)
    x = Reshape(first_conv_layer_dim)(x)

    # Apply series of transpose convolutional layers with ReLU activation and same padding and Upsampling
    mid_layers = [Conv2DTranspose(f, 3, activation="relu", strides=2, padding="same") for f in filters[::-1]]
    for mid_layer in mid_layers:
        x = mid_layer(x)

    # Apply final convolutional layer with sigmoid activation to output reconstructed image
    decoder_outputs = Conv2DTranspose(output_layer[-1], 3, activation="sigmoid", padding="same")(x)

    # Create and return Keras model with latent vector as input and reconstructed image as output
    return Model(latent_inputs, decoder_outputs, name="decoder")

The steps involved in the decoder model are as follows:

Extracting necessary dimensions: The dimensions necessary for building the decoder model are extracted from the encoder_layers_dim list. The dimensions of the latent space, dense layer, first convolutional layer, and output layer are stored in the latent_dim, dense_layer_dim, first_conv_layer_dim, and output_layer variables respectively.
Input layer: The decoder starts by defining an input layer. This layer takes as input a vector in the latent space. The shape of this vector is defined by latent_dim.
Dense layer: The latent vector is then passed through a dense layer with dense_layer_dim neurons and a ReLU activation function. This transforms the latent vector into a form that can be fed into the transpose convolutional layers.
Reshape layer: The output of the dense layer is reshaped to match the dimensions of the first convolutional layer in the encoder. This is achieved using the Reshape function.
Transpose Convolutional layers: The reshaped data is then passed through one or more transpose convolutional layers. The transpose convolutional layers perform the inverse operation of the convolutional layers in the encoder. They upsample the data to restore its original dimensions. The number of filters in each layer are determined from the encoder_layers_dim list.
Output layer: Finally, the data is passed through a final transpose convolutional layer with a sigmoid activation function. This layer outputs the reconstructed image. The number of filters in this layer is determined by the output_layer variable, which is equal to the number of channels in the original images.

The result of the decoder is a reconstruction of the original input data. The aim of the VAE is to minimize the difference between these reconstructed outputs and the original inputs, while also ensuring that the latent space has good properties that enable it to generate new data. This is achieved by defining a suitable loss function and training the model on a dataset.

VAE

Following the creation of the encoder and decoder components, our next step involves building the overarching Variational Autoencoder model. This higher-level model is established using Keras’ Model Subclassing API, which grants us greater flexibility and control over the model’s internal operations. Within this model, we connect our encoder and decoder, establishing the VAE’s capacity to encode inputs into the latent space and subsequently decode them back into their original form. The model also defines the training process, handling the computation of losses, and implements gradient updates during training. This holistic approach offers an effective method to train the model in an end-to-end fashion.

class VAE(Model):
    """
    This is a Variational Autoencoder (VAE) implemented using the Keras Model API. 
    It has an encoder and a decoder network defined separately and passed to the constructor as arguments. 
    The VAE class inherits from the Keras Model class and overrides the train_step() method to define the training loop.

    During forward pass, the encoder takes an input image and outputs the mean and standard deviation 
    of a latent space distribution, as well as a sampled vector from that distribution. 
    The decoder takes the sampled vector and outputs a reconstructed image.

    The training loop consists of computing the reconstruction loss and the 
    KL divergence loss, and then computing gradients and updating weights using the Adam optimizer. 
    The reconstruction loss measures the difference between the input image and the reconstructed image,
    while the KL divergence loss measures the divergence between the latent space distribution and a standard normal distribution. 
    The total loss is the sum of the two losses.

    The VAE class also defines three metrics to track during training: the total loss, the reconstruction loss, 
    and the KL divergence loss. These metrics are updated in the train_step() method and can be accessed via the metrics property. 
    The train_step() method returns a dictionary of these metrics.
    
    """
    def __init__(self, encoder, decoder, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        
        # Define metrics to track during training
        self.total_loss_tracker = Mean(name="loss")
        self.reconstruction_loss_tracker = Mean(name="recon_loss")
        self.kl_loss_tracker = Mean(name="kl_loss")
        # Define metrics to track during validation
        self.val_total_loss_tracker = Mean(name="val_loss")
        self.val_reconstruction_loss_tracker = Mean(name="val_recon_loss")
        self.val_kl_loss_tracker = Mean(name="val_kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
            self.val_total_loss_tracker,
            self.val_reconstruction_loss_tracker,
            self.val_kl_loss_tracker
        ]
    
    # Define forward pass
    def call(self, x):
        z_mean, z_log_var, z = self.encoder(x)
        reconstruction = self.decoder(z)
        return z_mean, z_log_var, z, reconstruction

    # Define training step
    def train_step(self, data):
        with GradientTape() as tape:
            # Forward pass through encoder and decoder
            z_mean, z_log_var, z, reconstruction = self(data)
            
            # Compute reconstruction loss
            reconstruction_loss = reduce_mean(
                reduce_sum(
                    binary_crossentropy(data, reconstruction), axis=(1, 2)
                )
            )

            # Compute KL divergence loss
            kl_loss = -0.5 * (1 + z_log_var - tf_square(z_mean) - tf_exp(z_log_var))
            kl_loss = reduce_mean(reduce_sum(kl_loss, axis=1))

            # Compute total loss
            total_loss = reconstruction_loss + kl_loss

        # Compute gradients and update weights
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        
        # Update metrics
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)

        # Return metrics as dictionary
        return {
            "loss": self.total_loss_tracker.result(),
            "recon_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }
    
    def test_step(self, data):
        # Forward pass through encoder and decoder
        z_mean, z_log_var, z, reconstruction = self(data)
        
        # Compute reconstruction loss
        reconstruction_loss = reduce_mean(
            reduce_sum(
                binary_crossentropy(data, reconstruction), axis=(1, 2)
            )
        )

        # Compute KL divergence loss
        kl_loss = -0.5 * (1 + z_log_var - tf_square(z_mean) - tf_exp(z_log_var))
        kl_loss = reduce_mean(reduce_sum(kl_loss, axis=1))

        # Compute total loss
        total_loss = reconstruction_loss + kl_loss
        self.val_total_loss_tracker.update_state(total_loss)
        self.val_reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.val_kl_loss_tracker.update_state(kl_loss)

        return {
            "loss": self.val_total_loss_tracker.result(),
            "recon_loss": self.val_reconstruction_loss_tracker.result(),
            "kl_loss": self.val_kl_loss_tracker.result(),
        }

    def on_epoch_end(self):
        self.total_loss_tracker.reset_states()
        self.reconstruction_loss_tracker.reset_states()
        self.kl_loss_tracker.reset_states()
        self.val_total_loss_tracker.reset_states()
        self.val_reconstruction_loss_tracker.reset_states()
        self.val_kl_loss_tracker.reset_states()

The operational mechanics of this Variational Autoencoder (VAE) class can be summarized in the following manner:

Initialization: The constructor of the VAE class takes an encoder and a decoder model as arguments. It also initializes trackers for different losses that will be computed during the training and validation process: total loss, reconstruction loss, and KL divergence loss. These trackers are instances of the Mean class, which is a metric that computes the mean of the values it receives.
Metrics: The metrics property returns a list of the six trackers. This property will be accessed by Keras during the training process to log the evolution of these metrics.
Forward pass (call method): This method first applies the encoder to the input data x to get the mean, log-variance, and a sample from the latent distribution. It then applies the decoder to the sampled latent vector to get a reconstruction of the input data. The method returns all four of these quantities.
Training step (train_step method): This method defines the operations performed in each step of the training process.
Forward pass: The input data is passed through the VAE (via the call method) to get the mean, log-variance, latent sample, and reconstruction.
Loss computation: The reconstruction loss and KL divergence loss are computed. The reconstruction loss is a measure of the difference between the input data and the reconstruction, and is computed using binary cross-entropy. The KL divergence loss is a measure of the difference between the latent distribution and a standard normal distribution.
Gradient computation and weight update: The total loss, which is the sum of the reconstruction loss and the KL divergence loss, is used to compute the gradients of the trainable weights. These gradients are then applied to update the weights.
Metric update: The total loss, reconstruction loss, and KL divergence loss are tracked by their respective trackers. The method returns a dictionary containing the current values of the loss metrics.
Validation step (test_step method): This method is similar to the train_step method, but is used for validation data. It does not perform any weight updates.
Reset states (on_epoch_end method): At the end of each training epoch, the states of the loss trackers are reset. This is done because these trackers are instances of the Mean class, which computes a running average of the values it receives. Resetting the trackers ensures that the average is computed separately for each epoch.

In conclusion, this VAE class is a custom Keras model that defines a VAE’s forward pass and training step, and tracks several metrics during training. It also handles validation data and provides functionality for resetting loss trackers at the end of each epoch.

To access the complete code for this series, please visit our GitHub repository at https://github.com/asokraju/ImageAutoEncoder.

Continuation:

Part 2: Training Procedures for Variational Autoencoders

Part 3: Hyperparameter Tuning with Docker and Bash Scripts

Variational Autoencoders: Introduction and Implementation (Part -1)

Encoder

Decoder

VAE

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Krishna Chaitanya

No responses yet