Delving into Generative AI Model Architectures and Core Algorithms

June 10, 2024

79 4 minutes read

Delving into Generative AI Model Architectures and Core Algorithms — 1224 400.png

Generative Artificial Intelligence (GenAI) constitutes a rapidly expanding subfield of artificial intelligence dedicated to developing models that generate novel, plausible samples mirroring targeted data distributions. In this article, I have provided an exhaustive survey of prevalent GenAI architectures, corresponding optimization goals, and loss functions, laying the conceptual foundation for future GenAI research endeavors. We discuss canonical GenAI applications, examine widely adopted architectures, namely autoregressive models, Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs), and scrutinize fundamental optimization targets instrumental in molding GenAI behaviors.

Generative AI pertains to a specialized branch of artificial intelligence devoted to designing models tasked with crafting fresh, convincing exemplars reflective of prespecified data distributions. In contrast to discriminative models primarily focused on distinguishing classes, GenAI models strive to faithfully reproduce the statistical properties exhibited by training datasets, rendering them versatile engines for creative pursuits and simulation-based experimentation.

Common GenAI applications

Text generation: Producing grammatically sound sentences, paragraphs, or entire narratives, GenAI systems facilitate storytelling, automated journalism, and chatbot functionalities.
Image synthesis: Fabricating visually appealing still images or animations, GenAI caters to virtual reality, video games, film production, and advertising agencies.
Music composition: Orchestrating compositions conforming to musical theory conventions, GenAI finds utility in digital audio workstation plugins, assisted songwriting software, and background scoring for videos or movies.

Popular GenAI Architectures

Generative AI encompasses a diverse array of architectures, each with its unique strengths and applications. Prominent amongst those are Autoregressive Model, Variational Encoders and Generative Adversarial Networks.
Autoregressive models: Autoregressive models generate data sequentially, with each output conditioned on previous inputs. GPT (Generative Pre-trained Transformer), PixelRNN, and WaveNet are notable examples of autoregressive models.
GPT: Representing one of the latest breakthroughs in NLP, GPT (Generative Pretraining Transformer) establishes a hierarchical attention mechanism that encodes rich linguistic nuances conveyed by input texts. Owing to its effectiveness, GPT powers various applications, notably OpenAI ChatGTP and Microsoft’s Turing NLG.
PixelRNN: Sequentially scanning pixels arranged in raster fashion, PixelRNN estimates conditional pixel intensity probabilities utilizing Long Short-Term Memory (LSTM) cells. Although computationally intensive, PixelRNN achieves superior results in image generation tasks.
WaveNet: Capitalizing on dilated causal convolutions, WaveNet produces raw audio waveforms with remarkable fidelity. By avoiding explicit temporal dependence specification, WaveNet efficiently models lengthy sequences, benefitting speech synthesis, voice conversion, and robotic interaction applications.
Variational Autoencoders (VAEs): VAEs aim to learn a latent representation of the data’s probability distribution, enabling them to generate new samples with similar characteristics. Variants like Beta-VAE, InfoVAE, and NVAE are popular in the VAE landscape.
Beta-VAE: Augmenting the standard VAE framework, Beta-VAE introduces an additional hyperparameter controlling tradeoffs between reconstruction precision and latent space regularization. Thus, Beta-VAE promotes disentangled representations, yielding interpretable embeddings suited for downstream tasks.
InfoVAE: Introduced as an alternative to maximum mean discrepancy (MMD)-based VAEs, InfoVAE leverages the Maximum Information Coefficient (MIC) metric, striking a balance between tractability and expressiveness. Empirically, InfoVAE matches or exceeds the performance of competitors while boasting faster training speeds.
NVAE: Neural Variational Autoencoder (NVAE) ranks amongst the foremost scalable deep generative models available, accommodating voluminous datasets without sacrificing predictive prowess. Designed around hierarchical structured priors, NVAE yields compelling visualizations indicative of its capacity to distill intricate structural patterns embedded in training data.
Generative Adversarial Networks (GANs): GANs consist of two neural networks, namely the generator and the discriminator, trained in adversarial manner to generate realistic data samples. DCGAN, CycleGAN, and StyleGAN are prominent examples of GAN architectures.
DCGAN: Standing for Deep Convolutional GAN, DCGAN replaces conventional pooling operators with stride convolutions and deploys fractionally stride convolutions for up-sampling, establishing stable training dynamics and consistently delivering aesthetically pleasing outputs.
CycleGAN: Addressing the daunting challenge of unpaired image-to-image translation, CycleGAN exploits cycle consistency losses to preserve source identities, obliterating the necessity for aligned dataset pairs. Exhibiting broad appeal, CycleGAN fuels artistic creations, satellite picture enhancement, and facial expression swapping.
StyleGAN: Propelled by adaptive instance normalization, StyleGAN significantly improves upon earlier iterations, granting unprecedented control over attribute manipulations and style blending. Recent advances culminate in StyleGAN3, eliminating artifacts commonly afflicting generated samples.

Fundamental Optimization Objectives & Loss Functions

Optimizing generative AI models involves defining suitable objectives and loss functions tailored to the specific architecture and application.

Maximum Likelihood Estimation (MLE)- MLE aims to maximize the likelihood of generating the observed data under the model’s learned distribution. Employed extensively in classical statistics, MLE locates optimal parameters that assign maximal joint probability mass to training observations, forming the cornerstone of various GenAI algorithms. For instance, PixelRNN, WaveNet, and plain vanilla VAEs utilize MLE during training.
Cross-Entropy Loss – Measuring disparities between anticipated and actual label distributions, cross-entropy loss penalizes discordant predictions, acting as a potent proxy for log-likelihood. Several GenAI implementations incorporate cross-entropy loss owing to its simplicity and compatibility with backpropagation with the aim to minimize the discrepancy between generated samples and ground truth data.
Kullback-Leibler (KL) Divergence – Quantifying deviations between genuine and approximated distributions, KL divergence gauges the efficacy of generative models, steering them toward desirable regions characterized by minimal dissimilarities. Within the realm of VAEs, KL divergence plays a vital role in regularizing latent spaces.
Jensen-Shannon (JS) Divergence – Blending symmetries inherent to KL divergence with finite upper bounds, JS divergence sidesteps drawbacks stemming from asymmetry and undefined-ness, proving indispensable in GAN landscapes. Specifically, JS divergence assumes center stage in conjunction with Earth Mover’s distance, furnishing reliable gradients that contribute immensely to the burgeoning popularity of GANs.

Wasserstein distance

Relaxing assumptions imposed by KL and JS divergences, Wasserstein distance embraces looser continuity requirements, permitting smooth transportation of probability masses across domains. Demonstrably superior in handling multi-modal distributions, Wasserstein distance emerges as an attractive choice for refractory scenarios demanding elevated degrees of sophistication.

In this article I supplied a compendium of prevailing GenAI architectures, enumerating their respective merits alongside fundamental optimization targets imperative for sculpting GenAI behaviors. Expounding upon established GenAI applications, I anticipate witnessing innovative uses unfolding in tandem with technological evolution, fueling curiosity and catalyzing further breakthroughs in this invigorating discipline.