Mode Collapse and WGANs

Kun Ouyang
5 min readJun 8, 2019

--

A Chinese annotation is this.

Revisiting GANs

GANs are good generators. The training process of GANs is firstly termed as a Min-Max Game [1][2]: discriminating and treating. But alternatively it can be viewed as an E-M process: approximating the true distribution divergence e.g., JS. and then minimize this divergence. I prefer the second viewpoint. Then every D step corresponds to approximate the *true* divergence between P (real data) and Q (generated) with D* measure the true divergence. Several well-known divergences can be used:

  1. In the vanilla GANs, the divergence is a JS-divergence[1]
  2. with the -log(D) trick [2], the divergence is KL(P||Q)-JS [3].
  3. Wasserstein divergence (also earth mover distance) [4]
  4. Maximum Mean Discrepancy [5]

Let’s mainly talk about [1–4], which are baselines. [5] outdoes [4] with no need for K-lipschitz requirement.

Pitfalls of GANS[1–3]

For vanilla GANs, when D->D*, the JS saturates and gradients vanish. Theoretical results is Theorem 2.1–2.4 in [3].

Adopted from Jonathon Hui, an empirical result:

Figure 1

Another factor leads to Vanishing Gradient is the final layer of D*, which is a sigmoid and generate saturated values when the are easily separable (giving probability of pure 1s and 0s). The red curve from below shows this problem:

Figure2 From [4]

So, though we wish we can accurately compute JS(P||Q), the more accurate we get the less useful gradients we have, where inaccurate JS can only lead to instability for training. Therefore, in this vanilla version, one need to specify a number of rounds to update D which is not too big (before optimal) and not too small (to provide some useful guidances).

Goodfellow also noticed the vanishing problem in [2], where he applied the -logD trick. This get rids of the vanishing problem, however, lead to instability and mode collapsing problem [3]. Specifically, the divergence to minimize now becomes KL-2JS. This is pathetic and lead to instability, as one should not minimize KL while maximizing JS. Also, this KL now becomes KL(Q||P), therefore, there’s few penalty when Q is big while P is small, which lead to mode collapse.

In summery, there are some specific design problems on [1][2]. The authors of [3] proposed to update using Earth-mover distance, termed as WGAN.

WGANs

The main motivation for WGANs is that JS, KL and Total Variance distances are not good. KL can suffer from scale problem (when denominator ->0). All of them are not continuous everywhere. Especially for JS, when the two distributions P, Q are (1) discretized or (2) with negligible support overlapped of manifolds (as in Figure 1)

Using W-distance instead provide continuous divergence measure and thus no exploded or vanished gradience to train G. Also, one can just train D (named critic in this work) as much as he can to achieve optimal without worrying tradeoff of correct approximation of W-distance and prominent gradients. The K-lipschitz constraint is needed to apply Kantorovich-Rubinstein duality to allow one to compute the W(P||Q).

Two main differences for training WGANs and GAN. (1) clip weights of D to guarantee K-lipschitz. (2) do not use sigmoid at the last layer to prevent saturation. Note that in WGAN-GP, another difference is to compute the second-order gradients.

Mode collapse

Another well-known problem Mode collapse. However, there is no convincing explanations so far.

  1. Therefore, in standard GAN training, each generator update step is a partial collapse towards a delta function”. In Unrolled GAN. Although each update of G is like maximum likelihood of D (the most prominent mode), however, one small step of GAN would not necessarily collapse to delta. Also when the divergence is correctly measured, cost would incur for the missing mode regions. (so WGAN partially solve this problem by correctly computing W-distance and giving useful gradients)
Figure 3. Justification from WGAN paper

2. “However, even if we train the discriminator to distinguish between these two manifolds, we have no control over the shape of the discriminator function in between these manifolds. In fact, the shape of the discriminator function in the data can be very non-linear with bad plateaus and wrong maxima” In MDGAN. Too weak.

3. “An intuition behind why mode collapse occurs is that the only information that the objective function provides about γ is mediated by the discriminator network Dω. For example, if Dω is a constant, then OGAN is constant with respect to γ, and so learning the generator is impossible. When this situation occurs in a localized region of input space, for example, when there is a specific type of image that the generator cannot replicate, this can cause mode collapse.”. In VEEGAN. This intuition is similar to that from [3], stating that JS can be constant in cases. But I don’t think JS will be constant in many cases.

In all, the short answer to this is that in original GAN paper, JS is not good enough to measure distance between distributions.

A mitigation is by multi-level reconstruction. Using a encoder to encode data from G and real data, and minimize the distances in the encoded space [7,8]. Besides, reconstruct the G(E(x)) to x at the data space is another possible method [7].

[1] Generative Adversarial Nets

[2] NIPS 2016 Tutorial: Generative Adversarial Networks

[3] TOWARDS PRINCIPLED METHODS FOR TRAINING GENERATIVE ADVERSARIAL NETWORKS

[4] Wasserstein Generative Adversarial Networks

[5] Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy

[6] UNROLLED GENERATIVE ADVERSARIAL NETWORKS

[7]MODE REGULARIZED GENERATIVE ADVERSARIAL NETWORKS

[8] VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning

--

--