Generative Networks for Deep Learning
Generative Adversarial Networks for Adversarial Deep Learning
Landon Chu and Joseph Collins
Thomas Jefferson High School for Science and Technology
Abstract
Deep neural networks are highly powerful and flexible models that have been able to achieve above-human performance on image recognition. It has recently been shown, however, that these networks are susceptible to adversarial examples, inputs that have been modified with small but targeted perturbations in order to force misclassification by the network. In this paper, we examine the feasibility of using an adversarial training method based on the Generative Adversarial Network architecture to improve the adversarial robustness of image classifiers.
Background
Szegedy et al. [5] originally demonstrated the susceptibility of neural networks to adversarial examples. Simply put, such models are easily fooled; in some cases, changes can be made to an input image that would be imperceptible to human observers, yet which can successfully force a high-performance image recognition model to misclassify the image with high confidence.
Goodfellow et al. [2] demonstrated the effectiveness of a "fast gradient sign method" (FGSM) for more reliably and quickly generating adversarial examples for neural networks. Their paper demonstrated that standard regularization procedures, such as augmenting the dataset by adding random noise or transformations to image examples, do not provide improvements in adversarial accuracy of image classifiers; however, adversarial retraining by repopulating training data with correctly labeled adversarial examples generated by FGSM was able to provide adversarial accuracy up to 82.1%. It was hypothesized that the adversarial susceptibility of neural networks was due to linearity of standard trained models in high dimensions.
Generative Adversarial Networks (GANs), developed by Goodfellow et al. [1], are a framework consisting of two neural networks, an image generator and a discriminator, training against each other. The discriminator attempts to distinguish images produced by the generator from those within its training dataset of real images, while the generator's objective is to produce images realistic enough to fool the discriminator. Generator networks trained using variants of this method have been able to produce highly realistic images.
The objective of this paper was to observe the effectiveness adversarial training of an image classifier using an adaptive two-player training scheme similar to that of a GAN, which would be able to provide continuous improvement to the adversarial performance of an image classifier without requiring any additional training data apart from the original dataset.
Methodology
The dataset used for this paper was the MNIST set of handwritten digits in grayscale, using 60000 examples for training and a 10000 example testing set. All pixel values ranged from 0 to 1. All neural network models were written in Python 3 using Keras with Tensorflow as the backend.
Similar to a GAN, in addition to a standard-architecture image classifier, training included a generative network in opposition to the classifier. Rather than generating realistic images from random noise as in the GAN architecture, however, the generator was set up to create small image perturbations. Specifically, the generator accepted a dataset image as an input and created a corresponding perturbation with maximum pixel values of .5, then outputted the original image with the perturbation added on.
In order to train the generator to output perturbations that would cause the discriminator to misclassify, several additional layers were added to the output of the discriminator. The original discriminator output consisted of categorical label confidences as a probability distribution over the digits 0-9. A fixed layer was added that performed a transformation of 1 - value for each label confidence, such that the output of the layer represented the discriminator's confidence that the input image did not belong to each class. Following this was a layer that accepted the previous layer's output as well as the correct image label as an input, whose output was a single value representing the discriminator's confidence of the input image not belonging to its correct class, in other words, the amount by which the discriminator was wrong. The generator was trained with the discriminator frozen in order to maximize this value for any given image.
The training setup was as follows:
Pre-train the discriminator on dataset images
Freeze discriminator, unfreeze generator; train the generator to maximize discriminator wrongness
Produce a number of perturbed images using the generator, and pair them with their original labels
Freeze generator, unfreeze discriminator; train discriminator on a combination of dataset images and previously produced perturbed images
Repeat 2-4
Results
Adversarial effectiveness of the training architecture was measured using the trained classifier's accuracy on a set of perturbed images produced using Fast Gradient Sign Method. In addition, the performance of the discriminator on standard dataset images was measured to verify maintained accuracy on the original task of image classification.
In order to achieve better convergence, several schemes were used for adjusting the number of epochs for which the generator was trained versus the discriminator for each training cycle. Originally, the generator was given a constant 2 training epochs per discriminator training epoch. Accuracy results are shown in Table 1.
With continued training, it was observed that the generator’s loss approached 1, and the discriminator’s adversarial accuracy reached an upper limit, signifying that the generator was unable to produce useful adversarial images. Better accuracy was observed by increasing the generator-to-discriminator training ratio to 4:1 as shown in Table 2.
Further improvements were made by stepping the training ratio between 6:1, 8:1, and 12:1 during training, as showing in Table 3. However, even with these adjustments, adversarial performance still peaked as generator loss slowly increased towards 1, as shown in Figure 7.
To remedy this, we allowed continuous adjustment of the training ratio to maintain a generator loss of less than 0.6, with results shown in Table 4. The training ratio was observed to increase continuously to over 700:1 to maintain generator performance, resulting in significantly slower training cycles, but producing significantly improved results in terms of adversarial performance. It appears likely that continued training would continue to yield higher discriminator accuracy, as no plateau in performance has been observed yet.
Conclusion
The primary observation of this paper is the demonstration that adversarial training of an image classifier using neural network-generated images is able to provide substantial improvement to adversarial robustness over a standard classifier. While this method has not produced performance at the level of adversarial retraining using FGSM-perturbed images, it is possible that such performance could be achieved given further tuning of parameters of the two networks such as learning rates and training ratios. None of these models have yet been tested against different methods of producing adversarial perturbations; thus, it remains to be seen whether the FGSM-trained classifier holds an advantage in general adversarial accuracy, or only against FGSM-generated adversarial examples. Papernot et al. [4] proposed a hierarchy of adversarial tasks according to the adversary's goals and capabilities. Both fast gradient sign method and this paper's generator implementation require knowledge of the discriminator network's network architecture and parameters to calculate a gradient, but not the training data. The fast gradient sign method would fall under the goal category of simple misclassification, that is, forcing the output of the discriminator for any given input into any class different from the original class. In its current implementation, generator model used in this paper would also fall under this goal category. However, the generator could easily be adapted to learn more difficult goals, including targeted misclassification, forcing any given input to give a specific output, and source/target misclassification, forcing a specific input to give a specific output. As the universal approximator theorem [3] guarantees the ability of deep neural networks to approximate arbitrary input-output functions, the culprit of adversarial vulnerability lies in the inability of standard training methods to teach a function that accounts for adversarial examples. The results of this paper support the possibility of augmenting data with adversarial examples as a viable and effective method for producing adversarially robust models.
References
[1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. ArXiv e-prints, June 2014.
[2] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ArXiv e-prints, December 2014.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 - 366, 1989.
[4] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. Berkay Celik, and A. Swami. The Limitations of Deep Learning in Adversarial Settings. ArXiv e-prints, November 2015.
[5] Christian Szegedy, Google Inc, Wojciech Zaremba, Ilya Sutskever, Google Inc, Joan Bruna, Dumitru Erhan, Google Inc, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In In ICLR, 2014.