Abstract

Prior work (Huang et. al. (2017) [3], Kos et. al. (2017) [4]) has shown that deep RL policies are vulnerable to small adversarial perturbations to their observations, similar to adversarial examples (Szegedy et. al. (2013) [5]) in image classifiers. Such adversarial models assume that the attacker can directly modify the victim’s observation. However, such attacks are not practical in the real world. In contrast, we look at attacks via adversarial policy designed specifically for the two-agent zero-sum environments. The goal of the attacker is to fail a well-trained agent in the game by manipulating the opponent’s behavior. Specifically, we explore the attacks using an adversarial policy in low-dimensional environments.

Background

What are Adversarial Attacks?

An adversarial attack is a method to generate adversarial examples. In a classification system, an adversarial example is an input to a machine learning model that is designed to cause the model to make a mistake in its predictions. The adversarial example is created by adding specific perturbations to an input such that the perturbation is imperceptible to the human eye. Without the perturbation, the input would have been correctly classified. The following image from Goodfellow et. al. (2014) [1] shows a representative example.

Fig. 1: The input image $x$, when fed to a classifier, is classified as a panda with 57.7% confidence. However, when a small amount of noise is added, the resultant image is classified as a gibbon with 99.3% confidence.

Fig. 1: The input image $x$, when fed to a classifier, is classified as a panda with 57.7% confidence. However, when a small amount of noise is added, the resultant image is classified as a gibbon with 99.3% confidence.

How are they different for RL Agents?

Adversarial attacks on deep RL agents are different from those on classification systems:

First, an RL agent interacts with the environment through a sequence of actions where each action changes the state of the environment. What the agent receives is a sequence of correlated observations. For an episode of $L$ steps, an adversary can determine whether or not to attack the agent at each time step (i.e., there are $2^L$ choices).
Second, adversaries to deep RL agents have different goals such as reducing the final rewards of agents or malevolently lure agents to dangerous states, which is different from an adversary to a classification system that aims to lower classification accuracies.

Types of Adversarial Attacks

Adversarial Attacks can be broadly divided into two types:

White-box Attacks

Here, the adversary has complete access to the victim's model, including the model architecture, parameters, and the policy. Most white-box attacks are pixel-based, where the adversary perturbs the observation of the victim. Other forms of attacks target the vulnerabilities of the neural networks. Some popular white-box attacks are listed below:

Fast Gradient Sign Method (FGSM): FGSM generates adversarial examples to minimize the maximum amount of perturbation added to any pixel of the image to cause misclassification. It is a uniform attack as the adversary attacks at every time step in an episode.
Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS): L-BFGS is a non-linear gradient-based numerical optimization algorithm that aims at minimizing the number of perturbations added to images.
Carlini & Wagner Attack (C&W): C&W is based on the L-BFGS attack but without box constraints and different objective functions, which makes this method more efficient at generating adversarial examples.