Final Project for NYU's graduate course on Deep Reinforcement Learning

Nikhil Verma [email protected]

Advika Reddy [email protected]

Abstract

Prior work (Huang et. al. (2017) [3], Kos et. al. (2017) [4]) has shown that deep RL policies are vulnerable to small adversarial perturbations to their observations, similar to adversarial examples (Szegedy et. al. (2013) [5]) in image classifiers. Such adversarial models assume that the attacker can directly modify the victim’s observation. However, such attacks are not practical in the real world. In contrast, we look at attacks via adversarial policy designed specifically for the two-agent zero-sum environments. The goal of the attacker is to fail a well-trained agent in the game by manipulating the opponent’s behavior. Specifically, we explore the attacks using an adversarial policy in low-dimensional environments.

Background

What are Adversarial Attacks?

An adversarial attack is a method to generate adversarial examples. In a classification system, an adversarial example is an input to a machine learning model that is designed to cause the model to make a mistake in its predictions. The adversarial example is created by adding specific perturbations to an input such that the perturbation is imperceptible to the human eye. Without the perturbation, the input would have been correctly classified. The following image from Goodfellow et. al. (2014) [1] shows a representative example.

Fig. 1: The input image $x$, when fed to a classifier, is classified as a panda with 57.7% confidence. However, when a small amount of noise is added, the resultant image is classified as a gibbon with 99.3% confidence.

Fig. 1: The input image $x$, when fed to a classifier, is classified as a panda with 57.7% confidence. However, when a small amount of noise is added, the resultant image is classified as a gibbon with 99.3% confidence.

How are they different for RL Agents?

Adversarial attacks on deep RL agents are different from those on classification systems:

Types of Adversarial Attacks

Adversarial Attacks can be broadly divided into two types:

White-box Attacks

Here, the adversary has complete access to the victim's model, including the model architecture, parameters, and the policy. Most white-box attacks are pixel-based, where the adversary perturbs the observation of the victim. Other forms of attacks target the vulnerabilities of the neural networks. Some popular white-box attacks are listed below: