Authors used arbitrary resolution (e.g. 96x96) when the input image size is not static but variable. But they used largest input resolution (256x256) when it's possible.
Smaller image size
Smaller image size yields more semantic details rather than global context of an image