How to evaluate on downstream tasks?
====================================

In our paper, we evaluate our pretrained VirTex models on seven different
downstream tasks. Our codebase supports all of these evaluations. Throughout
this documentation, we consider a specific example of our VirTex pretrained
model being evaluated for ensuring filepath uniformity in the following example
command snippets. Paths can be trivially adjusted for any other VirTex model;
evaluating the baselines (MoCo, ImageNet-supervised, Random Init) require
additional changes in commands, explained in the last sub-section.

As an example, consider a pretraining job for our best performing VirTex model
(``width_ablations/bicaptioning_R_50_L1_H2048.yaml``). The serialization
directory might look something like this:

.. code-block:: text

    /tmp/bicaptioning_R_50_L1_H2048
        pretrain_config.yaml
        log-rank0.txt    # stdout/stderr per GPU process
        log-rank1.txt
        ...
        log-rank7.txt
        checkpoint_2000.pth
        checkpoint_4000.pth
        ...
        checkpoint_498000.pth
        checkpoint_500000.pth    # serialized checkpoints
        train_captioning_forward/
            events.out.* ...    # tensorboard logs
        ...

We evaluate all checkpoints on **PASCAL VOC 2007 Linear Classification**, and
then evaluate the best checkpoint (here, it was iteration 500000) on all other
downstream tasks.


PASCAL VOC 2007 Linear Classification
-------------------------------------

Evaluate a single VirTex pretrained checkpoint on VOC 2007 ``trainval`` split:

.. code-block:: shell

    python scripts/clf_voc07.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --down-config configs/downstream/voc07_clf.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 1 \
        --cpu-workers 4 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048

To evaluate recent 100 checkpoints in the sub-directory, this command can be
looped over as follows:

.. code-block:: shell

    for ((iter = 300000; iter <= 500000; iter+=2000)); do
        # add command with `checkpoint_$iter.pth`        
    done

This script write metric to tensorboard logs in the same pretraining directory,
all VOC07 mAP curves appear together with pretraining loss curves.

-------------------------------------------------------------------------------

ImageNet Linear Classification
------------------------------

We train a linear classifier on 2048-dimensional global average pooled features
extracted from a frozen visual backbone. Evaluate a checkpoint (for example,
iteration 500000) on this task as:

.. code-block:: shell

    python scripts/clf_linear.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --down-config configs/downstream/imagenet_clf.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 8 \
        --cpu-workers 4 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/imagenet_500000 \
        --checkpoint-every 5005  # 1 epoch of ImageNet

-------------------------------------------------------------------------------

Instance Segmentation (and Object Detection) on COCO
----------------------------------------------------

Train a Mask R-CNN with FPN backbone for COCO Instance Segmentation (and Object
Detection, because it also has a box head) by initializing the backbone from
VirTex pretrained weights:

.. code-block:: shell

    python scripts/eval_detectron2.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --d2-config configs/detectron2/coco_segm_default_init_2x.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 8 \
        --cpu-workers 2 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/coco_segm_500000 \
        --checkpoint-every 5000

.. note::

    1. This script periodically serializes checkpoints but skips validation
       step during training for saving time; to evaluate a serialized checkpoint
       and write results to tensorboard, provide it as ``--checkpoint-path`` and
       additional flags ``--resume --eval-only``.

    2. Note that ``--d2-config`` here is in Detectron2 format, and not our
       package :class:`~virtex.config.Config`.

    These points are applicable for all tasks described below.

-------------------------------------------------------------------------------

Instance Segmentation on LVIS
-----------------------------

Train a Mask R-CNN with FPN backbone for LVIS Instance Segmentation by
initializing the backbone from VirTex pretrained weights:

.. code-block:: shell

    python scripts/eval_detectron2.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --d2-config configs/detectron2/lvis_segm_default_init_2x.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 8 \
        --cpu-workers 2 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/lvis_segm_500000 \
        --checkpoint-every 5000

-------------------------------------------------------------------------------

Object Detection on PASCAL VOC 2007+12
--------------------------------------

Train a Faster R-CNN with C4 backbone for PASCAL VOC 2007+12 Object Detection
by initializing the backbone from VirTex pretrained weights:

.. code-block:: shell

    python scripts/eval_detectron2.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --d2-config configs/detectron2/voc_det_default_init_24k.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 8 \
        --cpu-workers 2 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/voc_det_500000 \
        --checkpoint-every 2500

-------------------------------------------------------------------------------

iNaturalist 2018 Fine-Grained Classification
--------------------------------------------

Fine-tune the VirTex pretrained visual backbone end-to-end on iNaturalist 2018
dataset:

.. code-block:: shell

    python scripts/clf_linear.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --down-config configs/downstream/inaturalist_clf.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --weight-init virtex \
        --num-gpus-per-machine 8 \
        --cpu-workers 4 \
        --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/inaturalist_500000 \
        --checkpoint-every 1710  # 1 epoch of iNaturalist

-------------------------------------------------------------------------------

Image Captioning on COCO Captions val2017
-----------------------------------------

Evaluate a pretrained VirTex model on image captioning for COCO Captions val2017
split (reporting CIDEr and SPICE metics):

.. code-block:: shell

    python scripts/eval_captioning.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --calc-metrics \
        --num-gpus-per-machine 1 \
        --cpu-workers 4

-------------------------------------------------------------------------------

Running Image Captioning Inference on Arbitrary Images
------------------------------------------------------

The above script can be used for generating captions for any images in a directory.
Replace certain commands as follows:

.. code-block:: shell

    python scripts/eval_captioning.py \
        --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
        --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
        --data-root /path/to/images_dir \
        --output /path/to/save/predictions.json \
        --num-gpus-per-machine 1 \
        --cpu-workers 4

This script will save predictions in JSON format. Since our goal is to not
improve image captioning, these models may not generate the best captions.