How to evaluate on downstream tasks?

In our paper, we evaluate our pretrained VirTex models on seven different downstream tasks. Our codebase supports all of these evaluations. Throughout this documentation, we consider a specific example of our VirTex pretrained model being evaluated for ensuring filepath uniformity in the following example command snippets. Paths can be trivially adjusted for any other VirTex model; evaluating the baselines (MoCo, ImageNet-supervised, Random Init) require additional changes in commands, explained in the last sub-section.

As an example, consider a pretraining job for our best performing VirTex model (width_ablations/bicaptioning_R_50_L1_H2048.yaml). The serialization directory might look something like this:

    log-rank0.txt    # stdout/stderr per GPU process
    checkpoint_500000.pth    # serialized checkpoints
        events.out.* ...    # tensorboard logs

We evaluate all checkpoints on PASCAL VOC 2007 Linear Classification, and then evaluate the best checkpoint (here, it was iteration 500000) on all other downstream tasks.

PASCAL VOC 2007 Linear Classification

Evaluate a single VirTex pretrained checkpoint on VOC 2007 trainval split:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --down-config configs/downstream/voc07_clf.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 1 \
    --cpu-workers 4 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048

To evaluate recent 100 checkpoints in the sub-directory, this command can be looped over as follows:

for ((iter = 300000; iter <= 500000; iter+=2000)); do
    # add command with `checkpoint_$iter.pth`

This script write metric to tensorboard logs in the same pretraining directory, all VOC07 mAP curves appear together with pretraining loss curves.

ImageNet Linear Classification

We train a linear classifier on 2048-dimensional global average pooled features extracted from a frozen visual backbone. Evaluate a checkpoint (for example, iteration 500000) on this task as:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --down-config configs/downstream/imagenet_clf.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 8 \
    --cpu-workers 4 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/imagenet_500000 \
    --checkpoint-every 5005  # 1 epoch of ImageNet

Instance Segmentation (and Object Detection) on COCO

Train a Mask R-CNN with FPN backbone for COCO Instance Segmentation (and Object Detection, because it also has a box head) by initializing the backbone from VirTex pretrained weights:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --d2-config configs/detectron2/coco_segm_default_init_2x.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 8 \
    --cpu-workers 2 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/coco_segm_500000 \
    --checkpoint-every 5000


  1. This script periodically serializes checkpoints but skips validation step during training for saving time; to evaluate a serialized checkpoint and write results to tensorboard, provide it as --checkpoint-path and additional flags --resume --eval-only.

  2. Note that --d2-config here is in Detectron2 format, and not our package Config.

These points are applicable for all tasks described below.

Instance Segmentation on LVIS

Train a Mask R-CNN with FPN backbone for LVIS Instance Segmentation by initializing the backbone from VirTex pretrained weights:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --d2-config configs/detectron2/lvis_segm_default_init_2x.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 8 \
    --cpu-workers 2 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/lvis_segm_500000 \
    --checkpoint-every 5000

Object Detection on PASCAL VOC 2007+12

Train a Faster R-CNN with C4 backbone for PASCAL VOC 2007+12 Object Detection by initializing the backbone from VirTex pretrained weights:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --d2-config configs/detectron2/voc_det_default_init_24k.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 8 \
    --cpu-workers 2 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/voc_det_500000 \
    --checkpoint-every 2500

iNaturalist 2018 Fine-Grained Classification

Fine-tune the VirTex pretrained visual backbone end-to-end on iNaturalist 2018 dataset:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --down-config configs/downstream/inaturalist_clf.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --weight-init virtex \
    --num-gpus-per-machine 8 \
    --cpu-workers 4 \
    --serialization-dir /tmp/bicaptioning_R_50_L1_H2048/inaturalist_500000 \
    --checkpoint-every 1710  # 1 epoch of iNaturalist

Image Captioning on COCO Captions val2017

Evaluate a pretrained VirTex model on image captioning for COCO Captions val2017 split (reporting CIDEr and SPICE metics):

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --calc-metrics \
    --num-gpus-per-machine 1 \
    --cpu-workers 4

Running Image Captioning Inference on Arbitrary Images

The above script can be used for generating captions for any images in a directory. Replace certain commands as follows:

python scripts/ \
    --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
    --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \
    --data-root /path/to/images_dir \
    --output /path/to/save/predictions.json \
    --num-gpus-per-machine 1 \
    --cpu-workers 4

This script will save predictions in JSON format. Since our goal is to not improve image captioning, these models may not generate the best captions.