Image Feature Extraction
timm
PyTorch
Safetensors
Transformers

Under the hood difference between Feature Map Extraction and Image Embeddings

#1
by atapixxel - opened

Hi,

Going through the code example, I wonder what is the difference between features we get from Feature Map Extraction (gives 3 [1, 1024, 16, 16] features) vs Image Embeddings ([1, 261, 1024]). To be specific what are the 3 different tensors given by Feature Map Extraction and in what way the are different from image_embed[:, 5:, :].permute(1, 2, 0).view(1, 1024, 16, 16)?
dino_example

I want to correctly link this back to the original implementation.

PyTorch Image Models org
edited 6 days ago

@atapixxel the image_embeds is essentially the features right before the classifier (aka 'pre-logits').

At this point, it's unpooled, and includes the 'prefix tokens', so any registers and/or class token are included in that 261 ==> 1 class token + 4 reg tokens + 16x16 spatial tokens = 261.
The output of forward_features has the final norm layer applies.

output = model.forward_features(transforms(img).unsqueeze(0))
# output is unpooled, a (1, 261, 1024) shaped tensor

At this point, it's pooled, so either extracted class token, or average pool (w/o prefix tokens by default).

output = model.forward_head(output, pre_logits=True)
# output is a (1, num_features) shaped tensor

The feature maps via features_only=True evolved from convnets that grab feature maps (spatial tokens only) from the deepest layer in each stage, where a stage is delineated by the spatial pooling (size reduction) level. For EVA / ViT I adapted it to work, but it uses the last-3 block outputs, no norm applied. You can change the specific indices being extracted by passing the out_indices=? (a tuple/list for indices or int for last N).

There's also a more advanced forward call supported now in timm called forward_intermediates ... it's related to the intermediate feature helpers in the original dinov2/v3 repos... I got Claude to make a nice summary of it for someone else a little while back. The features_only extractor for vits actually uses forward_intermediates underneath, but forward_intermediates allows other functionality, like returning the prefix-tokens, spatial vs flat sequence output, etc.

https://claude.ai/share/a351e66f-419f-4254-9540-5466508e5098

Sign up or log in to comment