Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Overview

Ultra-high-resolution semantic segmentation faces a core trade-off:

Local crops preserve fine details but lose global context.
Global views preserve context but miss small structures.

This work introduces Relay Tokens, a lightweight mechanism that couples local and global transformer branches so both context scales can inform each other during inference. Instead of choosing between detail and context, the model processes both views jointly and exchanges information between them throughout the network.

Figure 1: Context and Details

Method At A Glance

Build local and global token streams from the same image.
Introduce a compact set of relay tokens.
Use relay tokens to aggregate cross-scale information and propagate it back.
Decode enriched features for segmentation.

Figure 2: ViT with Cross-Resolution Relay Tokens

In each transformer block, relay tokens are first updated with the global stream, then passed to the local stream, then forwarded to the next block. This sequential cross-resolution update gives local predictions access to global semantics while preserving high-resolution precision.

Training is applied at both scales:

a global supervision term on the global branch output
a local supervision term on the local branch output
a cross-resolution consistency term to align predictions between scales

Together, these terms stabilize optimization and encourage agreement between coarse context and fine-detail predictions.

Experimental Results

The method is validated on multiple datasets, including:

Archaeoscape (archaeological mapping from airborne laser scanning)
URUR (ultra-high-resolution remote sensing)
Gleason (histopathology tissue segmentation)
Cityscapes (urban street-scene segmentation)

Across these diverse domains, Relay Tokens shows consistent gains, with up to 15% relative mIoU improvement. The same design transfers across very different visual statistics, from natural street scenes to medical imagery and geospatial data.

Figure 4: Qualitative Results

To analyze how relay tokens operate, we inspect average relay-to-patch attention over both scales.

Figure 8: Relay Token Attention Maps

Practical Notes

The module is easy to integrate in existing transformer segmentation pipelines.
Parameter growth is small (under 2%), making model-size overhead limited.
Because both local and global branches are processed, compute is higher than single-scale baselines; in practice this is a trade-off for improved segmentation quality.

Citation

@article{perron2026relaytokens,
  title   = {Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens},
  author  = {Perron, Yohann and Sydorov, Vladyslav and Pottier, Christophe and Landrieu, Loic},
  journal = {Transactions on Machine Learning Research},
  year    = {2026}
}