Logo AutoVER

Grounding Language Models for Visual Entity Recognition

1Rice University, 2Microsoft

Abstract

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visual reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark with accuracy on the Entity seen split rising from 32.7% to 61.5%. It demonstrates superior performance on the unseen and query splits by a substantial double-digit margin, while also preserving the ability to effectively transfer to other generic visual question answering benchmarks without further training.

Task Overview

Visual Entity Recognition can be viewed as a multimodal knowledge grounding task, in which the model is required to process an image-text pair and predict an entity in Wikipedia. To circumvent trivial solutions for some query questions, all image-text pairs are annotated so the question cannot be correctly answered without the image.

Method

  • During training, we jointly optimize the contrastive training objective and langauge modeling objective, thereby augmenting the underlying multi-modal language model with fine-grained visual recognition capabilities.
  • At inference time, we dynamically built a prefix tree with top-score retrieved entity names. Prefix tree will guide the decoding process to generate within the set of retrieved items. Improbable decoding paths are removed from the decoding process.

Experiments

  • AutoVER demonstrates a consistent improvement in all data splits and subsets, with its inherent reasoning and visual localization capabilities within the MLLM.
  • AutoVER only falls short of zero-shot GPT-4V on Query Unseen Split.

  • AutoVER adeptly captures slight variations in the query text and retrieves entirely different entity candidates, which forms the basis for the generative decisions from the language model.
  • AutoVER demonstrates a consistent improvement in most data splits and subsets, compared with previous discriminative, generative or zero-shot methods.

BibTeX

@inproceedings{xiao2024grounding,
      Author = {Zilin Xiao and Ming Gong and Paola Cascante-Bonilla and Xingyao Zhang and Jie Wu and Vicente Ordonez},
      Title = {Grounding Language Models for Visual Entity Recognition},
      Year = {2024},
      Eprint = {arXiv:2402.18695},
      }