Polos: Multimodal Metric Learning
from Human Feedback for Image Captioning

Anonymous submission

Abstract
Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.

Fig 1. Our supervised metric Polos computes evaluation scores from multimodal inputs by integrating human feedback within the novel framework $\mathrm{M^2LHF}$. Polos is capable of modeling intricate relationships within the vector space of text-image pairs as well as text-text pairs, thereby effectively evaluating the depicted samples.

Overview

Fig 2. Overview of the proposed metric. In alignment with the principles of $\mathrm{M^2LHF}$, Polos computes the evaluation $\hat{y}$ based on multimodal inputs and regresses the human evaluation. The proposed metric extracts effective features for caption evaluation using the difference and Hadamard product of features derived from both CLIP and RoBERTa.

Results Overview

Table 1. Correlation coefficients between various metrics and human judgments. The symbol `--' indicates non-executable code or unavailable data. Bold font indicates the highest recorded value and underlining indicates the second-highest value..

Table 2. Pascal50-S accuracy results (five references) and
FOIL hallucination pairwise detection accuracy results.

BibTex
to be appear