Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals
Artificial Intelligence Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals If you’re building visual shopping, image or document understanding, or chart analysis, you need a way to verify whether your model’s response is actually grounded in the source image. A text-only evaluator cannot tell you whether a caption faithfully describes an image, whether an extracted invoice total matches the document, or whether a screen summary hallucinated a button that was never on the page. Gartner predicts that by 2030, 80% of enterprise software will be multimodal, up from less than 10% in 2024. Without automated multimodal evaluation, you’re stuck between expensive human review and unreliable text-only proxies. Today, we’re announcing four new multimodal large language model (MLLM)-as-a-Judge evaluators for image-to-text tasks in Strands Evals software development kit (SDK): Overall Quality, Correctness, Faithfulness, and Instruction Following. Each evaluator scores image-to-text outputs against the source image. The evaluator sends the image directly to a multimodal judge model, alongside the query, the response, and (optionally) a reference answer. The judge returns a score grounded in the image, together with a reasoning string you can use for debugging. You can use these evaluators as drop-in replacements for text-only judges in your existing Strands Evals Case → Experiment → Report workflow, and plug them into continuous integration (CI) to catch visual hallucinations, factual errors, and instruction violations automatically. In this post, you will learn how to: - Set up the four multimodal evaluators and run them on an image-to-text task. - Switch between reference-based and reference-free evaluation with the same evaluator. - Write a custom multimodal rubric for domain-specific criteria. - Choose a judge model on Amazon Bedrock that balances accuracy, cost, and latency. - Apply prompt-design choices that improved judge-to-human alignment in our experiments. Figure 1: Overview of the multimodal judge framework. Given an image (or document image), a textual query, and a model-generated response, the framework constructs a multimodal evaluation prompt, applies an MLLM-based judge, and returns a score (Likert 1-5 or binary) along with reasoning. The framework supports both reference-based and reference-free evaluation, and integrates with Strands Evals for case management and reporting. Prerequisites To follow the walkthrough in this post, you need: - Python 3.10 or later installed in your environment. - pip install strands-agents-evals for the evaluators, and pip installstrands-agents for the target agent used in the walkthrough. - An AWS account with access to Amazon Bedrock. - AWS credentials configured locally (for example, via aws configure or an AWS Identity and Access Management (AWS IAM) role) with Amazon BedrockInvokeModel permission for the judge model. - Familiarity with the Strands Evals Case →Experiment →Report workflow. If you are new to Strands Evals, see the Strands Evals launch blog post for a quick tour. Why text-only judges miss image-grounded failures Suppose you’ve shipped a model that reads invoices, summarizes dashboards, or narrates screenshots. Running a text-only LLM-as-a-Judge over the response gets you some signal (the writing is fluent, the structure is clean), but it misses exactly the failures that matter:…

