Bridges a vision model to enable text-only models like DeepSeek to describe images, extract text, and compare images via MCP tools.