Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision.
With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases.
Since the release of GPT-4o, omni-modal models (OMMs)—which have multiple input and output modalities—have been a focus of research. While much focus has been placed on aesthetics, few have quantitatively examined the precision and generalizability of the image generation capabilities.
In PixelArena, we propose using pixel-level tasks—specifically semantic segmentation tasks—to examine OMMs' fine-grained control capability, which we term Pixel-Precision Visual Intelligence (PPVI).
A benchmark using semantic segmentation to measure fine-grained control.
Revealing surprising zero-shot capabilities in Gemini 3 Pro Image.
In-depth qualitative and quantitative analysis of failure modes.
We use the COCO and CelebAMask-HQ datasets. We randomly sampled 150 images and their corresponding masks from each dataset.
Also compared with specialized models: SegFace, OneFormer, SAM 3.
I want you to do semantic segmentation based on facial features. The label encodings are ``` background : [0, 0, 0] ... ``` For your convenience, I've also given you a color palette (the second image) for the label encodings. Please draw a colorful mask, given the photo (the first image), the color palette and the label encodings.

Standard color encodings for CelebAMask-HQ
Gemini 3 Pro Image is the only OMM that understands the task requirements and completes it with high quality. Others either fail to understand the task or lack precise control.


Gemini 3 Pro Best Prediction

Gemini 3 Pro Worst Prediction

We shuffled the color encodings to test for memorization. Surprisingly, performance increased by ~10%.
Conclusion:
The model truly understands the task and is not just retrieving memorized masks.
COCO is significantly more challenging with 144 classes. Most models failed to generate valid masks.



PixelArena can be easily extended to other segmentation datasets.
OMM results are good enough to serve as annotation drafts for new datasets.
Under-specification in prompts (e.g., eyeball vs periorbital) affects results.
Score discrepancies don't always reflect visual similarity. Better metrics needed.
We present PixelArena, proving that Gemini 3 Pro Image represents a major breakthrough in pixel-precision visual intelligence. Our findings on zero-shot capabilities, failure modes, and data contamination provide a foundation for future research in OMMs.