PixelArena

A Benchmark for Pixel-Precision Visual Intelligence in Omni-Modal Models

View Gallery

Abstract

Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision.

With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases.

Introduction

Since the release of GPT-4o, omni-modal models (OMMs)—which have multiple input and output modalities—have been a focus of research. While much focus has been placed on aesthetics, few have quantitatively examined the precision and generalizability of the image generation capabilities.

In PixelArena, we propose using pixel-level tasks—specifically semantic segmentation tasks—to examine OMMs' fine-grained control capability, which we term Pixel-Precision Visual Intelligence (PPVI).

PixelArena Benchmark

A benchmark using semantic segmentation to measure fine-grained control.

Emergent Zero-Shot

Revealing surprising zero-shot capabilities in Gemini 3 Pro Image.

Failure Analysis

In-depth qualitative and quantitative analysis of failure modes.

Methodology

1Datasets

We use the COCO and CelebAMask-HQ datasets. We randomly sampled 150 images and their corresponding masks from each dataset.

COCO (150 images)CelebAMask-HQ (150 images)1024x1024 Resolution

2Models Tested

Gemini 3 Pro Image
Gemini 2.5 Flash
GPT Image 1
Emu 3.5
Uni-MoE-2

Also compared with specialized models: SegFace, OneFormer, SAM 3.

Prompt Template (CelebAMask-HQ)

I want you to do semantic segmentation based on facial features. 
The label encodings are

```
background : [0, 0, 0]
...
```

For your convenience, I've also given you a color palette 
(the second image) for the label encodings.

Please draw a colorful mask, given the photo (the first image), 
the color palette and the label encodings.
Color Palette

Standard color encodings for CelebAMask-HQ

CelebAMask-HQ Results

Gemini 3 Pro Image is the only OMM that understands the task requirements and completes it with high quality. Others either fail to understand the task or lack precise control.

Visual Comparison on CelebAMask-HQ
Visual comparison of different OMMs on CelebAMask-HQ face parsing.

Best Result

F1: 0.708
Best Result

Gemini 3 Pro Best Prediction

Worst Result

F1: 0.081
Worst Result

Gemini 3 Pro Worst Prediction

Quantitative Results (F1)

Quantitative Results

Data Contamination Check

We shuffled the color encodings to test for memorization. Surprisingly, performance increased by ~10%.

Conclusion:

The model truly understands the task and is not just retrieving memorized masks.

COCO Results

COCO is significantly more challenging with 144 classes. Most models failed to generate valid masks.

COCO Comparison
OneFormer (Specialized) vs Gemini 3 Pro vs Gemini 2.5 Flash on COCO.

Best Prediction (F1: 0.269)

Best COCO

Worst Prediction (F1: 0.0)

Worst COCO

Discussion

📊

Datasets

PixelArena can be easily extended to other segmentation datasets.

🔄

Data Refinement

OMM results are good enough to serve as annotation drafts for new datasets.

✍️

Better Prompts

Under-specification in prompts (e.g., eyeball vs periorbital) affects results.

📐

Metric Design

Score discrepancies don't always reflect visual similarity. Better metrics needed.

Conclusion

We present PixelArena, proving that Gemini 3 Pro Image represents a major breakthrough in pixel-precision visual intelligence. Our findings on zero-shot capabilities, failure modes, and data contamination provide a foundation for future research in OMMs.