POEM: Precise Object-level Editing via MLLM control

Abstract

We present POEM, a novel method for precise object-level image editing using off-the-shelf diffusion models and MLLMs with no training or fine-tuning. Our approach enables precise instructions for image editing (e.g., "move cat to the left by 12.5 px") by first using an MLLM to analyze the scene and identify objects, then refining detections and enhancing object masks using Grounded SAM. We then use a text-based LLM to predict the transformation matrix of the initial segmentation mask, and finally perform an image-to-image translation guided by these steps. This structured pipeline enables precise object-level editing with high visual fidelity while preserving spatial and visual coherence.

Demo

Method Overview

Given an image and an edit prompt, we first use an MLLM to analyze the scene and identify objects. Then, we refine the detections and enhance object masks using Grounded SAM. Next, we use a text-based LLM to predict the transformation matrix of the initial segmentation mask. Finally, we perform an image-to-image translation guided by these steps to generate the edited image. This structured pipeline enables precise object-level editing with high visual fidelity while preserving spatial and visual coherence.

Results

We compare POEM to state-of-the-art image editing models. We test our edit instructions using translation, scaling, appearance changing, and a combination of them to showcase the precision of our pipeline.

BibTeX

@inproceedings{schouten2025poem,
        title={POEM: Precise Object-level Editing via MLLM control},
        author={Schouten, Marco and Kaya, M. Onurcan and Belongie, Serge and Papadopoulos, Dim P.},
        booktitle={Scandinavian Conference on Image Analysis},
        year={2025},
      }

🪶 POEM: Precise Object-level Editing via MLLM control

Abstract

Demo

Method Overview

Results

BibTeX