ShapeUP: Scalable Image-Conditioned 3D Editing

Abstract

Overview

Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling.

In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training.

In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset.

Gallery

Our Results

ShapeUP results across a range of geometry and texture edits. Click any example to expand.

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Source

→

Edit Cond.

→

Textured

Untextured

Method

How It Works

Method Overview. ShapeUP formulates image-guided 3D editing as a supervised latent-to-latent translation within a pretrained 3D Diffusion Transformer. Given a source shape and an edit-conditioning image, ShapeUP produces an edited 3D asset with faithful geometry and appearance while preserving unedited regions.

Geometry Editing

Geometry Editing Pipeline. ShapeUP encodes the source 3D shape into the latent space of a pretrained Step1X-3D DiT, then conditions diffusion on both the subsampled shape latents and the image prompt via LoRA adapters trained on the MMDiT backbone. The model learns to translate the source shape into the edited target geometry while preserving structural identity.

Texture Editing

Texture Editing Pipeline. Multi-view renders of the source shape are injected through the model's cross-attention layers alongside the edit conditioning image. View-axis positional encodings distinguish source-texture tokens from edit tokens, enabling the model to synthesize consistent appearance that respects both the edit and the original fine-grained texture details.

Evaluation

Quantitative Results

Quantitative comparison on BenchUp. We evaluate image-guided 3D shape editing by measuring Condition Alignment and Occluded Region Fidelity. ShapeUP outperforms all baselines across every metric, demonstrating superior edit fidelity without sacrificing preservation of unedited regions.

Method	Condition Alignment					Occluded Region Fidelity
Method	SSIM ↑	LPIPS ↓	CLIP-I ↑	DINO-I ↑	C-Dir ↑	CLIP-I ↑	DINO-I ↑
3DEditFormer	0.733	0.270	0.908	0.849	0.441	0.877	0.736
EditP23	0.759	0.254	0.917	0.851	0.455	0.880	0.748
ShapeUP (Ours)	0.763	0.198	0.943	0.915	0.520	0.928	0.878

Ablation study. We evaluate the impact of the number of latent vectors and DFM training data on geometry editing, and compare texture conditioning strategies. Using 1024 latents with DFM data achieves the best overall balance of condition alignment and occluded region preservation.

Variant	Condition Alignment					Occluded Region Fidelity
Variant	SSIM ↑	LPIPS ↓	CLIP-I ↑	DINO-I ↑	C-Dir ↑	CLIP-I ↑	DINO-I ↑
w/o DFM data	0.769	0.196	0.943	0.909	0.505	0.932	0.884
256 latents	0.766	0.206	0.941	0.909	0.525	0.917	0.869
512 latents	0.768	0.220	0.918	0.868	0.506	0.905	0.837
Concat MV (texture)	0.779	0.187	0.943	0.912	0.555	0.897	0.832
ShapeUP (Ours)	0.763	0.198	0.943	0.915	0.520	0.928	0.878

User Study

Human preference study. Participants viewed the source object, target edit, and two results (ours vs. one baseline) in a two-alternative forced-choice setup, selecting the result that better achieves the edit while preserving the original. ShapeUP was strongly preferred across all criteria. Results from 664 comparisons by 34 participants.

vs. 3DEditFormer

ShapeUP

~83%

Baseline

~17%

vs. EditP23

ShapeUP

~77%

Baseline

~23%

Overall Preference

ShapeUP

~80%

Baselines

~20%

ShapeUP: Scalable Image-Conditioned 3D Editing

Overview

Our Results

How It Works

Geometry Editing

Texture Editing

Quantitative Results

User Study

vs. 3DEditFormer

vs. EditP23

Overall Preference

Qualitative Comparison

BibTeX