Authors: Xin Jin; Yichuan Zhong; Yapeng Tian
arXiv: 2601.08011 | DOI (arXiv): 10.48550/arXiv.2601.08011 | OpenReview/TMLR: q6M73uOBZE
Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts—one specifying a blend object and the other defining a target style— and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors: Cross-Attention Object Fusion (CAOF) and Self-Attention Style Fusion (SASF). Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
Tip: The GitHub Pages PDF link above is a stable direct PDF URL for indexing/crawling.
@misc{jin2026tpblend,
title={TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models},
author={Xin Jin and Yichuan Zhong and Yapeng Tian},
year={2026},
eprint={2601.08011},
archivePrefix={arXiv},
primaryClass={cs.CV},
doi={10.48550/arXiv.2601.08011},
url={https://arxiv.org/abs/2601.08011}
}
@article{jin2025tpblend,
title={{TP}-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models},
author={Xin Jin and Yichuan Zhong and Yapeng Tian},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=q6M73uOBZE}
}