TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Xin Jin; Yichuan Zhong; Yapeng Tian

doi:10.48550/arXiv.2601.08011

Abstract

Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts—one specifying a blend object and the other defining a target style— and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors: Cross-Attention Object Fusion (CAOF) and Self-Attention Style Fusion (SASF). Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

Full text

Tip: The GitHub Pages PDF link above is a stable direct PDF URL for indexing/crawling.

Citation (arXiv)

@misc{jin2026tpblend,
  title={TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models},
  author={Xin Jin and Yichuan Zhong and Yapeng Tian},
  year={2026},
  eprint={2601.08011},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2601.08011},
  url={https://arxiv.org/abs/2601.08011}
}

Citation (OpenReview / TMLR)

@article{jin2025tpblend,
  title={{TP}-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models},
  author={Xin Jin and Yichuan Zhong and Yapeng Tian},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2025},
  url={https://openreview.net/forum?id=q6M73uOBZE}
}