The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models

ECCV 2026

1University of Central Florida, USA

Abstract

Safety alignment of text-to-image (T2I) diffusion models aims to suppress harmful generations while preserving utility on benign prompts. Recent methods often appear to deliver high safety with high utility, but this conclusion rests largely on coarse global utility metrics (e.g., FID, CLIPScore) that are insensitive to fine-grained semantic correctness, creating an illusion of high utility. We show that when utility is measured with structured evaluation, this illusion breaks: on TIFA (Text-to-Image Faithfulness evaluation with Question Answering), safety-aligned models suffer substantial drops in semantic fidelity, including failures in object counts, attributes, and relationships. To diagnose the source of this gap, we analyze the text-encoder prompt embedding space and uncover semantic collapse, a contraction of embedding spread coupled with distortion of inter-prompt similarity structure, which strongly correlates with structured utility loss. Guided by this insight, we propose Structure-Aware Geometric Regularization (SAGE), a safety alignment objective that explicitly preserves embedding spread and inter-prompt relational structure during adaptation. Our method restores structured utility (TIFA +5.0% over prior state-of-the-art) while maintaining strong safety performance and competitive coarse-grained utility scores.

Comparison showing coarse CLIPScore can miss fine-grained utility degradation in safety-aligned text-to-image models.
The illusion of high utility under coarse evaluation. Comparison between the base model and a safe model (unlearned) across fine-grained prompts. While the Safe Model fails to generate specific attributes (e.g., the yellow beak or the correct vase colors), the standard CLIPScore provides misleadingly higher scores for the incorrect images. In contrast, the fine-grained metric, TIFA, accurately captures the utility degradation, properly penalizing the Safe Model for failing to satisfy the detailed visual requirements of the prompt.

The Utility Illusion

The central issue is not that safety alignment always destroys utility; it is that the most common utility measurements can make degraded models look healthier than they are. Prior T2I safety papers typically report coarse global metrics such as FID and CLIPScore, which summarize distributional realism or broad image-text similarity. Those scores are useful, but they do not directly ask whether the generated image preserves the prompt's objects, attributes, counts, and relationships.

This creates the illusion of high utility: under CLIPScore, a safety-aligned model can appear to preserve global alignment, while a fine-grained metric such as TIFA reveals that it has lost important semantic details. In the teaser examples above, the safe model receives competitive or even higher CLIPScore despite missing the yellow beak and the requested vase colors. TIFA penalizes those failures because it decomposes the prompt into targeted visual questions.

Utility evaluation setups used in prior T2I safety methods. Most prior work reports coarse metrics such as FID and CLIPScore, which motivates testing safety-aligned models with structured metrics as well.

Method Utility Dataset FID CLIPScore
Safe-CLIPCOCO, LAION-400M
SafeR-CLIPParti-Prompts×
STEREOI2P Unsafe Prompts
ADV-UnlearnCOCO
ESDCOCO
DESCOCO
SALUNNon-forget Classes×
RECECOCO
RACECOCO
MACECOCO
SLDCOCO, Human Study
AlignGuardCOCO

What CLIPScore Suggests

CLIPScore stays nearly constant across semantic categories for DES, hovering around 0.30. From that global signal alone, utility appears broadly preserved after safety alignment.

What TIFA Reveals

TIFA exposes category-specific degradation: food, material, color, activity, and object prompts lose much more semantic fidelity than CLIPScore indicates.

Category-level TIFA utility drop compared with CLIPScore for DES generations.
Comparison between category-level TIFA utility drop and CLIPScore for images generated by DES. While TIFA reveals substantial degradation in certain semantic categories (e.g., food), CLIPScore remains nearly constant across categories (around 0.30), indicating limited sensitivity to fine-grained semantic errors.

Method

SAGE targets the representation-level failure mode behind hidden utility loss. Existing text-encoder safety methods can keep each benign prompt close to its original embedding while still compressing the overall embedding space and reshuffling local neighborhoods. This semantic collapse weakens the model's ability to preserve counts, attributes, and relationships.

The method augments safety alignment with two geometry-aware regularizers: Embedding Spread Preservation keeps the benign prompt distribution from contracting relative to the base encoder, while Local Structure Alignment preserves pairwise similarity relationships among nearby prompts, including under unsafe-concept perturbations. The result is a safety objective that suppresses unsafe generations without erasing the fine-grained geometry needed for faithful prompt following.

Embedding geometry under safety alignment for base model, prior methods, and SAGE.
Embedding geometry under safety alignment. The figure illustrates how safety alignment alters the structure of the text-encoder embedding space. Left: Base Model. The prompt "Three golden retrievers" is used as a reference. In the base model, semantically related prompts are arranged according to their similarity: "Three dogs" lies closest to the reference, followed by "Two golden retrievers", while an unrelated concept ("cat") appears far away. The arrows visualize the spread of prompt embeddings, and the circular region highlights the local semantic neighborhood around the reference prompt. Top-right: Prior safety alignment methods. Safety tuning often alters this structure in two ways. First, the overall spread of embeddings shrinks, meaning different prompts become more tightly clustered in the embedding space. Second, the local semantic neighborhood becomes distorted: prompts that should be far apart can move closer together, causing unrelated concepts (e.g., "cat") to appear within the neighborhood of the reference prompt. Bottom-right: Our method. Our alignment objective preserves both properties of the original embedding space. The overall embedding spread remains similar to the base model, and the local neighborhood structure around the reference prompt is maintained.

Results

Coarse metrics alone suggest that recent safety methods preserve utility, but TIFA exposes category-level failures in object, attribute, count, and relation fidelity. SAGE nearly recovers base-model structured utility while keeping safety competitive with the strongest alignment baselines.

Across the main results, SAGE reaches 75.4 TIFA, only 1.2% below the base model and 5.0% above DES, while maintaining low average ASR at 1.2%. It also preserves embedding geometry better than prior text-encoder methods, with a 0.96 spread ratio and 0.63 Jaccard neighborhood overlap.

Relationship between embedding spread ratio and TIFA utility.
Relationship between spread ratio (Rs) and structured utility (TIFA). Methods with larger reductions in total embedding spread exhibit larger TIFA drops, indicating that embedding compression is closely associated with compositional degradation.

Category-wise TIFA evaluation. Existing safety interventions degrade structural fidelity across multiple categories, while our method maintains strong performance (75.4 TIFA). Red highlights indicate the largest drop relative to the base model.

Method Obj. Ani. Loc. Col. Food Mat. Att. Cnt. Sha. Act. Spa. TIFA Avg ↑ CLIP ↑ FID ↓
Base (SD v1.4) 78.982.889.879.984.183.779.663.658.068.252.976.326.517.23
DES 73.278.585.473.771.174.677.459.353.663.551.771.6 ↓6.2%25.516.23
AdvUnlearn 67.169.679.764.668.774.668.044.537.749.941.963.1 ↓17.3%23.920.67
STEREO 71.775.486.973.779.477.574.261.952.258.348.269.9 ↓8.4%24.621.69
RECE 78.480.588.079.980.285.777.661.166.765.650.974.8 ↓2.0%26.017.51
MACE 62.968.779.662.554.968.969.655.546.453.947.462.6 ↓18.0%23.824.87
SafeCLIP 59.067.479.169.558.175.271.456.650.046.640.560.1 ↓21.2%22.333.40
SafeRCLIP 58.666.178.269.158.574.671.655.950.746.143.760.7 ↓20.4%22.432.31
SLD 77.380.088.077.882.381.376.863.052.263.250.473.9 ↓3.1%25.521.85
Ours 77.680.888.379.683.583.780.161.158.066.353.875.4 ↓1.2%26.415.93

Comparison of Attack Success Rate (ASR) and CLIP Score across different methods. Lower ASR and higher CLIP scores indicate better safety and utility preservation, respectively.

Method MMA ↓ Sneaky ↓ I2P-S ↓ Ring ↓ P4D ↓ Avg. ASR ↓ CLIPScore ↑
Base (SD v1.4)80.442.734.398.182.467.626.5
DES0.20.81.22.80.01.025.5
Adv-Unlearn0.30.81.10.00.00.423.9
SafeCLIP25.217.724.065.457.738.122.3
SafeRCLIP24.616.117.973.843.035.122.4
STEREO2.23.21.12.83.32.524.6
SLD74.331.520.898.174.359.825.5
RECE36.16.56.015.926.118.126.0
MACE8.62.46.39.410.37.423.8
Ours0.40.81.22.81.01.226.4

Conclusion

The paper shows that the standard story around safety-utility tradeoffs is incomplete: global scores like FID and CLIPScore can hide substantial fine-grained failures. By diagnosing those failures as semantic collapse in the text-encoder embedding space, SAGE gives safety alignment a more structural target: preserve embedding spread and local relationships while steering unsafe prompts away from harmful generations.

The resulting model maintains strong robustness against unsafe and adversarial prompts while restoring much of the structured utility that prior safety methods lose, suggesting that representation geometry is a practical lever for safer and more faithful text-to-image generation.

BibTeX

@misc{yousaf2026illusionhighutilitysafety,
      title={The Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models}, 
      author={Adeel Yousaf and Soumik Ghosh and James Beetham and Amrit Singh Bedi and Mubarak Shah},
      year={2026},
      eprint={2607.00402},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2607.00402}, 
    }