SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge
[AAAI-2026]

Adeel Yousaf¹, Joseph Fioresi¹, James Beetham¹, Amrit Singh Bedi², Mubarak Shah¹

¹Center for Research in Computer Vision (CRCV), University of Central Florida
²SAFERR AI Lab, University of Central Florida

Warning: This paper includes explicit content and language that may be offensive or distressing to some readers.

Abstract

Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure.

To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SafeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SafeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety.

To support more rigorous evaluation, we also contribute NSFWCaps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.

Paper Details

Figure 1: An unsafe concept can have multiple semantically valid safe alternatives. In this example, the unsafe caption "A deadly looking gun on a table next to a child" could plausibly align with safe counterparts such as "A kid sitting at a table with some food" or "A child at a table sitting next to stacked items", which preserve the underlying semantics while removing unsafe elements. (Left) Safe-CLIP (Poppi et al. 2024a) enforces a rigid alignment between the unsafe input and a single predefined safe caption (e.g., "A delicious looking bunt cake on a table next to fruit"), while treating other valid alternatives as potential negatives. This leads to two major issues: (1) due to the noisy nature of existing datasets like ViSU (Poppi et al. 2024a), the selected unsafe–safe pair may be semantically misaligned, as shown; and (2) semantically closer safe alternatives are incorrectly penalized. (Right) Our method, SafeR-CLIP, addresses these limitations by aligning each unsafe input with its most semantically compatible safe counterpart while pushing it away from the unsafe embedding—ensuring better safety–generalization trade-off.

Relative Cross-Modal Redirection

Safe-CLIP’s InfoNCE-style cross-modal loss treats many semantically valid safe alternatives as negatives, pushing them away when they appear in the same batch and distorting CLIP’s cross-modal geometry; this drives a large zero-shot accuracy drop. We fix the negative selection by using a single targeted hard negative—the unsafe counterpart—so the model moves unsafe inputs toward the intended safe target without repelling related safe concepts. This preserves generalization while still suppressing unsafe associations.

Proximity-Based Alignment

Instead of forcing each unsafe input toward a fixed, sometimes poorly matched safe target, we retrieve its closest semantically compatible safe alternative in the pretrained space and align to that. This “minimal-intervention” step respects CLIP’s geometry, reducing representational shift and avoiding noisy supervision. The result is safer alignment with significantly less loss of overall generilization performance (i.e Zero-Shot).

Progressive Training for Safety Alignment

We introduce unsafe–safe pairs in a curriculum from easy to hard, stabilizing adaptation and preventing abrupt representational shifts. Early training focuses on clearly aligned pairs, while more challenging examples are added gradually. This controlled progression maintains model generalization while steadily increasing safety alignment strength.

NSFWCaps: Robust Safety Evaluation Set

Prior benchmarks like ViSU often pair unsafe and safe samples that are not semantically related, leading to noisy evaluation (e.g., an unsafe gun caption paired with a random “cake on a table” caption), as shown in Figure-1.

NSFWCaps fixes this by generating unsafe variants that preserve the original meaning, only changing the safety-relevant elements. We then apply safety filtering and semantic similarity scoring to ensure each safe–unsafe pair truly describes the same underlying scene. The final dataset contains 1,000 tightly aligned safe–unsafe quadruples, providing a much more reliable benchmark for evaluating cross-modal safety alignment.

Figure 2: Overview of the NSFWCaps dataset. Unsafe captions and images are minimally modified versions of their safe counterparts, preserving the original context while introducing NSFW elements. This ensures tight semantic alignment and enables controlled evaluation of cross-modal safety.

Results

Conclusion

We present SafeR-CLIP, a fine-tuning strategy that redirects unsafe embeddings toward safe counterparts while preserving model utility. Unlike prior approaches that rely on noisy mappings, our method uses proximity-based redirection to guide unsafe inputs toward semantically aligned safe alternatives. This improves safety alignment across multiple tasks—enhancing redirection in retrieval, reducing unsafe generations in text-to-image synthesis, and lowering toxicity in image captioning—while retaining strong generalization, as demonstrated by zero-shot classification results. These findings highlight that proximity-aware redirection offers an effective balance between safety and performance. Future work may explore asymmetric encoder adaptation and broader real-world deployment of safety-tuned models.

BibTeX

@article{adeel2026safer,
  author    = {Yousaf, Adeel and Fioresi, Joseph and Beetham, James and  Bedi, Amrit Singh and Shah, Mubarak},
  title     = {SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge},
  journal   = {AAAI},
  year      = {2026},
}

SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge[AAAI-2026]