Recent advancements in artificial intelligence have significantly improved how machines learn to associate visual content with language. Contrastive learning models have been pivotal in this transformation, particularly those aligning images and text through a shared embedding space. These models are central to zero-shot classification, image-text retrieval, and multimodal reasoning. However, while these tools have pushed…
