Rethinking Cross-Modal Fine-Tuning: Optimizing the Interaction between Feature Alignment and Target Fitting
Published in 29th International Conference on Artificial Intelligence and Statistics, 2026
Abstract: Adapting pre-trained models to unseen feature modalities is increasingly important due to evolving data acquisition technologies and the growing need for cross-disciplinary knowledge integration. A key challenge in such cross-modal fine-tuning scenarios is to align the representation of new feature modalities with the most relevant regime within a pre-trained model’s representation space to enable effective and contextually appropriate knowledge transfer. This requires combining feature alignment with target fine-tuning, but uncalibrated combinations can exacerbate misalignment between the source and target feature-label structures, reducing generalization performance on the target task. However, existing work lacks a theoretical understanding of this critical interaction between feature alignment and target fitting, as well as its impact on generalization. To bridge this gap, we propose a principled framework that establishes a provable generalization bound on the target error, which explains the interaction between feature alignment and target fitting through a novel concept of feature-label distortion. This bound offers actionable insights into how this interaction should be optimized, providing a provable guide for algorithm design. The resulting approach achieves significantly improved cross-modal fine-tuning performance over recent state-of-the-art methods across a wide range of benchmark datasets.
