In multimodal real-world scenarios, humans interpret intents using modalities like expressions, body movements, and speech tone. Yet, most methods neglect the links between modalities and fail in ...