Abstract: Many current video and text cross-modal retrieval research works focus on narrowing the semantic gap between video and text, but ignore the semantic difference between different sampled ...