Abstract: Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature ...