Abstract: Based on the fixed language descriptions in the initial frames, a vision-language tracker typically adopts a two-stream model structure to align vision and language features at the feature ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results