TAG: Tracking at Any Granularity

Adam W. Harley, Yang You, Alex Sun, Yang Zheng, Nikhil Raghuraman, Sheldon Liang, Wen-Hsuan Chu, Suya You, Achal Dave, Pavel Tokmakov, 
Rares Ambrus, Katerina Fragkiadaki, Leonidas Guibas

[Code]   [Paper]



Abstract

We introduce Tracking at Any Granularity (TAG): a new task, model, and dataset for tracking arbitrary targets in videos. We seek a tracking method that treats points, parts, and objects as equally trackable target types, embracing the fact that the distinction between these granularities is ambiguous. We introduce a generic high-capacity transformer for the task, taking as input a video and a target prompt (indicating what to track, in the form of a click, box, or mask), and producing as output the target's segmentation on every frame. To train the model, we aggregate nearly all publicly-available tracking datasets that we are aware of, which currently totals 75, amounting to millions of clips with tracking annotations, including a long tail of rare subjects such as body keypoints on insects and microscopy data. Our model is competitive with state-of-the-art on standard benchmarks for point tracking, mask tracking, and box tracking, but more importantly, achieves zero-shot performance far superior to prior work, largely thanks to the data effort. We will publicly release our code, model, and aggregated dataset, to provide a foundation model for motion and video understanding, and facilitate future research in this direction.


Paper

Adam W. Harley, Yang You, Alex Sun, Yang Zheng, Nikhil Raghuraman, Sheldon Liang, Wen-Hsuan Chu, Suya You, Achal Dave, Pavel Tokmakov, 
Rares Ambrus, Katerina Fragkiadaki, Leonidas Guibas TAG: Tracking at Any Granularity. arXiv 2024.

[pdf][bibtex]