Mining Visual Knowledge from Large Pre-trained Models (Talk)
Computer vision made huge progress in the past decade with the dominant supervised learning paradigm, that is training large-scale neural networks on each task with ever larger datasets. However, in many cases, scalable data or annotation collection is intractable. In contrast, humans can easily adapt to new vision tasks with very little data or labels. In order to bridge this gap, we found that there actually exists rich visual knowledge in large pre-trained models, i.e., models trained on scalable internet images with either self-supervised or generative objectives. And we proposed different techniques to extract these implicit knowledge and use them to accomplish specific downstream tasks where data is constrained including recognition, dense prediction and generation. Specifically, I’ll mainly present the following three works. Firstly, I will introduce an efficient and effective way to adapt pre-trained vision transformers to a variety of low-shot downstream tasks, while tuning only less than 1 percent of the model parameters. Secondly, I will show that accurate visual correspondences emerge from a strong generative model (i.e., diffusion models) without any supervision. Following that, I will demonstrate that an adapted diffusion model is able to complete a photo with true scene contents using only a few casual captured reference images.
Biography: Luming Tang is a final-year PhD student at Cornell University, working with Prof. Bharath Hariharan. Before that, he was an undergrad at Tsinghua University, studying physics. He has broad research interests in computer vision and machine learning, including generative models and representation learning, especially on how to solve challenging real-world vision problems where data or annotation is constrained.
LLM
Diffusion Model