Emulating human-like adaptive vision for efficient and flexible machine visual perception

Abstract

Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial–temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to enable the transition from ‘passive’ to ‘active and adaptive’ vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied artificial intelligence and side-by-side comparisons with humans. AdaptiveNN achieves up to 28 times inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue towards efficient, flexible and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviours in many cases, revealing its potential as a valuable tool for investigating visual cognition.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$32.99 / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

$119.00 per year

only $9.92 per issue

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Data availability

Most data used in this study are publicly available, including from ImageNet25 at https://www.image-net.org/, CUB-200-2011100 at https://www.vision.caltech.edu/datasets/cub_200_2011/, NABirds101 at https://dl.allaboutbirds.org/nabirds, Oxford-IIIT Pet102 at https://www.robots.ox.ac.uk/~vgg/data/pets/, Stanford Dogs103 at https://paperswithcode.com/dataset/stanford-dogs, Stanford Cars104 at https://paperswithcode.com/dataset/stanford-cars, FGVC-Aircraft[105](https://www.nature.com/articles/s42256-025-01130-7#ref-CR105 “Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. Preprint at https://arxiv.org/abs/1306.5151

(2013).“) at https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/, STSD74 at https://www.cvl.isy.liu.se/research/datasets/traffic-signs-dataset/, MNIST[48](https://www.nature.com/articles/s42256-025-01130-7#ref-CR48 “LeCun, Y. The MNIST Database of Handwritten Digits (MNIST, 1998); http://yann.lecun.com/exdb/mnist/

“) at https://paperswithcode.com/dataset/mnist, RSNA pneumonia detection76 at https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018, CALVIN78 at https://github.com/mees/calvin, SALICON79 at http://salicon.net and MIT1003106 at https://saliency.tuebingen.ai/. A minimum dataset for our visual Turing tests is provided in Supplementary Figs. 12 and 13.

Code availability

Implementation code is available via GitHub at https://github.com/LeapLabTHU/AdaptiveNN (ref. [107](https://www.nature.com/articles/s42256-025-01130-7#ref-CR107 “Yue, Y. LeapLab: LeapLabTHU/AdaptiveNN: official release. Zenodo https://doi.org/10.5281/zenodo.16810996

(2025).“)).

References

Biederman, I. Perceiving real-world scenes. Science 177, 77–80 (1972).

Article Google Scholar 1.

Sperling, G. & Melchner, M. J. The attention operating characteristic: examples from visual search. Science 202, 315–318 (1978).

Article Google Scholar 1.

Sagi, D. & Julesz, B. ‘Where’ and ‘what’ in vision. Science 228, 1217–1219 (1985).

Article Google Scholar 1.

Moran, J. & Desimone, R. Selective attention gates visual processing in the extrastriate cortex. Science 229, 782–784 (1985).

Article Google Scholar 1.

Ölveczky, B. P., Baccus, S. A. & Meister, M. Segregation of object and background motion in the retina. Nature 423, 401–408 (2003).

Article Google Scholar 1.

Moore, T. & Armstrong, K. M. Selective gating of visual signals by microstimulation of frontal cortex. Nature 421, 370–373 (2003).

Article Google Scholar 1.

Najemnik, J. & Geisler, W. S. Optimal eye movement strategies in visual search. Nature 434, 387–391 (2005).

Article Google Scholar 1.

Carrasco, M. Visual attention: the past 25 years. Vis. Res. 51, 1484–1525 (2011).

Article Google Scholar 1.

Wolfe, J. M. & Horowitz, T. S. Five factors that guide attention in visual search. Nat. Hum. Behav. 1, 0058 (2017).

Article Google Scholar 1.

Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In Proc. 36th International Conference on Neural Information Processing Systems 23716–23736 (ACM, 2022). 1.

OpenAI Gpt-4 Technical Report (OpenAI, 2023). 1.

Gemini Team Google Gemini: A Family of Highly Capable Multimodal Models Technical Report (Google, 2023). 1.

Lu, M. Y. et al. A multimodal generative AI copilot for human pathology. Nature 634, 466–473 (2024). 1.

Kaufmann, E. et al. Champion-level drone racing using deep reinforcement learning. Nature 620, 982–987 (2023).

Article Google Scholar 1.

Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning (eds Jie, T. & Marc, T.) 2165–2183 (PMLR, 2023). 1.

O’Neill, A. et al. Open X-Embodiment: robotic learning datasets and RT-X models: Open X-Embodiment collaboration. In 2024 IEEE International Conference on Robotics and Automation 6892–6903 (IEEE, 2024). 1.

Gehrig, D. & Scaramuzza, D. Low-latency automotive vision with event cameras. Nature 629, 1034–1040 (2024).

Article Google Scholar 1.

Chen, A. I., Balter, M. L., Maguire, T. J. & Yarmush, M. L. Deep learning robotic guidance for autonomous vascular access. Nat. Mach. Intell. 2, 104–115 (2020).

Article Google Scholar 1.

Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024). 1.

Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024). 1.

Schäfer, R. et al. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nat. Comput. Sci. 4, 495–509 (2024).

Article Google Scholar 1.

Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science 350, 1332–1338 (2015).

Article MathSciNet Google Scholar 1.

Orhan, A. E. & Lake, B. M. Learning high-level visual representations from a child’s perspective without strong inductive biases. Nat. Mach. Intell. 6, 271–283 (2024). 1.

Vong, W. K., Wang, W., Orhan, A. E. & Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science 383, 504–511 (2024).

Article Google Scholar 1.

Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Computer Vis. 115, 211–252 (2015).

Article MathSciNet Google Scholar 1.

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds Lourdes, A. et al.) 770–778 (IEEE, 2016). 1.

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (eds Jim, R. et al.) 4700–4708 (IEEE, 2017). 1.

Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (eds Katja, H. et al.) (ICLR, 2021). 1.

Dehghani, M. et al. Scaling vision transformers to 22 billion parameters. In Proc. 40th International Conference on Machine Learning 7480–7512 (PMLR, 2023). 1.

Zou, Z., Chen, K., Shi, Z., Guo, Y. & Ye, J. Object detection in 20 years: a survey. Proc. IEEE 111, 257–276 (2023).

Article Google Scholar 1.

Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 8748–8763 (PMLR, 2021). 1.

Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).

Article Google Scholar 1.

Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).

Article Google Scholar 1.

Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).

Article Google Scholar 1.

Du, Z. et al. ShiDianNao: shifting vision processing closer to the sensor. In Proc. 42nd Annual International Symposium on Computer Architecture (ed. David, A.) 92–104 (ACM, 2015). 1.

Bai, J., Lian, S., Liu, Z., Wang, K. & Liu, D. Smart guiding glasses for visually impaired people in indoor environment. IEEE Trans. Consum. Electron. 63, 258–266 (2017).

Article Google Scholar 1.

Howard, A. G et al. MobileNets: efficient convolutional neural networks for mobile vision applications. Preprint at https://arxiv.org/abs/1704.04861 (2017). 1.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds David, F. et al.) 4510–4520 (IEEE, 2018). 1.

Huang, G., Liu, S., Van der Maaten, L. & Weinberger, K. Q. CondenseNet: an efficient DenseNet using learned group convolutions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds David, F. et al.) 2752–2761 (IEEE, 2018). 1.

Chen, J. & Ran, X. Deep learning with edge computing: a review. Proc. IEEE 107, 1655–1674 (2019).

Article Google Scholar 1.

Wang, X. et al. Convergence of edge computing and deep learning: a comprehensive survey. IEEE Commun. Surv. Tutor. 22, 869–904 (2020).

Article Google Scholar 1.

Murshed, M. S. et al. Machine learning at the network edge: a survey. ACM Comput. Surv. 54, 1–37 (2021).

Article Google Scholar 1.

Bourzac, K. Fixing AI’s energy crisis. Nature https://doi.org/10.1038/d41586-024-03408-z (2024). 1.

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

Article Google Scholar 1.

LeCun, Y. et al. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems 396–404 (NeurIPS, 1989). 1.

Arbib, M. A. The Handbook of Brain Theory and Neural Networks (MIT, 1995). 1.

LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

Article Google Scholar 1.

LeCun, Y. The MNIST Database of Handwritten Digits (MNIST, 1998); http://yann.lecun.com/exdb/mnist/ 1.

Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020). 1.

Chen, Z. et al. Intern VL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Zeynep, A. et al.) 24185–24198 (IEEE, 2024). 1.

Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024). 1.

Ward, D. J. & MacKay, D. J. Fast hands-free writing by gaze direction. Nature 418, 838–838 (2002).

Article Google Scholar 1.

Ma, W. J., Navalpakkam, V., Beck, J. M., van den Berg, R. & Pouget, A. Behavior and neural basis of near-optimal visual search. Nat. Neurosci. 14, 783–790 (2011).

Article Google Scholar 1.

Henderson, J. M. & Hayes, T. R. Meaning-based guidance of attention in scenes as revealed by meaning maps. Nat. Hum. Behav. 1, 743–747 (2017).

Article Google Scholar 1.

Wolfe, J. M. & Horowitz, T. S. What attributes guide the deployment of visual attention and how do they do it? Nat. Rev. Neurosci. 5, 495–501 (2004).

Article Google Scholar 1.

Hanning, N. M., Fernández, A. & Carrasco, M. Dissociable roles of human frontal eye fields and early visual cortex in presaccadic attention. Nat. Commun. 14, 5381 (2023).

Article Google Scholar 1.

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

Article Google Scholar 1.

Mnih, V., Heess, N., Graves, A. & Kavukcuoglu, K. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 2204–2212 (NeurIPS, 2014). 1.

Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object recognition with visual attention. In International Conference on Learning Representations (eds Brian, K. et al.) (ICLR, 2015). 1.

Yang, L. et al. Resolution adaptive networks for efficient inference. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Ce, L. et al.) 2369–2378 (IEEE, 2020). 1.

Zelinsky, G. J., Chen, Y., Ahn, S. & Adeli, H. Changing perspectives on goal-directed attention control: the past, present, and future of modeling fixations during visual search. Psychol. Learn. Motiv. 73, 231–286 (2020).

Article Google Scholar 1.

Wang, Y., Huang, R., Song, S., Huang, Z. & Huang, G. Not all images are worth 16 × 16 words: dynamic transformers for efficient image recognition. In Proc. 35th International Conference on Neural Information Processing Systems 11960–11973 (NeurIPS, 2021). 1.

Rao, Y. et al. DynamicViT: efficient vision transformers with dynamic token sparsification. In 35th Conference on Neural Information Processing Systems 13937–13949 (NeurIPS, 2021). 1.

Huang, G. et al. Glance and focus networks for dynamic visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 4605–4621 (2022).

Google Scholar 1.

Bolya, D. et al. Token merging: your ViT but faster. In International Conference on Learning Representations (eds Been, K. et al.) (ICLR, 2023). 1.

Gottlieb, J. & Oudeyer, P.-Y. Towards a neuroscience of active sampling and curiosity. Nat. Rev. Neurosci. 19, 758–770 (2018).

Article Google Scholar 1.

Navon, D. Forest before trees: the precedence of global features in visual perception. Cogn. Psychol. 9, 353–383 (1977).

Article Google Scholar 1.

Chen, L. Topological structure in visual perception. Science 218, 699–700 (1982).

Article Google Scholar 1.

Hochstein, S. & Ahissar, M. View from the top: hierarchies and reverse hierarchies in the visual system. Neuron 36, 791–804 (2002).

Article Google Scholar 1.

Ganel, T. & Goodale, M. A. Visual control of action but not perception requires analytical processing of object shape. Nature 426, 664–667 (2003).

Article Google Scholar 1.

Oliva, A. & Torralba, A. Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res. 155, 23–36 (2006).

Article Google Scholar 1.

Peelen, M. V., Berlot, E. & de Lange, F. P. Predictive processing of scenes and objects. Nat. Rev. Psychol. 3, 13–26 (2024).

Article Google Scholar 1.

Touvron, H. et al. Training data-efficient image transformers & distillation through attention. In Proc. 38th International Conference on Machine Learning (eds Marina, M. & Tong, Z.) 10347–10357 (PMLR, 2021). 1.

Larsson, F. & Felsberg, M. Using Fourier descriptors and spatial models for traffic sign recognition. In Proc. Image Analysis: 17th Scandinavian Conference, SCIA 2011 (eds Heydn, A. e al.) 238–249 (Springer, 2011). 1.

Valliappan, N. et al. Accelerating eye movement research via accurate and affordable smartphone eye tracking. Nat. Commun. 11, 4553 (2020).

Article Google Scholar 1.

Shih, G. et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artif. Intell. 1, 180041 (2019).

Google Scholar 1.

Li, X. et al. Vision-language foundation models as effective robot imitators. In International Conference on Learning Representations (eds Swarat, C. et al.) (ICLR, 2024). 1.

Mees, O., Hermann, L., Rosete-Beas, E. & Burgard, W. CALVIN: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robot. Autom. Lett. 7, 7327–7334 (2022).

Article Google Scholar 1.

Jiang, M., Huang, S., Duan, J. & Zhao, Q. SALICON: Saliency in Context. In IEEE Conference on Computer Vision and Pattern Recognition (eds Kristen G. et al.) 1072–1080 (IEEE, 2015). 1.

Itti, L. & Koch, C. Computational modelling of visual attention. Nat. Rev. Neurosci. 2, 194–203 (2001).

Article Google Scholar 1.

Henderson, J. M. Human gaze control during real-world scene perception. Trends Cogn. Sci. 7, 498–504 (2003).

Article Google Scholar 1.

Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).

Article Google Scholar 1.

Kellman, P. J. & Spelke, E. S. Perception of partly occluded objects in infancy. Cogn. Psychol. 15, 483–524 (1983).

Article Google Scholar 1.

Spelke, E. S., Breinlinger, K., Macomber, J. & Jacobson, K. Origins of knowledge. Psychol. Rev. 99, 605–632 (1992).

Article Google Scholar 1.

Spelke, E. Initial knowledge: six suggestions. Cognition 50, 431–445 (1994).

Article Google Scholar 1.

Viola Macchi, C., Turati, C. & Simion, F. Can a nonspecific bias toward top-heavy patterns explain newborns’ face preference? Psychol. Sci. 15, 379–383 (2004).

Article Google Scholar 1.

Simion, F., Di Giorgio, E., Leo, I. & Bardi, L. The processing of social stimuli in early infancy: from faces to biological motion perception. Prog. Brain Res. 189, 173–193 (2011).

Article Google Scholar 1.

Ullman, S., Harari, D. & Dorfman, N. From simple innate biases to complex visual concepts. Proc. Natl Acad. Sci. USA 109, 18215–18220 (2012).

Article Google Scholar 1.

Stahl, A. E. & Feigenson, L. Observing the unexpected enhances infants’ learning and exploration. Science 348, 91–94 (2015).

Article Google Scholar 1.

Reynolds, G. D. & Roth, K. C. The development of attentional biases for faces in infancy: a developmental systems perspective. Front. Psychol. 9, 315789 (2018).

Article Google Scholar 1.

Bambach, S., Crandall, D., Smith, L. & Yu, C. Toddler-inspired visual object learning. In Proc. 32nd International Conference on Neural Information Processing Systems 1209–1218 (ACM, 2018). 1.

Orhan, E., Gupta, V. & Lake, B. M. Self-supervised learning through the eyes of a child. In 34th Conference on Neural Information Processing Systems 9960–9971 (NeurIPS, 2020). 1.

Schulman, J., Moritz, P., Levine, S., Jordan, M. & Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (eds Hugo, L. et al.) (ICLR, 2016). 1.

Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

Article Google Scholar 1.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. Preprint at https://arxiv.org/abs/1707.06347 (2017). 1.

Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proc. 13th International Conference on Neural Information Processing Systems 1057–1063 (ACM, 1999). 1.

Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016).

Article Google Scholar 1.

Wang, L. et al. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence. Nat. Mach. Intell. 5, 1356–1368 (2023).

Article Google Scholar 1.

Miller, G. A. WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995).

Article Google Scholar 1.

Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset (Caltech, 2011). 1.

Van Horn, G. et al. Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (eds Kristen, G. et al.) 595–604 (IEEE, 2015). 1.

Parkhi, O. M., Vedaldi, A., Zisserman, A. & Jawahar, C. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (eds Serge, B. et al.) 3498–3505 (IEEE, 2012). 1.

Khosla, A., Jayadevaprakash, N., Yao, B. & Li, F.-F. Novel dataset for fine-grained image categorization: Stanford Dogs. In Proc. CVPR Workshop on Fine-grained Visual Categorization (FGVC) 2 (2011). 1.

Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops (eds Kyros, K. et al.) 554–561 (IEEE, 2013). 1.

Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classification of aircraft. Preprint at https://arxiv.org/abs/1306.5151 (2013). 1.

Judd, T., Ehinger, K., Durand, F. & Torralba, A. Learning to predict where humans look. In 2009 IEEE 12th International Conference on Computer Vision (eds Roberto, C. et al.) 2106–2113 (IEEE, 2009). 1.

Yue, Y. LeapLab: LeapLabTHU/AdaptiveNN: official release. Zenodo https://doi.org/10.5281/zenodo.16810996 (2025). 1.

Caesar, H. et al. nuScenes: a multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition 11618–11628 (IEEE, 2020). 1.

NIH chest X-ray dataset. Kaggle www.kaggle.com/datasets/nih-chest-xrays/data (2017). 1.

Indian diabetic retinopathy image dataset (IDRiD). Kaggle www.kaggle.com/datasets/mohamedabdalkader/indian-diabetic-retinopathy-image-dataset-idrid (2018). 1.

COCO 2017 dataset. Kaggle www.kaggle.com/datasets/awsaf49/coco-2017-dataset (2017). 1.

CUB2002011 dataset. Kaggle www.kaggle.com/datasets/wenewone/cub2002011 (2011). 1.

ImageNet-1k-valid dataset. Kaggle www.kaggle.com/datasets/sautkin/imagenet1kvalid (2015). 1.

The Oxford-IIIT pet dataset. Kaggle www.kaggle.com/datasets/tanlikesmath/the-oxfordiiit-pet-dataset (2012). 1.

Stanford cars (folder, crop, segment) dataset. Kaggle www.kaggle.com/datasets/senemanu/stanfordcarsfcs (2013). 1.

FGVC aircraft dataset. Kaggle www.kaggle.com/datasets/seryouxblaster764/fgvc-aircraft (2013). 1.

Awadalla, A. et al. OpenFlamingo: an open-source framework for training large autoregressive vision-language models. Preprint at https://arxiv.org/abs/2308.01390 (2023).

Download references

Acknowledgements

G.H. is supported by the National Key R&D Program of China under grant no. 2024YFB4708200, the National Natural Science Foundation of China under grant nos. U24B20173 and 62276150, and the Scientific Research Innovation Capability Support Project for Young Faculty under grant no. ZYGXQNJSKYCXNLZCXM-I20. S.S. is supported by the National Natural Science Foundation of China under grant no. 42327901. We thank S. Zhang, M. Yao and Y. Wu for helpful discussions and comments on an earlier version of this paper.

Author information

Author notes

These authors contributed equally: Yulin Wang, Yang Yue, Yang Yue.

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, China

Yulin Wang (王语霖), Yang Yue (乐洋), Yang Yue (乐阳), Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song & Gao Huang

Authors

Yulin Wang (王语霖)
Yang Yue (乐洋)
Yang Yue (乐阳)
Huanqian Wang
Haojun Jiang
Yizeng Han
Zanlin Ni
Yifan Pu
Minglei Shi
Rui Lu
Qisen Yang
Andre

Abstract

Abstract

Access options

Additional access options:

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Similar Posts