Within the realm of artificial intelligence, the emergence of powerful autoregressive (AR) large language models (LLMs), just like the GPT series, has marked a big milestone. Despite facing challenges resembling hallucinations, these models are hailed as substantial strides toward achieving general artificial intelligence (AGI). Their effectiveness lies of their self-supervised learning strategy, which involves predicting the subsequent token in a sequence. Studies have underscored their scalability and generalizability, which enables them to adapt to diverse, unseen tasks through zero-shot and few-shot learning. These characteristics position AR models as promising candidates for learning from vast amounts of unlabeled data, encapsulating the essence of AGI.
Concurrently, the sector of computer vision has been exploring the potential of huge autoregressive or world models to copy the scalability and generalizability witnessed in language models. Efforts resembling VQGAN and DALL-E], alongside their successors, have showcased the potential of AR models in image generation. These models utilize a visible tokenizer to discretize continuous images into 2D tokens after which flatten them right into a 1D sequence for AR learning. Nevertheless, despite these advancements, the scaling laws of such models still have to be explored, and their performance significantly lags behind diffusion models.
To handle this gap, researchers at Peking University have proposed a novel AI approach to autoregressive learning for images, termed Visual AutoRegressive (VAR) modeling. Inspired by the hierarchical nature of human perception and design principles of multi-scale systems, VAR introduces a “next-scale prediction” paradigm. In VAR, images are encoded into multi-scale token maps, and the autoregressive process begins from a low-resolution token map, progressively expanding to higher resolutions. Their methodology, leveraging GPT-2-like transformer architecture, has significantly improved AR baselines, especially within the ImageNet 256×256 benchmark.
The empirical validation of VAR models has revealed scaling laws akin to those observed in LLMs, highlighting their potential for further advancement and application in various tasks. Notably, VAR models have showcased zero-shot generalization capabilities in tasks resembling image in-painting, out-painting, and editing. This breakthrough not only signifies a leap in visual autoregressive model performance but in addition marks the primary instance of GPT-style autoregressive methods surpassing strong diffusion models in image synthesis.
In conclusion, the contributions outlined of their work encompass a brand new visual generative framework employing a multi-scale autoregressive paradigm, empirical validation of scaling laws and zero-shot generalization potential, significant advancements in visual autoregressive model performance, and the availability of a comprehensive open-source code suite. These efforts aim to propel the advancement of visual autoregressive learning, bridging the gap between language models and computer vision and unlocking recent possibilities in artificial intelligence research and application.
Try the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
Should you like our work, you’ll love our newsletter..
Don’t Forget to hitch our 40k+ ML SubReddit
Wish to get in front of 1.5 Million AI Audience? Work with us here
Arshad is an intern at MarktechPost. He’s currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the elemental level results in recent discoveries which result in advancement in technology. He’s obsessed with understanding the character fundamentally with the assistance of tools like mathematical models, ML models and AI.