MIT News
AI Tool Generates High-Quality Images Faster
By Adam Zewe, MIT News
Researchers from MIT and NVIDIA have developed a new artificial intelligence tool capable of generating high-quality images more quickly and efficiently than existing methods. This innovative approach combines the strengths of two popular AI model types, promising advancements in various fields.

Researchers combined two types of generative AI models, an autoregressive model and a diffusion model, to create a tool that leverages the best of each model to rapidly generate high-quality images. Credit: Christine Daniloff, MIT; image of astronaut on horseback courtesy of the researchers

The new image generator, called HART (short for Hybrid Autoregressive Transformer), can generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster. Credit: Courtesy of the researchers
This new tool, called HART (Hybrid Autoregressive Transformer), can generate images that match or exceed the quality of state-of-the-art diffusion models. However, it operates about nine times faster and consumes fewer computational resources. This allows HART to run locally on a standard laptop or smartphone. Users simply enter a natural language prompt into HART to generate an image.
“If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART,” says Haotian Tang SM ’22, PhD ’25, co-lead author of a new paper on HART.
The Challenge of Image Generation
The ability to generate high-quality images quickly is crucial for producing realistic simulated environments. These environments can be used to train self-driving cars to avoid unpredictable hazards, enhancing their safety on real streets. However, generative AI techniques currently used for this purpose have limitations. Diffusion models can create very realistic images, but they are slow and computationally intensive.
“If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART,” according to Haotian Tang. On the other hand, autoregressive models, like those used in LLMs such as ChatGPT, are much faster but often produce lower-quality images.
A Hybrid Approach: HART
To overcome these limitations, the MIT and NVIDIA researchers combined the best features of both methods. Their hybrid image-generation tool uses an autoregressive model to quickly capture the broad outlines of an image and then a small diffusion model to refine the details.
This approach allows HART to generate images with quality that matches or surpasses state-of-the-art diffusion models, but about nine times faster. Moreover, the generation process requires fewer computational resources than typical diffusion models. As a result, HART can run locally on a commercial laptop or smartphone. Users only need to enter a natural language prompt into the HART interface to generate an image.
Applications and Impact
HART has the potential to be applied in various fields. It could assist researchers training robots to perform complex real-world tasks and aid designers in creating striking scenes for video games.
HART’s efficiency and compatibility with existing models open up exciting possibilities. For example, it can be integrated with unified vision-language generative models. In the future, users may be able to interact with these models by asking them to show the intermediate steps involved in actions like assembling furniture. Tang indicates that, efficient image-generation models could unlock many possibilities.
Technical Details
Popular diffusion models, such as Stable Diffusion and DALL-E, are known for producing highly detailed images, generating them through iterative steps of predicting and subtracting noise across each pixel. However, since these models process all pixels at each step, the process is slow and computationally expensive. On the other hand, autoregressive models predict patches of an image sequentially but are faster. They use tokens to make predictions; an autoregressive model uses an autoencoder to compress raw image pixels into discrete tokens and reconstruct the image from predictions. With HART, the researchers developed a hybrid approach that predicts compressed, discrete image tokens using an autoregressive model, and then uses a small diffusion model to predict residual tokens to capture high-frequency image details.
“We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes,” says Tang. The use of the small diffusion model as the final step of the process allows HART to maintain the speed advantages of autoregressive models.
HART’s method uses an autoregressive transformer model with 700 million parameters. However, it generates images comparable quality to larger diffusion model and does so nine times faster. The model uses about 31 percent less computation than state-of-the-art models.
Funding and Collaboration
This research was supported, in part, by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the U.S. National Science Foundation. NVIDIA provided the GPU infrastructure for training the model.
Originally published March 21, 2025