Enhancing Amazon’s Just Walk Out Technology with Multi-Modal AI
Since its introduction in 2018, Amazon’s Just Walk Out technology has redefined the retail experience. Customers can now enter a store, select their items, and leave without the need to wait in line to pay. This checkout-free technology is available in over 180 locations worldwide, including various retail environments such as sports stadiums, entertainment venues, and convenience stores.
The Just Walk Out system automatically identifies the products each customer selects, generating digital receipts and eliminating checkout lines. This post highlights the latest advancements in Just Walk Out technology, powered by a sophisticated multi-modal foundation model (FM).
We’ve engineered this multi-modal FM for physical stores, using a transformer-based architecture that is similar to the technology behind many generative AI applications. This model leverages data from multiple sources, including a network of overhead video cameras, advanced weight sensors on shelves, digital floor plans, and product catalog images, to generate highly accurate shopping receipts.
In essence, a multi-modal model processes data from various input channels. Our investment in state-of-the-art multi-modal FMs enables the Just Walk Out system to be deployed in diverse shopping environments with improved accuracy and efficiency. The new system, like large language models (LLMs) that produce text, is engineered to generate a precise sales receipt for every shopper visiting the store.
The Challenge: Addressing Complex Shopping Scenarios
Just Walk Out stores provide a unique technical challenge due to their innovative checkout-free environment. Retailers and shoppers alike demand nearly perfect checkout accuracy. This includes handling unusual shopping behaviors that generate complex activity sequences, which require significant effort to analyze.
Previous iterations of the Just Walk Out system used a modular architecture, breaking down the shopper’s visit into distinct tasks such as shopper interaction detection, item tracking, product identification, and quantity counting. These components were integrated into sequential pipelines. While this approach delivered highly accurate receipts, addressing challenges in new or complex shopping scenarios demanded substantial engineering effort, which limited scalability.
The Solution: Just Walk Out Multi-Modal AI
To overcome these challenges, we introduced a new multi-modal FM designed for retail store environments, empowering Just Walk Out technology to handle complex real-world shopping situations. The new FM improves the system’s ability to generalize to new store formats, new products, and customer behaviors—key to scaling Just Walk Out technology.
Continuous learning is another advantage as the model automatically adapts and learns from new and challenging scenarios. This capability ensures high performance, even as retail environments evolve. The system utilizes a combination of end-to-end learning and enhanced generalization, allowing it to tackle a broader array of dynamic and complex retail settings. This enables a more user-friendly checkout-free shopping experience.
The core elements of the Just Walk Out model include:
- Flexible data inputs: The system tracks how shoppers interact with products and fixtures. It primarily uses multi-view video feeds and utilizes weight sensors to track small items. The model maintains a 3D digital representation of the store and cross-references catalog images to identify products, including situations where items are returned to shelves incorrectly.
- Multi-modal AI tokens to represent shoppers’ journeys: Encoders process multi-modal data inputs, compressing them into transformer tokens, which serve as the foundational unit for the receipt model. This enables the model to interpret hand movements, distinguish between items, and accurately count items picked up or returned to the shelf with speed and precision.
- Continuously updating receipts: The system uses tokens to generate digital receipts, distinguishing between various shopper sessions, and dynamically updating each receipt as items are selected or returned.
Training the Just Walk Out FM
By feeding large amounts of multi-modal data into the Just Walk Out FM, the model consistently generated accurate receipts, or, technically, “predicted” them. To improve accuracy, we designed more than 10 auxiliary tasks such as detection, tracking, image segmentation, grounding (connecting abstract concepts to real-world objects), and activity recognition, all within a single model. This enhanced the model’s ability to successfully work in new store setups, with new products, and customer behaviors. This is integral to expanding Just Walk Out technology to new locations.
To achieve the most accurate results, AI model training relies on curated data fed to selected algorithms. We accelerated the model training by employing a data flywheel, a continuous, self-reinforcing cycle of data mining and labeling. This system is designed to integrate these iterative improvements with minimal human intervention.
To address the massive data volume required for training high-capacity neural networks, we built the infrastructure for the Just Walk Out model on Amazon Web Services (AWS). We used Amazon Simple Storage Service (Amazon S3) for data storage and Amazon SageMaker for training.
Key steps in the FM training process:
- Selecting challenging data sources: We focused on training data from tricky shopping scenarios that tested the system’s limits. Though rare, these cases proved valuable for the model to learn from its mistakes.
- Leveraging auto-labeling: Algorithms and models were created to attach meaningful labels to data automatically, which improves operational efficiency. In addition to receipt prediction, our algorithms handle these additional tasks.
- Pre-training the model: The FM was pre-trained on a diverse collection of multi-modal data to enhance its ability to generalize to new store settings.
- Fine-tuning the model: We further refined the model and implemented quantization techniques to develop a smaller, more efficient model for edge computing.
As the data flywheel continues, more difficult cases will inform the training set, increasing the model’s accuracy and applicability in new retail environments.

Tian Lan, a Principal Scientist at AWS.

Chris Broaddus, a Senior Manager at AWS.
Conclusion
This multi-modal AI system represents a substantial step forward for Just Walk Out technology. This innovative approach shifts from modular AI, which depends on human-defined subcomponents and interfaces, to end-to-end trained AI systems that are simpler and more scalable. Multi-modal AI is improving the shopping experience in more Just Walk Out stores worldwide.