Close Menu
Breaking News in Technology & Business – Tech Geekwire

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech GeekwireBreaking News in Technology & Business – Tech Geekwire
    • New
      • Amazon
      • Digital Health Technology
      • Microsoft
      • Startup
    • AI
    • Corporation
    • Crypto
    • Event
    Facebook X (Twitter) Instagram
    Breaking News in Technology & Business – Tech Geekwire
    Home » Microsoft Unveils ‘Magma,’ a Groundbreaking AI Model Merging Language and Visual Understanding
    Microsoft

    Microsoft Unveils ‘Magma,’ a Groundbreaking AI Model Merging Language and Visual Understanding

    techgeekwireBy techgeekwireFebruary 25, 2025No Comments3 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email

    Microsoft has developed Magma, a pioneering foundational model capable of comprehending both images and language, marking a significant step forward in multimodal AI. This innovation allows AI agents to execute tasks ranging from navigating user interfaces to controlling robots.

    Developed by researchers from Microsoft Research, the University of Maryland, the University of Wisconsin-Madison, and KAIST, alongside the University of Washington, Magma stands as the first foundational model designed to interpret and contextualize multimodal inputs within its environment.

    Bridging Verbal and Spatial Intelligence

    Magma, which integrates both verbal and spatial intelligence, can formulate plans and execute actions to achieve specified goals. Microsoft highlights that the model extends the capabilities of vision-language (VL) models, preserving their verbal understanding while also enabling them to plan and act in the visual-spatial world. This advancement allows Magma to perform complex agentic tasks, such as UI navigation, and also directly manipulate robots.

    Magma at a glance
    Magma at a glance

    “Magma effectively transfers knowledge from publicly available visual and language data, connecting verbal, spatial, and temporal intelligence to navigate complex tasks and settings,” Microsoft explains.

    Training Magma

    Magma was pre-trained on vast and varied VL datasets that include images, videos, and robotics data. Researchers used a method called Set-of-Mark (SoM) to label actionable items in images – for example, clickable buttons in a graphical user interface. They also used Trace-of-Mark (ToM) to label movements in videos, such as the trajectory of a robotic arm.

    What sets Magma apart is its acquisition of spatial intelligence, learned from extensive training data through SoM and ToM. Microsoft reports that Magma achieves new state-of-the-art results in tasks such as UI navigation and robotic manipulation, outperforming previous models that are specifically designed for these tasks.

    “On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets,” Microsoft adds.

    To train Magma, researchers divided the text into smaller units (tokens). Images and videos from different sources were encoded by a shared vision encoder, which converted visual information into a format the model could understand. These discrete and continuous tokens are then inputted into a large language model (LLM) to generate outputs in verbal, spatial, and action formats.

    Set-of-Mark (SoM) for Action Grounding
    Set-of-Mark (SoM) for Action Grounding

    SoM is used to ground actions across diverse data types. In the image above, SoM prompting enables effective action grounding in images for UI screenshots (left), robot manipulation (middle), and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space.

    Trace-of-Mark (ToM) for Action Planning
    Trace-of-Mark (ToM) for Action Planning

    ToM specifically helps in labelling and understanding movements in videos and robotics data. The above image shows ToM supervisions for robot manipulation (left) and human action (right). This helps the model comprehend video dynamics and anticipate future states before acting while using fewer tokens than next-frame prediction to capture action-related dynamics.

    Real-World Capabilities

    Microsoft also presented a zero-shot evaluation of Magma’s agentic intelligence, emphasizing its ability to conduct a complete set of tasks. In UI navigation, the model successfully performed actions like checking the weather, enabling flight mode, sharing files, and texting specific individuals.

    Magma multimodal agentic foundation
    Magma multimodal agentic foundation

    in robot manipulation, Microsoft claims that Magma consistently outperformed OpenVLA (finetuning) in soft object manipulation and pick-and-place operations. The model demonstrated reliable performance in both in-distribution and out-of-distribution generalization tasks on real robots.

    In spatial reasoning assessments, Microsoft notes that Magma surpasses GPT-4o, answering spatial reasoning questions relatively well, despite using less pretraining data. Furthermore, in multimodal understanding, Magma performed competitively and even outperformed some state-of-the-art approaches, like Video-Llama2 and ShareGPT4Video, on most benchmarks despite using less video instruction tuning data.

    AI Machine Learning Magma Microsoft Multimodal AI robotics UI Navigation
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    techgeekwire
    • Website

    Related Posts

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025

    Invesco QQQ ETF Hits All-Time High as Tech Stocks Continue to Soar

    July 4, 2025

    ContractPodAi Partners with Microsoft to Advance Legal AI Automation

    July 4, 2025
    Leave A Reply Cancel Reply

    Top Reviews
    Editors Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025

    Andreessen Horowitz Backs Controversial Startup Cluely Despite ‘Rage-Bait’ Marketing

    July 4, 2025
    Advertisement
    Demo
    About Us
    About Us

    A rich source of news about the latest technologies in the world. Compiled in the most detailed and accurate manner in the fastest way globally. Please follow us to receive the earliest notification

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Our Picks

    IEEE Spectrum: Flagship Publication of the IEEE

    July 4, 2025

    GOP Opposition Mounts Against AI Provision in Reconciliation Bill

    July 4, 2025

    Navigation Help

    July 4, 2025
    Categories
    • AI (2,696)
    • Amazon (1,056)
    • Corporation (990)
    • Crypto (1,130)
    • Digital Health Technology (1,079)
    • Event (523)
    • Microsoft (1,230)
    • New (9,568)
    • Startup (1,164)
    © 2025 TechGeekWire. Designed by TechGeekWire.
    • Home

    Type above and press Enter to search. Press Esc to cancel.