News and Blog

Evolving From "Data Fusion" to "Native Architecture", SenseTime Releases NEO Architecture Redefining the Efficiency Boundaries of Multimodal Models

2025-12-01

SenseTime officially released and open-sourced NEO, its new multimodal model architecture co-developed with S-Lab of Nanyang Technological University, which lays the cornerstone of the next-generation architecture for the SenseNova multimodal model.

As the industry’s first usable Native Vision-Language Model (Native VLM) enabling deep integration, NEO is no longer constrained by the traditional "modular" paradigm. Designed "specifically for multimodality" with innovative architecture, it achieves an overall breakthrough in performance, efficiency, and versatility through deep multimodal integration at the core architectural level. NEO redefines the efficiency boundaries of multimodal models, marking the new era of "native architecture" for AI multimodal technology.

图片6.png

Breaking bottlenecks through "native" architecture

Most mainstream multimodal models in the industry follow the modular paradigm of "visual encoder + projector + language model". While the extension based on large language models (LLMs) achieves compatibility with image input, it is essentially language-centric, with image-language fusion only at the data level. Such a "patchwork" modular design not only leads to low learning efficiency, but also limits the model’s processing capabilities in complex multimodal scenarios, such as capturing image details or understanding complex spatial structures.

SenseTime’s NEO architecture is aimed to address this pain point. In the second half of 2024, SenseTime took the lead in breaking through native multimodal fusion training technology in China, winning championships in both the SuperCLUE language evaluation and OpenCompass multimodal evaluation with a single model. Based on this breakthrough technology, SenseTime launched SenseNova 6.0, achieving leading multimodal reasoning capabilities.

With the release of SenseNova 6.5 in July 2025, SenseTime achieved early fusion at the encoder level, tripling the cost-performance ratio of multimodal models and taking the lead in launching commercial-grade text-image interleaved reasoning in China. Now SenseTime has taken a further step by completely abandoning the traditional modular structure and launching the ground-up designed NEO native architecture based on underlying principles.

Three core innovations for deep unification of vision and language

Based on core concepts of extreme efficiency and deep integration, NEO inherently possesses the ability to uniformly process vision and language through underlying innovations in three key dimensions: attention mechanism, position encoding, and semantic mapping.

Native Patch Embedding: Abandoning discrete image tokenizers, it constructs a bottom-up continuous mapping from pixels to tokens through the original Patch Embedding Layer (PEL).This design can capture more image details, fundamentally breaking the image modeling bottleneck of mainstream models.
Native 3D Rotational Position Embedding (Native-RoPE): The innovation can decouple 3D spatiotemporal frequency allocation, adopting high frequency for the visual dimension and low frequency for the text dimension to perfectly adapt to the natural structures of both modalities. The innovative technology enables NEO to not only accurately capture the spatial structure of images, but also seamlessly extend to complex scenarios such as video processing and cross-frame modeling.
Native Multi-Head Attention:Tailored to the characteristics of different modalities, NEO realizes the coexistence of autoregressive attention for text tokens and bi-directional attention for visual tokens within a unified framework. This design greatly improves the model’s utilization of spatial structure correlations, thereby better supporting complex image-text mixed understanding and reasoning.

In addition, combined with the innovative two-stage fusion training strategy of Pre-Buffer & Post-LLM, NEO can absorb the complete language reasoning capabilities of the original LLM, while building strong visual perception capabilities from scratch, to completely solve the problem of impaired language capabilities in traditional cross-modal training.

图片7.png

Matching flagship-level performance with 1/10 of the data

Driven by architectural innovations, NEO demonstrates impressive data efficiency and performance advantages.

Superior Data Efficiency: With only 1/10 of the data volume required by industry models of equivalent performance (390 million image-text pairs), NEO can develop top-tier visual perception capabilities. Without relying on massive data and additional visual encoders, its concise architecture can match the performance of top modular flagship models such as Qwen2-VL and InternVL3 in multiple visual understanding tasks.

Outstanding and Balanced Performance: NEO achieved high scores in multiple public authoritative evaluations including MMMU, MMB, MMStar, SEED-I, and POPE, demonstrating superior comprehensive performance compared to other native VLMs and truly realizing "precision losslessness" of the native architecture.

Extreme Reasoning Cost-Performance: Especially in the 0.6B-8B parameter range, NEO has significant advantages in edge deployment. It not only achieves dual leaps in precision and efficiency, but also greatly reduces inference costs, pushing the "cost-performance ratio" of multimodal visual perception to the extreme.

Open-source collaboration for next-generation AI infrastructure

Architecture is the "backbone" of a model, and only a solid backbone can support the future of multimodal technology. NEO’s fusion design from its early stage supports arbitrary resolution and long image input, enabling seamless extension to cutting-edge fields such as video and embodied intelligence, and achieving true bottom-to-top and end-to-end integration.

From the application perspective, the end-to-end "native integration" design provides solid technical support for diverse application scenarios, including robotic embodied interaction, intelligent terminal multimodal response, video understanding, 3D interaction, and embodied intelligence.

图片8.png

SenseTime has officially open-sourced NEO-based models in two specifications: 2B and 9B, to drive innovation and application of native multimodal architecture in the open-source community.

SenseTime is committed to making NEO a scalable and reusable next-generation AI infrastructure through open-source collaboration and scenario implementation. It aims to promote native multimodal technology from the laboratory to widespread industrial application, accelerating the establishment of next-generation industrial-grade native multimodal technology standards.

您尚未完善信息

完善信息后，即可下载资料

完善信息跳过，继续浏览

您尚未登录

您还未登录，登录方可继续

登录跳过，继续浏览

请选择您认为需要改进的地方：

导航不好用，不方便找到感兴趣的内容
产品介绍信息不够全面
产品介绍信息不容易懂
页面打开速度不快，页面浏览不流畅/有卡顿
页面不够美观
售后服务不好找，体验不好

跳过下一个

您是否能够达到本次网站的访问目的？

是
否
仍在进行中

下一个

您对商汤官网的满意度如何？

非常不满意非常满意

提交

已收到您对商汤官网的评价和建议！

感谢您的耐心反馈~

关闭

您还未登录，登录方可继续

登录跳过，继续浏览

您尚未完善信息

完善信息后，即可下载资料

完善信息跳过，继续浏览

Apply for Trial

Technical Capabilities

SenseTime Research

SenseNova

SenseFoundry Enterprise

SenseFoundry

SenseME

SenseMARS

SenseCare

Education

SenseMart

SenseAuto

Evolving From "Data Fusion" to "Native Architecture", SenseTime Releases NEO Architecture Redefining the Efficiency Boundaries of Multimodal Models

您尚未完善信息

您尚未登录

请选择您认为需要改进的地方：

您是否能够达到本次网站的访问目的？

您对商汤官网的满意度如何？

已收到您对商汤官网的评价和建议！

您还未登录，登录方可继续

您尚未完善信息

Apply for Trial