- News and Stories
Evolving From "Data Fusion" to "Native Architecture", SenseTime Releases NEO Architecture Redefining the Efficiency Boundaries of Multimodal Models
SenseTime officially released and open-sourced NEO, its new multimodal model architecture co-developed with S-Lab of Nanyang Technological University, which lays the cornerstone of the next-generation architecture for the SenseNova multimodal model.
As the industry’s first usable Native Vision-Language Model (Native VLM) enabling deep integration, NEO is no longer constrained by the traditional "modular" paradigm. Designed "specifically for multimodality" with innovative architecture, it achieves an overall breakthrough in performance, efficiency, and versatility through deep multimodal integration at the core architectural level. NEO redefines the efficiency boundaries of multimodal models, marking the new era of "native architecture" for AI multimodal technology.

Breaking bottlenecks through "native" architecture
Most mainstream multimodal models in the industry follow the modular paradigm of "visual encoder + projector + language model". While the extension based on large language models (LLMs) achieves compatibility with image input, it is essentially language-centric, with image-language fusion only at the data level. Such a "patchwork" modular design not only leads to low learning efficiency, but also limits the model’s processing capabilities in complex multimodal scenarios, such as capturing image details or understanding complex spatial structures.
SenseTime’s NEO architecture is aimed to address this pain point. In the second half of 2024, SenseTime took the lead in breaking through native multimodal fusion training technology in China, winning championships in both the SuperCLUE language evaluation and OpenCompass multimodal evaluation with a single model. Based on this breakthrough technology, SenseTime launched SenseNova 6.0, achieving leading multimodal reasoning capabilities.
With the release of SenseNova 6.5 in July 2025, SenseTime achieved early fusion at the encoder level, tripling the cost-performance ratio of multimodal models and taking the lead in launching commercial-grade text-image interleaved reasoning in China. Now SenseTime has taken a further step by completely abandoning the traditional modular structure and launching the ground-up designed NEO native architecture based on underlying principles.
Three core innovations for deep unification of vision and language
Based on core concepts of extreme efficiency and deep integration, NEO inherently possesses the ability to uniformly process vision and language through underlying innovations in three key dimensions: attention mechanism, position encoding, and semantic mapping.
Based on core concepts of extreme efficiency and deep integration, NEO inherently possesses the ability to uniformly process vision and language through underlying innovations in three key dimensions: attention mechanism, position encoding, and semantic mapping.
Native Patch Embedding: Abandoning discrete image tokenizers, it constructs a bottom-up continuous mapping from pixels to tokens through the original Patch Embedding Layer (PEL).This design can capture more image details, fundamentally breaking the image modeling bottleneck of mainstream models.
Native 3D Rotational Position Embedding (Native-RoPE): The innovation can decouple 3D spatiotemporal frequency allocation, adopting high frequency for the visual dimension and low frequency for the text dimension to perfectly adapt to the natural structures of both modalities. The innovative technology enables NEO to not only accurately capture the spatial structure of images, but also seamlessly extend to complex scenarios such as video processing and cross-frame modeling.
Native Multi-Head Attention:Tailored to the characteristics of different modalities, NEO realizes the coexistence of autoregressive attention for text tokens and bi-directional attention for visual tokens within a unified framework. This design greatly improves the model’s utilization of spatial structure correlations, thereby better supporting complex image-text mixed understanding and reasoning.
In addition, combined with the innovative two-stage fusion training strategy of Pre-Buffer & Post-LLM, NEO can absorb the complete language reasoning capabilities of the original LLM, while building strong visual perception capabilities from scratch, to completely solve the problem of impaired language capabilities in traditional cross-modal training.

Matching flagship-level performance with 1/10 of the data
Driven by architectural innovations, NEO demonstrates impressive data efficiency and performance advantages.
Superior Data Efficiency: With only 1/10 of the data volume required by industry models of equivalent performance (390 million image-text pairs), NEO can develop top-tier visual perception capabilities. Without relying on massive data and additional visual encoders, its concise architecture can match the performance of top modular flagship models such as Qwen2-VL and InternVL3 in multiple visual understanding tasks.
Outstanding and Balanced Performance: NEO achieved high scores in multiple public authoritative evaluations including MMMU, MMB, MMStar, SEED-I, and POPE, demonstrating superior comprehensive performance compared to other native VLMs and truly realizing "precision losslessness" of the native architecture.
Extreme Reasoning Cost-Performance: Especially in the 0.6B-8B parameter range, NEO has significant advantages in edge deployment. It not only achieves dual leaps in precision and efficiency, but also greatly reduces inference costs, pushing the "cost-performance ratio" of multimodal visual perception to the extreme.
Open-source collaboration for next-generation AI infrastructure
Architecture is the "backbone" of a model, and only a solid backbone can support the future of multimodal technology. NEO’s fusion design from its early stage supports arbitrary resolution and long image input, enabling seamless extension to cutting-edge fields such as video and embodied intelligence, and achieving true bottom-to-top and end-to-end integration.
From the application perspective, the end-to-end "native integration" design provides solid technical support for diverse application scenarios, including robotic embodied interaction, intelligent terminal multimodal response, video understanding, 3D interaction, and embodied intelligence.

SenseTime has officially open-sourced NEO-based models in two specifications: 2B and 9B, to drive innovation and application of native multimodal architecture in the open-source community.
SenseTime is committed to making NEO a scalable and reusable next-generation AI infrastructure through open-source collaboration and scenario implementation. It aims to promote native multimodal technology from the laboratory to widespread industrial application, accelerating the establishment of next-generation industrial-grade native multimodal technology standards.





Return