- News and Stories
Scaling Spatial Intelligence: SenseTime Open-Sources SenseNova-SI-1.3, Ranked No.1 Overall Across Eight Spatial Intelligence Benchmarks
SenseTime has officially open-sourced its spatial intelligence model, SenseNova-SI-1.3. The new release delivers significant enhancements across core tasks, including metric measurement, perspective-taking, and comprehensive reasoning, and further strengthens the model’s ability to answer open-ended questions compared with previous versions. On EASI, a comprehensive evaluation platform that integrates multiple key spatial intelligence benchmarks, SenseNova-SI-1.3 surpasses Gemini-3-Pro in overall performance, achieving the top average score on EASI-8, a unified evaluation across eight major spatial intelligence benchmarks. The model delivers particularly strong results on a range of highly challenging spatial tasks, most notably perspective-taking.

Spatial Intelligence Put to the Test
EASI-8 comprises a set of deliberately challenging questions designed to test advanced spatial reasoning: tasks that frequently prove difficult for models such as Gemini-3-Pro. So, how does SenseNova-SI-1.3 perform?

This task requires identifying the total number of building models visible across two viewpoints. The core challenge lies in correctly mapping correspondences between the images to avoid overlooking occluded structures or double-counting duplicates. From the second viewpoint, a dark gray building occluded in the first image becomes visible, while several models appear in both images. Gemini-3-Pro fails to fully de-duplicate and reports an incorrect count of six, whereas SenseNova-SI 1.3 correctly answers “four.”

This question provides two partial photos of a study room. Given that an iMac is located on the north side of the room, the task asks for the orientation of the student’s study area. Solving this requires recognizing that the two images depict the same space and stitching them together using visual cues. Gemini-3-Pro incorrectly places the study area on the west side, while SenseNova-SI 1.3 precisely identifies it as the “northwest corner.”

This question tests perspective-taking by asking, from the perspective of a man not wearing glasses, to determine the position of a nearby man wearing glasses. Models often default to the observer’s viewpoint when judging directions. Gemini-3-Pro selects “right,” falling into this trap, whereas SenseNova-SI 1.3 correctly answers “left.”

This question presents four images taken from the front, back, left, and right of a pink bottle, asking which object lies to the bottle’s left in the fourth view. This requires integrating multi-view information to mentally reconstruct the room’s global layout and then switching to the target viewpoint. In the fourth image, the bottle’s left side is entirely outside the view and can only be inferred from clues in the first three images, such as the window, bed, and closet. Gemini-3-Pro incorrectly chooses “window and blue curtains,” while SenseNova-SI-1.3 accurately identifies “closet and door.”

This question involves a double-decker bus and a bus stop, the task is to avoid the common-sense pitfall of assuming that “UK buses drive on the left, so the stop must be on the left,” and instead judge direction based on the actual visual scene. Gemini-3-Pro incorrectly answers “left,” whereas SenseNova-SI 1.3 correctly concludes “right.”
Spatial Intelligence Is a Unique Multimodal Capability

Core Knowledge Deficits in Multi-Modal Language Models (2025) discovers “perspective transformation” task exhibits low correlation with other multimodal capabilities (highlighted row, where blue color indicates low correlation)
A 2025 paper published at the top-tier machine learning conference ICML, Core Knowledge Deficits in Multi-Modal Language Models, reports a striking finding: perspective transformation (perspective-taking) has abnormally low correlation with virtually all capabilities of existing multimodal tasks. This suggests that much of the current research paradigm offers limited help in addressing spatial intelligence. The finding also helps explain why even leading multimodal foundation models perform poorly on spatial intelligence–related tasks.

Core Knowledge Deficits in Multi-Modal Language Models (2025) discovers that scaling model size offers no help to the perspective transformation task
The same work further shows that scaling up model size does not effectively improve performance on perspective-taking tasks. In other words, spatial intelligence appears to exhibit an anti-scaling effect, where larger models do not necessarily perform better. Similar observations are echoed in the official report of EASI, which identifies perspective-taking as one of the most challenging foundational spatial capabilities that remains largely unsolved.
From 3D World Data Scarcity to Scaling Spatial Intelligence

Perspective-taking is the core of spatial intelligence, which can be decomposed into three key subtasks: establishing cross-view correspondences, understanding camera motion, and imagining allocentric transformation.
Most existing academic datasets emphasize object recognition and scene understanding. As a result, models often remain at the level of 2D pattern matching and struggle to develop stable spatial understanding. Based on this insight, tackling spatial intelligence (especially perspective-taking) requires more than simply scaling dataset size. Instead, perspective-taking should be treated as a critical bridge from 2D visual perception to true 3D understanding. Accordingly, the task is decomposed into three increasingly difficult stages: establishing cross-view correspondences, understanding camera motion, and imagining allocentric transformation, with large volumes of carefully curated training data for each stage to progressively build comprehensive spatial understanding.
As data scale continued to grow, the SenseNova-SI team systematically mined and reorganized multi-view data resources, converting many previously underutilized annotations into effective training data for perspective-taking. For example, the MessyTable dataset provides scenes with high object complexity; its cross-view object instance labels and camera pose labels are well suited for training correspondence and camera motion estimation. Meanwhile, portions of indoor scene scanning datasets such as CA-1M, which include object orientation annotations, were repurposed to supply scarce but crucial supervision for viewpoint transformation and imagination. This cross-dataset recomposition and reuse made it possible to accumulate large-scale, rich, and systematic spatial understanding data.

Scaling spatial intelligence: SenseNova-SI surpasses GPT-5 on perspective-taking
With access to large volumes of high-quality spatial intelligence data, the SenseNova-SI team ultimately verified the scaling effects of spatial intelligence. The 8B-parameter SenseNova-SI base model surpasses strong closed-source models such as GPT-5, while the smaller 2B-parameter model also performs competitively—under the same data scale, even outperforming 7B-parameter models like Cambrian-S from New York University and VST from ByteDance.

Training on egocentric-exocentric matching on Ego-Exo4D surprisingly improves performances on 2D maze navigation in MMSI by a significant margin (+90.4%)
More intriguingly, the team observed early signs of emergent intelligence during this process. Tasks that initially appeared unrelated (but upon closer inspection may share underlying capabilities) began to improve synergistically. In addition, models trained on perspective-taking tasks also exhibited gains in other fundamental capabilities such as mental reconstruction and comprehensive spatial reasoning, suggesting that perspective-taking may act as a foundational driver for broader spatial intelligence development.
SenseTime Leads an Inclusive Spatial Intelligence Ecosystem
Behind the scene of SenseNova-SI-1.3 is SenseTime’s long-standing commitment to lowering technical barriers and making state-of-the-art spatial intelligence accessible to a broader ecosystem of developers and enterprises. For researchers, SenseNova-SI 1.3 validates the scaling effects of data for spatial intelligence, delivering a powerful pre-trained model and baseline that is fully compatible with existing foundation models yet excels specifically in spatial intelligence. Already officially adopted by popular benchmarks such as VSI-Bench and MMSI-Bench, SenseNova-SI can be directly used as a starting point for novel algorithm design or continued training, helping push spatial intelligence closer to human-level performance. For enterprises, SenseNova-SI-1.3 enables rapid deployment of spatially intelligent applications, significantly shortening R&D cycles and lowering the technical threshold for adoption. For end users, the impact will be felt through a growing range of products powered by advanced spatial intelligence—from smart home devices and autonomous driving systems to industrial robots and educational tools. These applications will better understand the 3D world and align more closely with real-world needs, bringing practical, human-centric spatial intelligence into everyday life.

SenseNova-SI investigates in spatially intelligent embodied agents.
Open-source Links
SenseNova-SI Model Family:https://huggingface.co/collections/sensenova/sensenova-si
SenseNova-SI Code:https://github.com/OpenSenseNova/SenseNova-SI
Discord Invite Link:https://discord.gg/WBzH62bk
Join the SenseNova-SI WeCom group





Return