- News and Stories
SenseTime Open Sources SenseNova-MARS A Breakthrough in Multimodal Search and Reasoning
SenseTime today officially open-sourced its multimodal autonomous reasoning model SenseNova-MARS (available in 8B and 32B versions). In multiple core benchmarks for multimodal search and reasoning, the model achieved an average score of 69.74, outperforming Gemini-3-Pro (69.06) and GPT-5.2 (67.64).
SenseNova-MARS is the first open-source Agentic visual language model (VLM) model to simultaneously support dynamic vision reasoning and text and image search. It can autonomously plan steps, invoke tools, and efficiently handle complex tasks, giving artificial intelligence (AI) genuine “execution capability.”
Across benchmarks, including MMSearch, HR-MMSearch, FVQA, InfoSeek, SimpleVQA, and LiveVQA, SenseNova-MARS outperformed other open-source models and exceeded leading closed-source models, such as Gemini 3.0 Pro and GPT-5.2, establishing leadership in both search reasoning and visual understanding.
For details, please refer to the technical report (https://arxiv.org/abs/2512.24330). Developers and industry users are invited to test and experience the model.
Autonomous Problem-Solving and Comprehensive Leadership
SenseNova-MARS demonstrated clear advantages in multimodal search evaluations, with an average score of 69.74, surpassing Gemini 3 Pro’s 69.06 and GPT-5.2’s 67.64.

On the MMSearch leaderboard, a core benchmark for text–image search, SenseNova‑MARS secured the top spot with 74.27, tying Gemini‑3‑Pro (74.27) and surpassing GPT‑5.2 (66.08). In the HR‑MMSearch benchmark, which tests high‑resolution detail reasoning, SenseNova‑MARS led decisively with a score of 54.43, outperforming closed‑source models.

The HR-MMSearch test set is regarded as the “Olympics of AI”. It uses 305 newly collected 4K ultra-high-definition images from 2025, preventing AI from relying on old knowledge to “cheat”. All questions focus on details occupying less than 5% of the area of the image—such as small logos, fine print, or tiny objects—which require cropping tools to discern. The dataset spans eight domains: sports, entertainment and culture, science and technology, business and finance, gaming, academic research, geography, and travel. Notably, 60% of the questions require at least three tools to solve.
In short, whether tackling knowledge-intensive tasks that demand “searching the entire web” or fine-grained visual analysis requiring “sharp eyes”, SenseNova-MARS consistently delivers leading performance.
Multi-Tool Collaboration for Real-World Problems
Conventional AI systems can only search text or view images; they cannot handle complex tasks requiring “zooming in on details, identifying objects, and then retrieving background information”. SenseNova-MARS addresses these challenges through multi-step reasoning and multi-tool collaboration, enabling practical applications in everyday and professional scenarios.
Examples of SenseNova-MARS’s autonomous reasoning and benchmark problem-solving:

Identify a tiny logo on a racing suit → search for the company’s founding year → search for the driver’s birth date → calculate the difference. SenseNova-MARS autonomously invokes image cropping, text search, and image search tools to provide a closed-loop solution without human intervention.

From product and summit photos, it can identify corporate logos, gather product and company information, and extract details such as time, quantity, and parameters to support industry insights.

From race photos, SenseNova-MARS recognizes logos and individuals, traces their competition or personnel background information, and quickly adds reporting details.

SenseNova-MARS handles long multi-step multimodal reasoning tasks, invoking more than three tools to crop and analyze details, search for research data, and rapidly validate hypotheses to reach key conclusions.
SenseNova-MARS combines autonomous reasoning and multi-tool collaboration, enabling automated solutions for complex tasks involving detail recognition, information retrieval, and logical reasoning—enhancing work efficiency.
Image cropping:
It focuses precisely on minute details occupying less than 5% of an image—such as small logos on racing suits or banners in the stands—by cropping and enlarging them for clearer analysis.
Image search:
It recognizes objects, people, or scenes and automatically matches them with relevant information, such as identifying a driver’s identity or the model of a niche device.
Text search:
It rapidly retrieves precise information, such as company founding year, birth date, or the latest industry data.
Cultivating Intuition and Building Experience for AI
SenseNova-MARS uses a tailored training approach.
Phase 1: Laying a Solid Foundation
To address the shortage of training data for multimodal multi-hop search reasoning, SenseTime has innovatively developed an automated data synthesis engine powered by multimodal agents. Leveraging a mechanism of fine-grained visual anchors, combined with multi-hop deep associative retrieval, the engine dynamically mines and connects the logical relationships between cross-webpage entities and automatically constructs highly complex multi-hop reasoning chains. Additionally, it integrates closed-loop self-consistency verification to eliminate hallucinatory data, resulting in multi-hop search question-answering datasets with rigorous logical chains and high knowledge density. The team used 3,000 carefully curated high-difficulty cases as training materials to ensure that the AI was exposed to real-world complex scenarios from the outset. Each case was annotated with the specific tools to be used and the corresponding operational steps, enabling the AI to master fundamental reasoning logic akin to that used in criminal investigation.
Phase 2: Accumulating Practical Experience
Reinforcement learning is employed in this phase. Just as detectives gain expertise through successive investigations, the AI receives rewards for correct decisions—such as selecting the right tools and following reasonable procedures—and adjusts its strategies when it makes mistakes. To prevent the AI from developing biased learning outcomes, the research team incorporated a stabilizer, known as the BN-GSPO algorithm. This ensures the AI maintains steady progress when handling both simple and complex tasks, eliminating performance bias. This elegant two-stage normalization mechanism effectively smooths out optimization fluctuations caused by the diverse distribution of results returned by dynamic tool calls, guaranteeing consistency in the learning signal distribution. Consequently, it effectively resolves the convergence challenges encountered during the training of multimodal, multi-step, multi-tool agents.
Fully Open-Sourced Model, Code, and Data
SenseTime’s SenseNova-MARS model, code, and datasets are fully open-sourced and available for direct download via Hugging Face.
GitHub repository: https://github.com/OpenSenseNova/SenseNova-MARS
Model repositories:
• 32B: https://huggingface.co/sensenova/SenseNova-MARS-32B
• 8B: https://huggingface.co/sensenova/SenseNova-MARS-8B
Technical report (click “View PDF”): https://arxiv.org/abs/2512.24330





Return