Apply for Trial
Newsroom

SenseTime Launches the SenseNova Unified Large Model, Clinches Top Rankings Across Key Model Benchmark Evaluations

2025-01-13

SenseTime has officially launched the SenseNova Unified Large Model, leading the way in natively integrated modalities. This model, which unifies in-depth reasoning and multimodal information processing capabilities, clinched first place on two comprehensive benchmarks, one on testing language capabilities and the other on visual content understanding.

According to the latest “2024 Chinese Large Model Benchmark Report” by SuperCLUE, a leading model evaluation institute, the SenseNova Unified Large Model achieved an outstanding score of 68.3, ranking first place in China together with DeepSeek V3.


SenseNova2.png

Source: SuperCLUE


On the other hand, the model also secured the highest score in multimodal evaluations conducted by OpenCompass, outpacing GPT-4o by a substantial margin.


SenseNova3.png

Source: OpenCompass


SenseTime’s top positions on both language and multi-modal benchmarks demonstrate its breakthrough in native multimodal training, marking a critical step towards unifying large language models and multimodal models, setting the stage for a more unifying approach in developing large models.

The SenseNova Unified Large Model overcomes the long-standing challenges in modality integration, seamlessly bridging different modalities and paving the way towards the integration of deep reasoning capabilities and multimodal information processing.

 

Excellence Across Humanities and Sciences: Writing, Perception and Reasoning

The SenseNova Unified Large Model excels across humanities and sciences. In SuperCLUE’s annual evaluation, the model ranked first globally with a score of 81.8 points in humanities, surpassing OpenAI’s o1 model. It also earned top honors in sciences, including a score of 78.2 points in computational evaluation, taking first place in China.

With its native multimodal capabilities, the SenseNova Integrated Large Model goes beyond human-level “reading” and “reasoning” to handle complex tasks such as deciphering unclear handwriting, extracting information from charts, and generating literary content.


Handwriting.jpg

The SenseNova Unified Large Model quickly and accurately deciphers unclear handwriting.

 

Expanding the Horizons of Multimodal Applications

In practical application scenarios, the SenseNova Unified Large Model demonstrates significant advantages compared with traditional large language models that primarily support pure textual inputs. It excels in scenarios with diverse multimodal information such as in autonomous driving, interactive videos, workplace settings, education, finance, business park management, and industrial manufacturing.

The model effectively meets users' needs for comprehensive processing of information across various sources and formats, including images, videos, speech, and text.

For example, SenseTime’s “Raccoon” application leverages the SenseNova Unified Large Model to efficiently process and analyze complex tasks, such as complex multimodal documents that contain a combination of formats like tables, text, images and videos, commonly used in the business and financial sectors.

The model also enhances user experience in interactive applications such as online education and voice-based customer service, seamlessly integrating speech and natural language capabilities.

While native multimodal foundational models have become a focus of industry discussions, many attempts have not been successful due to limitations in data and training methods. Such challenges often lead to diminished performance in core language tasks, such as following instructions and reasoning.

Drawing on a decade of expertise in computer vision and extensive experience in empowering various sectors through artificial intelligence, SenseTime firmly believes that multimodal models are the cornerstone of AI 2.0’s real-world implementations. Guided by this vision, SenseTime has developed unique insights into multimodal research. In its efforts to integrate large language models and multimodal models, SenseTime has pioneered two key innovations, which are unified modal data synthesis and enhanced joint task trainingThese innovations have driven the successful development and market launch of the SenseNova Unified Large Model.

During the pre-training phase, SenseTime leveraged a vast amount of naturally interleaved text-image data and synthesized a large volume of data that combine multiple modalities through methods such as inverse rendering and image generation based on mixed semantics. This approach bridged a large volume of image and text modalities, building strong interconnections between modalities that laid a strong foundation to execute cross-modal tasks more effectively, and thereby significantly enhancing the model’s overall performance.

In the post-training phase, SenseTime established a large number of cross-modal tasks based on its understanding of a wide range of business scenarios, such as interactive videos, multimodal document analysis, and comprehension of urban and in-vehicle scenarios. By incorporating these tasks into an enhanced training process, SenseTime’s Unified large model has not only developed robust capabilities for integrating, interpreting, and analyzing multimodal information but is also able to effectively respond to real-world business scenarios. This establishes a closed-loop system where practical applications drive iterative improvements to the foundational model.

Achieving seamless multimodal interactions and deep integration is a critical step towards realizing a unified model, which in turn is the prerequisite to develop a world model. SenseTime’s innovative approach and advancements in this field have firmly established its leading position in this domain, where it is at the forefront of shaping the future of foundational AI models.