- News and Stories
Dr. Xu Li, Chairman of the Board and CEO of SenseTime: A Decade of AI Evolution | WAIC 2025 Main Forum
From the very beginning of artificial intelligence, one core question has persisted: Where does intelligence come from?
Human intelligence originates through the continuous and self-directed exploration of the world, evolving through interaction with physical space.
In contrast, machine intelligence has largely relied on the limited body of human knowledge. And on its own, that may not be enough to fully access or integrate into the physical world.
As the evolution of unimodal systems approaches its limits, a new question emerges: What path lies ahead for AI?

At the WAIC 2025 Main Forum, SenseTime Chairman of the Board and CEO Dr. Xu Li delivered a keynote speech titled “A Decade of AI Evolution.”
The full transcript is provided below.
Good morning. It’s a pleasure to be here and to share some reflections on the evolution of artificial intelligence.
Today, I’ll be talking about “A Decade of AI Evolution” because the past ten years have seen some of the most rapid shifts in public understanding of AI and coincidentally, SenseTime is also in its tenth year. So, it feels fitting to reflect on this journey.
From Perception AI to Generative AI: Tracing the Shifts in Intelligence
Google Trends shows that public interest in AI has risen exponentially at several key moments over the past decade, ach marking a major shift in public perception.

The first wave was perception AI, around 2011–2012, when deep learning began gaining traction in computer vision. Algorithms like CNNs and ResNet sparked a wave of progress and real-world applications. Then in 2017–2018, with the rise of transformers and natural language models, the industry entered the generative AI phase, which further propelled AI sector forward and reshaped public understanding. From GPT to agents and multimodal foundation models, we now arrive at a third wave, centered on embodied AI and world models that aim to transform the real world.

One core question we’ve continually asked through each stage is this: where does intelligence come from? Algorithmic iteration and breakthroughs in computing power, including GPUs, are certainly important. However, at its core, AI and machine learning are defined by what exactly is being learned.
In the era of perception AI, we benefited from the internet’s vast “copies” of the physical world – images, videos, and other visual data were already widely available. As a result, AI in this period drew its intelligence mainly from human annotations: it learned through labeled data. Large volumes of annotated data were used to train models tailored to specific verticals. For instance, our SenseFoundry platform at the time integrated more than 10,000 perception models for various perception tasks across different domains.
The capabilities of perception models depended on the volume of labeled data used for training. Take the breakthrough by Hinton’s team on ImageNet in 2012, which involved around 14 million images. If one person were to label that many images, it would take about 10 years. While that may seem like a huge dataset, it still only reflects 10 years’ worth of knowledge of one individual, which inherently limits the model’s generalization. At the time, AI’s role was clearly that of a tool, constrained to specific vertical domains. We had to build dedicated models for each task.

So how is generative (or more general-purpose) AI different? The key lies in its foundation. Today’s more general AI is rooted in natural language. Text on the internet naturally carries embedded knowledge and doesn’t require post-processing or labeling. Although the volume of images and videos far exceeds that of text, text has a much higher knowledge density. Take GPT-3, which was trained on 750 billion tokens. If a single person were to write that much text, it would take approximately 100,000 years. That’s a thousandfold increase from the 10 years it would take to annotate ImageNet. This immense density of information is what gives language-based models their power to generalize and scale, and it has become the cornerstone of today’s general AI development.
We have also realized that this kind of data could soon be depleted. Image labeling requires manual effort, and even natural language data is projected to run out by 2027 or 2028. In fact, the pace of content creation now lags far behind the growth of computing power, leading to a mismatch between model demand and data supply. Can we extract more knowledge from raw, unstructured video and image data? It is possible but doing so will require much deeper exploration.

We’ve chosen what we believe is a natural path forward: fusing visual perception with language to build natively multimodal systems. That means establishing more natural connections between text and images, and building longer chains of multimodal reasoning. These reasoning chains help activate model capabilities. Intelligence is sparked, not generated from nothing. While today’s models can already carry out recursive self-learning to some degree, we still need to build a systematic evolution path for foundation models.
When we introduced large volumes of image-text data and even higher-order multimodal reasoning chains, we observed an interesting phenomenon. The same model that excelled at audio interaction and multimodal reasoning also showed significant improvements in text-based inference. This improvement was driven by the richness and sufficiency of the multimodal reasoning chains. It suggests that we can extract knowledge from multimodal internet data and embed it into text-based models, thus enhancing their ability to handle tasks such as spatial reasoning and physical world understanding. Such models are also better able to control generation outputs, including guiding image and video generation.

However, we will eventually face this fundamental question – once we exhaust knowledge from books and the internet, where will the next wave of intelligence come from? The first stage relied on labeling, the second on language, so what comes next?
Let’s consider how humans learn. From birth, we acquire intelligence through continuous interaction with the physical world, not by starting with language or supervised learning. This interaction with the world is a key source of intelligence. Naturally, it follows that data generated through interaction with the real world can also produce intelligence.

Why do we need large volumes of data? Because exploration must cover a wide range of physical environments. One major bottleneck in robotics and embodied AI is the high demand for this kind of high-quality interactive data.
There’s an interesting study from 1963 known as the ‘Kitten Carousel’ experiment. Two cats were connected by a rod, and one could move freely and interact with the real world, while the other was immobilized and could only passively observe the same visual input. Despite receiving the same visual data, the cat that could interact learned much faster. This illustrates the principle behind embodiment, where active exploration and interaction with the physical world are essential for developing intelligence.

However, challenges remain. When physical machines interact with the complex and vast real world, the exploration space becomes enormous. Embodied intelligence often relies on simulation platforms, but these inevitably face the Sim-to-Real gap. Could we instead build a unified world model that integrates both understanding and generation based on a deeper understanding of the real world? It’s possible but not without its own set of challenges.
For instance, model-generated data which is currently used mainly in autonomous driving has shown promising results. However, it can go against the laws of physics. Vehicles might ‘phase’ through intersections or trigger unpredictable accidents. Even the best video generation models today still respond slowly. For real-time interaction, users often have to wait a long time for results, and sometimes the output is random, like drawing a card, making the outcome unpredictable.
Unpredictable outcome 1: Vehicles "phasing" at an intersection
Unpredictable outcome 2: The white car drives over the top of the black car
Unpredictable outcome 3: A video that took a long time to load, which shows an elephant and squirrel of two vastly differing sizes playing see-saw
What is the path forward? We need powerful models that understand the real world, working in tandem with deep 3D understanding models to enhance this capability.
To this end, we have launched our own KaiWu World Model, powered by SenseNova V6.5. KaiWu is a video generation model that incorporates both temporal and spatial consistency.

For example, autonomous driving typically requires collecting data from complex perspectives, such as seven different camera views. Our model can generate realistic seven-view driving simulation videos based solely on natural language descriptions.

Let’s look at the details. As a vehicle moves, its position across each camera view remains precisely synchronized. Temporal consistency is also preserved. Regardless of distance, the details captured across different camera angles (like license plates) remain coherent over time. If a video generation engine lacks sufficient understanding of physical laws, a turn of the steering wheel could cause jarring shifts in the visual field, like trees by the road jumping positions which will break spatiotemporal consistency.


Let’s examine one typical long-tail scenario in autonomous driving – the 'cut-in.' It’s a common but difficult situation even for human drivers, where beginners hesitate, and experienced drivers have to take risks. Autonomous systems must learn to navigate it too. Being too cautious reduces efficiency but being too aggressive increases collision risk. Consider two autonomous vehicles both attempting to cut in, blocking each other and entering a cycle where they try to outdo each other. Collecting real-world data for such high-risk cases is extremely difficult and rare.
A real-world cut-in scenario in autonomous driving
Can the KaiWu World Model generate seven-view cut-in driving videos? Absolutely.

Based on descriptions specifying the direction, timing, and angle of the maneuver, the model produces videos that maintain spatiotemporal consistency. More importantly, it can reliably generate a wide range of diverse and controllable scenes, with adjustable lighting (day or night), weather (clear, cloudy, rainy), road structures (straight roads, curves, even F1 tracks), traffic density, speed, and vehicle types (from small to large).

This means that with controllable generated videos, we are opening up new possibilities for AI to explore the real world through simulation. In our early days of developing autonomous driving, we built simulation platforms (similar to reinforcement learning environments for robots) to practice before deploying in the real world, but we always faced the Sim-to-Real gap. Now, with more capable foundation models and deeper world understanding, the unification of understanding and generation is unlocking new forms of interaction.
SenseAuto's assisted driving 3D simulator
Here’s a special example – the video is generated based on inputs like steering, braking, and throttle and yet it creates a realistic, seven-camera driving simulation.
It’s like playing a real-world version of 'Need for Speed'. The user steers through varied lighting and vehicle conditions, with each camera offering a distinct perspective that remains visually consistent. This capability holds promise across many industries and could expand how we explore the real world through AI. The question of whether we can generate more data from a smaller base, and even achieve a degree of AI self-learning, is one well worth pursuing.
Today, we officially launched the KaiWu World Model platform. With a single prompt, anyone can now use natural language descriptions to generate video clips grounded in 3D physical laws and with specific camera perspectives. The goal isn’t about cinematic quality but to adhere to physical laws. The videos should reflect the laws of physics and align with practical application need, like letting you virtually drive through the real world in a ‘Need for Speed’ scenario. This capability can also extend to how robots learn and operate, which holds tremendous potential.
We look forward to moving through the three stages of AI development together: first perceiving the world, then better understanding and generating it, and ultimately, interacting with the physical world. Thank you.





Return