Skip to main content

Technically Speaking: Auto-labeling With Offline Perception

November 24, 2021 Technically Speaking

Motional’s Technically Speaking series takes a deep dive into how our top team of engineers and scientists are making driverless vehicles a safe, reliable, and accessible reality. In Part 1, we introduced our approach to machine learning and how our Continuous Learning Framework allows us to train our autonomous vehicles faster. In Part 2, we share how we build a world-class offline perception system to automatically label the data that will train our next-generation vehicles.

The recent breakthroughs in autonomous vehicle (AV) technology have been fueled in large part by machine learning (ML) and the availability of large-scale datasets. Similar to human beings, these ML systems are trained to detect objects by presenting them with millions of examples of cars and other traffic participants.

These examples, found within camera images and lidar scans, are hand-labeled by human annotators in a painstaking process to guarantee high quality. However, this process is very cost-inefficient and time-consuming. To scale AV usage from selected neighborhoods to any road in the world, automated solutions, often referred to as “auto-labeling,” are required to complement the small human-labeled datasets with vast auto-labeled datasets.

In 2018, Motional released the ground-breaking nuScenes dataset to the research community. It’s a collection of five hours of driving data and features a total of 1.4 million human-annotated objects. While nuScenes has become one of the standard benchmarks for AV performance in the industry, it is also abundantly clear that five hours of training data from two cities are not enough to teach an AV system how to drive anywhere.

Moving The Work To The Cloud

The straightforward solution to auto-labeling a dataset would be to use the "online perception" system employed by the AV to detect the traffic participants (vehicles, cyclists, pedestrians, traffic cones, etc.) in its environment. However, online perception systems are heavily constrained by multiple factors. First, the primary role of the AV stack is to detect every object in real time in a matter of milliseconds. Second, the computational power and memory of onboard computer systems are limited by cost and energy consumption. The computer system should cost a fraction of what the car itself costs and should not drastically reduce the battery time of an electric vehicle.

Therefore, a promising alternative is to work with an “offline perception” system that runs in the cloud and is not limited by any of the above constraints. As shown by many of the community submissions to our nuScenes benchmark challenges, removing the need for a system to operate in real-time can drastically improve its ability to detect traffic participants. By increasing network capacity, training for a longer period, and observing more data examples, we can drastically improve the ability of our neural network to generalize from the presented examples of vehicles to new types of vehicles that it has not previously encountered. When working with range sensors like radar or lidar, we can accumulate sensor readings over a larger time window and get denser point clouds, which allows us to label data in 4-D, and see farther in the distance.

Looking Into The Future

Perhaps the most exciting aspect of offline perception is that it allows us to look into the future. Since the data has already been recorded, we can jump any number of seconds ahead and find out whether the lights in the distance do indeed belong to a car that is approaching us at night, or if they're instead just streetlamps on the side of the road.

Similarly, it is often challenging to estimate the real size of a truck when we only see the back, but none of the sides. By using offline perception we can wait until our vehicle overtakes the truck and then find a better vantage point from which we can confidently assess the true size. More generally speaking, this approach allows us to infer a globally consistent estimate of the scene, rather than one that is based on only a short moment in time.

While building a world-class offline perception system, the size of our networks and the speed at which we can train them are often constrained by the available graphics card memory, including the largest cards currently available on the market. It is therefore crucial to distribute the system over a large number of machines and graphics cards and train on different subsets of the data in parallel. This parallelism enables us to speed up our large-scale network training from weeks to a matter of hours.

Unlocking New Capabilities

What capabilities do auto-labeling with offline perception unlock? We can now annotate any amount of data collected by our steadily growing fleet with a system that is approaching the same level of accuracy as human-labeled data.

One of the big use cases of offline perception is in our Continuous Learning Framework, where we continuously mine our new driving data for difficult scenarios by comparing detections from online and offline perception. In some cases these two perception systems disagree. For example, online perception may not be able to detect a pedestrian that is hidden behind a tree. However, offline perception can use foresight and hindsight and infer that a pedestrian that has been observed in the past and in the future, must also be there in the present. This is also referred to as object permanence. In such cases where both perception systems disagree, we can use the offline perception outputs as training data to improve our model, since its performance is significantly better than that of online perception.

The use of auto-labeling also reduces from weeks to hours the time it takes for us to annotate new driving data. This is important since it allows us to react to unforeseen issues, such as road closures, on the same day by retraining our system with examples of that road closure. It also helps us to expand to new cities faster, by quickly learning all the unique characteristics of a city’s roadways.

It is crucial to understand that auto-labeled data is not just helping the perception system of our vehicle. It can also be used as training data in essential tasks across the entire AV stack: tracking, prediction, planning, and even semantic mapping. For example, in the context of motion planning, we can discover through error mining if we are performing badly in a particular scenario, such as when jaywalkers cross the road. We can then find thousands of similar scenarios in our searchable database and retrain our on-car system to learn how to avoid jaywalkers.

nuPlan Launch Coming Soon

We are once again planning to share our progress with the research community. This coming December, we will be releasing our nuPlan dataset. nuPlan is a dataset of unprecedented size that is entirely auto-labeled. It contains 1,800 hours of driving data from four cities, and it will allow practitioners in academia and industry alike to develop the next generation of ML systems that are needed to further improve autonomous vehicle performance.

Motional’s mission is to change the way the world moves by making safe, reliable, and accessible robotaxis a reality. By using an offline perception system that fully automates the labeling process, we’ll enable our AVs to make smarter and safer decisions faster.

We invite you to follow us on social media @motionaldrive to learn more about our work. You can also search our engineering job openings on