News

Technical Speaking: Omnitag, ML-Powered Multimodal Data Mining Framework

July 14, 2025 Bing-Jui Ho, Senior Engineer Technically Speaking

The next frontier of autonomous vehicles (AVs) is being shaped by data and the ability to uncover the intelligence buried within. At Motional, we recognize that data is the “dark matter” of autonomy: vast, unseen, and yet essential to technical breakthroughs. As AV technology evolves from rule-based, siloed systems to machine learning (ML)-driven, end-to-end solutions, data mining emerges as a key enabler for the scalability, adaptability, and safety promised by the ML-powered AV technology.

Classic data mining approach typically involves building a structured, rule-based pipeline:

Data from diverse sensor modalities → Extract attributes via hand-crafted criteria → Store in a relational database → Craft structured query from expert-defined rules by leveraging engineered attributes

While this approach affords separation of concerns and interpretability of the mined data, it also comes with major limitations:

Manual Bottlenecks: Attribute extraction and query crafting are heavily human-driven, relying on domain expertise and hand-tuned heuristics. This makes development slow, inconsistent, and at times error-prone.

Fragility and Scalability: As rules accumulate to handle more use cases with increased complexity, rule-based systems become brittle and difficult to maintain and extend.

Loss of Information: Hand-crafted attributes focus only on predefined signals, potentially discarding rich, latent information present in the raw sensor data.

Limited Expressiveness: Rule-based systems struggle to model complex relationships—limiting its effectiveness in mining of dynamic or ambiguous scenarios.

By embracing an ML-first approach, Motional is building next-generation ML-based data mining systems inspired by the teacher-student paradigm, where powerful offline models are used for preparation of high-quality datasets required by the lighter, end-to-end AV models running on-car. In this blog post, we introduce Omnitag, an ML-Powered Multimodal Data Mining Framework that transforms the "dark matter" of autonomy into refined, ready-to-use fuel for next-generation AVs.

Omnitag: Towards “Example-Driven” Data Mining

At the heart of Motional’s next-generation ML-powered data mining systems is Omnitag, a multimodal framework designed to adapt to diverse mining requests with minimal human intervention—while improving scalability, capability, and applicability.

The diagram below illustrates how Omnitag embodies the concept of “omni” through its support for diverse modalities—including image, video, audio, world-state, and LiDAR. Each modality enables distinct, and often complementary, data mining capabilities. By leveraging the strengths of each, Omnitag supports a wide range of mining requests across heterogeneous data sources.

Our approach is built on three core pillars:

Multimodal Encoding for transforming raw, heterogeneous data into unified representations

One-Off Few-Shot Dataset Creation and Decoding to enable rapid, accurate mining with minimal supervision

Encoder-Decoder Adaptation to bridge domain gaps by tailoring to in-house dataset

Multimodal Encoding

Leveraging Multimodal Foundation Models for Data Mining

Omnitag leverages advancements in multimodal foundation models emerging from the open-source community. These models are trained on web-scale datasets with billions of parameters across massive compute clusters, learning general-purpose representations that can be adapted to a wide range of data mining tasks with minimal supervision. Unlike classic supervised models that rely on predefined and task-specific labels, multimodal foundation models infer structure, semantics, and relationships directly from raw, heterogeneous data—enabling broader generalization and adaptability.

How it works:

Multimodal Data Preprocessing: Raw data from diverse modalities—image, video, text, audio, world-state, and point cloud—are first aligned in time and space, normalized, and cleaned. This preprocessing step ensures that heterogeneous data are synchronized and formatted consistently, making them suitable for joint encoding. This unified treatment lays the groundwork for subsequent multimodal encoding.

Pretrained Multimodal Encoder: Multimodal foundation models pretrained across diverse modalities are adapted to encode preprocessed multimodal data into high-dimensional embeddings that preserve rich semantic and contextual information.

This approach breaks down the rigid silos of traditional rule-based data mining pipelines by transforming heterogeneous data into unified representation spaces—enabling scalable, flexible, and context-aware data mining where each modality can inform and enrich the others, surfacing ambiguous or rare scenarios by leveraging cross-modal cues—such as using LiDAR to disambiguate visual occlusions in video, or referencing images to explain anomalies in audio.

One-Off Few-Shot Dataset Creation and Decoding

One of the key challenges in data mining is efficiently extracting relevant features from heterogeneous data for diverse data mining requests with minimal manual effort. Traditional approaches are, at best, slow to adapt, often requiring extensive hand-engineering and tuning. Omnitag addresses this by enabling “example-driven” data mining through the creation of one-off few-shot datasets. With just a handful of curated examples, users can train a lightweight decoder or leverage a larger, more powerful decoder in a zero-shot, in-context learning fashion.

Key Components:

RAG-Driven Interactive Few-Shot Dataset Creation: We leverage a Retrieval-Augmented Generation (RAG) loop to surface informative positive and negative examples based on similarity and dissimilarity in unified representation spaces. This interactive process allows users to quickly curate and refine few-shot datasets, with minimal manual efforts.

Few-Shot Decoding: The curated examples are used to train lightweight decoders which in turn are used to efficiently tag cached multimodal embeddings at scale. These small models are quick to iterate on and can be continually improved through the RAG-driven interactive loop.

Zero-Shot Decoding: The same few-shot examples can be used as in-context prompts for powerful decoders during zero-shot inference. This makes it easy to experiment with and evaluate emerging open-source models, allowing Omnitag to quickly improve as more capable decoders become available.

By combining RAG-driven dataset creation with few-shot and zero-shot decodings, Omnitag turns a small number of labeled examples into rich supervision signals—accurately tagging embeddings to support a wide range of data mining needs.

Encoder-Decoder Domain Adaptation

Multimodal foundation models pretrained on web-scale datasets offer impressive generalization across diverse tasks. At Motional, we adapt these powerful encoders and decoders to our in-house datasets to bridge the gap between the general purpose capabilities of off-the-shelf open-source models and the specific needs of real-world AV deployment across diverse Operation Design Domains (ODDs).

Rather than relying solely on pretrained models, we fine-tune them on curated, high-quality data collected from our own fleet—data that captures edge cases and domain-specific distributions rarely represented in public datasets. This ensures that the representations learned by our encoders and the predictions made by our decoders are tightly aligned with the real-world challenges and data needs of our autonomy stack.

Our adaptation process includes two key strategies:

Domain-Specific Fine-Tuning: We selectively fine-tune pretrained encoders and decoders on representative data from our target ODDs, aligning the learned feature space with real-world deployment scenarios and supporting a wide range of targeted data mining requests.

Continuous Feedback Loop: As new data is collected and rare or long-tail events are discovered, we incrementally adapt our models—enabling continuous improvement with minimal overhead.

By adapting multimodal foundation models to our in-house datasets, Omnitag retains the flexibility and scalability of general-purpose pretrained multimodal foundation models while ensuring strong performance on our specific data mining needs. The approach ensures that the ML-based data mining system remains aligned with Motional’s evolving data mining needs in the presence of significant domain or distribution shifts.

A Step Toward ML-First Data Mining for Autonomous Vehicles

Omnitag marks a fundamental shift in how autonomous vehicle data is mined, curated, and turned into actionable intelligence. By combining multimodal foundation models, example-driven few-shot learning, and domain-specific adaptation, Omnitag replaces brittle, manual data mining pipelines with a flexible, scalable, and continuously improving data mining system.

This ML-first data mining approach empowers Motional to adapt to diverse ODDs and surface the “dark matter” of autonomy—the rich, hidden structure embedded in raw multimodal data. As open-source models rapidly evolve, Omnitag ensures Motional can rapidly harness these advances to transform raw, multimodal data into refined, ready-to-use dataset with which our ML-first AV systems are continuously enhanced to navigate more complex environments, and provide a safe and comfortable rider experience.