Solving the Data Management Challenge of Autonomous Driving

Print Friendly, PDF & Email
Autonomous Driving

This sponsored post from IBM explores how the industry can manage and utilize all the data used in AI development for autonomous driving and cars. 

The action of driving an automobile is destined to take a very different route in the coming years as autonomous vehicles become commonplace on our roadways. The notion of autonomous driving (AD) is both scary and exciting, pushing the boundaries of what we know. However, the AD space is ripe with technology challenges. This first post in a series of three looks at one of the most vexing challenges posed by autonomous vehicles: how to manage all the data used in artificial intelligence (AI) development.

Autonomous Driving

With hundreds of sensors onboard, a single vehicle can produce terabytes of data each day. (Photo: Shutterstock/metamorworks)

Reading Hundreds of Sensors Per Vehicle

With hundreds of sensors onboard, a single vehicle can produce terabytes of data each day. But data scientists typically do not look at just one vehicle. As time goes on, developers of autonomous vehicles might have multiple AI models, with multiple versions and hundreds of different data subsets.

Data scientists need to retrieve and analyze essential parts of this data—but not all of it—to improve the core models and the intelligence derived from them, which is then redistributed to the individual platforms. Locating the right data is a major challenge, and for data scientists the 80/20 rule generally applies—that is, 80 percent of a data scientist’s time can be spent simply finding, cleansing and organizing data. These curatorial tasks can leave the data scientist only 20 percent of their valuable time to actually build and apply deep learning models and perform the required analysis.

In addition, data scientists are in short supply. One way organizations can address the shortage of skills and save data scientists’ time is to simplify or automate portions of the AI development process.

The more data scientists can narrow the selection of data to ingest for analysis, the more efficient they can be with both their time and computing resources.

Leveraging Data About Data

Autonomous vehicle development requires responses in mere seconds, and the key to achieving such rapid results is to have very good metadata—the data about the data—that helps scientists and developers manage complexity.

What’s needed to help them meet this challenge is software that can index the metadata for hundreds of billions of files along with the ability to query that metadata and very quickly identify the files and objects needed. IBM Spectrum Discover, for example, is designed to provide these capabilities by offering data scientists a metadata management tool that helps easily categorize, search, locate and tag data based on attributes at petabyte scale.

IBM Spectrum Discover makes it possible to be very specific in calling up data, whether for an individual vehicle and event, or for an entire fleet deployed over a particular time frame. For example, a user could say, “I want to see this specific data from all of the vehicles that are a particular model in our fleet running a particular version of this system and software.”

Tagging Data to Add Value Over Time

The more data scientists can narrow the selection of data to ingest for analysis, the more efficient they can be with both their time and computing resources. The point of aggregation where data scientists are working and using the data is the most obvious place for tagging. In an ideal architecture, the data is tagged at both the aggregation level and the individual vehicle level.

Each time data is aggregated and analyzed, the AI models performing the analysis can derive new information and use tagging to apply that added knowledge to the data. As a result, the semantic content and usefulness of each piece of data increases. And in the data-driven world of AI and autonomous vehicles, getting maximum value from information is one of the keys to unlocking the future.

Learn more about IBM Spectrum Discover, and if you are an existing IBM client, sign up for a FREE 90-day trial.

Look for part 2 of this series to learn even more about what IBM is doing to help solve the AD challenge at every stage of the AI pipeline.


  1. John Anderson says

    This is a significant consideration for any industry and resonates with the Insurance industry in particular. Not just the data from vehicles but the data from the connected home and a more intriguing concept of the IoT readouts from infrastructure. Imagine combining data from traffic flow (with average vehicle weight) against weather (heavy snow), age of a bridge (and structural grade).
    Question: When moving from analysis of all the data (boiling the ocean) to identification of the most relevant (establish relevant data set); how do you move to that condition? And,… how often are you monitoring seemingly irrelevant conditions from the “ocean” that suddenly may be relevant?