Finding the needles in 'big data' haystacks

Mar. 7—A seemingly bottomless ocean of "big data" has flooded our world. Bits and bytes are pouring in from sources ranging from satellites and MRI scans to massive computer simulations and seismic-sensor networks, from security cameras to smartphones, from genome sequencing of SARS-Cov-2 to COVID-19 test results, from social networks to texts zipping from phone to phone.

Making sense of this ever-increasing racket is vital to national security, economic stability, individual health and practically every branch of science — and the job is getting easier, thanks to the SmartTensors artificial intelligence tool we have developed at Los Alamos National Laboratory.

Without any human guidance, this technology sifts through millions of millions of bytes of diverse data to find the hidden patterns and features that make the data understandable, revealing its underlying processes or causes. SmartTensors also can identify just how many features are needed to make sense of enormous, multidimensional datasets.

In analyzing data, finding that optimal number reduces a massive set of data to a scale that's manageable for computers to process and subject matter experts to analyze. The features that SmartTensors extracts are explainable and understandable chunks of data.

What makes a face?

Take, for example, facial recognition algorithms, which rely on large datasets. A face is a set of key facial features and features that matter less — noses, eyes, eyebrows, ears, mouths, cheeks, foreheads, jawlines, hairlines and chins. SmartTensors can be pointed at a large number of photos of faces and isolate those features as the important ones for recognizing faces. It also can determine how many of those features — the optimal number — are required to do the job accurately and reliably. For instance, maybe only specific shapes of eyes, noses and mouths are needed for facial recognition. It might also be essential to categorize all the faces that have oval eyes and slim noses.

In other database examples, the features needed to represent the whole dataset might not be that obvious. Very large sets of data — measured in billions of millions of bytes — typically are made up of unknown features obscured by a torrent of less useful information and noise in the data.

Vast datasets, such as COVID-19 test results or information from earthquake sensors, are formed exclusively by things we can observe directly. But in big-data analytics, it is difficult to directly link these observables to the underlying processes that control the behavior and generate the data. These processes or hidden features are not directly observable and are confusingly mixed with each other, with unimportant features, and with noise.

Cocktail party problem

The problem is similar to extracting the individual voices at a noisy cocktail party with a set of microphones recording the chatter. How do you isolate one or more conversations while individuals are moving around and talking? The number of hidden features here is the number of individual voices and their characteristics, which might include the pitch and tone of each person's voice, for instance. Once that's determined, it's easier to follow a conversational thread or a person.

Similarly, to sort out the important information in a dataset, SmartTensors organizes the information into a data cube, or tensor, that's made of three or more dimensions. Each dimension is a particular category of information within that data. So, in the cocktail party example, the pitch of a voice might be one dimension, its tonal qualities another, its volume a third, and so on. If you think of the data cube as being made up of many small, stacked cubes, each one represents information about some or all of the features of the data. The representation of the data in the form of a tensor allows fast processing as the AI churns through all the data.

As you might expect, we've applied SmartTensors to more important problems than separating individual conversations at a cocktail party. SmartTensors is helping us understand climate processes, watershed mechanisms, hidden geothermal resources, carbon sequestration processes, chemical reactions, protein structures, pharmaceutical molecules, cancerous mutations in human genomes, and more. In a world swimming in big data, this kind of tool just might help us all keep our heads above water.

Boian Alexandrov is an AI expert and principal investigator on the SmartTensors project in the Physics and Chemistry of Materials group at Los Alamos National Laboratory. Velimir "Monty" Vesselinov is an expert in machine learning, data analytics and model diagnostics in the Computational Earth Science group at Los Alamos, and also a principal investigator on the project. SmartTensors was funded by the Laboratory Directed Research and Development (LDRD) program at Los Alamos. For more information, visit the SmartTensors website.