News from the open internet

What the Tech is deterministic vs. probabilistic data?

Illustration by Ollie Catton / The Current

With Google predicting it will sunset third-party cookies roughly by the end of 2024, and regulatory bodies cracking down on consumer privacy, advertisers are on the hunt to replace the stale tracking technology and, at the same time, deliver an improved, privacy-conscious ad experience for consumers. The marketplace is giving advertisers several identity solutions to replace the industry standard, such as Unified ID 2.0, or UID2 and others.

And because nothing is simple in ad tech, the move toward new identifiers is not just about replacing cookies. It’s about leveling the playing field between the open internet and walled gardens.

We know that sentiment is loaded with things to unpack so let’s start with the basics. To understand all of this, it’s probably best to explain the types of data that identifiers lean on in order to work effectively — namely the difference between deterministic and probabilistic data.

Deterministic, what?

Deterministic data is any information you might supply directly to a brand — basically, any account you’ve created to login to a company’s site or app. This data is very specific, such as your name, age, gender, address, email, and phone number — and likely accurate because you’re inputting it yourself.

So, that pet subscription service you signed up for? Or that new summer outfit you bought online? Or that pizza you ordered on Seamless last night? Yep, all examples of ways you may have supplied deterministic data — all by providing your email address at some point in the process, whether it’s by logging in or checking out. Deterministic data is one piece of first-party data, which also includes what brands can glean from actions you’ve taken on a website or app, such as shopping behavior and the number of items you normally keep in your online cart.

Ok. So how is deterministic data used to identify consumers?

With deterministic data, a consumer can be identified when consumer data owned by advertisers (such as an email address supplied when logging in to a website) is matched to publisher inventory that has the same email data associated to it. Sometimes multiple pieces of deterministic data are used to determine a correct match.

I think I’m following. Now what is this probabilistic data you mentioned earlier?

Probabilistic data, also known as contextual or cognitive data, is a puzzle of a bunch of pieces of information used to determine the — you guessed it — probable identity of a consumer. Here, the data might be made up of mobile ad IDs, device attributes, wifi signals, or browser-level data and compared against third-party data sets to associate other likely attributes with the consumer. Overall, wider nets are cast to identify a single individual.

Got it. So, how exactly is probabilistic data used to identify consumers, then?

“Identify” is a loose term with probabilistic data, and that’s because probabilistic data stitches together an image of who a consumer might be based on a few known attributes.

All of the signal data mentioned earlier can ladder up to prescribing identity. Statistical models and algorithms, already found within programmatic marketplaces, can be used by advertisers to filter through probabilistic data to build user profiles and lookalike models. This could mean grouping users into categories like media they are most likely to consume or interests that they gravitate toward (like traveling to beach destinations). Probabilistic data can also be used to create a device graph which can connect different devices to the same user based on patterns in their behavior.

And, what exactly does all of this have to do with cookies going away?

Deterministic data is often difficult to come by and even then, it can be mismatched, inaccurate, or if another piece of the data is not available, it can result in low precision and accuracy when it comes to data and targeting.

And with more consumer privacy laws in effect around the world, advertisers have less leeway to collect and use deterministic data for advertising purposes. It’s for this reason, along with ongoing browser updates, that cookies are predicted to go away completely.

Does it all boil down to consumer privacy?

Largely. But it may not be what you think.

The misuse of walled gardens’ deterministic data has been the center of ongoing news cycles (remember the Cambridge Analytica scandal?), but the emphasis of these allegations should be on the alleged misuse of it, not the data itself. The access and use of deterministic data has propelled walled gardens to disproportionately grow their market share.

But deterministic doesn’t mean “bad” and probabilistic doesn’t mean “good” (or vice versa). Deterministic signals are about as precise as it gets, which means it can be used to model higher-quality audiences and achieve greater scale compared to probabilistic signals.

In general, probabilistic data presents a wider swath of individuals. With probabilistic data, advertisers or publishers can approximate identity with multiple data points, even if a user is not logged in or has not supplied specific details like their email address, potentially lowering the risk of privacy issues.

It’s the difference of reaching a larger audience with similarities over pinpointing exact consumers, which can help with overall reach. Both types of data can be used to serve more relevant ads to the end user and are often used in tandem to create more robust data sets for targeting.