V is for Privacy

V-is-for-privacy

Recently I attended a big data related event sponsored by a banking association. The content was pretty valuable, with tracks being organized into two main threads:

  • What big data is. Architectural discussion followed by the three original “Vs” definition Doug Laney from Gartner gave to big data back in 2001 (volume, velocity and variety). This document explains more. Many other Vs followed the original three in a deadly joke that left behind important stuff not beginning with a “V”.
  • What you can do with big data. This was the best thread, putting the focus on predictive analytics. The speakers provided lots of examples and use cases about the magic of predictive analytics, ranging from targeted advertising to pregnancy or divorce predictions based on shopping cart contents.

The most interesting part was not in what they said, but rather in what they did not say. First of all, no mention of security nor privacy problems/solutions in big data environments, but that did not surprise me. As with most new technologies, we tend to first push them, then to make them more secure, but only in a second phase.

It’s really a pity “privacy” nor “security” begin with a “V”.

Predictive Analytics and Big Data

Predictive analytics and big data are not new technologies, but each one is boosting the other in recent years.

Today we’re (almost) all aware of privacy issues related to personal data shared over the internet. Conscious users, as I am, always stop for a second before hitting the “Submit” button to think about the consequences of their action: what kind of data is being shared, with whom, and what will be done with my data, at least in my country. Then I’ll share it anyway, because it’s fun to get services for free, but this is a different discussion.:-)

Privacy Regulation and Big Data

The country I’m living in, Italy, adopted a pretty good privacy regulation back in 1996. Such regulation distinguishes between simple personal data (such as your email or what you buy) and sensitive personal data (such as medical data, sexual preferences, politics, religion, etc.). When registering for a local online or offline service, I’m told about how the service provider will use my data and there’s always a separate checkbox that allows me to accept or refuse the service provider sharing my data with third parties. It’s always an opt-in scheme.

So, in most cases,  I know what will happen to my personal or sensitive data before sharing it, and I can decide whether to enroll for that fancy service or not. Unfortunately, the internet spans multiple countries, and many of them work with opt-out based privacy regulations.

Predictive analytics is a game changer. I’m generally positive about predictive technology, but I have a lot of concerns about the use and abuse of my (big) data in the hands of corporations. For example, a few days ago I purchased a smart wristband, created an account on the appropriate website, installed the companion app on my mobile and started using it. Along with the large amount of data it tracks automatically, I noticed I can enter information about the food I eat, manually or in a semi-automated way by scanning the bar code on food packages. The purpose is to show the balance between calories eaten and calories consumed.

Let’s take a look at what “Smartband Vendor, Inc.” is putting in its data (probably big data) ecosystem, with my consent:

  • Name, birthday, email and other personal data
  • Heart rate. Not sure simple heart rate falls into “medical sensitive data”, let’s assume it doesn’t
  • Stairs climbed
  • Steps walked
  • Sleep time
  • Food eaten. (Umm… does bar code indirectly also tell where I bought it?)
  • Mobile app has authorization for GPS data, let’s assume it tracks me

There’s no predictive algorithm running on my data as far as I know. Let’s be creative and invent one for the sake of writing this blog post!

With all this data coming from thousands of different users, they can figure out a cancer risk factor or a heart disease prediction model. “If you eat that thing three times a week, and you sleep badly, and you go to the gym only twice a month, there’s a XX% probability you will have a heart attack before 54”. Some elaborations on this hypothetical scenario:

  • I’d be happy to know that result and take action, but I’d be especially happy to know that information is kept private.
  • I agreed on sharing simple personal data, but the above is sensitive medical information, which falls into a different privacy category that I did not agree to share.
  • The above information may be FALSE!
  • The predicted data is probably unregulated (I did not submit it), thus not protected by my beloved Italian privacy regulation. Smartband Vendor, Inc. could do whatever they want with it.
  • Would an employer look for such data? Would they hire me after learning my predicted health status?
  • Would my life insurance rates increase?
  • Is that data store encrypted? Protected in a secured repository?
  • What’s the value of such information? And the big data database? And the prediction algorithm? Are we at the dawn of a new black market?

Today we have regulations on data privacy. I’m pretty sure one day we will have regulations on predictive analytics. On that day you will have a compliancy problem to solve, but your security problems have already started if you don’t protect your data. Be prepared.

You can learn more on big data attacks at the following blog article.