What's the deal with those weird GDPR emails?

Edited 12/17: Comments from more people about the impact of this project

Edited 12/22: Updates from the researchers

The other day I got an email with the title “Questions About GDPR Data Access Process for [a website]”. I’m not going to give the specific site, but some important things to know are that it doesn’t collect personal data for any purpose, and that people in the EU are not the intended audience. The email goes on to state that the sender is in Russia, and has some questions about our handling of GDPR requests. Here’s the full text:

To Whom It May Concern:

My name is Vlad Orlov, and I am a resident of Moscow, Russia. I have a few questions about your process for responding to General Data Protection Regulation (GDPR) data access requests:

Would you process a GDPR data access request from me even though I am not a resident of the European Union?

Do you process GDPR data access requests via email, a website, or telephone? If via a website, what is the URL I should go to?

What personal information do I have to submit for you to verify and process a GDPR data access request?

What information do you provide in response to a GDPR data access request?

To be clear, I am not submitting a data access request at this time. My questions are about your process for when I do submit a request.

Thank you in advance for your answers to these questions. If there is a better contact for processing GDPR requests regarding [website], I kindly ask that you forward my request to them.

I look forward to your reply without undue delay and at most within one month of this email, as required by Article 12 of GDPR.

Sincerely,

Vlad Orlov

Several things seemed sketchy about this. The email it was sent to is not the one I would give if someone wanted to contact me about this website. Also, if Vlad was even vaguely familiar with the site, they would know that it’s unlikely we would need to respond to an actual GDPR request, for the reasons I mentioned above. The sender clearly knows that there’s no legal obligation for a site in the US to respond to a request made by someone in Russia (although lots of people might choose to do so anyhow, rather have one policy for people in the EU and one for those outside it). It doesn’t say anything about why they want the information, and how it will be used. Last, there’s a time pressure that implies we shouldn’t just ignore this email, suggesting there may be consequences.

I mentioned it to a friend of mine as a passing curiosity – this looks like phishing, but what’s the angle? She mentioned having seen something similar at work, so I offered to forward what I received so we could compare notes. Then my friend found some discussion of these messages on Mailop, which is a mailing list for people who run email services. There seem to be several versions of the emails, including some that are under the researcher’s actual name. Thus, we know that Ross Teixeira, a PhD student at Princeton, is conducting a “Study on Privacy Law Implementation”.

As part of the study, we are asking public websites about their processes for responding to GDPR and CCPA data access requests. We attempt to identify a website’s correct email address for data access requests through an automated system. While we have evaluated the system to confirm that it has high accuracy, some emails may be incorrectly directed to a website or email address.

Here’s the problem: you can’t just scrape lots of emails and start sending out questionaires under fake names using the guise of a research project. Anti-spam laws require you to say how you got that email address, why you’re contacting them, who you actually are, and how people can tell you to go away. Research ethics dictate that you don’t just enroll people in a study without them knowing, deceive them about the nature of the research, and do things that may cause harm without someone’s consent (at the very least, someone receiving this sort of message might feel obliged to talk to a lawyer). The recipients of these messages aren’t computer systems, they’re people in a variety of situations. This draft of a complaint letter does a good job of explaining those ethical concerns.

Ross Teixeira has answered a few questions about the project on Twitter, none of which really explains how they can justify any of these problematic tactics as necessary to the research. I thought this was interesting, though: “We built and evaluated an automated system for identifying email addresses that are designated for data access requests from websites.” So this is some sort of classifier? What are you planning to do with it?

As Richard Hughes remarks, “I should have a “right to know” where the data was collected from under the CCPA and also need to be able to do a “subject access request” under the GDPR…”

Weird, right? I wanted to write this up because I do think the emails have been distressing for people who don’t typically get data privacy requests for their blog. I’m also curious how the folks responsible for oversight at Princeton are going to respond. Someone must have signed off on the project – what was the rationale?

Jeff Kosseff has some details about that:

I have no idea how many websites received this email, but both friends are at fairly off-the-radar organizations - and neither is covered by CCPA due to size/nonprofit status. Yet they were devoting time and legal resources to figure out the answers to these weird questions.

Princeton’s IRB has apparently determined this is not human subject research - a conclusion that I question. While the requests go to websites, the research ultimately inquires into how the people who run those websites respond to a series of questions about a confusing law.

The email asks questions that demonstrate why CCPA is so confusing, but then places the burden on unsuspecting website operators – many of whom are operating on a shoestring budget during a pandemic – to spend money and time to figure it out, while failing to identify that this is a study. I’ve practiced privacy law for more than a decade, and the responses would require me to do some research and put some time into it. I understand the value in “secret shopper” type research, but this is different because many businesses will need to turn to outside counsel and their costly billable hours to come up with a response. They have no idea they’re taking part in a study, and they just want to avoid getting a letter from the California AG.

I’ll add more if I find out anything new. There are more than enough major security issues happening right now, so I’m glad this is at least one that people can choose to ignore.

Updates:

The research website has had three updates since I first posted this. They added a FAQ, a message from Jonathan Mayer who is the Principal Investigator, and another update to their research plans. The original FAQ indicated that they would be going ahead with using data collected for the research project, but as of yesterday (12/21) the plan is to delete all responses and disable the email accounts on December 31.

In terms of resolving everyone’s concerns, the plan they’ve shared includes contacting recipients of the emails, replying to people who contacted the researchers directly, writing a research ethics case study, and following up with the mail operator and privacy rights communities on future practices. To me this all seems like a reasonable start, but my biggest concern is still that the researchers don’t understand the scope of harm caused.

According to the FAQ, “The set of websites for this study is sampled from the Tranco list of popular websites and publicly available datasets of third-party tracking websites.” I’ve checked, the Alexa rank for the site I was contacted about is over 2 million – way in the weeds! The basic assumption that the study was engaging with organizations that should have the ability to respond to GDPR or CCPA requests is extremely flawed. GDPR exempts personal activities and businesses under 250 employees from most requirements. CCPA is even more limited: only for-profit businesses are included, and they need over $25 million in revenue or to be processing data from at least 50,000 California residents. Does that sound like your blog or community organization? Of course not.

It’s not enough to focus on more ethical data collection if they don’t also address how sites are being selected. Otherwise, this research is going to continue to be a waste of our time.