Removing PII From your Google Analytics Implementation


May 22, 2017

When it comes to data collection, Personally Identifiable Information (PII) is a rather sensitive subject. Terms of service aside, most users want to know any data collected about them is not personally identifiable. While the standards and expectations of the end customer vary, it is important to always be aware of what information you are collecting and ensure that you meet—at a minimum—the standards of the Analytics tool you are using. Beyond the toolset, it’s important to be aware of the countries you are serving and their various regulations.

During the audit process, we’ve found that the biggest PII offender is the URL. In some cases, though, PII is found in other places and should still be addressed. Overwhelmingly, too, we find PII via the Page dimension.

 

Filtering PII from Google Analytics

Using filters in GA to remove PII does not sufficiently remove the information to meet the terms of service. The challenge with filters is that they’re done at the view level. What that means is that even if every view is filtered, you are still collecting the PII to your web property.  Filters are a great asset for working within a single view, and can be useful for passing non-PII values to custom dimensions as an example. However, when it comes to ensuring that you aren’t collecting PII, filters just don’t cut it.

 

Removing PII from Google Analytics with GTM

As a standard, Analytics pros recommends using a Tag Management platform. Most Analytics Pros clients are using either Google Tag Manager (GTM) or Tealium. With GTM offering a free version, we will focus how to solve for PII collection in the URL. There are two fields that we need to focus on to ensure we aren’t collecting PII via the URL. The two fields to set are “page” and “referrer.” It’s important that you set these two fields on ALL tags within GTM.  The page and referrer fields apply to pageview and event tag types alike, and it’s important that these settings match across tags.

screen capture of GTM fields to set for page and referrer
Using a Javascript Variable that matches for both fields to set allows for a single place to make updates as well as ensure parity for these two fields.

The code used (Custom Javascript Variable) to help control PII takes into account a few factors. The first is a predefined list of known offenders. As you audit your query parameters and find known issues, you’ll want to include the parameter values in this JS.  The second is looking for email addresses. As a backup to the known offender list, we look for email addresses and remove them from the URL as well. Finding an email within the URL is the most common PII offense and can often be found on pages such as email subscription, email unsubscribes or forgotten password forms. Finally, while not technically PII, we also like to clean up the language and locale paths often found at the beginning of the URL.

The Variable setup used to get these basics handled is provided below. This should provide you with a great starting point to ensure you’re compliant with your user and analytics platform PII needs. This code is the JS variable used for scrubbing the page path.


function() {
function pagePathScrubber(pagePath, piiQueryParams, emailRemovedText) {
var removeLangPagePath = function(pagePath) {
// remove language -> search for en-us type (ignore case), change the regex to match the language setup of your site
var newPath = pagePath.replace(/^\/[a-z]{2}-[a-z]{2}/i, '');
// if only language in path, return "/"
if (!newPath) return '/'
return newPath;
}
//Looks for anything containing email and removes it
var removeEmailPagePath = function(pagePath, emailRemovedText) {
return pagePath.replace(/([a-zA-Z0-9\.\+_-`~!#\$%\^&*\(\)]+(@|%40|%2540)[a-zA-Z0-9\.\+_-`~!#\$%\^&*\(\)]+\.[a-zA-Z0-9\.\+_-`~!#\$%\^*\(\)]+)/gi, emailRemovedText);
} 
var removePiiQueryParams = function(query, pii) {
var isEmail = function(email) {
var re = /(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))/;
return re.test(email);
}; var queryArray = query.split(/[&;]/); var resultArray = []; var result = ""; // If no query return if (query === "") { return ""; } // Check for PII names for (var i = 0; i < queryArray.length; i++) { var tempSplit = queryArray[i].split("="); var name = tempSplit[0]; var value = tempSplit[1]; // Check all PII names each time in case there are repeat values include = true; for (var k = 0; k < pii.length; k++) { if (name.toLowerCase() === pii[k].toLowerCase() || isEmail(name) || isEmail(value)) include = false; } if (include) resultArray.push({ name: name, value: value }); } // Recompose query for (var i = 0; i < resultArray.length; i++) { result += resultArray[i].name; result += "="; result += resultArray[i].value; // Don't add ampersand if at end query if (i !== resultArray.length - 1) { result += "&"; } } return "?" + result; } var cleanedPagePath = removeLangPagePath(pagePath); var urlSplit = cleanedPagePath.split('?'); var baseUrl = urlSplit[0]; var qp = urlSplit.length > 1 ? cleanedPagePath.replace(baseUrl, '').substring(1) : ""; var cleanedQp = removePiiQueryParams(qp, piiQueryParams); cleanedPagePath = baseUrl + cleanedQp; cleanedPagePath = removeEmailPagePath(cleanedPagePath, emailRemovedText); return cleanedPagePath; } //List all query parameters to remove from the URL here var piiParams = ['firstname', 'lastname', 'nickname', 'address', 'gender', 'p', 'e', 'profileurl', 'email', 'pwd', 'fname', 'lname', 'user']; //Text string to replace any email address found var emailRemovedText = "email_removed"; return pagePathScrubber({{PagePath}}, piiParams, emailRemovedText); }