Customer-centric companies like Vista are in one long wrestling match – sweating over how to protect personal data and explore analytics in a way that’s good for everyone. Add in the exponential nature of data mesh architecture, an ever-changing data world, and balancing innovation alongside privacy protocols requires fancy footwork. DnA’s Jannik Podlesny, Principal for Data Governance, Architecture and Technology, and Andrew Graziani , VP, Chief Security & Privacy Officer, reflect on where the Vista team is at.
A shared problem
Startups and tech players are focused on fresh tooling around data privacy and anonymization. Yet there isn’t an established solution that ensures scalability and resilience that we know of. And no one is immune to the issue. We’ve seen the impact of privacy compromises across multiple industries, with the exposure of tens of thousands of health records from a clinical laboratory network and the de-anonymization of search history resulting in a class-action lawsuit against a sizeable internet company. Users trust us with their data, and Vista dedicates huge resources to mitigate any potential risk of exposure. It’s what we owe our customers and being trustworthy is part of who we are as a brand.
The reality of being first
Not many – if any – companies have fully implemented a large-scale data mesh structure. Vista is recognized as one of the first: adding another dimension to the data privacy challenge. Here, we weigh the flexibility and autonomy of data product teams higher than anything else and love the mesh, but complexity comes with it. You wave goodbye to the ‘easier’ central governance afforded by the traditional and centralized Relational Database Management System (RDMS), yet the need to meet EU General Data Protection Regulation (GDPR) or US CCPA remains. What replaces the RDMS defines the kind of company you want to be, and data privacy and security protocols become the backbone.
Pulling up data lineage and scanning for personally identifiable information (PII) ceases to be straightforward. A decentralized structure comes with a highly fragmented data landscape: customer details and design preferences need to be found and evaluated. Harvard Professor Latanya Sweeney’s work long since demonstrated that a mix of age, gender and zip code could uniquely identify 87% of the US population, so our vigilance has to be inexhaustible given data might be duplicated by autonomous domain teams. And PII could be processed as a side product of what Vista does best – seamless, customized design at scale for small businesses.
The nub of the problem is to locate combined pieces of information that qualify as PII, yet aren’t suspect alone – AKA, quasi-identifiers, because they can re-identify individuals. Finding them gets really hard – NP-hard or W-complete, , which will excite fans of computational complexity theory. And then there’s the tricky issue of data deletion. Will that become impractical in a distributed mesh topology as data circulates minus a central gateway?
Engineering a solution
If the central question is how to discover attributes that uniquely identify people, the first step is transparency – highlighting their existence in Vista’s data meshes. One option is to deploy a continuous PII scanner to search for hidden quasi-identifiers, label them correctly in the data catalogue, and proactively raise awareness of their presence. With this knowledge, our ‘role-based access’ paradigm (RBAC) sanctions access to data depending on what job you do – reducing ‘blast radius’ if oversharing can ever occur, and enforcing high, zero-trust security standards with two-factor authentication (2FA).
Step two is reducing the amount of information we hold. Do we need this data point for analytical purposes? To recommend a new personalized product, for instance? If it’s not valuable, let’s delete or descope it – archive it in a bunker and lock the door.
The exponential growth of finding privacy exposing quasi-identifiers
Source: Nikolai Jannik Podlesny, Anne VDM Kayem, and Christoph Meinel. “Parallel Quasi-identifier Discovery Scheme for Dependable Data Anonymisation.” Transactions on Large-Scale Data-and Knowledge-Centered Systems L. Springer, Berlin, Heidelberg, 2021. 1-24. l
Benchmarking execution time of different anonymization techniques to remove quasi-identifiers
Source: Nikolai Jannik Podlesny, Anne VDM Kayem, and Christoph Meinel. “Attribute compartmentation and greedy UCC discovery for high-dimensional data anonymization.” Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy. 2019.
Creating new standards
Data privacy is a hot industry topic. But a global definition of PII remains elusive in the face of tightening, varied international regulations. We address the different legal requirements by seeking the best creative solutions that help us take a balanced approach across the board.
Compliance isn’t the only driver – we want to go beyond all existing standards. We initiate reviews, anonymization, or data removal by exploring scalable techniques with state-of-the-art solutions like GPU-accelerated computing to discover quasi-identifiers that become PII in our data meshes. Why? Customers trust us with many personal details and images, and it’s a privilege. We return their trust by operating at the highest ethical standards to protect their information.
Speeding up data-intensive applications: Computer Processing Units (CPU) versus Graphics Processing Units (GPU)
Source: Nikolai Jannik Podlesny, Stephen Simpson, and Henning Soller, “The business case for using GPUs to accelerate analytics processing”, McKinsey & Company Tech:Forward, 2020
Nikolai Jannik Podlesny, Anne VDM Kayem, and Christoph Meinel. “GPU Accelerated Bayesian Inference for Quasi-Identifier Discovery in High-Dimensional Data.” International Conference on Advanced Information Networking and Applications. Springer, Cham, 2021
The new world
Data privacy is a legal, business, math or engineering challenge, depending on who you talk to. For Vista, it’s simple. Distributed data is the foundation of all we do, so investing in customer privacy on a large scale is a natural choice. We employ PII scanners to check more than 100TB with frequency, but there isn’t one solution you can quickly roll out or neatly explain – yet. As more companies consider mesh architecture, we hope sharing our journey will help other teams make the leap.