In a changing landscape where the security around data is becoming more and more paramount, it’s worthwhile to take a deep look at what data needs to live where. Storing sensitive data, like PII, comes with a level of legal responsibility to keep that data secure and safe, and the risk of heavy legal ramification for any data leak becomes larger and larger the more data that is stored. While this is a necessary evil, some systems and platforms, some creative outside the box thinking can help reduce the risk around even needing to store sensitive data.
Just because you need to display someone’s address doesn’t always mean you need to store it.
For white label apps and platforms, the integration requirements to other data stores/lakes can prove to be a benefit. If the sensitive data is accessible via a robust API, then that data can be retrieved and translated on the fly to whatever data structure needed. And while it is not as performant and fast as actually owning the data, as long as you set a good user experience around the display of that data, users will never notice that you never actually owned their data.
No matter what tools are used to actually do the heavy lifting of retrieving and transforming data, the most important aspect is to have a unified structure to the data expected from each data source. These schemas are the blueprints that not only keep internal services in line, but also verify with a datasource that they will be able to provide the level of data needed to deliver a quality user experience.
These contracts should not only exist in a way that your developers can incorporate them deeply into their systems, but also exist in a form that non-technical people can understand and use them. Everyone between both parties should fully understand every aspect of what data is being delivered, how it's being delivered, and the security concerns around them.
Systems that have this level of access to sensitive information should strive to be as secure as possible. This sounds like a no brainer, but oftentimes in the heat of development people forget this simple truth. If the APIs used to access data are fully Create, Read, Update, Delete (CRUD), it is important that there be some way of locking some functionality to Read Only. All aspects of each connection should be encrypted, and there should be an auditable log of access. The more you can limit access, or narrow the scope of data, the better. Be wary of global access.
The easiest way to keep this system as fast as possible is to limit the amount of times you have to use it. The data that works best for this is data that changes rarely, and only needs to be retrieved zero-one times per user session. Local caching on users machines of this data is paramount as any cloud caching of this data would imply ownership of it (and negate any benefits). It is worthwhile to note that, outside the performance of the system you are integrating with, the over the wire time of just sending data ends up adding latency. It won’t be as fast as if you owned the data and managed it yourself, and if that matters then it may be worthwhile to take on the extra risk related to owning data.
An Example Flow