Safe Data Framework, a pragmatic approach for dealing with sensitive data in a BI environment

posted 14.04.2016

Sensitive data within organisations

Dealing with sensitive data is a challenge in most BI environments. There are different types of sensitive data. For communication purposes, we make a distinction between business sensitive data and compliance sensitive data.
Business sensitive data is categorised as sensitive by the owners of the data. Examples are financial data or production metrics data.
Compliance sensitive data is subject to legislation. An example is all data concerning privacy.

The challenge in your BI effort

Conforming to all the rules and regulations is a challenge for every BI team. Often, legislation does not translate directly into a solution or design. Use of data is often permitted or restricted within a certain context. Legislation is not even concise, it sometimes contradicts, which does not help in creating clear guidelines.

Privacy related data is sometimes used as an integration point. For instance, patient data in hospitals is often the only common element in unrelated hospital systems. Once data is integrated and validated, most of the usage in a data warehouse is on the aggregate of the integrated data set. The privacy related data itself is not queried or being asked for by users. An example is the sick leave percentage rate per department, per function group, per period. Who is on sick leave is of no concern, but the individual records are needed to create the aggregated information.

A confusing conversation

Separation of concerns also complicates the implementation of guidelines. Security Officers, Privacy Officers and Legal departments have the responsibility to create policies on security and data. Also, they have the obligation to assure the compliance to these policies.
Most of the legal officers are not technological savvy. Information security policies apply to all different sorts of IT systems and are drawn up in generic terms.
Business intelligence developers are not trained to think in a legal way and have a hard time to translate the generic policies to the specific challenges they face in the BI domain. Conversations between legal officers and BI developers about pragmatic technical guidelines often feel like a Tower of Babel discussion. We try to understand each other but our professional worlds seem to spin in a different galaxy.

Not all is equal

To make matters worse, there are degrees of sensitivity. Privacy related data could be regarded more sensitive than quarterly revenue figures. Often, this distinction is not expressed clearly in the design of the data warehouse. As a result of the Tower of Babel discussions, you find yourself applying the strictest security to all data. The resulting maintenance and development effort on the data warehouse is harder and more expensive than needed.
As a result, Users feel it is a hassle to get the data they need from the BI environment. They will try and find other ways to get their information. In fact, they will do their utmost best to circumvent the security measures to get the data, thus rendering security measures obsolete.

Over time, views on privacy change

Over time, laws, the interpretation of laws and the organisational attitude towards privacy related issues can and will change.

How to deal with all of this, without further complicating the maintenance, changeability or accessibility of the data warehouse and without incurring huge operational costs?

Introducing The Safe Data Framework

A method that helps you to deal with this complicated matter in a very simple way is the Safe Data Framework. It offers you a method to move along with changes in policies, but maintain a single implementation in a data warehouse. The Safe Data Framework offers you to create auditable examples of problem areas that anyone can comprehend. It opens up the possibility to have a clear conversation between all parties involved.

Does this sound like a magic trick that evaporates all those hard problems? It isn’t. You still need to develop your data models and ETL, but it is a pragmatic approach addressing the needs of the BI developer, the needs of the legal officers and the needs of the users of data at the same time.

Disclaimer: this is not the definitive answer to all your security and privacy concerns. It is a proposal for a workable solution. Your legal officers still need to determine if this proposed solution is sufficient to meet the requirements set by their policies.

How does the Safe Data Framework help you?

  • Easy to implement and maintain in the data warehouse;
  • Offering methods to screen sensitive data from users;
  • Low impact when changing the selection of sensitive data items or method to screen data, even when being applied in retrospect;
  • Enabling constructive communication between legal officers and developers through tangible examples.

Easy to implement and maintain in the data warehouse

The approach is to create two containers with data in the data warehouse:

  • A secured data container;
  • A general available data container.

The Safe Data Framework depicted in an image

The general available data container offers access to the data warehouse with a uniform access model and security policy. In the database an all or nothing is applied for access to the general available container. Additional, finer grained access rights can be granted within a BI tool that is accessing the data in the general available container.

The secured data container locks down the data for user access through BI tools, but data is available for processing. Access to this container is kept to a minimum and all access is system based or name based. For instance, ETL tooling has a system account that can read, process and write the data from and to the container. Name based accounts for developers, maintenance and operations personal, such as DBA’s, are available for regular development, maintenance and troubleshooting operations.

These name based maintenance accounts are governed by additional policies to which people have to comply and by audit trails that log all access and usage. If unavoidable, users who have clearance to use the actual data can get name-based accounts to query this data. This is standard policy for most organisations, so adopting this approach will not meet much resistance.

The secured container can be implemented as a separate schema in the same database, a separate database or even as a database on a separate server. The appropriate choice is determined by the security policies of the organisation, each has its own challenges.

Each table of the data model is assigned to either the secured or the general available container. An assessment is made for each table if it contains data classified as sensitive. This assessment process is a conversation between legal officers and developers and is guided by the data security policies. Tables with one of more columns classified as sensitive are stored in the secured container. Tables that do not contain any data classified as sensitive are stored in the general available container.

It is not recommended to split the secured attributes from the general accessible attributes. In general it is recommended to let the business context and data modelling method determine the data model and not the security context.

Try to keep the amount of tables stored in the secured container to the minimal needed set.

Offering methods to screen sensitive data from users

The next step is the crucial part of the Safe Data Framework. Replicate the structure of the table in the secured container to the general available container: same table name and same amount of columns. Next, data will be replicated applying one of the screening methods outlined in the table below. These methods do only apply to columns classified as sensitive. All other data can be copied over without modification.

Screening method Format of the data after the rule has been applied Consequences of using this rule
Leave it blank Data is not copied over at all. The column contains a NULL value.
Example: the postal code ‘1234 AB’ becomes NULL
There is no data to use for aggregation or filtering in reports or analysis.Users unaware of the application of the rule might think there is an error in the data.
Obfuscating Data is stored in an obfuscated way. For instance by replacing the value with four asterisk symbols.
Example: the postal code ‘1234 AB’ becomes ‘****’
Aggregation on this column will lead to one grand total of the measure against the value ‘****’. The column becomes useless for filtering the data.The format of the data clearly shows the user that the data is screened.
Pseudonyms Actual or generated data is used to substitute the real values. In most applications of pseudonyms, actual values in a column are shuffled in random order over the rows.
Example: the postal code ‘1234 AB’ becomes ‘5678 YZ’
Users who are unaware of the pseudonyms and take the value for truth might jump to the wrong conclusions. The value looks ‘real’. This method is not recommended for production systems, but is often applied for acceptance testing of applications
Aggregation Data is truncated or aggregated to an allowed level of visibility.
Example: the postal code ‘1234 AB’ becomes ‘12’.
Filtering on these values is crude. Also aggregates are not fine grained. This method is often applied when for instance the age of a person is represented as an age range. This method enables users to derive useful information from reports and analysis results within the limitations of use.

Why do it this way? The granularity of your data is not compromised, which means all calculations are still valid on all other data. Individual records can be retrieved and inspected by users, which makes the implications of the application of data security policies very easy to communicate. Also, since all records are represented it is still possible for users to do reconciliations or other comparisons on the general available data items. Like most solutions, doing it the simple way makes your life a lot easier.

Low impact when changing the selection of sensitive data items or method to screen data, even when being applied in retrospect

Laws or perceptions on using privacy sensitive data change over time. It takes a while before these changes are reflected in new policies and new data governance rules. We have experienced more than once that new rules should be applied starting some date in the past. Using the outlined methodology, it is easy to apply these rules in retrospect, as long as you have the historical original data available in the secured container. Even if you have to move a table from the general available container to the secured container, the impact is relatively low. The structure of the table in the general available container won’t change, just the contents of the table. As a result, the impact is localised to one or a few ETL flows. The flexibility in changing existing rules or applying new security rules is high, while the implementation is straightforward.

The setup in containers enables you to define different retention policies for the data in both containers and thus comply with legal obligations. You often are allowed to keep anonymous or aggregated data for a longer period of time than you can keep the original records. Retention policies limit the application of rules in retrospect of course. Once the source data is deleted, it is hard to apply different rules to them.

Enabling constructive communication between legal officers and developers through tangible examples

The flexibility in implementation helps the conversation between users, developers and legal officers during the development process. It is very easy to create a prototype of a report with the right security policies applied. Sit together, talk about interpretation of what the policies translate into, or point out the limitations in using the data due to policies applied.

Putting it into practice

Having a solution to address the resilient issue of sensitive data and being able to deal with its effects in an adaptable way alleviates the stress felt by BI developers. But where do you start?

The best advice is to put the technical framework with containers in place first. Even when a discussion about privacy is still diffuse, it is easy to identify the tables that do contain sensitive data in regards to privacy. Move those tables to the secured container and copy over the data to the mirrored table in the general available container without applying any screening methods. This won’t impact your existing reports.

Structure the discussion by making examples by applying different screening methods. This will help everyone. While the governance on sensitive data issues matures within your organisation, you can adjust the data stored in the secured container and get to a balanced set of screening methods applied to your data.