MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Explore the new MySQL Enterprise Data Masking and De-Identification Features

Overview

Data privacy is an important issue and a core requirement for regulatory compliance. Its required by:

  • EU General Data Protection Directive (GDPR).
  • US Health Insurance Portability and Accountability Act (HIPAA).
  • UK Data Protection Act.
  • Sarbanes Oxley (SOX).
  • Family Educational Rights and Privacy Act (FERPA).
  • And many more.

Data masking and de-identification enable the delivery of useful information or assure proper data presentation without violating privacy. MySQL Enterprise Edition provides the following capabilities to deal with the problem:

Masking parts of text by replacing meaningful characters by placeholders.
For instance, payment applications may present the Primary Account Number (PAN) of a credit card with the last four digits replaced by ‘XXXX’. This is enough for the end user to identify his card on the list without revealing the whole number. There are general-purpose masking functions that mask some parts of the input string as specified by the caller and special-purpose ones that use well defined masking patterns on well defined data formats like PAN.

Generation of random data
Data types such as phone numbers, email addresses or bank account numbers can be protected using random data. This generated data is provided with the correct format and observes restrictions like control digits if applicable. Those functions may be used to generate test data sets for developers, application testers, or analysts.

Random dictionary substitution
Enables DBAs to substitute actual data with a list of “dictionary” terms. This can provide realistic content in an end user application without disclosing the real data when the data domain is a finite set (e.g., city names).

The masking functions work on server side next to the data, versus client side which risks exposing unmasked data on the client.

Prior to 8.0.33, MySQL enabled masking and de-identification capabilities using a MySQL server plugin. We have transitioned the masking capabilities to use the more modern component infrastructure. The new approach extends and improves the functionality of the plugin in a few ways:

  • Support for all MySQL character sets in general-purpose functions.
  • Additional special-purpose masking and random data generation functions.
  • Enhancement of some random data generation functions with additional parameters.
  • Dictionaries are persisted in the database and not in files.
  • Modification of dictionaries is possible with admin functions secured by a special privilege.

Installation

Prerequisites

If the legacy plugin was in use, it must be uninstalled prior to installing the components and all its functions must be dropped:

UNINSTALL PLUGIN data_masking;
DROP FUNCTION gen_blocklist;
DROP FUNCTION gen_dictionary;
DROP FUNCTION gen_dictionary_drop;
DROP FUNCTION gen_dictionary_load;
DROP FUNCTION gen_range;
DROP FUNCTION gen_rnd_email;
DROP FUNCTION gen_rnd_pan;
DROP FUNCTION gen_rnd_ssn;
DROP FUNCTION gen_rnd_us_phone;
DROP FUNCTION mask_inner;
DROP FUNCTION mask_outer;
DROP FUNCTION mask_pan;
DROP FUNCTION mask_pan_relaxed;
DROP FUNCTION mask_ssn;

Components

To install the masking components, use the following statements:

INSTALL COMPONENT 'file://component_masking';
INSTALL COMPONENT 'file://component_masking_functions';

The component_masking component implements the core data masking functionality. The component_masking_functions calls the component_masking APIs to provide loadable functions to be used in SQL statements.

Masking Dictionaries table

In order to enable dictionary replacement functionality, the masking_dictionaries table must be created:

CREATE TABLE IF NOT EXISTS
mysql.masking_dictionaries(
    Dictionary VARCHAR(256) NOT NULL,
    Term VARCHAR(256) NOT NULL,
    UNIQUE INDEX dictionary_term_idx (Dictionary, Term),
    INDEX dictionary_idx (Dictionary)
) ENGINE = InnoDB DEFAULT CHARSET=utf8mb4;

That table will be used as a persistent storage for dictionary terms.

Usage

General-Purpose Masking Functions

The functions mask_inner() and mask_outer() mask internal or outer parts of arbitrary strings. The first argument of the functions is the string to be masked, the following two arguments are the sizes of left and right margin and the last one (optional) is a masking character. It is worth noticing, that these functions (opposite to their siblings provided by the legacy data masking and de-identification plugin) deal correctly with strings of single byte and multibyte encoded characters. Examples:

mysql> SELECT mask_inner('Romeo and Juliet', 6, 7);
+--------------------------------------+
| mask_inner('Romeo and Juliet', 6, 7) |
+--------------------------------------+
| Romeo XXX Juliet                     |
+--------------------------------------+

mysql> SELECT mask_inner('源氏物語', 1, 1, '*');
+---------------------------------------+
| mask_inner('源氏物語', 1, 1, '*')     |
+---------------------------------------+
| 源**語                                |
+---------------------------------------+

mysql> SELECT mask_outer('Под игото', 2, 4, 'т');
+---------------------------------------------+
| mask_outer('Под игото', 2, 4, 'т')          |
+---------------------------------------------+
| ттд итттт                                   |
+---------------------------------------------+

Special-Purpose Masking Functions

There are a number of special data types related to ids, bank card or account numbers, insurance numbers, etc. They usually have one or few acceptable presentation formats and masking patterns. While such data may be obviously masked using the general-purpose functions, a number of special-purpose functions were implemented to make it simpler and ensure correctness.

For example, let’s consider UK National Insurance Numbers (NIN). They consist of two letters, six numbers and one number. The presentation format of NIN may differ, but the mask_uk_nin() function accepts different formats:

mysql> select mask_uk_nin('QQ123456C') AS '#########', mask_uk_nin('QQ 123456 C') AS '## ###### #', mask_uk_nin('QQ 12 34 56 C') AS '## ## ## ## #', mask_uk_nin('QQ-12-34-56-C') AS '##-##-##-##-#';
+-----------+-------------+---------------+---------------+
| ######### | ## ###### # | ## ## ## ## # | ##-##-##-##-# |
+-----------+-------------+---------------+---------------+
| QQ******* | QQ ****** * | QQ ** ** ** * | QQ-**-**-**-* |
+-----------+-------------+---------------+---------------+

Still, the function protects against unwanted information disclosure in case the input string is malformed, for example: 

mysql> select mask_uk_nin('Q-Q 1234 56 C') AS 'bad 1', mask_uk_nin('Q Q123456C') AS 'bad 2', mask_uk_nin(' QQ123456C') AS 'bad 3';
+-------+-------+-------+
| bad 1 | bad 2 | bad 3 |
+-------+-------+-------+
| NULL  | NULL  | NULL  |
+-------+-------+-------+

Generating Random Data with Specific Characteristics

Functions that generate random values may be used for testing or simulation purposes. Let’s look at a few of them in action:

mysql> SELECT gen_rnd_email(),
              gen_rnd_email(5, 7, 'testdomain.com'),
              gen_rnd_iban('XX', 24),
              gen_rnd_uuid()\G
*************************** 1. row ***************************
                      gen_rnd_email(): atbpl.fesiyip@example.com
gen_rnd_email(5, 7, 'testdomain.com'): hfhjr.pspvhbj@testdomain.com
               gen_rnd_iban('XX', 24): XX96 M1UQ D1TS ELYR J5D5 31RW
                       gen_rnd_uuid(): 66540665-7879-1978-5100-019943361126

One must be careful when using these functions as sometimes they may output real-world values, which we don’t want. This can be mitigated in a few ways. For example:

  • A non-existing domain may be passed to gen_rnd_email().
  • gen_rnd_iban() may take an invalid country code (first 2 characters, ‘ZZ’ by default).
  • The range of possible UUIDs is very large, so the probability that  gen_rnd_uuid() matches a real value is relatively small.

Even though the functions produce non-real values, the values are formally correct: an email follows the email format and IBAN would be positively verified with Luhn algorithm, etc.

Generating Random Data Using Dictionaries

A masking dictionary is just a set of text strings called terms. Functions that select random terms obviously use previously created and filled dictionaries. To modify dictionaries, a special privilege is needed, so we should start with granting it to some user:

mysql> GRANT MASKING_DICTIONARIES_ADMIN ON *.* TO dict_admin;

Then dict_admin may add some German city names to 'DE_Cities' dictionary and some Japan cities to 'JP_Cities' dictionary:

mysql> SELECT masking_dictionary_term_add('DE_Cities', 'Berlin'),
              masking_dictionary_term_add('DE_Cities', 'Hamburg'),
              masking_dictionary_term_add('DE_Cities', 'München'),
              masking_dictionary_term_add('DE_Cities', 'Köln');
mysql> SELECT masking_dictionary_term_add('JP_Cities', '東京'),
              masking_dictionary_term_add('JP_Cities', '横浜市'),
              masking_dictionary_term_add('JP_Cities', '大阪市'),
              masking_dictionary_term_add('JP_Cities', '名古屋市');

Now any user may select some random terms from dictionary 'DE_Cities':

mysql> SELECT gen_dictionary('DE_Cities'), gen_dictionary('DE_Cities');
+-----------------------------+-----------------------------+
| gen_dictionary('DE_Cities') | gen_dictionary('DE_Cities') |
+-----------------------------+-----------------------------+
| Köln                        | Berlin                      |
+-----------------------------+-----------------------------+

In another scenario, it may be required to substitute some German city name by a Japanese city name, but only if the first one exists in the dictionary:

mysql> SELECT gen_blocklist('München', 'DE_Cities', 'JP_Cities') AS existing, gen_blocklist('Bonn', 'DE_Cities', 'JP_Cities') AS nonexisting;

+-----------+-------------+
| existing  | nonexisting |
+-----------+-------------+
| 大阪市    | Bonn        |
+-----------+-------------+

Creating Views that Display Masked Data

Creating a view limiting the data scope, and a user with limited privileges is an easy way to protect data privacy.
Let’s create a table with complete data and a view with limited data:

mysql> CREATE TABLE customer(
       id         BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
       first_name VARCHAR(40),
       last_name  VARCHAR(40),
       ssn        VARCHAR(11));
mysql> CREATE VIEW masked_customer AS SELECT
       id,
       mask_inner(first_name, 1, 0) AS first_name,
       mask_inner(last_name, 1, 0) AS last_name,
       mask_ssn(ssn) AS ssn
       FROM customer;

Create a user with SELECT privilege on the view:

mysql> CREATE USER app;
mysql> GRANT SELECT ON masked_customer TO app;

Populate the table with some dummy data:

mysql> INSERT INTO customer (first_name, last_name, ssn) VALUES
       ('Ann', 'White', gen_rnd_ssn()),
       ('John', 'Brown', gen_rnd_ssn()),
       ('Justin', 'Green', gen_rnd_ssn()),
       ('Melanie', 'Orange', gen_rnd_ssn());

Finally, login as app user to test which information may be accessed:

mysql> SELECT * FROM customer;
ERROR 1142 (42000): SELECT command denied to user 'app'@'localhost' for table 'customer'
mysql> SELECT * FROM masked_customer;
+----+------------+-----------+-------------+
| id | first_name | last_name | ssn         |
+----+------------+-----------+-------------+
|  3 | AXX        | WXXXX     | XXX-XX-6897 |
|  4 | JXXX       | BXXXX     | XXX-XX-9524 |
|  5 | JXXXXX     | GXXXX     | XXX-XX-8495 |
|  6 | MXXXXXX    | OXXXXX    | XXX-XX-6777 |
+----+------------+-----------+-------------+

Conclusion

Using the new MySQL Enterprise Data Masking and De-Identification featuresallow developers and DBAs to easily address regulatory and data privacy requirements. They can protect sensitive columns of information by providing functions that:

  • De-identifies the data:  Scrambling identifiers of individuals, also known as personally identifiable information (PII).  
  • Masks sensitive data:  Mask data that, if associated with personally identifiable information (PII), would cause privacy concerns.  Examples include compensation, health, and employment information.  
  • Maintains data validity:  Provide data that can be used by fully functional applications.

Additional Documentation