MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Mask and De-identify Data in MySQL Enterprise Edition

MySQL Enterprise Edition has for years featured methods for masking and de-identifying data.  Let's explore, with examples, how to use them!

Agenda of the article

The Data Masking and De-Identification extensions of MySQL Enterprise Edition

Data masking refers to masking portions or all of the data by replacing some of its content with a masking character.  For example, the SSN 123-45-6789 can be thus masked: xxx-xx-6789, to only display the last 4 digits.  Data de-identification changes a value to a different value in a way that it is still useful to the application, for example by replacing a unique value with another unique value, but makes it impossible to know the original value.  As the name implies, the purpose de-identification is to prevent identification of private data.

Data masking and data de-identification functionality is provided as an extension of MySQL Enterprise Edition in the form of both a plugin and a component.  Plugins are the original mechanism for extending MySQL functionality.  They provide an API to extend various functions of the server.  For example, there is a plugin API type for authentication methods, one plugin API type for storage engines, and so on.  MySQL components are the new kid-on-the-block, introduced in MySQL 8.0.  The component framework extends the server functionality like plugins but offers much more functionality to the developers as well as it allows individual components to interact with each other, which plugins are unable to do.

The reason I bring this up is because MySQL Enterprise Edition has both a data masking and de-identification plugin and component installed.  And they both provide similar but different functionality!  The table below contains a list of differences between the functionality provided by the plugin and that provided by the component.  The database administrator must be careful to use only either the plugin or the component as only one of them can be enabled at a time.  Of the two, the component should be favored as it provides enhanced functionality and is actively developed, whereas the plugin is subject to removal in a future release (due and timely notice will be provided).

Function Plugin Component
Interface Loadable functions Service functions, loadable functions
Support for multibyte character sets No Yes for general-purpose masking functions
General-purpose masking functions mask_inner(), mask_outer() mask_inner(), mask_outer
Masking of specific types PAN, SSN PAN, SSN, IBAN, UUID, Canada SIN, UK NIN
Random generation of specific types Email, US phone, PAN, SSN Email, US phone, PAN, SSN, IBAN, UUID, Canada SIN, UK NIN
Random generation of integer from given ranges Yes Yes
Persisting substitution dictionaries File-based In database
Privilege to manage dictionaries FILE Privilege specific to component (MASKING_DICTIONARIES_ADMIN)

Automated loadable function registration and deregistration

during installation and uninstallation

No Yes
Enhancements to existing functions N/A More arguments to the function gen_rnd_email()

The steps to install the component are the following (note: it comes as two component files):

INSTALL COMPONENT 'file://component_masking';
INSTALL COMPONENT 'file://component_masking_functions';

To confirm that the component is installed:

mysql> SELECT * FROM `component`;
+--------------+--------------------+------------------------------------+
| component_id | component_group_id | component_urn                      |
+--------------+--------------------+------------------------------------+
| ...          | ...                | ...                                |
|            2 |                  2 | file://component_masking           |
|            3 |                  3 | file://component_masking_functions |
+--------------+--------------------+------------------------------------+

Example Dataset

This article contains several practical examples that use the following dataset:

First Name Last Name DOB SSN Phone Number Card Number
George Washington 1732-02-22 NULL (804)555-0101 4716958491933491
John Adams 1735-10-30 756-16-0134 (508)555-0102 5548026033569479
Thomas Jefferson 1743-04-13 647-42-4787 NULL 341605099040689
James Madison 1751-03-16 939-87-6230 (276)555-0104 6223050740560951
James Monroe 1758-04-28 615-50-9245 (571)555-0105 4916285199808116
John Adams 1767-07-11 677-54-5716 (617)555-0106 NULL
Andrew Jackson 1767-03-15 302-80-6308 (803)555-0107 370325552763685
Martin Van Buren 1782-12-05 133-85-9284 (329)555-0108 4024007112991004
William Harrison 1773-02-09 397-35-5070 (757)555-0109 5535714026624691
John Tyler 1790-03-29 037-67-9069 (804)555-0110 374077354201234

Here is the SQL to provision the above dataset:

CREATE SCHEMA IF NOT EXISTS `test`;
CREATE TABLE `test`.`persons`
(
  `id` INT PRIMARY KEY AUTO_INCREMENT,
  `First Name` VARCHAR(32) NOT NULL,
  `Last Name` VARCHAR(32) NULL,
  `DOB` DATE NOT NULL,
  `SSN` CHAR(11) NULL,
  `Phone Number` CHAR(13) NULL,
  `Card Number` VARCHAR(16) NULL
);

INSERT INTO `test`.`persons`
  ( `First Name`, `Last Name`,  `DOB`,        `SSN`,         `Phone Number`,  `Card Number`      )
VALUES
  ( 'George',     'Washington', '1732-02-22', NULL,          '(804)555-0101', '4716958491933491' ),
  ( 'John',       'Adams',      '1735-10-30', '756-16-0134', '(508)555-0102', '5548026033569479' ),
  ( 'Thomas',     'Jefferson',  '1743-04-13', '647-42-4787', NULL,            '341605099040689'  ),
  ( 'James',      'Madison',    '1751-03-16', '939-87-6230', '(276)555-0104', '6223050740560951' ),
  ( 'James',      'Monroe',     '1758-04-28', '615-50-9245', '(571)555-0105', '4916285199808116' ),
  ( 'John',       'Adams',      '1767-07-11', '677-54-5716', '(617)555-0106', NULL               ),
  ( 'Andrew',     'Jackson',    '1767-03-15', '302-80-6308', '(803)555-0107', '370325552763685'  ),
  ( 'Martin',     'Van Buren',  '1782-12-05', '133-85-9284', '(329)555-0108', '4024007112991004' ),
  ( 'William',    'Harrison',   '1773-02-09', '397-35-5070', '(757)555-0109', '5535714026624691' ),
  ( 'John',       'Tyler',      '1790-03-29', '037-67-9069', '(804)555-0110', '374077354201234'  );

Data Masking and De-Identification in SELECT statements

Below is an example using the data masking and de-identification functions.  These functions can be used anywhere that SQL functions may normally be used such as within WHERE clauses, projection lists, and elsewhere.  The documentation for these functions can be found here.

mysql> SELECT
    ->   `First Name`,
    ->   `Last Name`,
    ->   `DOB`,
    ->   mask_ssn(`SSN`,'x') AS `SSN`,
    ->   mask_inner(`Phone Number`,5,5,'*') AS `Phone Number`,
    ->   `Card Number`,
    ->   gen_rnd_pan(16) AS `Random Number`
    -> FROM
    ->   `test`.`persons`;
+------------+------------+------------+-------------+---------------+------------------+------------------+
| First Name | Last Name  | DOB        | SSN         | Phone         | Card Number      | Random Number    |
+------------+------------+------------+-------------+---------------+------------------+------------------+
| George     | Washington | 1732-02-22 | NULL        | (804)***-0101 | 4716958491933491 | 4701543887067000 |
| John       | Adams      | 1735-10-30 | xxx-xx-0134 | (508)***-0102 | 5548026033569479 | 4812011388774058 |
| Thomas     | Jefferson  | 1743-04-13 | xxx-xx-4787 | NULL          | 341605099040689  | 8206380111725769 |
| James      | Madison    | 1751-03-16 | xxx-xx-6230 | (276)***-0104 | 6223050740560951 | 0102261140601415 |
| James      | Monroe     | 1758-04-28 | xxx-xx-9245 | (571)***-0105 | 4916285199808116 | 6123363455567462 |
| John       | Adams      | 1767-07-11 | xxx-xx-5716 | (617)***-0106 | NULL             | 8357707102536143 |
| Andrew     | Jackson    | 1767-03-15 | xxx-xx-6308 | (803)***-0107 | 370325552763685  | 1202083866771074 |
| Martin     | Van Buren  | 1782-12-05 | xxx-xx-9284 | (329)***-0108 | 4024007112991004 | 0847031686483639 |
| William    | Harrison   | 1773-02-09 | xxx-xx-5070 | (757)***-0109 | 5535714026624691 | 6107678513862674 |
| John       | Tyler      | 1790-03-29 | xxx-xx-9069 | (804)***-0110 | 374077354201234  | 4335451458208116 |
+------------+------------+------------+-------------+---------------+------------------+------------------+
10 rows in set (0.00 sec)

The fields SSN and Phone Number are masked with specialized data masking functions, and random card numbers are generated that pass validation (Luhn).  Note how the original NULL values remain NULL when masked.  The randomly generated card number values, which are presented in a seperate column only to show that they are different from the original values, do not have NULL values because the generation function does not take into account the original values.  To account for them would require the use of a "CASE" clause in the projection list in the style of:

CASE
  WHEN `Card Number` IS NULL THEN NULL
  ELSE gen_rnd_pan(16)
END AS `Random Card`

SELECT INTO and CREATE TABLE AS SELECT

The SELECT ... INTO and CREATE TABLE ... AS SELECT statements work similarly as SELECT statements.  The data masking and de-identification functions can be used anywhere SQL functions may be used.

Here is an example of a CREATE TABLE ... AS SELECT statement.  To add value to the example, I have included the "CASE" clause that was mentioned above.

CREATE TABLE `test`.`persons2`
(
  `id` INT PRIMARY KEY AUTO_INCREMENT,
  `First Name` VARCHAR(32) NOT NULL,
  `Last Name` VARCHAR(32) NULL,
  `DOB` DATE NOT NULL,
  `SSN` CHAR(11) NULL,
  `Phone Number` CHAR(13) NULL,
  `Card Number` VARCHAR(16) NULL
) AS
SELECT
  `First Name`,
  `Last Name`,
  `DOB`,
  mask_ssn(`SSN`,'x') AS `SSN`,
  mask_inner(`Phone Number`,5,5,'*') AS `Phone Number`,
  CASE
    WHEN `Card Number` IS NULL THEN NULL
    ELSE gen_rnd_pan(16)
  END AS `Card Number`
FROM
  `test`.`persons`;

To confirm that the statement worked as expected:

mysql> SELECT
    ->   `Original`.`First Name`,
    ->   `Original`.`Last Name`,
    ->   `Original`.`DOB`,
    ->   `Original`.`SSN`,
    ->   `New`.`SSN` AS `New SSN`,
    ->   `Original`.`Phone Number`,
    ->   `New`.`Phone Number` AS `New Phone Number`,
    ->   `Original`.`Card Number`,
    ->   `New`.`Card Number` AS `New Card Number`
    -> FROM
    ->     `test`.`persons` `Original`
    ->   INNER JOIN
    ->     `test`.`persons2` `New`
    ->   ON
    ->     `Original`.`First Name` = `New`.`First Name` AND
    ->     `Original`.`Last Name` = `New`.`Last Name` AND
    ->     `Original`.`DOB` = `New`.`DOB`;
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
| First Name | Last Name  | DOB        | SSN         | New SSN     | Phone Number  | New Phone Number | Card Number      | New Card Number  |
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
| George     | Washington | 1732-02-22 | NULL        | NULL        | (804)555-0101 | (804)***-0101    | 4716958491933491 | 5343770563021230 |
| John       | Adams      | 1735-10-30 | 756-16-0134 | xxx-xx-0134 | (508)555-0102 | (508)***-0102    | 5548026033569479 | 1202657013173118 |
| Thomas     | Jefferson  | 1743-04-13 | 647-42-4787 | xxx-xx-4787 | NULL          | NULL             | 341605099040689  | 3776752473350728 |
| James      | Madison    | 1751-03-16 | 939-87-6230 | xxx-xx-6230 | (276)555-0104 | (276)***-0104    | 6223050740560951 | 5875320742028874 |
| James      | Monroe     | 1758-04-28 | 615-50-9245 | xxx-xx-9245 | (571)555-0105 | (571)***-0105    | 4916285199808116 | 3337533857432262 |
| John       | Adams      | 1767-07-11 | 677-54-5716 | xxx-xx-5716 | (617)555-0106 | (617)***-0106    | NULL             | NULL             |
| Andrew     | Jackson    | 1767-03-15 | 302-80-6308 | xxx-xx-6308 | (803)555-0107 | (803)***-0107    | 370325552763685  | 4201660821245511 |
| Martin     | Van Buren  | 1782-12-05 | 133-85-9284 | xxx-xx-9284 | (329)555-0108 | (329)***-0108    | 4024007112991004 | 7325744713583041 |
| William    | Harrison   | 1773-02-09 | 397-35-5070 | xxx-xx-5070 | (757)555-0109 | (757)***-0109    | 5535714026624691 | 5775203611674070 |
| John       | Tyler      | 1790-03-29 | 037-67-9069 | xxx-xx-9069 | (804)555-0110 | (804)***-0110    | 374077354201234  | 4040166113407848 |
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
10 rows in set (0.00 sec)

Data Masking and De-Identification in INSERT statements

The MySQL data masking and de-identification functions work perfectly fine with INSERT statements, as seen below:

INSERT INTO `test`.`persons`
(
  `First Name`,
  `Last Name`,
  `DOB`,
  `SSN`,
  `Phone Number`,
  `Card Number`
)
VALUES
( 'James',
  'Polk',
  '1795-11-02',
  mask_ssn('196-15-3788'),
  '(919)555-0111',
  '5355367287235827'
);

To test this:

mysql> SELECT * FROM `test`.`persons` WHERE `First Name` = 'James' AND `Last Name` = 'Polk';
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB        | SSN         | Phone Number  | Card Number      |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James      | Polk      | 1795-11-02 | XXX-XX-3788 | (919)555-0111 | 5355367287235827 |
+----+------------+-----------+------------+-------------+---------------+------------------+
1 row in set (0.00 sec)

Data Masking and De-Identification in UPDATE statements

Now, an example of an UPDATE statement that modifies the row inserted in the previous example:

UPDATE
  `test`.`persons`
SET
  `Phone Number` = mask_inner(`Phone Number`,5,5,'*') AND
  `Card Number` = gen_rnd_pan(16)
WHERE
  `First Name` = 'James' AND
  `Last Name` = 'Polk';

And, we test it:

mysql> SELECT * FROM `test`.`persons` WHERE `First Name` = 'James' AND `Last Name` = 'Polk';
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB        | SSN         | Phone Number  | Card Number      |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James      | Polk      | 1795-11-02 | XXX-XX-3788 | (919)***-0111 | 7610524457065262 |
+----+------------+-----------+------------+-------------+---------------+------------------+
1 row in set (0.00 sec)

Implementing Data Masking and De-Identification Transparently

Ideally, data masking and de-identification should be applied at the database level in a way that is transparent to the application.  Unfortunately, MySQL does not natively implement these features transparently.  It is up the database administrator and/or developer to figure out a way to implement it.  Below are examples of methods that can help.

Triggers

MySQL supports triggers for INSERT, UPDATE, and DELETE operations.  In our scenario, we probably don't need to implement DELETE triggers.  MySQL INSERT triggers are applicable to INSERT, LOAD DATA, and REPLACE statements.  As a side note, UPDATE triggers, which are applicable to UPDATE statements, do not apply to the REPLACE statement because this statement is a combination of INSERT and DELETE operations, and not a derivative of the UPDATE statement.

Embedding data masking and de-identification operations in triggers is a great way to implement the functionality transparently for an application.  Below are examples of INSERT and UPDATE triggers.  The UPDATE trigger uses IF clauses that verify whether the incoming data is already masked.   This validation is probably superflous in our very simple example where further masking already masked values would not make any difference, but I present it anyway because it is a good concern to have in mind when implemeting an UPDATE trigger that masks data.

DELIMITER //

CREATE TRIGGER `test`.`persons_insert` BEFORE INSERT ON `test`.`persons` FOR EACH ROW
  BEGIN
    SET NEW.`SSN` = mask_ssn(NEW.`SSN`);
    SET NEW.`Phone Number` = mask_inner(NEW.`Phone Number`,5,5,'*');
    SET NEW.`Card Number` = gen_rnd_pan(16);
  END; //

CREATE TRIGGER `test`.`persons_update` BEFORE UPDATE ON `test`.`persons` FOR EACH ROW
  BEGIN
        IF LEFT(NEW.`SSN`,1) <> '*' THEN
          SET NEW.`SSN` = mask_ssn(NEW.`SSN`);
        END IF;
        IF SUBSTRING(NEW.`Phone Number`,5,3) <> '***' THEN
          SET NEW.`Phone Number` = mask_inner(NEW.`Phone Number`,5,5,'*');
        END IF;
        SET NEW.`Card Number` = gen_rnd_pan(16);
  END; //

DELIMITER ;

Now, some INSERT and UPDATE statements to test these triggers:

INSERT INTO `test`.`persons`
(
  `First Name`,
  `Last Name`,
  `DOB`,
  `SSN`,
  `Phone Number`,
  `Card Number`
)
VALUES
(
  'Zachary',
  'Taylor',
  '1784-11-24',
  mask_ssn('948-15-3788'),
  '(919)555-0112',
  '1234567890ABCDEF'
);

UPDATE
  `test`.`persons`
SET
  `SSN` = NULL,
  `Phone Number` = '(123)456-7890',
  `Card Number` = '1234567890ABCDEF'
WHERE
  `First Name` = 'James' AND
  `Last Name` = 'Polk';

Verify the results:

mysql> SELECT * FROM `test`.`persons` WHERE `Last Name` IN ('Polk', 'Taylor');
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB        | SSN         | Phone Number  | Card Number      |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James      | Polk      | 1795-11-02 | NULL        | (123)***-7890 | 4120428138834320 |
| 12 | Zachary    | Taylor    | 1784-11-24 | XXX-XX-3788 | (919)***-0112 | 8287051411720113 |
+----+------------+-----------+------------+-------------+---------------+------------------+
2 rows in set (0.00 sec)

SELECT statements

Transparently masking or de-identifying data for SELECT statements is a bigger challenge, which can be surmounted, albeit not in a bullet-proof way, by aid of the MySQL Query Rewriter.  The idea is to configure MySQL to modify incoming SELECT statements to use masking and de-identification functions in the projection list and elsewhere, if needed.

The MySQL Query Rewriter must first be installed as follows.  Command specifics may vary depending on the MySQL installation (especially the path).

mysql -uroot -p < /usr/share/mysql-8.4/install_rewriter.sql

Once installed, the Query Rewriter must be enabled:

SET GLOBAL rewriter_enabled = ON;

With the Query Rewriter now installed and enabled, a rule can thus be created to modify incoming INSERT statements on the table persons:

INSERT INTO `query_rewrite`.`rewrite_rules`(`pattern`, `replacement`) VALUES
(
  'SELECT `First Name`, `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`',
  'SELECT `First Name`, `Last Name`, `DOB`, mask_ssn(`SSN`) AS `SSN`,`Phone Number`, gen_rnd_pan(16) AS `Card Number` FROM `test`.`persons`'
);

The new rule(s) must be flushed for the query digests to be created and the rules to be enabled:

CALL query_rewrite.flush_rewrite_rules();

Now, to test our rule:

mysql> SELECT `First Name`, `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`;
+------------+------------+------------+-------------+---------------+------------------+
| First Name | Last Name  | DOB        | SSN         | Phone Number  | Card Number      |
+------------+------------+------------+-------------+---------------+------------------+
| George     | Washington | 1732-02-22 | NULL        | (804)555-0101 | 6762566676671220 |
| John       | Adams      | 1735-10-30 | XXX-XX-0134 | (508)555-0102 | 4505312582063302 |
| Thomas     | Jefferson  | 1743-04-13 | XXX-XX-4787 | NULL          | 6433845517288264 |
| James      | Madison    | 1751-03-16 | XXX-XX-6230 | (276)555-0104 | 3345803412061488 |
| James      | Monroe     | 1758-04-28 | XXX-XX-9245 | (571)555-0105 | 6668688581256346 |
| John       | Adams      | 1767-07-11 | XXX-XX-5716 | (617)555-0106 | 0357680107216358 |
| Andrew     | Jackson    | 1767-03-15 | XXX-XX-6308 | (803)555-0107 | 0540308164535431 |
| Martin     | Van Buren  | 1782-12-05 | XXX-XX-9284 | (329)555-0108 | 7864047134683383 |
| William    | Harrison   | 1773-02-09 | XXX-XX-5070 | (757)555-0109 | 0813006343624320 |
| John       | Tyler      | 1790-03-29 | XXX-XX-9069 | (804)555-0109 | 1131026814832510 |
| James      | Polk       | 1795-11-02 | NULL        | (123)***-7890 | 5346414414473731 |
| Zachary    | Taylor     | 1784-11-24 | XXX-XX-3788 | (919)***-0111 | 3865052427700231 |
+------------+------------+------------+-------------+---------------+------------------+
12 rows in set, 1 warning (0.00 sec)

The data returned is as desired for a query that did not originally contain any of the data masking and de-identification functions.  The SSN field is masked and the Card Number field is randomly generated, as expected.

However, I mentioned earlier that this method is not bullet-proof.  This is because the MySQL Query Rewriter works by comparing the digests of the patterns in the rules with the digests of incoming statements to find matches.  Any query that is not exactly structured the same way as are the patterns in the rules will have a different digest.  Therefore, the MySQL Query Rewriter, while great at transforming known queries as are those normally generated by enterprise applications, would not help against ad-hoc queries.  Here is an example in which I simply omit field First Name in the projection list.  We immediately observe that the SSN values are not masked.

mysql> SELECT `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`;
+------------+------------+-------------+---------------+------------------+
| Last Name  | DOB        | SSN         | Phone Number  | Card Number      |
+------------+------------+-------------+---------------+------------------+
| Washington | 1732-02-22 | NULL        | (804)555-0101 | 4716958491933491 |
| Adams      | 1735-10-30 | 756-16-0134 | (508)555-0102 | 5548026033569479 |
| Jefferson  | 1743-04-13 | 647-42-4787 | NULL          | 341605099040689  |
| Madison    | 1751-03-16 | 939-87-6230 | (276)555-0104 | 6223050740560951 |
| Monroe     | 1758-04-28 | 615-50-9245 | (571)555-0105 | 4916285199808116 |
| Adams      | 1767-07-11 | 677-54-5716 | (617)555-0106 | NULL             |
| Jackson    | 1767-03-15 | 302-80-6308 | (803)555-0107 | 370325552763685  |
| Van Buren  | 1782-12-05 | 133-85-9284 | (329)555-0108 | 4024007112991004 |
| Harrison   | 1773-02-09 | 397-35-5070 | (757)555-0109 | 5535714026624691 |
| Tyler      | 1790-03-29 | 037-67-9069 | (804)555-0109 | 374077354201234  |
| Polk       | 1795-11-02 | NULL        | (123)***-7890 | 4120428138834320 |
| Taylor     | 1784-11-24 | XXX-XX-3788 | (919)***-0111 | 6715143436862560 |
+------------+------------+-------------+---------------+------------------+
12 rows in set (0.00 sec)

Conclusion

We saw in this article how to implement data masking and de-identification with MySQL Enterprise Edition.  In my opinion, the best part is that these operations can be done transparently for the application, as was demonstrated in the article.

These techniques modify the data, usually for the purpose of privacy, in a way that the original data cannot be recovered.  MySQL Enterprise Edition also supports a data obfuscation technique, which modifies the data in a way that can be used for data de-identification and allows for the reconstruction of the original data.  We will explore this technique in another article.

As always, please let me know of any errors or glaring omissions.

References