MySQL Enterprise Edition has for years featured methods for masking and de-identifying data. Let's explore, with examples, how to use them!
Agenda of the article
- The Data Masking and De-Identification extensions of MySQL Enterprise Edition
- Example dataset
- Data Masking and De-Identification in SELECT statements
- Data Masking and De-Identification in INSERT statements
- Data Masking and De-Identification in UPDATE statements
- Implementing Data Masking and De-Identification transparently
- Conclusion
- References
The Data Masking and De-Identification extensions of MySQL Enterprise Edition
Data masking refers to masking portions or all of the data by replacing some of its content with a masking character. For example, the SSN 123-45-6789 can be thus masked: xxx-xx-6789, to only display the last 4 digits. Data de-identification changes a value to a different value in a way that it is still useful to the application, for example by replacing a unique value with another unique value, but makes it impossible to know the original value. As the name implies, the purpose de-identification is to prevent identification of private data.
Data masking and data de-identification functionality is provided as an extension of MySQL Enterprise Edition in the form of both a plugin and a component. Plugins are the original mechanism for extending MySQL functionality. They provide an API to extend various functions of the server. For example, there is a plugin API type for authentication methods, one plugin API type for storage engines, and so on. MySQL components are the new kid-on-the-block, introduced in MySQL 8.0. The component framework extends the server functionality like plugins but offers much more functionality to the developers as well as it allows individual components to interact with each other, which plugins are unable to do.
The reason I bring this up is because MySQL Enterprise Edition has both a data masking and de-identification plugin and component installed. And they both provide similar but different functionality! The table below contains a list of differences between the functionality provided by the plugin and that provided by the component. The database administrator must be careful to use only either the plugin or the component as only one of them can be enabled at a time. Of the two, the component should be favored as it provides enhanced functionality and is actively developed, whereas the plugin is subject to removal in a future release (due and timely notice will be provided).
Function | Plugin | Component |
---|---|---|
Interface | Loadable functions | Service functions, loadable functions |
Support for multibyte character sets | No | Yes for general-purpose masking functions |
General-purpose masking functions | mask_inner(), mask_outer() | mask_inner(), mask_outer |
Masking of specific types | PAN, SSN | PAN, SSN, IBAN, UUID, Canada SIN, UK NIN |
Random generation of specific types | Email, US phone, PAN, SSN | Email, US phone, PAN, SSN, IBAN, UUID, Canada SIN, UK NIN |
Random generation of integer from given ranges | Yes | Yes |
Persisting substitution dictionaries | File-based | In database |
Privilege to manage dictionaries | FILE | Privilege specific to component (MASKING_DICTIONARIES_ADMIN) |
Automated loadable function registration and deregistration during installation and uninstallation |
No | Yes |
Enhancements to existing functions | N/A | More arguments to the function gen_rnd_email() |
The steps to install the component are the following (note: it comes as two component files):
INSTALL COMPONENT 'file://component_masking';
INSTALL COMPONENT 'file://component_masking_functions';
To confirm that the component is installed:
mysql> SELECT * FROM `component`;
+--------------+--------------------+------------------------------------+
| component_id | component_group_id | component_urn |
+--------------+--------------------+------------------------------------+
| ... | ... | ... |
| 2 | 2 | file://component_masking |
| 3 | 3 | file://component_masking_functions |
+--------------+--------------------+------------------------------------+
Example Dataset
This article contains several practical examples that use the following dataset:
First Name | Last Name | DOB | SSN | Phone Number | Card Number |
---|---|---|---|---|---|
George | Washington | 1732-02-22 | NULL | (804)555-0101 | 4716958491933491 |
John | Adams | 1735-10-30 | 756-16-0134 | (508)555-0102 | 5548026033569479 |
Thomas | Jefferson | 1743-04-13 | 647-42-4787 | NULL | 341605099040689 |
James | Madison | 1751-03-16 | 939-87-6230 | (276)555-0104 | 6223050740560951 |
James | Monroe | 1758-04-28 | 615-50-9245 | (571)555-0105 | 4916285199808116 |
John | Adams | 1767-07-11 | 677-54-5716 | (617)555-0106 | NULL |
Andrew | Jackson | 1767-03-15 | 302-80-6308 | (803)555-0107 | 370325552763685 |
Martin | Van Buren | 1782-12-05 | 133-85-9284 | (329)555-0108 | 4024007112991004 |
William | Harrison | 1773-02-09 | 397-35-5070 | (757)555-0109 | 5535714026624691 |
John | Tyler | 1790-03-29 | 037-67-9069 | (804)555-0110 | 374077354201234 |
Here is the SQL to provision the above dataset:
CREATE SCHEMA IF NOT EXISTS `test`;
CREATE TABLE `test`.`persons`
(
`id` INT PRIMARY KEY AUTO_INCREMENT,
`First Name` VARCHAR(32) NOT NULL,
`Last Name` VARCHAR(32) NULL,
`DOB` DATE NOT NULL,
`SSN` CHAR(11) NULL,
`Phone Number` CHAR(13) NULL,
`Card Number` VARCHAR(16) NULL
);
INSERT INTO `test`.`persons`
( `First Name`, `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` )
VALUES
( 'George', 'Washington', '1732-02-22', NULL, '(804)555-0101', '4716958491933491' ),
( 'John', 'Adams', '1735-10-30', '756-16-0134', '(508)555-0102', '5548026033569479' ),
( 'Thomas', 'Jefferson', '1743-04-13', '647-42-4787', NULL, '341605099040689' ),
( 'James', 'Madison', '1751-03-16', '939-87-6230', '(276)555-0104', '6223050740560951' ),
( 'James', 'Monroe', '1758-04-28', '615-50-9245', '(571)555-0105', '4916285199808116' ),
( 'John', 'Adams', '1767-07-11', '677-54-5716', '(617)555-0106', NULL ),
( 'Andrew', 'Jackson', '1767-03-15', '302-80-6308', '(803)555-0107', '370325552763685' ),
( 'Martin', 'Van Buren', '1782-12-05', '133-85-9284', '(329)555-0108', '4024007112991004' ),
( 'William', 'Harrison', '1773-02-09', '397-35-5070', '(757)555-0109', '5535714026624691' ),
( 'John', 'Tyler', '1790-03-29', '037-67-9069', '(804)555-0110', '374077354201234' );
Data Masking and De-Identification in SELECT statements
Below is an example using the data masking and de-identification functions. These functions can be used anywhere that SQL functions may normally be used such as within WHERE clauses, projection lists, and elsewhere. The documentation for these functions can be found here.
mysql> SELECT
-> `First Name`,
-> `Last Name`,
-> `DOB`,
-> mask_ssn(`SSN`,'x') AS `SSN`,
-> mask_inner(`Phone Number`,5,5,'*') AS `Phone Number`,
-> `Card Number`,
-> gen_rnd_pan(16) AS `Random Number`
-> FROM
-> `test`.`persons`;
+------------+------------+------------+-------------+---------------+------------------+------------------+
| First Name | Last Name | DOB | SSN | Phone | Card Number | Random Number |
+------------+------------+------------+-------------+---------------+------------------+------------------+
| George | Washington | 1732-02-22 | NULL | (804)***-0101 | 4716958491933491 | 4701543887067000 |
| John | Adams | 1735-10-30 | xxx-xx-0134 | (508)***-0102 | 5548026033569479 | 4812011388774058 |
| Thomas | Jefferson | 1743-04-13 | xxx-xx-4787 | NULL | 341605099040689 | 8206380111725769 |
| James | Madison | 1751-03-16 | xxx-xx-6230 | (276)***-0104 | 6223050740560951 | 0102261140601415 |
| James | Monroe | 1758-04-28 | xxx-xx-9245 | (571)***-0105 | 4916285199808116 | 6123363455567462 |
| John | Adams | 1767-07-11 | xxx-xx-5716 | (617)***-0106 | NULL | 8357707102536143 |
| Andrew | Jackson | 1767-03-15 | xxx-xx-6308 | (803)***-0107 | 370325552763685 | 1202083866771074 |
| Martin | Van Buren | 1782-12-05 | xxx-xx-9284 | (329)***-0108 | 4024007112991004 | 0847031686483639 |
| William | Harrison | 1773-02-09 | xxx-xx-5070 | (757)***-0109 | 5535714026624691 | 6107678513862674 |
| John | Tyler | 1790-03-29 | xxx-xx-9069 | (804)***-0110 | 374077354201234 | 4335451458208116 |
+------------+------------+------------+-------------+---------------+------------------+------------------+
10 rows in set (0.00 sec)
The fields SSN and Phone Number are masked with specialized data masking functions, and random card numbers are generated that pass validation (Luhn). Note how the original NULL values remain NULL when masked. The randomly generated card number values, which are presented in a seperate column only to show that they are different from the original values, do not have NULL values because the generation function does not take into account the original values. To account for them would require the use of a "CASE" clause in the projection list in the style of:
CASE
WHEN `Card Number` IS NULL THEN NULL
ELSE gen_rnd_pan(16)
END AS `Random Card`
SELECT INTO and CREATE TABLE AS SELECT
The SELECT ... INTO and CREATE TABLE ... AS SELECT statements work similarly as SELECT statements. The data masking and de-identification functions can be used anywhere SQL functions may be used.
Here is an example of a CREATE TABLE ... AS SELECT statement. To add value to the example, I have included the "CASE" clause that was mentioned above.
CREATE TABLE `test`.`persons2`
(
`id` INT PRIMARY KEY AUTO_INCREMENT,
`First Name` VARCHAR(32) NOT NULL,
`Last Name` VARCHAR(32) NULL,
`DOB` DATE NOT NULL,
`SSN` CHAR(11) NULL,
`Phone Number` CHAR(13) NULL,
`Card Number` VARCHAR(16) NULL
) AS
SELECT
`First Name`,
`Last Name`,
`DOB`,
mask_ssn(`SSN`,'x') AS `SSN`,
mask_inner(`Phone Number`,5,5,'*') AS `Phone Number`,
CASE
WHEN `Card Number` IS NULL THEN NULL
ELSE gen_rnd_pan(16)
END AS `Card Number`
FROM
`test`.`persons`;
To confirm that the statement worked as expected:
mysql> SELECT
-> `Original`.`First Name`,
-> `Original`.`Last Name`,
-> `Original`.`DOB`,
-> `Original`.`SSN`,
-> `New`.`SSN` AS `New SSN`,
-> `Original`.`Phone Number`,
-> `New`.`Phone Number` AS `New Phone Number`,
-> `Original`.`Card Number`,
-> `New`.`Card Number` AS `New Card Number`
-> FROM
-> `test`.`persons` `Original`
-> INNER JOIN
-> `test`.`persons2` `New`
-> ON
-> `Original`.`First Name` = `New`.`First Name` AND
-> `Original`.`Last Name` = `New`.`Last Name` AND
-> `Original`.`DOB` = `New`.`DOB`;
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
| First Name | Last Name | DOB | SSN | New SSN | Phone Number | New Phone Number | Card Number | New Card Number |
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
| George | Washington | 1732-02-22 | NULL | NULL | (804)555-0101 | (804)***-0101 | 4716958491933491 | 5343770563021230 |
| John | Adams | 1735-10-30 | 756-16-0134 | xxx-xx-0134 | (508)555-0102 | (508)***-0102 | 5548026033569479 | 1202657013173118 |
| Thomas | Jefferson | 1743-04-13 | 647-42-4787 | xxx-xx-4787 | NULL | NULL | 341605099040689 | 3776752473350728 |
| James | Madison | 1751-03-16 | 939-87-6230 | xxx-xx-6230 | (276)555-0104 | (276)***-0104 | 6223050740560951 | 5875320742028874 |
| James | Monroe | 1758-04-28 | 615-50-9245 | xxx-xx-9245 | (571)555-0105 | (571)***-0105 | 4916285199808116 | 3337533857432262 |
| John | Adams | 1767-07-11 | 677-54-5716 | xxx-xx-5716 | (617)555-0106 | (617)***-0106 | NULL | NULL |
| Andrew | Jackson | 1767-03-15 | 302-80-6308 | xxx-xx-6308 | (803)555-0107 | (803)***-0107 | 370325552763685 | 4201660821245511 |
| Martin | Van Buren | 1782-12-05 | 133-85-9284 | xxx-xx-9284 | (329)555-0108 | (329)***-0108 | 4024007112991004 | 7325744713583041 |
| William | Harrison | 1773-02-09 | 397-35-5070 | xxx-xx-5070 | (757)555-0109 | (757)***-0109 | 5535714026624691 | 5775203611674070 |
| John | Tyler | 1790-03-29 | 037-67-9069 | xxx-xx-9069 | (804)555-0110 | (804)***-0110 | 374077354201234 | 4040166113407848 |
+------------+------------+------------+-------------+-------------+---------------+------------------+------------------+------------------+
10 rows in set (0.00 sec)
Data Masking and De-Identification in INSERT statements
The MySQL data masking and de-identification functions work perfectly fine with INSERT statements, as seen below:
INSERT INTO `test`.`persons`
(
`First Name`,
`Last Name`,
`DOB`,
`SSN`,
`Phone Number`,
`Card Number`
)
VALUES
( 'James',
'Polk',
'1795-11-02',
mask_ssn('196-15-3788'),
'(919)555-0111',
'5355367287235827'
);
To test this:
mysql> SELECT * FROM `test`.`persons` WHERE `First Name` = 'James' AND `Last Name` = 'Polk';
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB | SSN | Phone Number | Card Number |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James | Polk | 1795-11-02 | XXX-XX-3788 | (919)555-0111 | 5355367287235827 |
+----+------------+-----------+------------+-------------+---------------+------------------+
1 row in set (0.00 sec)
Data Masking and De-Identification in UPDATE statements
Now, an example of an UPDATE statement that modifies the row inserted in the previous example:
UPDATE
`test`.`persons`
SET
`Phone Number` = mask_inner(`Phone Number`,5,5,'*') AND
`Card Number` = gen_rnd_pan(16)
WHERE
`First Name` = 'James' AND
`Last Name` = 'Polk';
And, we test it:
mysql> SELECT * FROM `test`.`persons` WHERE `First Name` = 'James' AND `Last Name` = 'Polk';
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB | SSN | Phone Number | Card Number |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James | Polk | 1795-11-02 | XXX-XX-3788 | (919)***-0111 | 7610524457065262 |
+----+------------+-----------+------------+-------------+---------------+------------------+
1 row in set (0.00 sec)
Implementing Data Masking and De-Identification Transparently
Ideally, data masking and de-identification should be applied at the database level in a way that is transparent to the application. Unfortunately, MySQL does not natively implement these features transparently. It is up the database administrator and/or developer to figure out a way to implement it. Below are examples of methods that can help.
Triggers
MySQL supports triggers for INSERT, UPDATE, and DELETE operations. In our scenario, we probably don't need to implement DELETE triggers. MySQL INSERT triggers are applicable to INSERT, LOAD DATA, and REPLACE statements. As a side note, UPDATE triggers, which are applicable to UPDATE statements, do not apply to the REPLACE statement because this statement is a combination of INSERT and DELETE operations, and not a derivative of the UPDATE statement.
Embedding data masking and de-identification operations in triggers is a great way to implement the functionality transparently for an application. Below are examples of INSERT and UPDATE triggers. The UPDATE trigger uses IF clauses that verify whether the incoming data is already masked. This validation is probably superflous in our very simple example where further masking already masked values would not make any difference, but I present it anyway because it is a good concern to have in mind when implemeting an UPDATE trigger that masks data.
DELIMITER //
CREATE TRIGGER `test`.`persons_insert` BEFORE INSERT ON `test`.`persons` FOR EACH ROW
BEGIN
SET NEW.`SSN` = mask_ssn(NEW.`SSN`);
SET NEW.`Phone Number` = mask_inner(NEW.`Phone Number`,5,5,'*');
SET NEW.`Card Number` = gen_rnd_pan(16);
END; //
CREATE TRIGGER `test`.`persons_update` BEFORE UPDATE ON `test`.`persons` FOR EACH ROW
BEGIN
IF LEFT(NEW.`SSN`,1) <> '*' THEN
SET NEW.`SSN` = mask_ssn(NEW.`SSN`);
END IF;
IF SUBSTRING(NEW.`Phone Number`,5,3) <> '***' THEN
SET NEW.`Phone Number` = mask_inner(NEW.`Phone Number`,5,5,'*');
END IF;
SET NEW.`Card Number` = gen_rnd_pan(16);
END; //
DELIMITER ;
Now, some INSERT and UPDATE statements to test these triggers:
INSERT INTO `test`.`persons`
(
`First Name`,
`Last Name`,
`DOB`,
`SSN`,
`Phone Number`,
`Card Number`
)
VALUES
(
'Zachary',
'Taylor',
'1784-11-24',
mask_ssn('948-15-3788'),
'(919)555-0112',
'1234567890ABCDEF'
);
UPDATE
`test`.`persons`
SET
`SSN` = NULL,
`Phone Number` = '(123)456-7890',
`Card Number` = '1234567890ABCDEF'
WHERE
`First Name` = 'James' AND
`Last Name` = 'Polk';
Verify the results:
mysql> SELECT * FROM `test`.`persons` WHERE `Last Name` IN ('Polk', 'Taylor');
+----+------------+-----------+------------+-------------+---------------+------------------+
| id | First Name | Last Name | DOB | SSN | Phone Number | Card Number |
+----+------------+-----------+------------+-------------+---------------+------------------+
| 11 | James | Polk | 1795-11-02 | NULL | (123)***-7890 | 4120428138834320 |
| 12 | Zachary | Taylor | 1784-11-24 | XXX-XX-3788 | (919)***-0112 | 8287051411720113 |
+----+------------+-----------+------------+-------------+---------------+------------------+
2 rows in set (0.00 sec)
SELECT statements
Transparently masking or de-identifying data for SELECT statements is a bigger challenge, which can be surmounted, albeit not in a bullet-proof way, by aid of the MySQL Query Rewriter. The idea is to configure MySQL to modify incoming SELECT statements to use masking and de-identification functions in the projection list and elsewhere, if needed.
The MySQL Query Rewriter must first be installed as follows. Command specifics may vary depending on the MySQL installation (especially the path).
mysql -uroot -p < /usr/share/mysql-8.4/install_rewriter.sql
Once installed, the Query Rewriter must be enabled:
SET GLOBAL rewriter_enabled = ON;
With the Query Rewriter now installed and enabled, a rule can thus be created to modify incoming INSERT statements on the table persons:
INSERT INTO `query_rewrite`.`rewrite_rules`(`pattern`, `replacement`) VALUES
(
'SELECT `First Name`, `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`',
'SELECT `First Name`, `Last Name`, `DOB`, mask_ssn(`SSN`) AS `SSN`,`Phone Number`, gen_rnd_pan(16) AS `Card Number` FROM `test`.`persons`'
);
The new rule(s) must be flushed for the query digests to be created and the rules to be enabled:
CALL query_rewrite.flush_rewrite_rules();
Now, to test our rule:
mysql> SELECT `First Name`, `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`;
+------------+------------+------------+-------------+---------------+------------------+
| First Name | Last Name | DOB | SSN | Phone Number | Card Number |
+------------+------------+------------+-------------+---------------+------------------+
| George | Washington | 1732-02-22 | NULL | (804)555-0101 | 6762566676671220 |
| John | Adams | 1735-10-30 | XXX-XX-0134 | (508)555-0102 | 4505312582063302 |
| Thomas | Jefferson | 1743-04-13 | XXX-XX-4787 | NULL | 6433845517288264 |
| James | Madison | 1751-03-16 | XXX-XX-6230 | (276)555-0104 | 3345803412061488 |
| James | Monroe | 1758-04-28 | XXX-XX-9245 | (571)555-0105 | 6668688581256346 |
| John | Adams | 1767-07-11 | XXX-XX-5716 | (617)555-0106 | 0357680107216358 |
| Andrew | Jackson | 1767-03-15 | XXX-XX-6308 | (803)555-0107 | 0540308164535431 |
| Martin | Van Buren | 1782-12-05 | XXX-XX-9284 | (329)555-0108 | 7864047134683383 |
| William | Harrison | 1773-02-09 | XXX-XX-5070 | (757)555-0109 | 0813006343624320 |
| John | Tyler | 1790-03-29 | XXX-XX-9069 | (804)555-0109 | 1131026814832510 |
| James | Polk | 1795-11-02 | NULL | (123)***-7890 | 5346414414473731 |
| Zachary | Taylor | 1784-11-24 | XXX-XX-3788 | (919)***-0111 | 3865052427700231 |
+------------+------------+------------+-------------+---------------+------------------+
12 rows in set, 1 warning (0.00 sec)
The data returned is as desired for a query that did not originally contain any of the data masking and de-identification functions. The SSN field is masked and the Card Number field is randomly generated, as expected.
However, I mentioned earlier that this method is not bullet-proof. This is because the MySQL Query Rewriter works by comparing the digests of the patterns in the rules with the digests of incoming statements to find matches. Any query that is not exactly structured the same way as are the patterns in the rules will have a different digest. Therefore, the MySQL Query Rewriter, while great at transforming known queries as are those normally generated by enterprise applications, would not help against ad-hoc queries. Here is an example in which I simply omit field First Name in the projection list. We immediately observe that the SSN values are not masked.
mysql> SELECT `Last Name`, `DOB`, `SSN`, `Phone Number`, `Card Number` FROM `test`.`persons`;
+------------+------------+-------------+---------------+------------------+
| Last Name | DOB | SSN | Phone Number | Card Number |
+------------+------------+-------------+---------------+------------------+
| Washington | 1732-02-22 | NULL | (804)555-0101 | 4716958491933491 |
| Adams | 1735-10-30 | 756-16-0134 | (508)555-0102 | 5548026033569479 |
| Jefferson | 1743-04-13 | 647-42-4787 | NULL | 341605099040689 |
| Madison | 1751-03-16 | 939-87-6230 | (276)555-0104 | 6223050740560951 |
| Monroe | 1758-04-28 | 615-50-9245 | (571)555-0105 | 4916285199808116 |
| Adams | 1767-07-11 | 677-54-5716 | (617)555-0106 | NULL |
| Jackson | 1767-03-15 | 302-80-6308 | (803)555-0107 | 370325552763685 |
| Van Buren | 1782-12-05 | 133-85-9284 | (329)555-0108 | 4024007112991004 |
| Harrison | 1773-02-09 | 397-35-5070 | (757)555-0109 | 5535714026624691 |
| Tyler | 1790-03-29 | 037-67-9069 | (804)555-0109 | 374077354201234 |
| Polk | 1795-11-02 | NULL | (123)***-7890 | 4120428138834320 |
| Taylor | 1784-11-24 | XXX-XX-3788 | (919)***-0111 | 6715143436862560 |
+------------+------------+-------------+---------------+------------------+
12 rows in set (0.00 sec)
Conclusion
We saw in this article how to implement data masking and de-identification with MySQL Enterprise Edition. In my opinion, the best part is that these operations can be done transparently for the application, as was demonstrated in the article.
These techniques modify the data, usually for the purpose of privacy, in a way that the original data cannot be recovered. MySQL Enterprise Edition also supports a data obfuscation technique, which modifies the data in a way that can be used for data de-identification and allows for the reconstruction of the original data. We will explore this technique in another article.
As always, please let me know of any errors or glaring omissions.
References
- Documentation: https://dev.mysql.com/doc/refman/8.4/en/data-masking.html
- Developer article: https://blogs.oracle.com/mysql/post/explore-the-new-mysql-enterprise-data-masking-and-deidentification-features
- Query Rewriter documentation: https://dev.mysql.com/doc/refman/8.4/en/rewriter-query-rewrite-plugin.html