MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
Data Masking in MySQL

If you’re here, that probably means you’re about to acquire some sensitive data to take care of. Or that you’ve already acquired it and are protecting it, and you want to see how MySQL Enterprise Data Masking and De-Identification features can help you do it in an easier, better and more efficient manner. It also doesn’t hurt to just add another item to your “bag of tricks”.

Whatever the reason, with recent legislature pushing (can I hear GDPR?) and continuous industry focus on data security, data masking is slowly becoming more and more indispensable part of our daily routine. In this article we show what is currently possible with MySQL Enterprise Data Masking and De-Identification (let’s call it data masking, for short). If you’re interested in data masking in general, the all-mighty Wikipedia provides a nice, lean-and-mean intro material.

First step

Customers who’ve had an early look at this new data masking have responded very positively, and remarked on how useful this set of functions will be for them. We initially focused on the most common requirements by releasing this suite of simple to install functions. They can be enabled extremely easy within your running MySQL server, both on 5.7.24 and 8.0.13 releases. We’ve already gotten new requests from customers for enhancements, and we plan to add them going forward.

Ofcourse, there are many reasons to be rocking our latest-and-greatest (8.0 series) and you should really, really, look into it 🙂

Specialized or generic – what’s your flavour?

Often we have to do with typical formats of sensitive data, such as credit card numbers (also known as PAN, Personal Account Number), social security numbers etc. For such purposes, we’ve created appropriate, fixed-function style solutions.

mysql> SELECT mask_pan(“1234567890123456”);
+------------------------------+
| mask_pan(“1234567890123456”) |
+------------------------------+
| XXXXXXXXXXXX3456             |
+------------------------------+
1 row in set (0,00 sec)
 
mysql> SELECT mask_pan_relaxed(“1234567890123456”);
+--------------------------------------+
| mask_pan_relaxed(“1234567890123456”) |
+--------------------------------------+
| 123456XXXXXX3456                     |
+--------------------------------------+
1 row in set (0,00 sec)
 
mysql> SELECT mask_ssn(“123-45-6789”);
+-------------------------+
| mask_ssn(“123-45-6789”) |
+-------------------------+
| XXX-XX-6789             |
+-------------------------+
1 row in set (0,00 sec)

As you can see, these functions perform partial masking on a well known format using predefined, industry standard masking definitions. Partial masking exposes part(s) of the original data which are deemed safe enough to be used, e.g. for identification purposes.

We imagine you’ll probably think of more places where partial masking could get used (and when you do, we’d like to know more about it, so ping us). In the mean time, if provided fixed-format functions do not apply to your particular problem, maybe our generic partial masking will do the trick:

mysql> SELECT mask_inner(“Michael”, 1, 0);
+-----------------------------+
| mask_inner(“Michael”, 1, 0) |
+-----------------------------+
| MXXXXXX                     |
+-----------------------------+
1 row in set (0,01 sec)
 
mysql> SELECT mask_inner(“1234567890123456”, 0, 4, ‘*’);
+-------------------------------------------+
| mask_inner(“1234567890123456”, 0, 4, ‘*’) |
+-------------------------------------------+
| ************3456                          |
+-------------------------------------------+
1 row in set (0,00 sec)
 
mysql> SELECT mask_inner(“john-the-ceo@company.com”, 0, 12, ‘-‘);
+----------------------------------------------------+
| mask_inner(“john-the-ceo@company.com”, 0, 12, ‘-‘) |
+----------------------------------------------------+
| ------------@company.com                           |
+----------------------------------------------------+
1 row in set (0,00 sec)
 
mysql> SELECT mask_outer(“1-555-123-4567”, 5, 0, ‘?’);
+-----------------------------------------+
| mask_outer(“1-555-123-4567”, 5, 0, ‘?’) |
+-----------------------------------------+
| ?????-123-4567                          |
+-----------------------------------------+
1 row in set (0,00 sec)

The usage is intended to be rather self-explanatory, but you can always take a look at our data masking reference for all the details.

What about my data?

“Hang on”, you say, “but I have sensitive data stored in tables, how does this help me?” We’re only getting warmed up here, so stay tuned. For the purposes of this blog entry, we decide to (ab)use a part of our data masking facilities responsible for random data generation.

mysql> CREATE TABLE employees(
-> id INT NOT NULL AUTO_INCREMENT,
-> name VARCHAR(100) NOT NULL,
-> surname VARCHAR(100) NOT NULL,
-> email VARCHAR(100) NOT NULL,
-> phone VARCHAR(20) NOT NULL,
-> title VARCHAR(50) NOT NULL,
-> salary INT NOT NULL,
-> city VARCHAR(30) NOT NULL,
-> ssn CHAR(11) NOT NULL,
-> credit_card CHAR(19) NOT NULL,
-> PRIMARY KEY (id))
-> CHARACTER SET=latin1;
Query OK, 0 rows affected (0,12 sec)
 
mysql> delimiter //
mysql> CREATE procedure masking.make_10_rows()
-> wholeblock:BEGIN
-> DECLARE x INT;
-> SET x = 10;
-> REPEAT
-> INSERT INTO employees(
-> name,surname,email,phone,title,
-> salary,city,ssn,credit_card
-> ) SELECT
-> gen_dictionary(“Names”),
-> gen_dictionary(“Surnames”),
-> gen_rnd_email(),
-> gen_rnd_us_phone(),
-> gen_dictionary(“Titles”),
-> gen_range(50000, 250000),
-> gen_dictionary(“US_Cities”),
-> gen_rnd_ssn(),
-> gen_rnd_pan();
-> select sleep(1);
-> SET x = x -- 1;
-> UNTIL x <= 0
-> END REPEAT;
-> END//
Query OK, 0 rows affected (0,04 sec)
 
mysql> delimiter ;
mysql> call make_10_rows();

The original use case for our random data generators was masking of sensitive data which is not to be seen at all, not even partially. In many cases, we needed to mimic the look-and-feel of the original data, so that many applications relying on data format would continue to function properly. Turns out, random data generation, if used properly, can also have an alter-ego (as many heroes do), the “Data Fabricator”. We trust you to use this power for good purpose. Here’s what your data may look like after random generating data:

mysql> SELECT name,surname,email FROM employees;
+----------+-----------+---------------------------+
| name     | surname   | email                     |
+----------+-----------+---------------------------+
| Damian   | Rodriguez | ipnmh.jnxisba@example.com |
| Oscar    | Byrne     | gaqli.vstccfo@example.com |
| Richard  | Li        | hgyoh.dvqqhvq@example.com |
| Emily    | Brown     | vjotj.hkvqisk@example.com |
| Olivia   | Davies    | xduhr.lttnyqo@example.com |
| Margaret | White     | rmdqr.hdttduf@example.com |
| Joseph   | Li        | kexmk.lspmuqd@example.com |
| Jeniffer | White     | wrxjc.xhmhhlx@example.com |
| Thomas   | O’Kelly   | imqsd.bqrounw@example.com |
| John     | Garcia    | qxntl.ksuhnrg@example.com |
+----------+-----------+---------------------------+
10 rows in set (0,00 sec)
 
mysql> SELECT phone,title,city FROM employees;
+----------------+--------------+---------------+
| phone          | title        | city          |
+----------------+--------------+---------------+
| 1-555-710-7876 | Manager 2    | San Jose      |
| 1-555-086-4016 | CTO          | Fresno        |
| 1-555-478-8364 | Manager 3    | Seattle       |
| 1-555-771-6028 | Engineer 3   | Kansas City   |
| 1-555-132-2027 | Specialist 2 | Jacksonville  |
| 1-555-065-5578 | Engineer 3   | Oklahoma City |
| 1-555-171-3075 | Manager 3    | Phoenix       |
| 1-555-741-0173 | Engineer 3   | Boston        |
| 1-555-530-8384 | Engineer 2   | Los Angeles   |
| 1-555-822-1087 | Engineer 2   | Columbus      |
+----------------+--------------+---------------+
10 rows in set (0,00 sec)
 
mysql> SELECT salary,ssn,credit_card FROM employees;
+--------+-------------+------------------+
| salary | ssn         | credit_card      |
+--------+-------------+------------------+
| 67940 | 909-38-3013  | 1234567890123456 |
| 192235 | 954-48-6304 | 2345678901234567 |
| 207216 | 943-09-7108 | 3456789012345678 |
| 232295 | 907-20-3293 | 4567890123456789 |
| 66443 | 981-19-9069  | 5678901234567890 |
| 62366 | 966-24-1895  | 6789012345678901 |
| 126843 | 996-10-1105 | 7890123456789012 |
| 120765 | 939-55-6878 | 8901234567890123 |
| 109612 | 972-02-9968 | 9012345678901234 |
| 116995 | 907-03-8417 | 0123456789012345 |
+--------+-------------+------------------+
10 rows in set (0,00 sec)

Notice: although gen_rnd_pan() function will create a proper (Luhn check compliant) credit card number, we have “edited” generated numbers for the purpose of publication, and replaced them with sequence of digits. There is no actual guarantee that some of the generated numbers are not belonging to a real person account.

Dictionaries

One of the interesting features we used “under the hood” in this example is a data masking dictionary. When data is generated randomly, you typically can choose a range, and can expect an output within that whole range. Data masking dictionaries enable us to select a random value from a set of predefined values we can “load” into the masking engine. In order for previous example to be possible, we had to add our dictionaries beforehand in a way similar to this:

SELECT gen_dictionary_load("/dict/path/file.txt", "DictionaryName");

At this point, dictionaries support only string values, and dictionary files are essentially newline delimited rows of distinct values. Sometimes the simple solution works best, and we tried to ease it up, for starters. The dictionary shall be available for usage until we shut down MySQL server, or we issue the following command:

SELECT gen_dictionary_drop("DictionaryName");

For the purposes of generating data, we’ve created short dictionaries for columns name, surname, city and title. We have also created another dictionary with german city names, which is going to be used for following examples.

Masking the data

At this point, we believe hope that “fabricated” data bears some resemblance to your actual production data, give or take a column or two. It’s obvious that most of the columns fall into sensitive data territory. Now we’ll use combined forces of partial and random data masking to create a masked view of our data.

mysql> CREATE VIEW employee_mask AS
-> SELECT
-> id,
-> mask_inner(name, 1, 0, _binary’*’) AS name,
-> mask_inner(surname, 1, 0, _binary’*’) AS surname,
-> gen_rnd_email() as email,
-> gen_rnd_us_phone() as phone,
-> gen_blacklist(title, “C-level”, “Worker”) as title,
-> gen_range(100000, 120000) as salary,
-> gen_dictionary(“DE_Cities”) as city,
-> mask_ssn(ssn) as ssn,
-> mask_pan(credit_card) as credit_card
-> FROM employees;
Query OK, 0 rows affected (0,04 sec)

Voila – you have yourself a masked version of your sensitive data:

mysql> SELECT name,surname,email FROM employee_mask;
+----------+-----------+---------------------------+
| name     | surname   | email                     |
+----------+-----------+---------------------------+
| D*****   | R******** | xanfk.ivkvhcu@example.com |
| O****    | B****     | jeiod.xajqyby@example.com |
| R******  | L*        | ktxfp.cummlvh@example.com |
| E****    | B****     | qftuk.enjsrho@example.com |
| O*****   | D*****    | hdajm.hrsyogv@example.com |
| M******* | W****     | aapgy.kvojqfa@example.com |
| J*****   | L*        | aixfs.weoqrab@example.com |
| J******* | W****     | cyosn.tvburvk@example.com |
| T*****   | O******   | ouaou.xfhycfm@example.com |
| J***     | G*****    | ysqwy.ohablhi@example.com |
+----------+-----------+---------------------------+
10 rows in set (0,00 sec)

Our first and last names got cosily switched to initials, and emails are all from example.com domain (not that they weren’t before, but your actual e-mail addresses are probably not from this domain).

mysql> SELECT phone,title,city FROM employee_mask;
+----------------+--------------+--------------+
| phone          | title        | city         |
+----------------+--------------+--------------+
| 1-555-316-5755 | Manager 2    | Bremen       |
| 1-555-132-6885 | Janitor      | Muenster     |
| 1-555-461-5475 | Manager 3    | Mannheim     |
| 1-555-415-1856 | Engineer 3   | Munich       |
| 1-555-665-3002 | Specialist 2 | Wiesbaden    |
| 1-555-683-0121 | Engineer 3   | Leipzig      |
| 1-555-861-6081 | Manager 3    | Hannover     |
| 1-555-352-8884 | Engineer 3   | Duisburg     |
| 1-555-033-3441 | Engineer 2   | Wuppertal    |
| 1-555-428-7360 | Engineer 2   | Braunschweig |
+----------------+--------------+--------------+
10 rows in set (0,00 sec)

New US phone numbers got applied, and special C-level employee got slightly demoted. Not to mention that the whole team transferred to EU.

mysql> SELECT salary,ssn,credit_card FROM employee_mask;
+--------+-------------+------------------+
| salary | ssn         | credit_card      |
+--------+-------------+------------------+
| 112256 | XXX-XX-3013 | XXXXXXXXXXXX6905 |
| 108754 | XXX-XX-6304 | XXXXXXXXXXXX2471 |
| 102690 | XXX-XX-7108 | XXXXXXXXXXXX9396 |
| 119930 | XXX-XX-3293 | XXXXXXXXXXXX0512 |
| 113246 | XXX-XX-9069 | XXXXXXXXXXXX6875 |
| 113359 | XXX-XX-1895 | XXXXXXXXXXXX4129 |
| 115219 | XXX-XX-1105 | XXXXXXXXXXXX3346 |
| 113183 | XXX-XX-6878 | XXXXXXXXXXXX1672 |
| 106002 | XXX-XX-9968 | XXXXXXXXXXXX1890 |
| 109498 | XXX-XX-8417 | XXXXXXXXXXXX9717 |
+--------+-------------+------------------+
10 rows in set (0,00 sec)

Finally, salaries are “compressed” a bit, social security numbers are partially masked, and no one will get to use your employee’s credit card numbers in an online store.

Conclusion

Security is of utmost importance in today’s world, MySQL will continue to focus on security features and for sure data masking and de-identification. We look for customer feedback to guide our roadmap to meet your top challenges first. Tell us what you want. And as always thanks for using MySQL.