A practical guide
DALIBO
Feb. 2023
This is a 4 hours workshop that demonstrates various anonymization techniques using the PostgreSQL Anonymizer extension.
Paul’s boutique has a lot of customers. Paul asks his friend Pierre, a Data Scientist, to make some statistics about his clients : average age, etc…
Pierre wants a direct access to the database in order to write SQL queries.
Jack is an employee of Paul. He’s in charge of relationship with the various suppliers of the shop.
Paul respects his suppliers privacy. He needs to hide the personnal information to Pierre, but Jack needs read and write access the real data.
Using the simple example above, we will learn:
postgresql_anonymizer
is an extension to mask or replace
personally
identifiable information (PII) or commercially sensitive data from a
PostgreSQL database.
The project has a declarative approach of anonymization. This means you can declare the masking rules using the PostgreSQL Data Definition Language (DDL) and specify your anonymization strategy inside the table definition itself.
Once the maskings rules are defined, you can access the anonymized data in 4 different ways:
This presentation does not go into the details of the GPDR act and the general concepts of anonymization.
For more information about it, please refer to the talk below:
In order to make this workshop, you will need:
Debian 11 bullseye
or
Ubuntu 22.04
)PostgreSQL 14
)A simple way to deploy a workshop environment is to install Docker Desktop and download the image below:
Check out the INSTALL section in the documentation to learn how to install the extension in your PostgreSQL instance.
We will with 3 differents users:
Unless stated otherwise, all commands must be executed with the role
paul
.
Setup a .pgpass
file to simplify the connections !
cat > ~/.pgpass << EOL
*:*:boutique:paul:CHANGEME
*:*:boutique:pierre:CHANGEME
*:*:boutique:jack:CHANGEME
EOL
chmod 0600 ~/.pgpass
We will work on a database called “boutique”:
We need to activate the anon
library inside that
database:
This workshop is a collective work from Damien Clochard, Be Hai Tran, Florent Jardin, Frédéric Yhuel.
This document is distributed under the PostgreSQL license.
The source is available at
https://gitlab.com/dalibo/postgresql_anonymizer/-/tree/master/docs/how-to
Static Masking is the simplest way to hide personal information! This idea is simply to destroy the original data or replace it with an artificial one.
Over the years, Paul has collected data about his customers and their purchases in a simple database. He recently installed a brand new sales application and the old database is now obsolete. He wants to save it and he would like to remove all personal information before archiving it.
In this section, we will learn:
DROP TABLE IF EXISTS customer CASCADE;
DROP TABLE IF EXISTS payout CASCADE;
CREATE TABLE customer ( id SERIAL PRIMARY KEY, firstname TEXT, lastname TEXT, phone TEXT, birth DATE, postcode TEXT );
Insert a few persons:
INSERT INTO customer
VALUES (107,'Sarah','Conor','060-911-0911', '1965-10-10', '90016'),
(258,'Luke', 'Skywalker', NULL, '1951-09-25', '90120'),
(341,'Don', 'Draper','347-515-3423', '1926-06-01', '04520') ;
id | firstname | lastname | phone | birth | postcode |
---|---|---|---|---|---|
107 | Sarah | Conor | 060-911-0911 | 1965-10-10 | 90016 |
258 | Luke | Skywalker | None | 1951-09-25 | 90120 |
341 | Don | Draper | 347-515-3423 | 1926-06-01 | 04520 |
Sales are tracked in a simple table:
CREATE TABLE payout ( id SERIAL PRIMARY KEY, fk_customer_id INT REFERENCES customer(id), order_date DATE, payment_date DATE, amount INT );
Let's add some orders:
Paul wants to hide the last name and the phone numbers of his
clients. He will use the fake_last_name()
and
partial()
functions for that:
anonymize_table |
---|
True |
id | firstname | lastname | phone |
---|---|---|---|
107 | Sarah | Morgan | 06X-XXX-XX11 |
258 | Luke | Thomas | None |
341 | Don | Clarke | 34X-XXX-XX23 |
This is called
Static Masking
because the real data has been permanently replaced. We'll see later how we can use dynamic anonymization or anonymous exports.
Declare a new masking rule and run the static anonymization function again.
Paul realizes that the postcode gives a clear indication of where his customers live. However he would like to have statistics based on their "postcode area".
Add a new masking rule to replace the last 3 digits by ‘x’.
Aggregate the customers based on their anonymized postcode.
Paul wants age-based statistic. But he also wants to hide the real birth date of the customers.
Replace all the birth dates by January 1rst, while keeping the real year.
HINT: You can use the make_date function !
Even if the "customer" is properly anonymized, we can still isolate a given individual based on data stored outside of the table. For instance, we can identify the best client of Paul's boutique with a query like this:
WITH best_client AS
(SELECT SUM(amount),
fk_customer_id
FROM payout
GROUP BY fk_customer_id
ORDER BY 1 DESC
LIMIT 1)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id)
id | firstname | lastname | phone | birth | postcode |
---|---|---|---|---|---|
341 | Don | Clarke | 34X-XXX-XX23 | 1926-06-01 | 04520 |
This is called Singling Out a person.
We need to anonymize even further by removing the link between a person and its company. In the "order" table, this link is materialized by a foreign key on the field "fk_company_id". However we can't remove values from this column or insert fake identifiers because if would break the foreign key constraint.
How can we separate the customers from their payouts while respecting the integrity of the data?
Find a function that will shuffle the column "fk_company_id" of the "payout" table
HINT: Check out the static masking section of the documentation
postcode | count |
---|---|
90xxx | 2 |
04xxx | 1 |
SECURITY LABEL
FOR anon ON COLUMN customer.birth IS 'MASKED WITH FUNCTION make_date(EXTRACT(YEAR FROM birth)::INT,1,1)';
SELECT anon.anonymize_table('customer');
SELECT id,
firstname,
lastname,
birth
FROM customer;
run-postgres: ‘MASKED WITH FUNCTION make_date(EXTRACT(YEAR FROM birth)::INT,1,1)’ is not a valid masking function
Let's mix up the values of the fk_customer_id:
shuffle_column |
---|
True |
Now let's try to single out the best client again :
WITH best_client AS
(SELECT SUM(amount),
fk_customer_id
FROM payout
GROUP BY fk_customer_id
ORDER BY 1 DESC
LIMIT 1)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id);
id | firstname | lastname | phone | birth | postcode |
---|---|---|---|---|---|
258 | Timothy | Morris | None | 1951-09-25 | 90xxx |
WARNING
Note that the link between a customer
and its
payout
is now completely false. For instance, if a customer
A had 2 payouts. One of these payout may be linked to a customer B,
while the second one is linked to a customer C.
In other words, this shuffling method with respect the foreign key constraint (aka the referential integrity) but it will break the data integrity. For some use case, this may be a problem.
In this case, Pierre will not be able to produce a BI report with the shuffle data, because the links between the customers and their payments are fake.
With Dynamic Masking, the database owner can hide personnal data for some users, while other users are still allowed to read and write the authentic data.
Paul has 2 employees:
In this section, we will learn:
DROP TABLE IF EXISTS supplier CASCADE;
DROP TABLE IF EXISTS company CASCADE;
CREATE TABLE company ( id SERIAL PRIMARY KEY, name TEXT, vat_id TEXT UNIQUE );
INSERT INTO company
VALUES (952,'Shadrach', 'FR62684255667'),
(194,E'Johnny\'s Shoe Store','CHE670945644'),
(346,'Capitol Records','GB663829617823') ;
id | name | vat_id |
---|---|---|
952 | Shadrach | FR62684255667 |
194 | Johnny's Shoe Store | CHE670945644 |
346 | Capitol Records | GB663829617823 |
CREATE TABLE supplier ( id SERIAL PRIMARY KEY, fk_company_id INT REFERENCES company(id), contact TEXT, phone TEXT, job_title TEXT );
INSERT INTO supplier
VALUES (299,194,'Johnny Ryall','597-500-569','CEO'),
(157,346,'George Clinton', '131-002-530','Sales manager') ;
id | fk_company_id | contact | phone | job_title |
---|---|---|---|---|
299 | 194 | Johnny Ryall | 597-500-569 | CEO |
157 | 346 | George Clinton | 131-002-530 | Sales manager |
start_dynamic_masking |
---|
True |
Now connect as Pierre and try to read the supplier table:
id | fk_company_id | contact | phone | job_title |
---|---|---|---|---|
299 | 194 | Johnny Ryall | 597-500-569 | CEO |
157 | 346 | George Clinton | 131-002-530 | Sales manager |
For the moment, there is no masking rule so Pierre can see the original data in each table.
Connect as Paul and define a masking rule on the supplier table:
Now connect as Pierre and try to read the supplier table again:
id | fk_company_id | contact | phone | job_title |
---|---|---|---|---|
299 | 194 | CONFIDENTIAL | 597-500-569 | CEO |
157 | 346 | CONFIDENTIAL | 131-002-530 | Sales manager |
Now connect as Jack and try to read the real data:
id | fk_company_id | contact | phone | job_title |
---|---|---|---|---|
299 | 194 | Johnny Ryall | 597-500-569 | CEO |
157 | 346 | George Clinton | 131-002-530 | Sales manager |
Masking the supplier name is clearly not enough to provide anonymity.
Connect as Pierre and write a simple SQL query that would reindentify some suppliers based on their job and their company.
Company names and job positions are available in many public datasets. A simple search on Linkedin or Google, would give you the names of the top executives of most companies..
This is called Linkability: the ability to connect multiple records concerning the same data subject.
We need to anonymize the "company" table, too. Even if they don't contain personal information, some fields can be used to infer the identity of their employees...
Write 2 masking rules for the company table. The first one will replace the "name" field with a fake name. The second will replace the "vat_id" with a random sequence of 10 characters
HINT: Go to the documentation and look at the faking functions and random functions!
Connect as Pierre and check that he cannot view the real company info:
Because of dynamic masking, the fake values will be different everytime Pierre tries to read the table.
Pierre would like to have always the same fake values for a given company. This is called pseudonymization.
Write a new masking rule over the "vat_id" field by generating 10 random characters using the md5() function.
Write a new masking rule over the "name" field by using a pseudonymizing function.
SELECT s.id,
s.contact,
s.job_title,
c.name
FROM supplier s
JOIN company c ON s.fk_company_id = c.id;
id | contact | job_title | name |
---|---|---|---|
299 | CONFIDENTIAL | CEO | Johnny's Shoe Store |
157 | CONFIDENTIAL | Sales manager | Capitol Records |
SECURITY LABEL
FOR anon ON COLUMN company.name IS 'MASKED WITH FUNCTION anon.fake_company()';
SECURITY LABEL
FOR anon ON COLUMN company.vat_id IS 'MASKED WITH FUNCTION anon.random_string(10)';
Now connect as Pierre and read the table again:
id | name | vat_id |
---|---|---|
952 | Thomas-Carter | R55CFB8ZRJ |
194 | Logan Ltd | GCX0YZCSJW |
346 | Tyler LLC | 2UVC557K6C |
Pierre will see different "fake data" everytime he reads the table:
id | name | vat_id |
---|---|---|
952 | Turner-Valencia | P8GBIZTU7E |
194 | Briggs, Lewis and Frederick | VGUZ04X7JR |
346 | Torres PLC | H2CAWUPUN6 |
ALTER FUNCTION anon.pseudo_company SECURITY DEFINER;
SECURITY LABEL
FOR anon ON COLUMN company.name IS 'MASKED WITH FUNCTION anon.pseudo_company(id)';
Connect as Pierre and read the table multiple times:
id | name | vat_id |
---|---|---|
952 | Wilkinson LLC | HWHX1FK8NW |
194 | Johnson PLC | OD3DNK0PCS |
346 | Young-Carpenter | YQM770DXQ1 |
id | name | vat_id |
---|---|---|
952 | Wilkinson LLC | H7QQ5FCUVA |
194 | Johnson PLC | HF6ZQCA50J |
346 | Young-Carpenter | CI74UW1TRH |
Now the fake company name is always the same.
In many situation, what we want is simply to export the anonymized data into another database (for testing or to produce statistics). This is what pg_dump_anon does!
Paul has a website and a comment section where customers can express their views.
He hired a web agency to develop a new design for his website. The agency asked for a SQL export (dump) of the current website database. Paul wants to "clean" the database export and remove any personnal information contained in the comment section.
SELECT message->'meta'->'name' AS name,
message->'content' AS content
FROM website_comment
ORDER BY id ASC
name | content |
---|---|
Lee Perry | Hello Nasty! |
Great Shop | |
Jimmy | Hi ! This is me, Jimmy James |
The "comment" field is filled with personal information and the fact the field does not have a standard schema makes our tasks harder.
In general, unstructured data are difficult to mask.
As we can see, web visitors can write any kind of information in the comment section. Our best option is to remove this key entirely because there's no way to extract personnal data properly.
We can clean the comment column simply by removing the "content" key!
?column? |
---|
{'meta': {'name': 'Lee Perry', 'ip_addr': '40.87.29.113'}} |
First let's create a dedicated schema and declare it as trusted. This means the "anon" extension will accept the functions located in this schema as valid masking functions. Only a superuser should be able to add functions in this schema.
Now we can write a function that remove the message content:
Let's try it!
remove_content |
---|
{'meta': {'name': 'Lee Perry', 'ip_addr': '40.87.29.113'}} |
{'meta': {'name': '', 'email': 'biz@bizmarkie.com'}} |
{'meta': {'name': 'Jimmy'}} |
And now we can use it in a masking rule:
Finally we can export an anonymous dump of the table
with pg_dump_anon
:
Create a database named "boutique_anon" and transfer the entire database into it.
Pierre plans to extract general information from the metadata. For instance, he wants to calculate the number of unique visitors based on the different IP adresses. But an IP adress is an indirect identifier, so Paul needs to anonymize this field while maintaining the fact that some values appear multiple times.
Replace the remove_content
function with a better one
called clean_comment
that will:
HINT: Look at the
jsonb_set()
andjsonb_build_object()
functions
CREATE OR REPLACE FUNCTION my_masks.clean_comment(message JSONB) RETURNS JSONB VOLATILE LANGUAGE SQL AS $func$ SELECT jsonb_set( message, ARRAY['meta'], jsonb_build_object( 'name',anon.fake_last_name(), 'ip_address', md5((message->'meta'->'ip_addr')::TEXT), 'email', NULL ) ) - ARRAY['content']; $func$;
clean_comment |
---|
{'meta': {'name': 'Morgan', 'email': None, 'ip_address': '1d8cbcdef988d55982af1536922ddcd1'}} |
{'meta': {'name': 'Thomas', 'email': None, 'ip_address': None}} |
{'meta': {'name': 'Clarke', 'email': None, 'ip_address': None}} |
The main idea of generalization is to "blur" the original data. For example, instead of saying "Mister X was born on July 25, 1989", we can say "Mister X was born is the 80's". The information is still true, but it is less precise and it can't be used to reidentify the subject.
Paul hired dozens of employees over the years. He kept a record of their hair color, size and medical condition.
Paul wants to extract weird stats from these details. He provides generalized views to Pierre.
In this section, we will learn:
DROP TABLE IF EXISTS employee CASCADE;
CREATE TABLE employee ( id INT PRIMARY KEY, full_name TEXT, first_day DATE, last_day DATE, height INT, hair TEXT, eyes TEXT, size TEXT, asthma BOOLEAN, CHECK(hair = ANY(ARRAY['bald','blond','dark','red'])), CHECK(eyes = ANY(ARRAY['blue','green','brown'])) , CHECK(size = ANY(ARRAY['S','M','L','XL','XXL'])) );
This is awkward and illegal.
Loading the data:
count |
---|
16 |
full_name | first_day | hair | size | asthma |
---|---|---|---|---|
Luna Dickens | 2018-07-22 | blond | L | True |
Paul Wolf | 2020-01-15 | bald | M | False |
Rowan Hoeger | 2018-12-01 | dark | XXL | True |
Paul wants to find if there's a correlation between asthma and the eyes color.
He provides the following view to Pierre.
DROP MATERIALIZED VIEW IF EXISTS v_asthma_eyes;
CREATE MATERIALIZED VIEW v_asthma_eyes AS
SELECT eyes,
asthma
FROM employee;
eyes | asthma |
---|---|
blue | True |
brown | False |
blue | True |
Pierre can now write queries over this view.
SELECT eyes,
100*COUNT(1) FILTER (
WHERE asthma) / COUNT(1) AS asthma_rate
FROM v_asthma_eyes
GROUP BY eyes;
eyes | asthma_rate |
---|---|
green | 100 |
brown | 37 |
blue | 33 |
Pierre just proved that asthma is caused by green eyes.
The 'asthma' and 'eyes' are considered as indirect identifiers.
SECURITY LABEL
FOR anon ON COLUMN v_asthma_eyes.eyes IS 'INDIRECT IDENTIFIER';
SECURITY LABEL
FOR anon ON COLUMN v_asthma_eyes.asthma IS 'INDIRECT IDENTIFIER';
run-postgres: ‘INDIRECT IDENTIFIER’ is not a valid label for a column
k_anonymity |
---|
None |
The v_asthma_eyes has '2-anonymity'. This means that each quasi-identifier combination (the 'eyes-asthma' tuples) occurs in at least 2 records for a dataset.
In other words, it means that each individual in the view cannot be distinguished from at least 1 (k-1) other individual.
DROP MATERIALIZED VIEW IF EXISTS v_staff_per_month;
CREATE MATERIALIZED VIEW v_staff_per_month AS
SELECT anon.generalize_daterange(first_day, 'month') AS first_day,
anon.generalize_daterange(last_day, 'month') AS last_day
FROM employee;
first_day | last_day |
---|---|
[2018-07-01, 2018-08-01) | [2018-12-01, 2019-01-01) |
[2020-01-01, 2020-02-01) | (None, None) |
[2018-12-01, 2019-01-01) | [2018-12-01, 2019-01-01) |
Pierre can write a query to find how many employees were hired in november 2021.
SELECT COUNT(1) FILTER (
WHERE make_date(2019, 11, 1) BETWEEN lower(first_day) AND COALESCE(upper(last_day), now()) )
FROM v_staff_per_month;
count |
---|
4 |
Now let's check the k-anonymity of this view by declaring which columns are indirect identifiers.
SECURITY LABEL
FOR anon ON COLUMN v_staff_per_month.first_day IS 'INDIRECT IDENTIFIER';
SECURITY LABEL
FOR anon ON COLUMN v_staff_per_month.last_day IS 'INDIRECT IDENTIFIER';
SELECT anon.k_anonymity('v_staff_per_month');
run-postgres: ‘INDIRECT IDENTIFIER’ is not a valid label for a column
In this case, the k factor is 1 which means that at least one unique individual can be identified directly by his/her first and last dates.
v_staff_per_month
and decrease
granularityGeneralizing dates per month is not enough. Write another view called 'v_staff_per_year' that will generalize dates per year.
Also simplify the view by using a range of int to store the years instead of a date range.
How many people worked for Paul for each year between 2018 and 2021?
v_staff_per_year
viewWhat is the k-anonymity of ‘v_staff_per_month_years’?
DROP MATERIALIZED VIEW IF EXISTS v_staff_per_year;
CREATE MATERIALIZED VIEW v_staff_per_year AS
SELECT int4range(extract(YEAR
FROM first_day)::INT, extract(YEAR
FROM last_day)::INT, '[]') AS period
FROM employee;
‘[]’ will include the upper bound
period |
---|
[2018, 2019) |
[2020, None) |
[2018, 2019) |
SELECT YEAR,
COUNT(1) FILTER (
WHERE YEAR <@ period )
FROM generate_series(2018, 2021) YEAR,
v_staff_per_year
GROUP BY YEAR
ORDER BY YEAR ASC;
year | count |
---|---|
2018 | 4 |
2019 | 6 |
2020 | 9 |
2021 | 10 |
SECURITY LABEL
FOR anon ON COLUMN v_staff_per_year.period IS 'INDIRECT IDENTIFIER';
SELECT anon.k_anonymity('v_staff_per_year');
run-postgres: ‘INDIRECT IDENTIFIER’ is not a valid label for a column
run-postgres: database “boutique” is being accessed by other users DETAIL: There is 1 other session using the database.
run-postgres: role “jack” cannot be dropped because some objects depend on it DETAIL: 5 objects in database boutique
Static Masking : perfect for "once-and-for-all" anonymization
Dynamic Masking : useful when one user is untrusted
Anonymous Dumps : can be used in CI/CD workflows
RTFM -> Masking Functions
Other projects you may like
This is a free and open project!
labs.dalibo.com/postgresql_anonymizer
Please send us feedback on how you use it, how it fits your needs (or not), etc.
Sources are here: gitlab.com/dalibo/postgresql_anonymizer
Download the PDF Handout
:::