Skip to main content

K-Anonymity (K-Member Micro-Aggregations)

For many use cases and tasks in an organization, de-identification is enough when you have to take care of privacy (for instance because the data was collected lawfully, which of course it was...). By using masking or (rotated) encryption, plaintext data is hidden. This is great if you want to retain patterns and value inside your data. However, there is no mathematical guarantee we can never re-identify any user based on patterns inside non-personal data.

There are cases where strict guarantees on the level of personality inside data are necessary (by default, or mandated by policies inside organizations). By definition, removing personality from data also includes removing potential information from the data.

If this is a trade-off you need to strike you can extend your pipelines with K-MMA: K-member micro aggregations.

K-Member Micro Aggregations and K-anonymity

A common way to describe how hard re-identification of your data is, is by the means of k-anonymity. A dataset is said to be k-anonymous if and only if for any data subject at least k-1 other data subjects have the same characteristics. These characteristics are based on a set of Quasi Identifiers (see definitions). With k-Member Micro-Aggregations we can guarantee that for each row of the Quasi Identifiers you define at least k-1 rows exist with the same values.

To illustrate this principle, consider the following example. We generate some random data that resembles some sort of ecommerce platform. We created a data contract with the following simple schema:

nodes:
- type: STRING
name: transactionId
repeated: false
required: true
- type: STRING
name: userId
- type: STRING
name: email
- type: INTEGER
name: age
- type: STRING
name: size
- type: INTEGER
name: transactionAmount
- type: STRING
name: items
- type: STRING
name: hairColor
- type: INTEGER
name: itemCount
- type: STRING
name: date
- type: INTEGER
name: purpose

A sample of the generated data:

transactionIduserIdemailagesizehairColortransactionAmountitemsitemCountdatepurpose
861200791533445jeffreypowell@hotmail.com33XSred123[19063]12022-08-30 15:44:441
733970993468355forbeserik@gmail.com16Sbrown46[13342, 12309, 13755, 10134]42022-07-19 15:44:442
494723158553892wboone@gmail.com64XSblack73[13342, 10773, 12442]32022-06-18 15:44:442

Data contract annotation for K-Anonymity

For using micro-aggregations, additional information in the data contract is required. For each of the quasi-identifiers we differentiate between three different statistical data types.

  1. Numerical: numerical data type with natural ordering, e.g. age in years
  2. Nominal: categorical data without ordering, e.g. a person's hair color
  3. Ordinal: categorical data with a specific order, e.g. clothing size (S, M, L, ...)

We define the quasi-identifiers in the FieldMetadata.

{
"data_contract": {
...,
"field_metadata": [
{
"field_name": "size",
"personal_data_config": {
"is_quasi_id": true
},
"statistical_data_type": "ORDINAL",
"ordinal_values": [
"XS",
"S",
"M",
"L",
"XL"
]
},
{
"field_name": "age",
"personal_data_config": {
"is_quasi_id": true
},
"statistical_data_type": "NUMERICAL"
},
{
"field_name": "transactionAmount",
"personal_data_config": {
"is_quasi_id": true
},
"statistical_data_type": "NUMERICAL"
},
{
"field_name": "hairColor",
"personal_data_config": {
"is_quasi_id": true
},
"statistical_data_type": "NOMINAL"
}
]
}
}
  1. Define the field_name corresponding to the header of the column
  2. Indicate the field is a quasi-identifier
  3. Define the statistical data type (NUMERICAL, NOMINAL, ORDINAL)
  4. If your column has ordinal data, you should define the ordinal_values in their respective order

Hence, for the micro-aggregations we will only use four columns of the data. A sample is found below.

agehairColorsizetransactionAmount
30blondeXL44
30blondeXL19
26blondeXL31
27brownS220
27brownS257
32brownS233
28blackXL172
31blackXL209
28blackXL216
28blackXL167

Micro-Aggregations Batch Job

Now that the data contract is defined correctly (make sure the contract is active), define the configuration for the micro-aggregations batch job. We assume you have created a data connector in the same project. The configuration file should then look like this:

{
"source_data": {
"data_connector_ref": { "name": "dataConnectorName"},
"path_prefix": "input",
"file_name": "file.csv",
"data_type": { "csv": { "charset": "UTF-8" } }
},
"target_data": {
"data_connector_ref": { "name": "dataConnectorName"},
"path_prefix": "output",
"file_name": "file.csv",
"data_type": { "csv": { "charset": "UTF-8" } }
},
"data_contract_ref": {
"handle": "handle",
"name": "name",
"version": "xx.xx.xx"
},
"aggregation_config": {
"minimum_k_anonymity": 3
}
}
  1. The source of the data by the means of your data connector, the file name and path prefix and the type of the data (only csv is available presently)
  2. The target location of the data, similar to the source data
  3. The reference to the data contract applied to the data
  4. The minimum amount of k-anonymity you want to ensure

Finally, you can run the micro-aggregations batch job by running the following command:

strm create batch-job --type micro-aggregation --file my_config.json

The results of our example are found below:

BeforeAfter
agehairColorsizetransactionAmount
30blondeXL44
30blondeXL19
26blondeXL31
27brownS220
27brownS257
32brownS233
28blackXL172
31blackXL209
28blackXL216
28blackXL167
agehairColorsizetransactionAmount
28.666666666666668blondeXL31.33333333333336
28.666666666666668blondeXL31.33333333333336
28.666666666666668blondeXL31.33333333333336
28.666666666666668brownS236.66666666666669
28.666666666666668brownS236.66666666666669
28.666666666666668brownS236.66666666666669
28.75blackXL191.0
28.75blackXL191.0
28.75blackXL191.0
28.75blackXL191.0

To see if your output data actually satisfies the minimum k-anonymity, you can use our privacy diagnostics python package. The full example that we used, can be found here.