K-Anonymity (K-Member Micro-Aggregations)

For many use cases and tasks in an organization, de-identification is enough when you have to take care of privacy (for instance because the data was collected lawfully, which of course it was...). By using masking or (rotated) encryption, plaintext data is hidden. This is great if you want to retain patterns and value inside your data. However, there is no mathematical guarantee we can never re-identify any user based on patterns inside non-personal data.

There are cases where strict guarantees on the level of personality inside data are necessary (by default, or mandated by policies inside organizations). By definition, removing personality from data also includes removing potential information from the data.

If this is a trade-off you need to strike you can extend your pipelines with K-MMA: K-member micro aggregations.

K-Member Micro Aggregations and K-anonymity

A common way to describe how hard re-identification of your data is, is by the means of k-anonymity. A dataset is said to be k-anonymous if and only if for any data subject at least k-1 other data subjects have the same characteristics. These characteristics are based on a set of Quasi Identifiers (see definitions). With k-Member Micro-Aggregations we can guarantee that for each row of the Quasi Identifiers you define at least k-1 rows exist with the same values.

To illustrate this principle, consider the following example. We generate some random data that resembles some sort of ecommerce platform. We created a data contract with the following simple schema:

nodes:
  - type: STRING
    name: transactionId
    repeated: false
    required: true
  - type: STRING
    name: userId
  - type: STRING
    name: email
  - type: INTEGER
    name: age
  - type: STRING
    name: size
  - type: INTEGER
    name: transactionAmount
  - type: STRING
    name: items
  - type: STRING
    name: hairColor
  - type: INTEGER
    name: itemCount
  - type: STRING
    name: date
  - type: INTEGER
    name: purpose

A sample of the generated data:

transactionId	userId	email	age	size	hairColor	transactionAmount	items	itemCount	date	purpose
861200791	533445	jeffreypowell@hotmail.com	33	XS	red	123	[19063]	1	2022-08-30 15:44:44	1
733970993	468355	forbeserik@gmail.com	16	S	brown	46	[13342, 12309, 13755, 10134]	4	2022-07-19 15:44:44	2
494723158	553892	wboone@gmail.com	64	XS	black	73	[13342, 10773, 12442]	3	2022-06-18 15:44:44	2

Data contract annotation for K-Anonymity

For using micro-aggregations, additional information in the data contract is required. For each of the quasi-identifiers we differentiate between three different statistical data types.

Numerical: numerical data type with natural ordering, e.g. age in years
Nominal: categorical data without ordering, e.g. a person's hair color
Ordinal: categorical data with a specific order, e.g. clothing size (S, M, L, ...)

We define the quasi-identifiers in the FieldMetadata.

{
  "data_contract": {
    ...,
    "field_metadata": [
      {
        "field_name": "size",
        "personal_data_config": {
          "is_quasi_id": true
        },
        "statistical_data_type": "ORDINAL",
        "ordinal_values": [
          "XS",
          "S",
          "M",
          "L",
          "XL"
        ]
      },
      {
        "field_name": "age",
        "personal_data_config": {
          "is_quasi_id": true
        },
        "statistical_data_type": "NUMERICAL"
      },
      {
        "field_name": "transactionAmount",
        "personal_data_config": {
          "is_quasi_id": true
        },
        "statistical_data_type": "NUMERICAL"
      },
      {
        "field_name": "hairColor",
        "personal_data_config": {
          "is_quasi_id": true
        },
        "statistical_data_type": "NOMINAL"
      }
    ]
  }
}

Define the field_name corresponding to the header of the column
Indicate the field is a quasi-identifier
Define the statistical data type (NUMERICAL, NOMINAL, ORDINAL)
If your column has ordinal data, you should define the ordinal_values in their respective order

Hence, for the micro-aggregations we will only use four columns of the data. A sample is found below.

age	hairColor	size	transactionAmount
30	blonde	XL	44
30	blonde	XL	19
26	blonde	XL	31
27	brown	S	220
27	brown	S	257
32	brown	S	233
28	black	XL	172
31	black	XL	209
28	black	XL	216
28	black	XL	167

Micro-Aggregations Batch Job

Now that the data contract is defined correctly (make sure the contract is active), define the configuration for the micro-aggregations batch job. We assume you have created a data connector in the same project. The configuration file should then look like this:

{
  "source_data": { 
    "data_connector_ref": { "name": "dataConnectorName"},
    "path_prefix": "input",
    "file_name": "file.csv",
    "data_type": { "csv": { "charset": "UTF-8" } }
  },
  "target_data": {
    "data_connector_ref": { "name": "dataConnectorName"},
    "path_prefix": "output",
    "file_name": "file.csv",
    "data_type": { "csv": { "charset": "UTF-8" } }
  },
  "data_contract_ref": {
    "handle": "handle",
    "name": "name",
    "version": "xx.xx.xx"
  },
  "aggregation_config": {
    "minimum_k_anonymity": 3
  }
}

The source of the data by the means of your data connector, the file name and path prefix and the type of the data (only csv is available presently)
The target location of the data, similar to the source data
The reference to the data contract applied to the data
The minimum amount of k-anonymity you want to ensure

Finally, you can run the micro-aggregations batch job by running the following command:

strm create batch-job --type micro-aggregation --file my_config.json

The results of our example are found below:

Before

After

age	hairColor	size	transactionAmount
30	blonde	XL	44
30	blonde	XL	19
26	blonde	XL	31
27	brown	S	220
27	brown	S	257
32	brown	S	233
28	black	XL	172
31	black	XL	209
28	black	XL	216
28	black	XL	167

age	hairColor	size	transactionAmount
28.666666666666668	blonde	XL	31.33333333333336
28.666666666666668	blonde	XL	31.33333333333336
28.666666666666668	blonde	XL	31.33333333333336
28.666666666666668	brown	S	236.66666666666669
28.666666666666668	brown	S	236.66666666666669
28.666666666666668	brown	S	236.66666666666669
28.75	black	XL	191.0
28.75	black	XL	191.0
28.75	black	XL	191.0
28.75	black	XL	191.0

To see if your output data actually satisfies the minimum k-anonymity, you can use our privacy diagnostics python package. The full example that we used, can be found here.

K-Member Micro Aggregations and K-anonymity

Data contract annotation for K-Anonymity​

Micro-Aggregations Batch Job​

Data contract annotation for K-Anonymity

Micro-Aggregations Batch Job