Batch Jobs

STRM Privacy offers support for batch processing. This quickstart helps you to get started with Batch Jobs. To read more about the background for batch data pipelines, go here.

tip

Prefer to look at the example notebook directly? Find it here.

Create a STRM Privacy batch mode pipeline

With batch mode, you can set up data routines that, based on the data contract, grab data from a bucket, transform it according to a data contract and subsequently pick it up for downstream processing.

This is a powerful way to quickly set up data pipelines to feed applications that process or need sensitive data in batch routines without the overhead. In the real-world, this means you align on the privacy implications with your security and/or legal counterparts first, and because privacy comes by design and is encoded into the pipeline and data itself, you can just go ahead and use it.

This saves you a lot of trips to legal desks, and so improves your workflow considerably.

Quickstart outline

The following steps will be covered in this quickstart:

Create a data connection to retrieve and store the data
Define the data contract your data adheres to.
Define a batch job in the CLI
Generate some data for demo purposes
Explore the transformed data for downstream consumption

Creating a data connector

First create a data-connector of the desired kind.

Define the data contract

The next step is to instruct STRM Privacy what your data looks like. This is done in the data contract, which combines the data shape (your fields) with the privacy implications. In this quickstart, the privacy demo data contract is used:

Data Contract (schema shown separately for brevity)
Schema Definition (Avro representation)

tip

View this data contract with the CLI using: strm get data-contract strmprivacy/example/1.3.0 -ojson

{
  "dataContract": {
    "id": "44cc99df-04f1-4e42-9345-28151b1139d0",
    "ref": { "handle": "strmprivacy", "name": "example", "version": "1.3.0" },
    "state": "ACTIVE",
    "isPublic": true,
    "keyField": "consistentValue",
    "piiFields": {
      "consistentValue": 2,
      "someSensitiveValue": 3,
      "uniqueIdentifier": 1
    },
    "validations": [ { "field": "consistentValue", "type": "regex", "value": "^.+$" } ],
    "metadata": { ... omitted ... },
    "schema": {
      "ref": { "handle": "strmprivacy", "name": "example", "version": "1.3.0", "schemaType": "AVRO" },
      "state": "ACTIVE",
      "isPublic": true,
      "definition": ... shown in separate tab for brevity ...,
      "fingerprint": "6093265390869578999",
      "metadata": { ... omitted ... },
      "simpleSchema": {},
      "id": "44cc99df-04f1-4e42-9345-28151b1139d0"
    },
    "projectId": "d995bd01-22ea-458b-a184-4fac5ba48535"
  },
  "checksum": "5321256876911080621"
}

tip

View this schema definition with the CLI using: strm get data-contract strmprivacy/example/1.3.0 -ojson | jq '.dataContract.schema.definition | fromjson'

{
  "type": "record",
  "name": "DemoEvent",
  "namespace": "io.strmprivacy.schemas.demo.v1",
  "fields": [
    {
      "name": "strmMeta",
      "type": {
        "type": "record",
        "name": "StrmMeta",
        "fields": [
          { "name": "eventContractRef", "type": "string" },
          { "name": "nonce", "type": [ "null", "int" ], "default": null },
          { "name": "timestamp", "type": [ "null", "long" ], "default": null },
          { "name": "keyLink", "type": [ "null", "string" ], "default": null },
          { "name": "billingId", "type": [ "null", "string" ], "default": null },
          { "name": "consentLevels", "type": { "type": "array", "items": "int" } }
        ]
      }
    },
    { 
      "name": "uniqueIdentifier", "type": [ "null", "string" ], "default": null,
      "doc": "any value. For illustration purposes: use a value that is consistent over time like a customer or device ID."
    },
    {
      "name": "consistentValue", "type": "string",
      "doc": "any value. For illustration purposes: use a value that is consistent over a limited period like a session."
    },
    {
      "name": "someSensitiveValue", "type": [ "null", "string" ], "default": null,
      "doc": "any value. For illustration purposes: use a value that could identify a user over time based on behavior, like browsing behavior (e.g. urls)."
    },
    {
      "name": "notSensitiveValue", "type": [ "null", "string" ], "default": null,
      "doc": "any value. For illustration purposes: use a value that is not sensitive at all, like the rank of an item in a set."
    }
  ]
}

You can use an existing data contract or create your own. Refer to this blog on creating data contracts. Furthermore, you can also use Simple Schemas, a much easier way to define your data shape than the underlying Avro serialization schema.

Define a batch job via the CLI

With the data connection and contract defined, the batch job itself can be defined. Batch jobs can be defined by providing a config JSON to the CLI. The reference can be found here.

$ strm create batch-job --help
Create a Batch Job

Usage:
  strm create batch-job [flags]

Flags:
  -F, --file string The path to the JSON file containing the batch job configuration
  -h, --help help for batch-job

The JSON simply details which data-connector to use, what contract to apply and how to write the data back. We are working to include GCP Storage, a visual interface and even a file upload in follow-on releases.

tip

Configure your editor (VS Code, IntelliJ) to validate the Batch Job JSON definition against the JSON Schema.

For each record (row) processed by a batch job, the data subject (data owner) may or may not consent with their data being used for certain purposes. In other words, the legal ground under which the data was collected may differ per record. In streaming pipelines, the consent is provided by you and embedded in the metadata of each event. Similarly, for batch jobs, you need to indicate which field in your data contains the legal ground per row. This field does not necessarily need to be defined in your data contract.

In the definition file you need to set these three values:

what the default legal ground (consent level) is
the field that contains the legal ground in your data
how each of field's values map to consent levels

Here, consent levels are integer values referring to the respective purposes as defined in your purpose map.

About the default consent: It's safest to keep this to the integer value 0. It just means the data was collected under the most basic consent or legal ground you use.

{
  // partial excerpt
  "consent": {
    "default_consent_levels": [ 0 ],
    "consent_level_extractor": {
      "field": "the field that indicates collection ground",
      "field_patterns": {
        "example, like legitimate interest": {
          "consent_levels": [ 1 ]
        },
        "example, like marketing": {
          "consent_levels": [ 2 ]
        }
      }
    }
  }
  // partial excerpt
}

An example of the full definition file is included in the demo notebook. Just swap the example values for your own data-connector names and preferred buckets.

Indicate the timestamp config

An important part of data that is processed in batch, is the time that belongs to an individual record in the data. This can be the time this data was recorded (i.e. the event time equivalent for streaming data), or the time that this data was processed. Regardless of the meaning, it is required to have a field in the CSV data that represents a date and time. As can be seen in the Batch Job reference, the EncryptionConfig requires a TimestampConfig. The TimestampConfig defines how data is encrypted with respect to time. The privacy algorithm we use, uses the concept of time to determine whether the same encryption key or a new one should be used.

In the TimestampConfig, a format field defines the pattern that is used to parse the date and time that is specified in the time field (denoted by field in the TimestampConfig). To test the pattern, we advise to use the following tool. Keep a reference to the patterns open as well.

Define the derived data

The next step is to define the derived data - the privacy-transformed output. Just think of this as a folder on a disk that contains data that is ready for a specific purpose (like, in the example below, training a recommender).

First, let's dive a bit deeper into how transformations are applied. In this quickstart, we'll focus on a specific derived stream - in real-world applications you would probably have many different purposes and so a bunch of different derived streams.

Intermezzo: Privacy-transforming your data

Based on the data contract, data is processed and transformed in your batches. The level of privacy that can be achieved depends on the format of your source data.

A temporal (i.e. time-based) field in your data is used to achieve a fast but powerful way to apply the necessary transformations (through the keyField in the data contract) via encryption. It is therefore important to understand that data is pseudonimized at best, unless you have multiple rows per user that are closely spaced in time (like separated clicks or url hits with context data).

We plan to extend the privacy transforms, but as we expand batch mode further this is currently an important limitation. In streaming mode, you usually have separate but closely spaced data points (e.g. events over multiple days).

danger

So are you planning to use batch mode for e.g. user profile info, where every row is just one user? That won't get you anonymized data currently.

A real-world use case for derived data: masking for recommenders

Imagine you have a batch job with clickstream data you plan to use to train or evaluate a recommender system.

Your data includes a PII field that you do not want or are not allowed to reveal, while you do need it for your data analysis. Recommendations are highly personal and therefore require linking previous behaviour (orders, movies etc.) to the same user.

The only "personality dimension" a basic recommender really needs is to know what was the same user. They do not necessarily need to know who was the underlying customer. This is where masking comes in.

By masking a field, the actual value (e.g. the customer id) is replaced with a hash, allowing to link multiple data points to a single user, without revealing personal information. This can be done with derived streams.

In the snippet below, you will find the derived_data configuration of the batch-job. This configuration shows the data-connector to read from, the file to write to, and the allowed consent levels (purposes that should be decrypted). The consent level type is deprecated and should typically be set to GRANULAR, meaning that each desired purpose should be listed explicitly under the consent levels, and each record should explicitly specify consent for each corresponding purpose as well. Here, CUMULATIVE means that a consent for purpose 2 will include consent for purpose 1 as well. Finally, the snippet also shows the masked_fields. Within the data contract block "databert-handle/batch_job_public/1.0.1"{ ... } you can see the column names or field_patterns of the fields to mask.

{
  // partial excerpt
  "derived_data": [
    {
      "target": {
        "data_connector_ref": {
          "billing_id": "your_billing_id",
          "name": "databert-demo"
        },
        "data_type": {
          "csv": {
            "charset": "UTF-8"
          }
        },
        "file_name": "databert-demo-derived.csv"
      },
      "consent_levels": [
        2
      ],
      "consent_level_type": "CUMULATIVE",
      "masked_fields": {
        "field_patterns": {
          "databert-handle/batch_job_public/1.0.0": {
            "field_patterns": [
              "Email",
              "UserName"
            ]
          }
        }
      }
    }
  ]
  // partial excerpt
}

Generate the batch data

In a batch job, data is read, transformed and returned as soon as new files are found inside the bucket.

To simulate a data routine you already have or plan to set up, the example notebook includes a DataGenerator class that simulates some random user data (when we say random, we really mean "nonsensical"). Apart from session, user and meta (like a timestamp) fields you will recognize the PrivacyPlane as the consent field in step 3.1 above.

Clone/fork/download the notebook and add/replace your own credentials in the AwsProperties() class and s3.json to quickly prepare a demo pipeline of your own.

Explore privacy-transformed data

The data shape and privacy implications (the data contract) are now defined, the batch job is defined, a derived stream with masking applied is defined, and some example data is generated.

Next, let's explore what happens to the data based on (1) the data contract and (2) derived stream we defined.

Input data

The input data coming from the DataGenerator class that acts as input looks as follows:

input data

Encrypted data

The next step is to look at the data that is just encrypted (per field).

Basically, all connections that might exist between rows are destroyed here: the pii-fields Email, PrivateFieldA and PrivateFieldB, set in the data contract, are encrypted.

encryped data

Derived Data

It becomes more interesting looking at the derived data (as defined by the derived stream above). Remember, the goal was to apply masking instead of destroying any connection between rows that might exist.

Per the batch job configuration, the derived data is allowed to contain entries with a consent level of 2 or higher. From the input data it is known that there are 3 entries with a consent level of 2, which correspond to the three outputs below. In the table below, you can also see that the values for UserName and Email are hashed. This corresponds to the field_patterns that have been set in de masked_fields section of the data contract for derived_data. The username has been masked, but the hashed username is consistent over all rows. The Email field is different for every entry and therefore the hashed field is too.

Records that did not include consent for purpose 2 have been excluded from the derived data.

derived data

Example notebook

To quickly see for yourself how Batch Mode works, copy or clone the example notebook from GitHub with your own S3 and STRM Privacy credentials and explore the data. It also includes the batch job definition file.

Wrapping up

So, to illustrate how to create batch jobs with privacy transformations, the following steps have been covered:

We created a data connection to retrieve and store the data
Defined the data contract your data adheres to.
We defined a batch job in the CLI
Generated some data for demo purposes
Explore the transformed data for downstream consumption

Create a STRM Privacy batch mode pipeline​

Quickstart outline​

Creating a data connector​

Define the data contract​

Define a batch job via the CLI​

Indicate the data subject consent field​

Indicate the timestamp config​

Define the derived data​

Intermezzo: Privacy-transforming your data​

A real-world use case for derived data: masking for recommenders​

Generate the batch data​

Explore privacy-transformed data​

Input data​

Encrypted data​

Derived Data​

Example notebook​

Wrapping up​