Redacting PHI and PII in Data Streams with Phirest...

prattk · ‎10-11-2021

As streaming data expands its reach into virtually every industry, it is inevitable that we will encounter data that contains sensitive information, such as Protected Health Information (PHI) and Personally Identifiable Information (PII). The sensitive information must be kept secure and may be subject to special rules such as HIPAA and other industry regulations. This often requires redacting the sensitive information. Redaction also allows data to be used for secondary purposes, such as machine learning and analytics.

Identifying sensitive information such as PHI and PII in data is a complex task. The PHI and PII can take on many different forms such as person's names, ages, addresses, unique identifiers, and any other information that could be used to identify an individual. This information often does not follow strict patterns so a different approach is required.

Phirestream is an application that uses modern natural language processing techniques to identify sensitive information in streaming data. Once the information has been identified, Phirestream can redact the information based on your configuration settings. This could be to redact all ages within a certain range, or just person’s names and addresses, or zip codes having a population that exceeds some threshold. With Phirestream we can redact PHI and PII in our streaming data and keep that sensitive information out of our downstream data processing tasks. Phirestream supports over 30 types of sensitive information and custom types can be added by the user.

Phirestream is available on the Google Cloud Marketplace and can be launched as a virtual machine into your cloud. Phirestream works by providing a subset implementation of the Apache Kafka REST Proxy API that redacts sensitive information as the data streams through. The redacted data is produced to an Apache Kafka topic where it can be consumed by other applications. This means that if you are already using the Apache Kafka REST Proxy to produce messages to Apache Kafka the only configuration change you likely need to make is to adjust the location of the destination endpoint to be Phirestream’s API endpoint. This process is illustrated in the diagram below:

Here’s how you can launch Phirestream and configure it for your streaming data. First, visit the Phirestream page on the Google Cloud Marketplace and click the Launch button.

If you need to enable any APIs you will be prompted to do so. If so, click Enable to continue. At the following page provide your deployment name, select your zone, and other configuration options. The n2-standard-2 machine type is recommended for getting started. You will want to access the Phirestream virtual machine on ports 8080 and 22 so make sure those ports are open to you. Click Deploy to launch Phirestream.

After the Phirestream virtual machine launches, open an SSH connection to the virtual machine. Next, open Phirestream’s configuration file:

sudo nano /opt/phirestream/application.properties

Set the location of your Apache Kafka brokers:

kafka.bootstrap.servers=[msk-broker-addresses]

You can also set any other required Kafka settings by prefixing the setting name with kafka.. Now restart Phirestream for your changes to take effect:

sudo systemctl restart phirestream

Once Phirestream restarts in a few seconds, it is now exposing an API on port 8080. Let’s send some data to Phirestream. The command below sends a single message to Phirestream:

curl -k -X POST \

https://localhost:8080/topics/redacted \

-H 'Content-Type: application/vnd.kafka.json.v2+json' \

-d '{

"records": [

{

"key": "key-1",

"value": "George Washington was president."

},

]

}'

The message George Washington was president was sent to Phirestream. The person’s name George Washington was identified as PII. After redaction, the redacted message was produced to Apache Kafka to the redacted topic. You can change the destination Apache Kafka topic by modifying the topic name in the URL.

Now, if you consume from the redacted topic you will see the redacted message:

kafka-console-consumer.sh \

--topic redacted \

--bootstrap-server localhost:9092 \

--from-beginning

The message “{{{REDACTED-ner}}} was president.” will be shown. The actual message shown will vary based on your redaction settings. To set the types of information Phirestream will redact, you can edit or create a filter profile. A filter profile gives you full control over how Phirestream identifies and redacts sensitive information. You can choose the types of sensitive information to redact and how to redact each type. For example, you can configure a filter profile to redact person’s names, anonymize phone numbers, and encrypt street addresses.

If you are already using the Apache Kafka REST Proxy, you simply need to modify your application to use the Phirestream endpoint (e.g, http://host:8080) instead of the Apache Kafka REST Proxy endpoint. Your data will now pass through Phirestream, be redacted, and the redacted data will be produced to the Apache Kafka topic in the request URL.

In summary, to get started with Phirestream and redact PHI and PII in your streaming data, go to Phirestream on the Google Cloud Marketplace. Follow the steps to launch a Phirestream virtual machine and follow the instructions in this blog post. Open source Phirestream client SDKs are available on GitHub.

For questions or assistance please reach out to us at support@mtnfog.com or visit https://www.mtnfog.com.

Redacting PHI and PII in Data Streams with Phirestream on Google Cloud