Data Governance and GDPR with Apache Kafka

Data Governance conformance with Data in Motion
photo of Antonios Chalkiopoulos
Antonios Chalkiopoulos

Last December we announced our commitment to provide the necessary capabilities for data streaming systems, that will enable data-driven businesses to achieve compliance with GDPR prior to the regulation’s effective date (May 25, 2018), and this post explains how Lenses delivers by providing Data Governance capabilities and GDPR compliance by design.

The immutable nature of modern high-performance distributed systems, provides a lot of competitive advantages to various industries that are interested in fast loading streams of events and apply low latency queries and scalable processing for data in motion. Regulations such as the GDPR and Data Governance challenges, are making it non trivial to build compliant applications on such systems, as these become critical paths of the architecture.

Lenses supports and accelerates GDPR compliance

Many companies are looking to make GDPR compliance easier and faster for their streaming data architectures. With our team bringing heavy industry experience in the financial industry, high frequency trading and top tier investment banking, where regulations are treated as a first class citizen, we built Lenses to provide Data Governance by design.

Lenses is a streaming management platform for data discovery and management / operation of real-time pipelines that enable data movement across systems and processes. At the same time provides a default and secure gateway to safely share data in motion among users or applications. Lenses embraces Apache Kafka and Kubernetes, and provides a number of open source components to achieve so.

Data Governance Primer, Data in-flight control

Lenses, with its Lenses SQL streaming engine already adopted by many industry-leading organisations, is well placed to support data protection compliance and help companies meet evolving data governance demands.

If we take a step back from GDPR, modern data operations that include streaming components like Kafka, should ensure data availability, integrity and security, while protecting and governing usage of Personal Identifiable Information (PII). What customer information do I have? How we receive this data? Who has access or looked to the data? How to report this information? .. and the list goes on!

Fine grained security

By treating security as a first class citizen for all User Access Rights & processes, Lenses ensures exactly who should be allowed to use the data and enforces the access rights across all data in motion. Authentication and Authorisation is supported by either basic Role Based Access or LDAP and Active Directory (or Kerberos / TLS certificates). Effective access control is enhanced via black list and/or white list permissions assigned to User Groups or individual users, for particular data sets. In addition to that Access Control Lists and Quotas make large multi-tenant environments safe and secure. Using Lenses you can now govern your data-in motion, while controlling finely who has access (read or write level) and how data is used.

Monitor and Report data access

Lenses continuously monitors every aspect of the system and user activity, and captures detailed records of every action or data-access pattern, in an immutable audit log. Every action like creating, amending or deleting a topic, processor, connector or admin changes like ACLs or Quotas or changing a configuration is tracked. Lenses also monitors the data access activity, whether this originated from the Lenses web interface, a REST / WS call, CLI, BI tool. The audit log protected with an ACL (Access Control List) rule, means that only the authorised users may access it, but is also queryable so that we can easily answer the WHO did WHAT and WHEN.

We strongly encourage all actions to go through Lenses, so that role based access, security checks and auditing is enforced, and other services to be firewalled in production environments. What happens if an operation happens directly to the systems? Again an audit event is generated as Lenses periodically watches all resources and identifies such actions and its timestamp.

In addition to monitoring you data in motion, Lenses can be used to trigger relevant Alerts in order to notify you, about particular activities, such as deleting or adding a new dataset, or removing existing records. Alerts on specific user and system activity can further ensure data availability and integrity.

Right to Retrieve Data

One of the fundamental requirements of GDPR is the Right to Retrieve Personal Data.

With Lenses SQL the above requirement can be covered via a set of simple but thorough queries into the topics that contain PII data:

SELECT * from topicA WHERE customer.id = "XXX"

Lenses will retrieve and deserialize the data from a binary format (i.e. Avro) into a human-readable format and provide full Control Execution.

Control Execution brings into context the fact that streaming SQL is operating on un-bounded streams of events: A query would normally be a never-ending query. In order to bring query termination schemantics into Apache Kafka we introduced 4 controls:

  • LIMIT 10000 - Force the query to terminate when 10,000 records are matched
  • max.bytes = 20000000 - Force the query to terminate once 20 MBytes have been retrieved
  • max.time = 60000 - Force the query to terminate after 60 seconds
  • max.zero.polls = 8 - Force the query to terminate after 8 consecutive polls are empty, indicating we have exhausted a topic

Thus when requested to retrieve all data for a particular user, under the new regulations use the query with:

SET `max.bytes` = 1000000000;
SET `max.time` = 60000000;
SELECT * from topicA WHERE customer.id = "XXX"
-- or WHERE _key.customer.id = "XXX"

The above does NOT include a LIMIT clause as we want to retrieve all records and sets the max timeout to 1 hour to do an exhaustive search even on topics with Billions of messages, or when Quotas are in place. The query will normally terminate much faster, after 8 consecutive polls are empty (can be adjusted), as there is no need to keep querying the topic once we have exhausted all available partitions and offsets.

If you want to automate the retrieval of PPI data, you can automate querying multiple topics, via the relevant REST and WebSocket APIs, or use the Lenses CLI, or JDBC or Python or Go libraries.

Records of data activities

GDPR introduces an interesting challenge here. We need to preserve detailed records on data activities. Lenses in that respect keeps records of:

1 - How data flows through your data pipelines

2 - The internal representation of Lenses streaming SQL processors

The above two views, contain important information such as:

  • Topology view with status and metrics with custom SLA support for the data pipelines
  • Processor topology to view details regarding data processing activity

The combination of the above are the Records of data activities as per the GDPR specifications. All Connectors for Kafka and Lenses SQL / KStream applications are continuously monitored and the topologies are fully interactive so you can inspect every level of your data systems.

Deleting personal Data

With LSQL you get the ability to anonymize any data field while moving records through your pipelines.

INSERT INTO topicA SELECT ANONYMIZE(name), ANONYMIZE(surname), message
FROM topicA_staging

To physically delete Data from an overall immutable system like Apache Kafka options are limited. Pushing a null value on a compacted topic would eventually evict records, and with Kafka 1.0 forward we can now delete records on a partition before a particular offset.

Removing specific records however is not supported by any version of Kafka, as it is against the nature of such a system.

What you can do using Lenses SQL Processors, is create a new topic with a copy of the data of the source topic, while excluding particular records:

INSERT INTO topicA_clean
  SELECT * from topicA
  WHERE customer.id NOT IN (1243, 4382)

And then delete the old topic and start using the clean one. It is a short-cut, and requires some practice in order to be in a position to tackle this currently. Hopefully the capability to really delete records from a Kafka topic will be introduced in a future release, thus enabling fully the seemless deletion of data from topics.

Conclusions

The challenge of GDPR was managed systematically and at design time. By installing and enabling Lenses on a Kafka cluster, you get automatically data governance capabilities, as well as a number of other industrial level tooling to manage and operate your data in-motion. This means that you can now use streaming real-time data while ensuring Enterprise Governance and Data Compliance. It also means that individuals can easily find the necessary data and integrate their applications.

Our collective journey through GDPR is far from over; we will continue to innovate, talk to and —most importantly— hear from the community and every future release is going to provide additional capabilities for Data Governance.

You can try Lenses out by contacting us and we’ll shortly send you the download link, or request a demo for your team.

Additional Resources

GDPR - Chapter 3 - Rights of the data subject

Lenses 2.0 - Documentation page


Share this article

Did you like this article?

Subscribe to get new blogs, updates & news

Follow us for news, updates and releases!
@LandoopLtd
LENSES
For Apache Kafka ®
Download Now
Share this article


2 Minute Overview


Discover awesome features


Community


Join us at Landoop Community


Resources


Repos, Docs, Trainings, Tutorials


Free Download


ALL-IN-ONE free for developers!