There’s a common difficulty that companies are having in transitioning to Big Data, especially Kafka. They’re coming from systems where everything is exposed as an RPC-esque call (remote procedure call/REST call/etc). They’re transitioning to a data pipeline where everything is exposed as raw data.
These data pipelines are a brand new concept. With RPC’s, there was a much higher coupling. Teams could change the RPCs as they needed to change the call. With a data pipeline, there is a very loose coupling. Changes to the data pipeline will ripple through the organization in different ways.
Here are questions that teams and organization need to answer when using a data pipeline:
Organizationally
- How do we socialize that the data pipeline exists?
- How do we get other members of the organization to start adopting the data pipeline?
- How do we monetize the data or results from the analysis of the data?
- Which team is directly responsible for the data pipeline? (Hint: this is the reason a data engineering team needs to exist)
Security
- How do we lock down who has access to the data pipeline?
- How do we encrypt the data as it’s being sent around the data pipeline?
- How do we mask PII from consumers of the data pipeline that don’t need that information?
Technically
- How do we make sure that teams have the skills to use the data pipelines?
- How do we design the data pipeline to evolve as use cases increase?
- What technologies make sense for our data pipeline given our use case?
- How do we notify other teams when the data changes?
- How do we decide when and how to change our data in the data pipeline?