Reducing System Complexity with Event Sourcing
When I start working with a team, one the first questions I ask is “how much time do you spend creating new features versus making sure those new features don’t break something else.” …
Saving Money with Apache Pulsar Tiered Storage
As companies start to look at rolling out real-time messaging systems, it’s important to look at the overall hardware costs. With some forward planning, companies can save as much as 85% on their overall storage costs. Before we start getting into the cost comparisons, let me briefly show how Apache Kafka and Apache Pulsar store […]
Q and A: Viewpoints on Open Source
There are diverse viewpoints on open source and its usage as a service. I’ve attempted to give a synopsis of the issues and some background – but that’s only my viewpoint. I’m bringing in other people to give their diverse viewpoints to give a more well-rounded one. This is stemming from this Twitter thread. The […]
The Three Components of a Big Data Data Pipeline
The Three Components of a Big Data Data Pipeline There’s a common misconception in Big Data that you only need 1 technology to do everything that’s necessary for a data pipeline – and that’s incorrect. Data Engineering != Spark The misconception that Apache Spark is all you’ll need for your data pipeline is common. The […]
Advice for Small Teams and Startups on Data Engineering
Small data engineering teams require different tactics. Much of my writing is geared towards larger companies and teams. How should a startup or small data engineering team in a big company be set up and work? What, if anything, should be done different? Your First Data Engineer Your first data engineering hire is a crucial […]
Creating a Data Engineering Culture
At DataEngConf Barcelona, I premiered a new talk about the importance of creating a data engineering culture. I share what a data engineering culture is and what management needs to do to be successful with Big Data.
Here is the video from the conferen…
Why You Can’t Do All of Your Data Engineering with SQL
There is a common misunderstanding in data engineering that you can do everything you need to create a Big Data data pipeline with SQL. This notion is being promoted by some vendors and companies. They’re wrong and you can’t do all of your data engineering with SQL. You will eventually need a programming language to […]
Thoughts on Cloudera Merging/Buying Hortonworks
Cloudera has merged with/purchased Hortonworks. As a former Clouderan, it’s interesting to see this move on several levels. I’m going to share my insights from the outside as a former insider. Full Disclosure: Although I’m former Cloudera, I don’t own any shares of Cloudera or Hortonworks and don’t plan to purchase any in the short-term. […]
Creating Work Queues with Apache Kafka and Apache Pulsar
A common use case for using Kafka and Pulsar is to create work queues. The two technologies offer different implementations for accomplishing this use case. I’ll discuss the ways of implementing work queues in Kafka and Pulsar as well as the relative strengths of doing each one. What are work queues? A work queue is […]
InfiniteConf Keynote – Why Real-time is the Future
Here is my keynote from InfiniteConf 2018. I talk about why real-time is gaining so much momentum, what it does for businesses, how it helps data sciences, and some common use cases.