Unapologetically Technical Episode 9 – Gunnar Morling
https://youtu.be/ayAGiPd2zq4?si=o0SHbsbT0-Pmdkqd This week on Unapologetically Technical, I had the wonderful pleasure of interviewing Gunnar Morling, the creator of the Billion Row Challenge and Senior Staff Software Engineer at Decodable. In this episode, we talk about why it is so important to stay in a position long enough to gain experience and see the success or failure of decisions. […]
The post Unapologetically Technical Episode 9 – Gunnar Morling first appeared on Jesse Anderson.
Unapologetically Technical Episode 8 – Tom Scott
https://youtu.be/ZJF7zLvH6-w?si=YHBTQ7B7XB2zwkrq It has been quite a while, but we’re finally back to a new episode this year! In this episode of Unapologetically Technical, I interview Tom Scott, the Founder and CEO of Streambased. Join us as we talk about distributed systems and how he created distributed or what we call the Monte Carlo simulations. We […]
The post Unapologetically Technical Episode 8 – Tom Scott first appeared on Jesse Anderson.
The Reasons for Data Mesh on Pulsar
Data mesh is quickly becoming a way for companies to roll out their data strategy. If you haven’t already learned about data mesh, I suggest doing so. It comes with organizational and technical changes. I think a crucial part of your data mesh revolves around the choice of publish/subscribe technologies.
At the crux of data mesh is a desire for flexibility. This flexibility extends from the creation of data products (publish) all the way to the consumption of data products (subscribe). Greater flexibility in data mesh means that teams creating and consuming data products have even less coordination between themselves.
To help us look at the pub/sub needs for data teams, we’ll compare Apache Pulsar and Apache Kafka.
There isn’t a vast difference between publishing data in Pulsar or Kafka from a cursory look. They both have similar publish APIs. However, Kafka lacks a built-in schema registry, whereas Pulsar has a built-in schema registry that prevents incorrect schema usage. Data mesh calls for discovery and schema services. However, I still believe these should be built into the technology itself and not be added on.
Another significant difference between Kafka and Pulsar that isn’t as well known is that once data is published into a Kafka topic, it can’t be consumed in any other way than through reading that partition. This manifests when teams have to choose between wanting an ordered topic or random ordering, such as a round-robin layout. Before a team can answer this question, the team needs to know how the data will be consumed. In a data mesh scenario, the team producing the data product may be able to talk to the team or predict all the ways their data will be consumed in current or future use cases. In these situations, the team would be forced to publish their data product twice, once into an ordered topic and again into a random topic. The Kafka solution adds complexity in both the operations, by doubling the data and load, and programmatically by having to know which topic is ordered or random.
With Pulsar, this choice is dramatically different. Pulsar allows for different methods of consumption that can be chosen by the consumer. In this case, the producer can produce in an ordered manner. The data product consumer can choose from several ways of maintaining order or randomized data. Pulsar is flexible enough to let the downstream consuming team choose for themselves.
Can you switch careers to Big Data in 4 months or less?
If you’re a Software Engineer or Data Analyst, I’ve written a book on switching careers to Big Data. Inside, I show you:
How to switch careers: the 7 things you need to answer before making a career switch (page 77)
What to learn: the 15 Big Data technologies you should know (page 67)
Specific career advice: what you need to do to switch from your current title (page 46)
You have Successfully Subscribed!
A common use case for pub/sub is messaging. With messaging, consumption requires the ability to individually acknowledge that a message was consumed and processed. Kafka’s messaging support entails workarounds whereas Pulsar supports this natively. Once again, Kafka requires more coordination and operational considerations. Pulsar allows for the consumer of the data product to make the choice.
I always recommend using managed services whenever you are running on the cloud. There are two main managed services for Pulsar on the major cloud providers. StreamNative has StreamNative Cloud and DataStax has DataStax Astra.
I think a critical feature of the technologies for data mesh is flexibility. We don’t want to have to complicate our technical landscape with the various limitations of our core technologies. One of those core technologies is our choice of pub/sub. Data teams won’t know ahead of time all of the ways that the data products will be consumed. We can clearly see that Pulsar will make it far easier for data mesh while Kafka will complicate it. When doing data mesh, I highly recommend Pulsar as the pub/sub of choice.
The Soldiers, Rogues, and Mages of Data Teams
Data Teams are like Role Playing Games (RPG). If you’re not familiar with RPGs, there is a person or group of characters all working together for a common goa…
KPIs Every Data Team Should Have
Data teams can be challenging. KPIs (Key Performance Indicators) are different from other teams. The team’s value creation and performance are distinct. I want to share some KPIs.
Before emba…
Ten Years On – The Million Monkeys Project
I want to tell you a story about how my life changed. It wasn’t a cult, new religion, or programming language. A million monkeys changed my life.
Ten years ago, I randomly recreated…
Keeping Things Stupidly Simple With Pulsar and Kafka
During my interviews for another case study in Data Teams, I was introduced to a concept I teach but hadn’t heard so brilliantly stated. The case study was with Justin Coffey and Fr…
What Happens When Data Science Teams Add A Data Engineer
By Jesse Anderson and Mikio Braun
Organizations are gradually getting the message about the critical nature of data engineering. Data science teams are getting that message too. Sometime…
Analysis of Confluent’s S1
Confluent just filed their S1 to IPO. I worked with Confluent starting in March of 2015, and we eventually parted ways. At my company, we continue to work with streaming technologies, inc…
Why Data Science Teams Don’t Think They Need Data Engineering
Some of the most interesting consultations are when I help data science teams that don’t think they need data engineering. I’ve compiled a list of some of the more common reasons why data…