Apache Kafka and Google Cloud Pub/Sub

Some of the contenders for Big Data messaging systems are Apache Kafka, Google Cloud Pub/Sub, and Amazon Kinesis (not discussed in this post). While similar in many ways, there are enough subtle differences that a Data Engineer needs to know. These can range from nice to know to we’ll have to switch. Cloud vs DIY […]

Kafka 0.10 Changes for Developers

Kafka 0.10 is out. Here are the changes that developers need to know about. Here is the new URL to the Kafka 0.10 JavaDoc. KafkaConsumer The KafkaConsumer had a minor change to that allows you to specify a maximum number of messages to return. You can set this by using the max.poll.records property to a […]

Question and Answers with the Apache Beam Team

Apache Beam just had its first release. Now that we’re working towards the second release, 0.2.0-incubating, I’m catching up with the committers and users to ask some of the common questions about Beam. Each committer and user is sharing their own opinion and not necessarily that of their company. Our interviewees are: Neville Li (NL) […]

Ability Gap – Why We Need Data Engineers

I had a conversation with another person in the Big Data field. We were discussing whether the Data Engineers would become a more common job title and migrate out of Silicon Valley. I told him yes. Big Data is downright complicated on many levels. There are too many new technologies and changes within technologies where […]

Big Data’s Required and Recommended Technical Skills

A common question beginners ask about Hadoop are the technical skills needed to get started. This helps level set what skills you need before you embark on a big data journey. For developers and administrators, I divide up the skills as those that required and those that are nice to have or recommended. Developer Skills […]

My Big Data Journey

Everyone’s Big Data journey starts somewhere. We’re often given stories of outright mastery, but I want to tell you how I got started with Big Data. Each of these stories about mastery forget or omit their humble beginnings. This is my story from my humble beginnings. Distributed Systems My specialty in programming has always been […]

The Case for Heron

For the past few months, I’ve been teaching at companies who are heavy users of Apache Storm. They’re also undertaking massive projects to move off of Storm. During that time, I’d say that something new was coming that might convince them to consider an alternative. Now, I’m free to talk about that alternative. Twitter has […]

We Live, Eat, and Breathe This Stuff

The NFL ran a commercial a few years back. It featured various professional athletes from the NFL doing things you wouldn’t otherwise believe. One showed a quarterback shooting trap with his football instead of a shotgun. I’ve shot trap and it’s hard enough to with a shotgun, much less a football. I see a similar […]

Spark and Java – Yes, They Work Together

Person who chases two rabbits catches neither. – Confucius This applies really to learning. Learning two new and different technologies at the same time makes you catch neither. I’ve seen so many students trying to learn Big Data and a new programming language at the same time. A few succeed where most fail. Why Two? […]

SSH With Google Cloud

Let’s just say that Google Cloud’s SSH instructions aren’t the greatest. Here are the steps to SSH into your instance. It also assumes that you’ve installed the gcloud program. These instructions are for MacOSX and Linux. We start off by creating a new SSH key. $ ssh-keygen -t rsa -f ~/.ssh/google_compute_engine -C yourgooglecloudemailaddress@example.com The ssh-keygen […]