Crawl, Walk, Run with Big Data
Crawl, Walk, Run with Big Data Attacking a Big Data project with an all-or-nothing mindset leads to an absolute failure. I highly suggest breaking the overall project into more manageable phases. These phases are called crawl, walk, and run. Crawling In this phase, you’re doing the absolute minimum to start using Big Data. This might […]
Q and A: Big Data strategy
How good a Big Data strategy can be defined by someone that doesn’t know the technology behind it? Today’s blog post comes from a question from a subscriber to my mailing list. The question come from André M: How good a Big Data strategy can be defined by someone that doesn’t know the technology behind […]
Is Big Data Cheap?
Companies and individuals often come into Big Data thinking everything is cheap. After all, the entire stack is open source, right? Well, some things are cheap and some things are more expensive. Software One of the important distinctions with Hadoop is that it isn’t an open source knock off of a better closed source framework. […]
Apache Kafka and Google Cloud Pub/Sub
Some of the contenders for Big Data messaging systems are Apache Kafka, Google Cloud Pub/Sub, and Amazon Kinesis (not discussed in this post). While similar in many ways, there are enough subtle differences that a Data Engineer needs to know. These can range from nice to know to we’ll have to switch. Cloud vs DIY […]
Kafka 0.10 Changes for Developers
Kafka 0.10 is out. Here are the changes that developers need to know about. Here is the new URL to the Kafka 0.10 JavaDoc. KafkaConsumer The KafkaConsumer had a minor change to that allows you to specify a maximum number of messages to return. You can set this by using the max.poll.records property to a […]
Question and Answers with the Apache Beam Team
Apache Beam just had its first release. Now that we’re working towards the second release, 0.2.0-incubating, I’m catching up with the committers and users to ask some of the common questions about Beam. Each committer and user is sharing their own opinion and not necessarily that of their company. Our interviewees are: Neville Li (NL) […]
Ability Gap – Why We Need Data Engineers
I had a conversation with another person in the Big Data field. We were discussing whether the Data Engineers would become a more common job title and migrate out of Silicon Valley. I told him yes. Big Data is downright complicated on many levels. There are too many new technologies and changes within technologies where […]
Big Data’s Required and Recommended Technical Skills
A common question beginners ask about Hadoop are the technical skills needed to get started. This helps level set what skills you need before you embark on a big data journey. For developers and administrators, I divide up the skills as those that required and those that are nice to have or recommended. Developer Skills […]
My Big Data Journey
Everyone’s Big Data journey starts somewhere. We’re often given stories of outright mastery, but I want to tell you how I got started with Big Data. Each of these stories about mastery forget or omit their humble beginnings. This is my story from my humble beginnings. Distributed Systems My specialty in programming has always been […]
The Case for Heron
For the past few months, I’ve been teaching at companies who are heavy users of Apache Storm. They’re also undertaking massive projects to move off of Storm. During that time, I’d say that something new was coming that might convince them to consider an alternative. Now, I’m free to talk about that alternative. Twitter has […]