Writing your own distributed system shouldn’t be a task you undertake lightly. Too often, I’m seeing teams create their own distributed system. In my experience, this is because they don’t know or think about all of the ramifications of creating their own distributed system.
I say all of these things as someone who’s created 3 distributed systems from scratch. I’ve also taught and mentored teams who’ve created their own system. As part of my research, I always ask them “Knowing what you know now, would you write your own?” The answer is always a resounding no. They had too many unknown unknowns to make an informed decision. I’m writing this post to remove these unknown unknowns.
Check Yourself – Before You Wrickity-Wwreck Yourself
Let’s first consider the reasons why you’re even considering writing your own distributed system. These reason will give you an initial gut check of the real reasons you’re even considering this. Let’s consider some poor reasons to write your own.
One reason is that you don’t like how a particular system dotted its i’s or crossed its t’s. Every system in Big Data has to cheat in some way or another. You might not like the way a system cheats or that cheat prevents you from accomplishing a use case. There are many different systems out there and I strongly suggest you keep looking for one that lets you accomplish a use case.
Another reason is that you are really smart and could do it better. Your distributed system would be better adapted to your use case because you’re smarter and could write it better. You might have written some code in university that gives you a background in distributed systems. You might think your use case is so special that it requires your own distributed system. You might think that a general purpose distributed system is overkill for a problem and yours would elegantly handle it better. These thoughts often come from a lack of production experience with distributed systems.
Still another reason is that you don’t feel you have enough time to look around at the different systems out there to make a choice. There are soooo many out there that you think you could write it quicker yourself. My rule of thumb is that nothing is quick and easy with distributed systems. It’s foolish to think you’re going to do it faster or better.
A final reason, is that your master’s or PhD thesis was around a distributed system. There’s this thought that you could simply improve their code a little to make it production worthy. There is a massive difference between production code and PhD thesis code. There is a massive difference between a system that works in production and a system that works in theory or academically.
Development Phases
The manifestations of problems writing your own distributed system will vary depending on phase of development you’re in. Let’s talk a little about what happens at each phase.
Early Development
This is the stage where you make the fateful mistake of writing your own distributed system. It starts out simple. This system calls that system. It divides up the work…and done. Jesse must not know what he’s talking about after all. That was relatively simple.
Then you pass it off to QA and they start testing. They find this edge condition and you fix it. You find another exception and you fix it. Wash, rinse, and repeat. Things are getting more difficult. Your code is looking like a bunch of if statements and exception handling.
That’s just the beginning. QA won’t be able to find or test for everything.
Production and Operations
This is the stage where your distributed system is put in production. Things are handed off to the operations team. Your part in this is done…or so you think.
One of the first questions the Ops teams asks is how they should support this new system. You say “Don’t worry. We’ve tested this extensively and it should run without any issues.”
Then comes the first support call at 3 AM. “Things are blowing up and we don’t know why.” the ops person says. You roll out of bed, turn on your laptop, and start looking at the problem. Looking at the stack trace, you say to yourself “Huh, I didn’t think of that error.”
You realize that problem can only be fixed with a new code push. Your company has draconian rules on quick releases and you know this will be a fight. You leave for the office to get the patch ready. Tired. Very tired.
The next support call comes a few days later at 2 AM. “Things are blowing up and we don’t know why.” the ops person says. You roll out of bed, turn on your laptop, and start looking at the problem. This time, there isn’t enough logging to figure out what happened. You really have no idea what broke or why it broke.
Trying to replicate the problem doesn’t work. There is some kind of interplay or state issue. This is going to take days to figure out and fix. You go back to bed. Tired. Very tired.
Every time there is a problem, it gets escalated to you. The Ops team has no other resource for figuring out problems than you. You’ve given them some steps to run before they call you, but those steps rarely work.
All of the knowledge about the system is tribal. It has to be transmitted directly from you to each member of the team. There’s no way that an Ops person can Google it or find a solution on Stack Overflow. There is no outside course or book that Ops can leverage to support themselves. You are the bottleneck to everything.
You are now spending your time figuring out and fixing problems. You don’t create features or write new code. You’re spending your time plugging holes.
Your code is now an unreadable mess. It’s 80% error and edge case handing. No one else can read it but you. It’s now difficult to make fixes because that fix may affect another workaround/fix. This system is so brittle you can barely keep it going.
Post-release Development
This is the stage where you’ve handed off the code to the Operations team. The project should have went into a maintenance mode while you work on the next version or another project.
The problem is that you haven’t moved on. Months down the line, you’re still patching and fixing. Anytime you start working on the next version, you’re pulled back into a problem. At some point you stop trying to even start a new thing because you’re just waiting for the next operational problem.
The entire team’s productivity goes down as you’re not coding. It become a vicious cycle of not being able to QA everything, which makes a problem in production, which takes developers off coding, which takes time away from fixing the actual problem or creating new features.
Your roadmap looks like nothing but delays. Management is starting to ask questions. The salespeople are having to explain to customers why this or that product is so delayed. There’s lots of unhappiness to go around.
It’s incredibly difficult to break yourself out of this downward cycle. It only comes after months of concerted effort and maybe a complete rewrite.
Will This Happen to You?
Maybe, maybe not. The preceding story is directly based on my personal experience at one company. This isn’t hypothetical or academic; this is the real world.
Does This Apply to You?
You might be thinking this is a great blog post for other people. I’m unique, different, and smarter than others and this doesn’t really apply to me.
You might think your use case is simple and would never grow to the monstrosity I’m talking about. A distributed system framework would be over kill for me.
Writing a distributed system is high experience-based. Unless you and your team have extensive experience writing your own system from scratch and put it in production, you probably don’t have the requisite skills to write your own.
What You Should Do
You should use existing distributed systems whenever possible. There are many different systems out. Each one is appropriate for different use cases. You should spend as much time as necessary looking at frameworks. With distributed systems, an ounce of prevention isn’t worth a pound of cure – it’s more like a metric ton of cure.
As you look through frameworks, don’t immediately discount a framework as unable to work or handle your use case. You may not have the background to know for certain if it will or won’t work. As part of your project plan or Gandtt chart, there should be an entire piece or chunk of time allocated to choosing a distributed system. Coming from non-distributed systems or small data can give you a false sense of easy or trivial difficulty. Don’t make this mistake and skip this vital step or put enough time into it.
When You Should Write You Own
There is a time and a place for you to write you own distributed system. These times, however, are few and far between. You should take an honest look at your team and skills before embarking on this project.
I strongly suggest you get a second opinion from an outside expert. This person should help you work through any preconceived notions before advising on the right path. This is often the purview of a qualified Data Engineer.
Let’s say you’ve exhausted all of the other routes. No one sells a system similar to what you need. There isn’t an open source or closed source project that does what you need. Go back and look again!
Only after fully exhausting all other avenues and talking with an outside expert should you do it yourself.
Where to Go From Here
The most common cause for a team to write their own distributed system is a lack of knowledge of what’s out there. If you’re doing a Big Data project, I have an entire course dedicated to teaching these skills. I don’t just cover the APIs; I cover how to use it and why a technology is the right tool for the job. Save yourself some time and heartache. Get some help before you write your own distributed system.