The move from batch to real-time Big Data represents change. It will entail using brand new technologies and concepts that you haven’t dealt with before.
Batch Big Data
Let’s start off by defining batch Big Data. For batch, all data must be there when the processing starts.
Batch processes can run over fixed periods of time. Data is written out for a period of time and then processed. This time period is 1 hour at the soonest to, more often, 24 hours. As a direct result, the data is anywhere from the 1-2 times the time period old before it can be used. For example, if data is batched for 24 hours and it takes 24 hours to process the data, the oldest data will be 48 hours old before it can be used.
Common technologies that are used for batch processing in Big Data are Apache Hadoop and Apache Spark.
Real-time Big Data
Real-time Big Data is processed as soon as the data is received. All data does not need to be manifested before processing.
The definition for the amount of time before data is processed varies on the technologies used. To be clear, the definition of real-time is the software definition and not the hardware definition. The amount of time before data is available can range from a few milliseconds to tens of seconds.
There have been open source real-time technologies around for a while. However, they didn’t scale up to the Big Data levels we needed.
Common technologies that are used for real-time in Big Data are Apache Spark Streaming, Apache Flink, and Apache Kafka.
Comparing Batch and Real-time
Real-time is used when you need data fast. You will get instant knowledge of what’s happening in your data. Batch is used when you can wait for results or when you need to process a large aggregate. Ad hoc analysis is done in batch.
One of the biggest differences for real-time is how it will change data science. My clients are starting to train models in real-time and react to model drift faster. Best of all, they’re running their models in real-time and get real-time scoring.
Compared to batch, your costs will go up for real-time. While you can use some dynamic resource allocation, your real-time cluster, processing, and storage won’t be 100% efficient. You aren’t doing real-time to be cost efficient; you are trading off cost for speed.
Real-time brings a new level of operational issues. Your SLA (service level agreement) times with real-time will need to be much lower than with batch. In a worst case scenario, operational downtime with real-time will cause you to lose data. This makes it important to have the smallest possible SLA with real-time. Real-time put even more emphasis on the need for disaster recovery and overall system fault tolerance.
Batch Big Data is complex. Real-time is even more complex. This complexity increase will manifest in your system design, programmatically, and operationally.
Real-time and Batch Together
Real-time is almost always paired with batch. Data is offloaded from the real-time systems into a batch storage system such as HDFS, S3, or Google Cloud Storage. This enables long-term retention of the data. By having the data saved out, teams can do ad hoc analytics, model training, and other in-depth analysis.
Batch and real-time are better together. You can ETL in real-time and use the same ETL’d data with batch from offloading it. For commonly run analytics, you can precompute in real-time and then summarize in batch.
What Do You Need To Do?
Just like batch, successful real-time projects require forethought and planning. Organizations will need to understand the increase in complexity. There are new technologies and concepts that the teams need to understand.
You and your team will need help. They’ll need training on the real-time technologies. Make sure you set yourselves up for success by getting the right training.