About Professional Data Engineering

Big Data solutions are incredibly complex. They usually entail the correct application of five to ten or more complicated technologies, all working together. That increase in complexity is the reason why Big Data is such a different animal. There is an order of magnitude increase in the complexity over small data.

This course is only intended for teams and companies facing real big data problems. We focus on practical use cases and real-world applications for teams and companies. The course doesn’t just show the technologies; we show how they’re used in data pipelines and how professional Data Engineers use them.

Duration: Online – Two weeks of concerted effort/Eight weeks of less effort

Intended Audience: Technical, Software Engineers, QA, Analysts

Prerequisites: Intermediate-Level Java

You Will Learn

What exists in the Big Data ecosystem so you can use the right tool for the right job.
An understanding of how HDFS works and how to interact with it.
An understanding of how MapReduce works and how each phase works.
An understanding of how Spark works and how each phase works.
What are Java 8 Lambdas and how they make your Spark code humanly readable?
The basics of coding a Spark job with Java to build your Big Data foundation.
The various API methods in Spark and what they do.
How SQL can be used with a Spark job and when that vastly improves your productivity and code.
How to create Java code that runs as a function during a Spark SQL command to use existing Java code or do use case-specific queries.
The basics of coding a MapReduce job with Java to build your Big Data foundation.
What the advanced features of the MapReduce API that only the true experts know.
How Apache Crunch gives you a very different API from MapReduce and gives you a more Java-centric API.
How to use Apache Crunch to do the things not humanly possible in MapReduce like joining datasets and performing secondary sorts.
The simple and advanced SQL-like commands available in Hive.
How to extend Hive commands with custom non-Java code to do company or use case-specific queries.
How to move data out of and into relational databases like MySQL and Oracle from Hadoop/Spark using Apache Sqoop.
How to move files and network data from many different computers to Hadoop using Apache Flume.
What is Hue and how it aids in creating browser-based data products?
How Apache Oozie makes it possible to create repeatable workflows that enterprises need.
How all of these technologies come together as a solution for ETL, clickstream, and sessionization use cases.
The steps and iterations to take when creating a Big Data solution.

Course Outline

Thinking in Big Data
  Introducing Big Data
  What is Hadoop?
  The Ecosystem
  Introduction to HDFS
  Introduction to MapReduce
Coding with MapReduce
  Java API
  Streaming API
  Using Eclipse
  Regular Expressions
  Using Apache Maven
Advanced MapReduce
  Advanced MapReduce Classes
  Unit Testing
  Avro
  MapReduce and Avro
Using Parquet
  Columnar File Formats
  Coding With Parquet
Coding With Crunch
  Using Crunch
  Crunch API Pipelines
Advanced Crunch
  Joins
  Crunch Operations
  Secondary Sorts
  Unit Testing
Using Hive
  Hive Overview
  Hive Queries
  Advanced Queries
Augmenting Hive With UDFs and Transforms
  Hive Transforms
  Hive UDFs
Coding With Spark
  About Spark
  Using Eclipse
  Using Apache Maven
  Functional Programming
  Java API
  Built-In Transformations and Actions
Spark SQL
  Spark SQL
  Spark SQL API
  Spark SQL UDFs
Moving and Accessing Data
  Sqoop
  Flume
Creating Workflows
  Hue
  Oozie
  Hue and Oozie
Hadoop Architectures
  ETL
  Click Steam
  Other Architectures
Step 0 – Learning
  How To Learn
  Learning Strategies
  Habits of Successful Students
  Habits of Unsuccessful Students
  Applying Strategies
Pre-Big Data
  Simple Big Data
  Review and Application
Live Coding
  Simple Algorithms
D3.js
  Ways Of Visualizing Data
  Charting With Dimple and D3.js
  Importance of Visualization
  Creating the Right Visualizations
The Basics of HBase
  HBase Architecture
  HBase API
  Architecting HBase Solutions
Doing Data Science on the NFL Play by Play Dataset
  Enter the Query – The Hive Story
  Algorithms Alone – Lost in data
Million Monkeys
  The Project
  The Algorithm
  The Results
  Going Viral
Engineering Big Data Solutions
Kafka
  About Kafka
  Kafka Internals
  Kafka API

Technologies Covered

Apache Hadoop
Apache Spark
Apache Hive
Apache Pig
Apache HBase
Apache Impala
Apache Kafka
Apache Parquet
Apache Crunch
Apache Sqoop
Apache Flume
Hue
Apache Oozie
D3.js
Apache HBase

About Professional Data Engineering

You Will Learn

Course Outline

Technologies Covered

I want this class

Get your free copy of Data Engineering Teams: Creating Successful Big Data Teams and Products

Data Engineering Teams Book

Would you like to know what I teach successful organizations to do?

Mentoring

We’re here to help make the process more successful and the outcome more effective.

Architecture Reviews

The right tool for the job saves countless hours, time, money. Are you using the right tool for the job?

Project Acceleration

Why do so few companies create enormous value from Big Data while most fail?

Company

Resources

Resources

Stay updated with the latest.

Have a question?

Send us a message

or give us a call at +1 775.393.9122

© 2025 Big Data Institute

Privacy

© 2025 Big Data Institute

Privacy

Have a question?

Send us a message

or give us a call at +1 775.393.9122