Today’s blog post comes from a question from a subscriber to my mailing list. The question come from Alpesh D.:

I have been getting your emails and they all seem to make sense. However, did I understand it correct that you believe all big data engineers need to be to use Java? I come from a heavy SQL, MPP data warehousing and BI background. With having done shell scripting from my days when I was a DBA I am able to pick up Python and move ahead but Java seems like a little too much. What are your thoughts?

I think your questions could be restated as two questions:

Is a Data Engineer the same thing as a BI or DBA?

A Data Engineer is someone who has specialized their skills in creating software solutions around data. Their skills are predominantly based around Hadoop, Spark, and the open source Big Data ecosystem. They usually program in Java, Scala, or Python. They have an in-depth knowledge of creating data pipelines. Data pipelines are how data is brought in, processed, and create some kind of business value. This business value is usually reports, analytics, and dashboarding. More advanced examples are fraud analytics or predictive analytics pipelines.

They are not a DBA (Database Administrator), Business Intelligence, Data Analyst, or ETL Developer. That’s not to say a person with these titles couldn’t be a Data Engineer. Rather, people with these titles will need training and probably entirely new skills to become a Data Engineer. Usually, they’ll need more programming skills and Big Data skills than most people with these titles.

Data Engineers are tasked with creating data pipelines and data products. Complex data pipelines are often outside the abilities of non-programmers because they require custom programming and code.

Does a Data Engineer need to use Java?

A Data Engineer’s primary language needs to be Java. They’ll also need to know SQL and I highly recommend they know at least one dynamic language like Python or Scala.

If you look around the Big Data ecosystem, virtually every one of the projects has a Java API. Some projects may support a Java API and another language. That doesn’t mean everything in a data pipeline is limited to Java. Some pipelines will be a mix of Java, SQL, and a dynamic language.

I’ve trained at companies where their data team was limited to a knowledge of SQL. They are severely limited in what they can accomplish with SQL. You can do some interesting things with SQL and I recommend using SQL for some operations. But when SQL is your only tool, you can’t use the other ecosystem tools that don’t have a SQL interface and, if SQL couldn’t do it, it simply wasn’t done. They had no other alternative to create something else.

Join my mailing list and I might answer your question next time.