What is a Data Engineer?

Data engineer, data analyst, and data scientist — these are job titles you'll often hear mentioned together when people are talking about the fast-growing field of data science. Today we will discover the secrets behind a data engineer, what does a data engineer do and how you become one.

Stay on top of the latest tech trends & AI news with Le Wagon’s newsletter

Data engineers build and optimize the systems that allow data scientists and analysts to perform their work. Every company depends on its data to be accurate and accessible to individuals who need to work with it. The data engineer ensures that any data is properly received, transformed, stored, and made accessible to other users.

What does a Data Engineer do?

At larger organizations, data engineers can have different focuses such as leveraging data tools, maintaining databases, and creating and managing data pipelines. Whatever the focus may be, a good data engineer allows a data scientist or analyst to focus on solving analytical problems, rather than having to move data from source to source.

The data engineer’s mindset is often more focused on building and optimization. The following are examples of tasks that a data engineer might be working on:

Building APIs for data consumption.
Integrating external or new datasets into existing data pipelines.
Applying feature transformations for machine learning models on new data.
Continuously monitoring and testing the system to ensure optimized performance.

Data engineering skills like Python and SQL regularly rank among the highest-paying skills in StackOverflow’s developer surveys. And at the time of this writing, there are around 70,000 results for the search term Data Scientist on LinkedIn, and around 112,500 results for the search term Data Engineer. On GlassDoor the difference is even more pronounced: around 22,500 for data scientists versus around 77,100 for data engineers (filtered for jobs posted in the last month).

Not only is there a large demand for data engineers, but that demand also keeps increasing! As of June of 2019, the demand for data engineers had increased by 88% year over year (source).

And that’s not all! According to Statista, “The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its expected market size in” 2019.

How about the pay?

According to IBM’s The Quant Crunch: How the Demand for Data Science Skills is Disrupting the Job Market, “Jobs specifying machine learning skills pay an average of $114,000. Advertised data scientist jobs pay an average of $105,000 and advertised data engineering jobs pay an average of $117,000.”

Where to begin?

If this job sparks a light in you and you are full of enthusiasm, you can learn it, you can master all the needed skills and became a real data engineering rock-star. And, yes, you can do it even without programming or other tech backgrounds. It’s hard, but it’s possible!

First of all, Data Engineering is primarily related to computer science. To be more specific, you should have an understanding of efficient algorithms and data structures. Secondly, since data engineers deal with data, an understanding of the operation of databases and the structures underlying them is a necessity.

For example, the usual B-tree SQL databases are based on the B-Tree structure, and in the modern distributed repositories LSM-Tree and other hash table modifications.

1. Algorithms and Data Structures

Using the right data structure can drastically improve the performance of an algorithm. Ideally, we should all learn data structures and algorithms in our schools, but it’s rarely ever covered. Anyway, it’s never too late.

Some FREE COURSES to learn data structures and algorithms:

Easy to Advanced Data Structures (Udemy)

Algorithms, Part I (Coursera)

Algorithms, Part II (Coursera)

2. Learn SQL

Our whole life is data. And in order to extract this data from the database, you need to “speak” with it in the same language. SQL (Structured Query Language) is the lingua franca in the data area. No matter what anyone says, SQL lives, it is alive and will live for a very long time.

If you have been in development for a long time, you probably noticed that rumors about the imminent death of SQL appear periodically. The language was developed in the early 70s and is still wildly popular among analysts, developers, and just enthusiasts.

How to learn SQL? Just do it on practice. Get acquainted with an excellent tutorial, which is free by the way, from Mode Analytics.

Intermediate SQL

Joining Data in SQL

3. Programming in Python and Java / Scala

To understand how these tools work you need to know the languages in which they are written. The functional approach of Scala allows you to effectively solve problems of parallel data processing. Python, unfortunately, can not boast of speed and parallel processing. On the whole, knowledge of several languages and programming paradigms has a good effect on the breadth of approaches to solving problems.

For plunging into the Scala language, you can read Programming in Scala by the author of the language. Also, the company Twitter has published a good introductory guide — Scala School.

As for Python, before you begin working on some code of yourself, take some time to read the following resource from the official Python 3 documentation.

4. Big Data Tools

More information on big data building blocks you can find in this awesome interactive environment. The most popular tools are Spark and Kafka. They are worth exploring, preferably understanding how they work from the inside.

Jay Kreps (co-author Kafka) in 2013 published a monumental work of The Log: What every software engineer should know about real-time data’s unifying abstraction, core ideas from this boob, by the way, was used for the creation of Apache Kafka.

An introduction to Hadoop can be A Complete Guide to Mastering Hadoop (free).

The most comprehensive guide to Apache Spark for me is Spark: The Definitive Guide.

5. Cloud Platforms

Knowledge of at least one cloud platform is in the nest requirements for the position of Data Engineer. Employers give preference to Amazon Web Services, in the second place is the Google Cloud Platform, and ends with the top three Microsoft Azure leaders.

You should be well-oriented in Amazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Distributed Systems

Working with big data implies the presence of clusters of independently working computers, the communication between which takes place over the network. The larger the cluster, the greater the likelihood of failure of its member nodes. To become a cool data expert, you need to understand the problems and existing solutions for distributed systems. This area is old and complex.

Andrew Tanenbaum is considered to be a pioneer in this realm. For those who don’t afraid theory, check out his book Distributed Systems, for beginners it may seem difficult, but it will really help you to brush your skills up.

7. Data Pipelines

Data pipelines are something you can’t live without as a Data Engineer.

Much of the time data engineer builds a so-called. Pipeline date, that is, builds the process of delivering data from one place to another. These can be custom scripts that go to the external service API or make a SQL query, enrich the data and put it into centralized storage (data warehouse) or storage of unstructured data (data lakes).

The journey of becoming a Data Engineer is not so easy as it might seem. It is unforgiving, frustrating and you have to be ready for this. Some moments on this journey will push you to throw everything in the towel. But, this is a true work and learning process.

Just don’t sugar coat it right from the beginning. The whole point of the journey is to learn as much as you can and be prepared for new challenges.