udacity capstone project data engineer

  • No suggested jump to results
  • Notifications

Udacity Data Engineering Nanodegree Capstone Project


Name already in use.

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more .

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Data engineering capstone project, project summary.

The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events. A use case for this analytics database is to find immigration patterns to the US. For example, we could try to find answears to questions such as, do people from countries with warmer or cold climate immigrate to the US in large numbers?

Data and Code

All the data for this project was loaded into S3 prior to commencing the project. The exception is the i94res.csv file which was loaded into Amazon EMR hdfs filesystem.

In addition to the data files, the project workspace includes:


The project follows the following steps:

Step 1: scope the project and gather data, step 2: explore and assess the data, step 3: define the data model.

Project Scope

To create the analytics database, the following steps will be carried out:

The technology used in this project is Amazon S3, Apache Sparkw. Data will be read and staged from the customers repository using Spark.

Refer to the jupyter notebook for exploratory data analysis

3.1 Conceptual Data Model

Database schema

The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.

The us demographics dimension table comes from the demographics dataset and links to the immigration fact table at US state level. This dimension would allow analysts to get insights into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attract more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level.

The visa type dimension table comes from the immigration datasets and links to the immigaration via the visa_type_key.

The immigration fact table is the heart of the data model. This table's data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table.

3.2 Mapping Out Data Pipelines

The pipeline steps are as follows:

Step 4: Run Pipelines to Model the Data

4.1 create the data model.

Refere to the jupyter notebook for the data dictionary.

4.2 Running the ETL pipeline

The ETL pipeline is defined in the etl.py script, and this script uses the utility.py and etl_functions.py modules to create a pipeline that creates final tables in Amazon S3.

spark-submit --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 etl.py


Udacity Data Engineering Capstone Project

Project Summary

The project follows the follow steps:

Step 1: Scope the Project and Gather Data

Step 2: explore and assess the data, step 3: define the data model.

Step 5: Complete Project Write Up

The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project.

The project builds a data lake using Pyspark that can help to support the analytics department of the US immigration department to query the information by extracting data from all the sources. The conceptual data model is a Factless fact based transactional star schema with dimensions tables. Some examples of the information which can be queries from the data model include the numbers of visitors by nationality, visitor’s main country of residence, their demographics and flight information. Python is the main language used to complete the project. The libraries used to perform ETL are Pandas, Pyarrow and Pyspark. The environment used is workspace by Udacity. Immigration data was transformed from sas format to parquet format using Pyspark. These parquest files were ingested using Pyarrow and explored using Pandas to gain an understanding of the data and before building a conceptual data model. Pyspark was then used to build the ETL pipeline. The data sources provided have been cleaned, transformed to create new features and then save the data tables are parquet file. The two notebooks with all the code and output are as follows:

1. exploringUsingPandas.ipynb

2. exploringUsingPyspark.ipynb

Describe and Gather Data

Immigration data.

“Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” ( https://en.wikipedia.org/wiki/Form_I-94 ) .It lists the traveler’s immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler’s lawful admission to the United States ( https://i94.cbp.dhs.gov/I94/(

This is the main dataset and there is a file for each month of the year of 2016 available in the directory ../../data/18-83510-I94-Data-2016/ . It is in SAS binary database storage format sas7bdat. This project uses the parquet files available in the workspace and the folder called sap_data. The data is for the month of the month of April of 2016 which has more than three million records (3.096.313). The fact table is derived from this table.

World Temperature Data

Data is from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. The original dataset from Kaggle includes several files ( https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data ). But for this project, only the GlobalLandTemperaturesByCity was analyzed. The dataset provides a long period of the world’s temperature (from year 1743 to 2013). However, since the immigration dataset only has data in the year of 2016, the vast majority of the data here seems not to be suitable.

Airports Data

“Airport data includes IATA airport code.An IATA airport code, also known as an IATA location identifier, IATA station code or simply a location identifier, is a three-letter geocode designating many airports and metropolitan areas around the world, defined by the International Air Transport Association (IATA). IATA code is used in passenger reservation, ticketing and baggage-handling systems (https://en.wikipedia.org/wiki/IATA_airport_code)”. It was downloaded from a public domain source ( http://ourairports.com/data/ )

U.S. City Demographic Data

This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau’s 2015 American Community Survey. This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. The US City Demographics is the source of the STATE dimension in the data model and grouped by State.

Gather Data

Explore the data, exploringusingpandas.ipynb shows the workings to assess and explore the data., the main finding and cleaning steps necessary are as follows:.

3.1 Conceptual Data Model

Map out the conceptual data model and explain why you chose that model

For this project, Star schema is deployed in a relational database management system as dimensional structures. Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/ foreign key relationships.

3.2 Mapping Out Data Pipelines

The project involved four key decisions during the design of a dimensional model:

The business process for the immigration department is to allow valid visitors into the country. The process generate events and capture performance metrics that translate into facts in a fact table.

The grain establishes exactly what a single fact table row represents. In the project the records are recorded as the event of a visitor entring the USA occurs. It is done before choosing the fact and dimension table and becomes a binding contract on the design. This ensures uniformity on all dimensional designs and critical to BI application performance and ease of use.

Dimensions table provides the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. In this project, a dimension is single valued when associated with a given fact row. Every dimension table has a single primary key column. This primary key is embedded as a foreign key in the associated fact table where the dimension row’s descriptive context is exactly correct for that fact table row.

Dimension tables are wide, flat denormalized tables with many low cardinality text attributes. It is designed with one column serving as a unique primary key. This primary key is not the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. These surrogate keys are simple integers, assigned in sequence. The tables also denormalize the many-to-one fixed depth hierarchies into separate attributes on a flattened dimension row. Dimension denormalization supports dimensional modeling’s twin objectives of simplicity and speed.

The fact table focuses on the results of a single business process. A single fact table row has a one-to-one relationship to a measurement event as described by the fact table’s grain. Thus a fact table design is entirely based on a physical activity and is not influenced by the demands of a particular report. Within a fact table, only facts consistent with the declared grain are allowed. In this project, the information about the visitor is the fact. The fact table is transactional with each row corresponding to a measurement event at a point in space and time. It is also Factless Fact Tables as the event merely records a set of dimensional entities coming together at a moment in time. Factless fact tables can also be used to analyze what didn’t happen. These queries always have two parts: a factless coverage table that contains all the possibilities of events that might happen and an activity table that contains the events that did happen. When the activity is subtracted from the coverage, the result is the set of events that did not happen. Each row corresponds to an event. The fact table contains foreign keys for each of its associated dimensions, as well as date stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.

( http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf )

Step 4: Run Pipelines to Model the Data

4.1 create the data model.

Build the data pipelines to create the data model.

4.2 Data Quality Checks

Explain the data quality checks you’ll perform to ensure the pipeline ran as expected. These could include:

exploringUsingPyspark.ipynb are the workings for task 4.1 and 4.2

4.3 data dictionary, datadictionary.md has the data model.

You Might Also Like


Facebook Data Analysis

Evaluating algorithms using mnist.

Read more about the article 7 ShortListed Free and Decisive MLOps Tools for Food Recipe Application

7 ShortListed Free and Decisive MLOps Tools for Food Recipe Application

Leave a reply cancel reply.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

How to Become a Data Engineer

Nanodegree program.

Understand the latest AWS features used by data engineers to design and build systems for collecting, storing, and analyzing data at scale.

04 Days 07 Hrs 43 Min 36 Sec

Estimated time

At 5-10 hrs/week

March 22, 2023

Get access to classroom immediately on enrollment

Skills acquired

Apache airflow, aws glue, apache spark, redshift, amazon s3, what you will learn.

udacity capstone project data engineer

Data Engineering with AWS

You’ll master the AWS data engineering skills necessary to level up your tech career. Learn data engineering concepts like designing data models, building data warehouses and data lakes, automating data pipelines, and managing massive datasets.

Prerequisite knowledge

It is recommended that learners have intermediate Python, intermediate SQL, and command line skills.

Data Modeling

Learners will create relational and NoSQL data models to fit the diverse needs of data consumers. They’ll also use ETL to build databases in Apache Cassandra.

Cloud Data Warehouses

In this data engineering course, learners will create cloud-based data warehouses. They will sharpen their data warehousing skills, deepen their understanding of data infrastructure, and be introduced to data engineering on the cloud using Amazon Web Services (AWS).

Spark and Data Lakes

Learners will build a data lake on AWS and a data catalog following the principles of data lakehouse architecture. They will learn about the big data ecosystem and the power of Apache Spark for data wrangling and transformation. They’ll work with AWS data tools and services to extract, load, process, query, and transform semi-structured data in data lakes.

Automate Data Pipelines

This data engineer training dives into the concept of data pipelines and how learners can use them to accelerate their career. This course will focus on applying the data pipeline concepts learns will learn through an open-source tool from Airbnb called Apache Airflow. This course will start by covering concepts including data validation, DAGs, and Airflow and then venture into AWS quality concepts like copying S3 data, connections and hooks, and Redshift Serverless. Next, learners will explore data quality through data lineage, data pipeline schedules, and data partitioning. Finally, they’ll put data pipelines into production by extending Airflow with plugins, implementing task boundaries, and refactoring DAGs.

All our programs include

With real-world projects and immersive content built in partnership with top-tier companies, you’ll master the tech skills companies want.

Our knowledgeable mentors guide your learning and are focused on answering your questions, motivating you, and keeping you on track.

You’ll have access to Github portfolio review and LinkedIn profile optimization to help you advance your career and land a high-paying role.

Flexible learning program

Tailor a learning plan that fits your busy life. Learn at your own pace and reach your personal goals on the schedule that works best for you.

Program offerings

udacity capstone project data engineer

Class content

udacity capstone project data engineer

Student services

udacity capstone project data engineer

Succeed with personalized services.

We provide services customized for your needs at every step of your learning journey to ensure your success.

udacity capstone project data engineer

Get timely feedback on your projects.

project reviewers

projects reviewed

reviewer rating

avg project review turnaround time

Learn with the best.

udacity capstone project data engineer

Amanda Moran

Amanda is a developer advocate for DataStax after spending the last 6 years as a software engineer on 4 different distributed databases. Her passion is bridging the gap between customers and engineering. She has degrees from the University of Washington and Santa Clara University.

udacity capstone project data engineer

Ben Goldberg

In his career as an engineer, Ben Goldberg has worked in fields ranging from computer vision to natural language processing. At SpotHero, he founded and built out their data engineering team, using Airflow as one of the key technologies.

udacity capstone project data engineer

Valerie Scarlata

Valerie is a curriculum manager at Udacity who has developed and taught a broad range of computing curriculum for several colleges and universities. She was a professor and software engineer for over 10 years specializing in web, mobile, voice assistant, and social full-stack application development.

udacity capstone project data engineer

Matt Swaffer

Matt is a software and solutions architect focusing on data science and analytics for managed business solutions. In addition, Matt is an adjunct lecturer, teaching courses in the computer information systems department at the University of Northern Colorado where he received his PhD in Educational Psychology.

udacity capstone project data engineer

Sean Murdock

Sean currently teaches cybersecurity and DevOps courses at Brigham Young University Idaho. He has been a software engineer for over 16 years. Some of the most exciting projects he has worked on involved data pipelines for DNA processing and vehicle telematics.

Top student reviews


Get started today

Learn the high-impact AWS skills that a data engineer uses on a daily basis.

Average Time

On average, successful students take 4 months to complete this program.

Benefits include

Related programs, aws cloud architect.

Build confidence planning, designing, and creating high availability cloud infrastructure.

Data Streaming

Learn how to stream data to unlock key insights in real-time.

Program details

Program overview: why should i take this program, why should i enroll.

The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries. Udacity has collaborated with industry professionals to offer up-to-date learning content that can advance your data engineering career.

By the end of the Nanodegree program, you will have an impressive portfolio of real-world projects and valuable hands-on experience.

What jobs will this program prepare me for?

This program is designed to teach you how to become a data engineer. These skills will prepare you for jobs such as analytics engineer, big data engineer, data platform engineer, and others. Data engineering skills are also helpful for adjacent roles, such as data analysts, data scientists, machine learning engineers, or software engineers.

How do I know if this program is right for me?

This Nanodegree program offers an ideal path for experienced programmers to advance their data engineering careers. If you enjoy solving important technical challenges and want to learn to work with massive datasets, this is a great way to get hands-on practice.

Enrollment and admission

Do i need to apply what are the admission criteria.

There is no application. This Nanodegree program accepts everyone, regardless of experience and specific background.

What are the prerequisites for enrollment?

The Data Engineering with AWS Nanodegree program is designed for learners with intermediate Python, intermediate SQL, and command line skills.

In order to successfully complete the program, learners should be comfortable with the following concepts:

If you need to sharpen your pre-requisite skills, try our below programs:

If I do not meet the requirements to enroll, what should I do?

To prepare for this program learners are encouraged to enroll in one of the following programs:

Tuition and term of program

How is this nanodegree program structured.

The Data Engineering with AWS Nanodegree program has 4 courses with 4 projects. We estimate that students can complete the program in 4 months working 5-10 hours per week.

Each project will be reviewed by the Udacity reviewer network. Feedback will be provided and if you do not pass the project, you will be asked to resubmit the project until it passes.

How long is this Nanodegree program?

Access to this Nanodegree program runs for the length of time specified above. If you do not graduate within that time period, you will continue learning with month-to-month payments. See the Terms of Use and FAQs for other policies regarding the terms of access to our Nanodegree programs.

Can I switch my start date? Can I get a refund?

Please see the Udacity Program FAQs for policies on enrollment in our programs.

Software and hardware: What do I need for this program?

What software and versions will i need in this program.

There are no software and version requirements to complete this Nanodegree program. All coursework and projects can be done via Student Workspaces in the Udacity online classroom.

udacity capstone project data engineer

Un sito che tratta di data science, machine learning, big data e applicazioni varie.

... udacity data engineering capstone project.

In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di Udacity. Cosa sviluppare era libera scelta dello sviluppatore posto che alcuni criteri fossero soddisfatti, per esempio lavorare con un database di almeno 3 milioni di records.

Questa è il primo notebook del progetto, nel secondo ci sono esempi di queries che possono essere eseguite sul data lake.

Data lake with Apache Spark ¶

Data engineering capstone project ¶, project summary ¶.

The Organization for Tourism Development ( OTD ) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.

To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.

The project deals with building a data pipeline, to go from raw data to the data insights on the migration flux.

The raw data are gathered from different sources, saved in files and made available for download.

The project shows the execution and decisional flow, specifically:

1. Scope of the Project ¶

The OTD want to run pre-defined queries on the data, with periodical timing.

They also want to maintain the flexibility to run different queries on the data, using BI tools connected to an SQL-like database.

The core data is the dataset provided by US governative agencies filing request of access in the USA (I94 module).

They also have other lower value data available, that are not part of the core analysis, whose use is unclear, therefore are stored in the data lake for a possible future use.

1.1 What data ¶

Following datasets are used in the project:

1.2 What tools ¶

Because of the nature of the data and the analysis that must be performed, not time-critical analysis, monthly or weekly batch, the choice fell on a cheaper S3-based data lake with on-demand on-the-fly analytical capability: EMR cluster with Apache Spark , and optionally Apache Airflow for scheduled execution (not implemented here).

The architecture shown below has been implemented.


1.3 The I94 immigration data ¶

The data are provided by the US National Tourism and Trade Office . It is a collection of all I94 that have been filed in 2016.

1.3.1 What is an I94? ¶

To give some context is useful to explain what an I94 file is.

From the government website : “The I-94 is the Arrival/Departure Record, in either paper or electronic format, issued by a Customs and Border Protection (CBP) Officer to foreign visitors entering the United States.”

1.3.2 The I94 dataset ¶

Each record contains these fields:

More details in the file I94_SAS_Labels_Descriptions.SAS

1.3.3 The SAS date format ¶

Represent any date D0 as the number of days between D0 and the 1th January 1960

1.3.4 Loading I94 SAS data ¶

The package saurfang:spark-sas7bdat:2.0.0-s_2.11 and the dependency parso-2.0.8 are needed to read SAS data format.

To load them use the config option spark.jars and give the URL of the repositories, as Spark itself wasn’t able to resolve the dependencies.

1.4 World temperature data ¶

The dataset is from Kaggle. It can be found here .

The dataset contains temperature data:

land temp

1.5 Airport codes data ¶

This is a table of airport codes, and information on the corresponding cities, like gps coordinates, elevation, country, etc. It comes from Datahub website .

airport codes

1.6 U.S. City Demographic Data ¶

The dataset comes from OpenSoft. It can be found here .

us city demo

2. Data Exploration ¶

In this chapter we proceed identifying data quality issues, like missing values, duplicate data, etc.

The purpose is to identify the flow in the data pipeline to programmatically correct data issues.

In this step we work on local data.

2.1 The I94 dataset ¶

2.2 I94 SAS data load ¶

To read SAS data format I need to specify the com.github.saurfang.sas.spark format.

The most columns are categorical data, this means the information is coded, for example I94CIT=101 , 101 is the country code for Albania.

Other columns represent integer data.

It appears clear that there is no need to have data that are defined as double => let’s change those fields to integer

Verifying the schema is correct.

These fields come in a simple string format. To be able to run time-based queries they are converted to date type

A date in SAS format is simply the number of days between the chosen date and the reference date (01-01-1960)

2.3 Explore I94 data ¶

I want to know the 10 most represented nations

The i94res code 135, where the highest number of visitors come from, corresponds to the the United Kingdom, as can be read in the accompanying file I94_SAS_Labels_Descriptions.SAS

New York City port registered the highest number of arrivals.

2.4 Cleaning the I94 dataset ¶

These are the steps to perform on the I94 database:

The number of nulls equal the number of rows. It means there is at least one null on each row of the dataframe.

There are many nulls in many columns.

The question is, if there is a need to correct/fill those nulls.

Looking at the data, it seems like some field have been left empty for lack of information.

Because these are categorical data there is no use, at this step, in assigning arbitrary values to the nulls.

The nulls are not going to be filled apriori, but only if a specific need comes up.

Dropping duplicate row

Cheching if the number changed

No row has been dropped => no duplicated row

This gives confidence on the consistence of the data

2.5 Store I94 data as parquet ¶

I94 data are stored in parquet format in an S3 bucket, they are partinioned using the fields: year, month

2.6 The Airport codes dataset ¶

A snippet of the data

How many records?

There are no duplicates

We discover there are some null fields:

The nulls are in these colomuns:

No action taken to fill the nulls

Finally, let’s save the data in parquet format in our temporary folder mimicking the S3 bucket.

3. The Data Model ¶

The core of the architecture is a data lake , with S3 storage and EMR processing.

The data are stored into S3 in raw and parquet format.

Apache Spark is the tool elected for analytical tasks, therefore all data are loaded into Spark dataframe using a schema-on-read approach.

For SQL queries style on the data, Spark temporary views are generated.

3.1 Mapping Out Data Pipelines ¶

data lineage

4. Run Pipeline to Model the Data ¶

4.1 provision the aws s3 infrastructure ¶.

Reading credentials and configuration from file

Create the bucket if it’s not existing

4.2 Transfer raw data to S3 bucket ¶

Transfer the data from current shared storage (currently Udacity workspace) to S3 lake storage.

A naive metadata system is implemented. It uses a json file to store basic information on each file added to the S3 bucket:

These dataset are moved to the S3 lake storage:

4.3 EMR cluster on EC2 ¶

An EMR cluster on EC2 instances with Apache Spark preinstalled is used to perform the ELT work.

A 3-nodes cluster of m5.xlarge istances is configured by default in the config.cfg file.

If the performance requires it, the cluster can be scaled up to use more nodes and/or bigger instances.

After the cluster has been created, the steps to execute spark cleaning jobs are added to the EMR job flow, the steps are in separate .py files. These steps are added:

The cluster is set to auto-terminate by default after executing all the steps.

4.3.1 Provision the EMR cluster ¶

Create the cluster using the code emr_cluster.py [Ref. 3] and emr_cluster_spark_submit.py and and set the steps to execute spark_script_1 and spark_script_2 .

These scripts have already been previously uploaded to a dedicated folder in the project’s S3 bucket, and are accessible from the EMR cluster.

The file spark_4_emr_codes_extraction.py contains the code for following paragraphs 4.3.1

The file spark_4_emr_I94_processing.py contains the code for following paragraphs 4.3.2, 4.3.3, 4.3.4

4.3.2 Coded fields: I94CIT and I94RES ¶

I94CIT, I94RES contain codes indicating the country where the applicant is born (I94CIT), or resident (I94RES).

The data is extracted from I94_SAS_Labels_Descriptions.SAS . This can be done sporadically or every time a change occurred, for example a new code has been added.

The conceptual flow below was implemented.

data transform

First steps are define credential to access S3a then load the data in a dataframe, in a single row

Find the section of the file where I94CIT and I94RES are specified.

It start with I94CIT & I94RES and finish with the semicolon character.

To match the section, it is important to have the complete text in a single row, I did this using the option wholetext=True in the previous dataFrame read operation

Now I can split in a dataframe with multiple rows

I filter the rows with structure \ = \

And then create 2 differents columns with code and country

I can finally store the data in a single file in json format

4.3.3 Coded field: I94PORT ¶

Similarly to extract the I94PORT codes

The complete code for codes extraction is in spark_4_emr_codes_extraction.py

4.3.4 Data cleaning ¶

The cleaning steps have already been shown in section 2, here are only summarized

4.3.5 Save clean data (parquet/json) to S3 ¶

The complete code, refactorized and modularized, is in **spark_4_emr_I94_processing.py**

As a side note, saving the test file as parquet takes about 3 minute on the provisioned cluster. The complete script execution takes 6 minutes.

4.3.6 Loading, cleaning and saving airport codes ¶

4.4 querying data on-the-fly ¶.

The data in the data lake can be queried on-place. That is the Spark cluster on EMR is directly operating on S3 data.

There are two possible ways to query the data:

We see example of both programming styles.

These are some typical queries that are run on the data:

The queries are collected in the Jupyter notebook Capstone project 1 – Querying the data lake.ipynb

4.5 Querying data using the SQL querying style ¶

4.6 data quality checks ¶.

The query-in-place concept implemented here uses a very short pipeline, data are loaded from S3 and after a cleaning process are saved as parquet. Quality of the data is guaranteed by design.

5. Write Up ¶

The project has been set up with scalability in mind. All components used, S3 and EMR, offer higher degree of scalability, either horizontal and vertical.

The tool used for the processing, Apache Spark, is the de facto tool for big data processing.

To achieve such a level of scalability we sacrified processing speed. A data warehouse solution with a Redshift database or an OLAP cube would have been faster answering the queries. Anyway nothing forbids to add a DWH to stage the data in case of a more intensive, real-time responsive, usage of the data.

An important part of an ELT/ETL process is automation. Although it has not been touched here, I believe the code developed here is prone to be automatized with a reasonable small effort. A tool like Apache Airflow can be used for the purpose.

Scenario extension ¶

In an increased data scenario, the EMR hardware needs to be scaled up accordingly. This is done by simply changing configuration in the config.cfg file. Apache Spark is the tool for big data processing, and is already used as the project analityc tool.

In this case an orchestration tool like Apache Airflow is required. A DAG that trigger Phython scripts and Spark jobs executions, needs to be scheduled for daily execution at 7am.

The results of the queries for the dashboard can be saved in a file.

A proper database wasn’t used, on the contrary Amazon S3 is used to store data and queries them in-place. S3 is designed to massive scale in mind, it is able to handle sudden traffic spikes. Therefore, accessing the data by many people shouldn’t be an issue.

The programming used in the project, provision an EMR cluster for any user that plan to run it’s queries. 100+ EMRs is probably going to be expensive for the company. A more efficient sharing of processing resources must be realized.

6. Lessons learned ¶

Emr 5.28.1 use python 2 as default ¶.

Adding jars packages to Spark ¶

For some reason adding the packages in the Python programm when instantiating the sparkSession doesn’t work (error message package not found). This doesn’t work:

The packages must be added in the spark-submit:

Debugging Spark on EMR ¶

While evrything work locally, it doesn’t necessarily means that is going to work on the EMR cluster. Debugging the code is easier with SSH on EMR.

Reading an S3 file from Python is tricky ¶

While reading with Spark is straightforward, one just needs to give the address s3://…., with Python boto3 must be used.

Transfering file to S3 ¶

During the debbuging phase, when the code on S3 must be changed many time using the web interface is slow and unpractical ( permanently delete ). Memorize this command: aws s3 cp <local file> <s3 folder>

Removing the content of a directory from Python ¶

import shutil dirPath = 'metastore_db' shutil.rmtree(dirPath)

7. References ¶

Leave a Reply Cancel Reply

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

You may use these HTML tags and attributes:

Dai Dao

Jul 26, 2019

What I learned from finishing the Udacity Data Engineering Capstone

This is my first experience with real Data Engineering. My tasks were to start with a lot of raw data from different sources, seemingly unrelated to each other, and somehow come up with a use case and design a data pipeline that could support those use cases. Time to put my data creativity hat on. The data potential is HUGE in this age.

Sources of data provided by Udacity: - US Immigration - World Temperature - US Demographics - International Airport Codes

The majority of Data Science works depends on what kind of data you have access to, and how to draw out the most relevant data attributes that can support your use cases and let the ML algorithms do its magic. I’ve read a lot of very creative use of data to do crazy inferences, like using Google house images and street views to infer potential car accidents at certain locations. Another crazy use case is combining traffic data, with car purchases, weather patterns, traffic lights patterns, to predict traffic. What I’m most interested in is Cybersecurity and Privacy use cases, which I will talk about in my next project. Note that all these data are in different forms and come from different sources. It would require a lot of Data Engineering work to extract, transform and load these into a usable data warehouses for analytics to happen.

The data sources provided by Udacity are not that interesting to me per se, but makes very good practice and preparation for my next project. Using the Snowflake schema, I’ve designed 8 normalized tables, such as airports, immigrants, state code, weather, etc … Using these normalized tables to design 2 analytics tables that are useful to consume, either for analytics or decision making, such as airport weather, and immigration demographics. During the process I had to deal with missing data, duplicates, malformed data, and so on.

Now that we got the data modelling out of the way, we’ll move on to my favourite topic, technologies. Particularly AWS, Spark and Airflow

AWS Infrastructure

First off, I’m particularly impressed at AWS CloudFormation, it’s such an easy and convenient way to deploy infrastructure with minimal effort, without spending hours clicking around multiple services trying to link them together. The configuration syntax is very intuitive and easy to learn.

An EC2 instance is used to host Airflow server, and an RDS instance is used as Airflow database, to store metadata about DAG runs, task completion, etc … Airflow is used to orchestrate the creation / termination of the EMR cluster. The basic workflow is: At the start of the DAG, create the EMR cluster and wait for completion. Once completed, start submitting Spark jobs to the cluster using Apache Livy REST API. When all jobs are completed, terminate the cluster.

I divided the workflow into 3 separate DAGs, which communicates with each other through Airflow Variables and SensorOperator

Few things to note:

There are lots of nuances in configuring Airflow and AWS, which I’ve learnt the hard way through doing this project. It’s really good preparation for me to embark on my own data journey. Stay tuned for my next project.

Project repo: https://github.com/dai-dao/udacity-data-engineering-capstone

More from Dai Dao

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Text to speech

Site Logo

Udacity Data Engineering Nanodegree Review in 2023- Pros & Cons

Udacity Data Engineering Nanodegree Review

Are you planning to enroll in Udacity Data Engineering Nanodegree ? If yes, read this latest Udacity Data Engineering Nanodegree Review and its Pros and Cons . This Udacity Data Engineering Nanodegree review will help you to decide whether this Nanodegree program is good for you or not.

So, without further ado, let’s get started-

Udacity Data Engineering Nanodegree Review

Before we dive into a Udacity Data Engineering Nanodegree Review , I would like to clear one thing-

Udacity Data Engineering Nanodegree  is  not a beginner-level  program. This is an intermediate-advanced data engineering course . So, If you don’t have previous Python and SQL knowledge,  don’t directly enroll in this program . 

In this case, you can check Programming for Data Science with Python .

Now, let’s see the Pros and Cons of the Udacity Data Engineering Nanodegree-

Pros and Cons of Udacity Data Engineering Nanodegree

So these are the Pros and Cons of the Udacity Data Engineering Nanodegree . Now let’s see how is the content of Udacity Data Engineering Nanodegree and what projects are covered throughout the Nanodegree program.

How is the Content & Projects of Udacity Data Engineering Nanodegree ?

Udacity Data Engineering Nanodegree Review

The Udacity data engineering Nanodegree has 5 courses and 6 projects. Each course has 3-4 lessons and 1-2 Course Projects . You need to submit these guided projects after completing the course. And the contractor hired by Udacity reviews your projects.

Due to its practical approach , you will get to learn various new things. Because when you implement it by yourself, your understanding becomes stronger.

These are the 5 courses in Udacity Data Engineering Nanodegree –

Course 1. Data Modeling

Udacity Data Engineering Nanodegree Review

This is the first course where you will learn how to create NoSQL and relational data models to fill the needs of data consumers. You will also learn how to choose the appropriate data model for a given situation. Each course has some lessons. There are three lessons in the first course.

In the first lesson, you will learn the fundamentals of data modeling and how to create a table in Postgres and Apache Cassandra .

In the second lesson, concepts of normalization and denormalization will be introduced with hands-on projects. And you will also know the difference between OLAP and OLTP databases.

The third lesson of this course will teach you when to use NoSQL databases and how they differ from relational databases. You will also learn how to create a NoSQL database in Apache Cassandra .

Check the current Discount on-> Udacity Data Engineering Nanodegree

Project Details

There are two projects in this first-course Data Modeling with Postgres and Data Modeling with Apache Cassandra .

In these projects, you have to model user activity data for a music streaming app called Sparkify . For this, you have to create a database and ETL pipeline , in both Postgres and Apache Cassandra , designed to optimize queries for understanding what songs users are listening to.

Course 2. Cloud Data Warehouses

Udacity Data Engineering Nanodegree Review

The second lesson is focused on data warehousing, specifically on AWS . You will also learn various techniques like Kimball, Inmon, Hybrid, OLAP vs OLTP, Data Marts, etc.  Some of the AWS tools that you’ll be using here are IAM, S3, EC2, and RDS instances.

There are three lessons in this course. In the first lesson, you will understand Data Warehousing architecture , how to run an ETL process to denormalize a database (3NF to Star) , how to create an OLAP cube from facts and dimensions , etc.

The second lesson will help you to understand cloud computing and teach you how to create an AWS account and understand their services , and how to set up Amazon S3, IAM, VPC, EC2, and RDS PostgreSQL .

In the third lesson, you will learn how to implement Data Warehouses on AWS . You will also learn how to identify components of the Redshift architecture, how to run the ETL process to extract data from S3 into Redshift , and how to set up AWS infrastructure using Infrastructure as Code (IaC).

In this course, there is one project where you have to build a cloud data warehouse to find insights into what songs their users are listening to. And for this, you have to build an ELT pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables.

Course 3. Spark and Data Lakes

Udacity Data Engineering Nanodegree Review

This course provides an introduction to Apache Spark and Data Lakes. In this course, you will learn how to use Spark to work with massive datasets and how to store big data in a data lake and query it with Spark . You will also learn concepts such as distributed processing, storage, schema flexibility, and different file formats.

There are four lessons. In the first lesson, you will learn more about Spark and understand when to use Spark and when not to use it.

The second course will teach you data wrangling with Spark and how to use Spark for ETL purposes . In the third course, you will learn about debugging and optimization and how to troubleshoot common errors and optimize their code using the Spark WebUI .

The fourth course is all about data lakes and teaches you how to implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue. You will also understand the components and issues of data lakes.

Project Details-

In this course, there is one project where you have to create an ETL pipeline for a data lake, using data stored in AWS S3 in JSON format. And for this, you have to load data from S3 , process the data into analytics tables using Spark, and load them back into S3.

Course 4. Automate Data Pipelines

Course 5

In the fourth course, you’ll use all the technologies learned in the above 3 courses. This is an exciting course where you will get an introduction to Apache Airflow and how to schedule, automate, and monitor data pipelines using Apache Airflow.

There are three lessons in this course. In the first lesson, you will learn how to create data pipelines with Apache Airflow , how to set up task dependencies , and how to create data connections using hooks .

In the second lesson, you will learn about data quality such as partitioning data to optimize pipelines, writing tests to ensure data quality , tracking data lineage, etc.

There is one project in this course, where you have to build data pipelines with Airflow. In this project, you will work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines. You have to configure and schedule data pipelines with Airflow and monitor and debug production pipelines.

5. Udacity Data Engineering Capstone Project

The last course is a capstone project . Where you will combine all the technologies learned in the entire course and build a data engineering portfolio project. 

In this Udacity Data Engineering Capstone Project , you have to gather data from several different data sources , transform, combine, and summarize it, and create a clean database for others to analyze. Throughout the project guidelines, suggestions, tips, and resources will be provided by Udacity.

Now, let’s see whether you should enroll in the Udacity Data Engineer Nanodegree program or not-

Should You Enroll   in Udacity Data Engineer Nanodegree?

Udacity Data Engineering Nanodegree is not a beginner-level program. This is the only intermediate-advanced data engineering course out there. This Nanodegree program requires the following skills before enrolling in the program-

Python –

If you are a beginner in Python , then don’t directly enroll in this program . To get 100% from  Udacity Data Engineering Nanodegree  Program, you need to know the following concepts-

Along with Python knowledge, you should be familiar with SQL programs such as Joins, Aggregations, Subqueries , Table definition, and manipulation (Create, Update, Insert, Alter) .

If you meet the following prerequisites, then you can enroll in the Udacity Data Engineering Nanodegree . If not, first learn Python and SQL .

Now let’s see the price and duration of the Udacity Data Engineering Nanodegree  program-

How Much Time and Money Do You have to Spend on Udacity Data Engineering Nanodegree ?

According to Udacity, the  Udacity Data Engineering Nanodegree  program will take 5 months to complete if you spend 5-10 hours per week.

And for 5 months they cost around more than $800 . But Udacity offers two options- One is either pay the complete amount upfront or you can pay monthly installments of  $399/month .

So this is according to Udacity, but here I would like to tell you  how you can complete the full Udacity Data Engineering Nanodegree program in less time .

Excited to know…How?

So, let’s see-

How to Complete the Udacity Data Engineering Nanodegree   In Less Time?

To complete the  Udacity Data Engineering Nanodegree  program in less time, you need to manage your time productively.

You need to plan your day before and create a to-do list for each day. And you need to spend a good amount of time daily on the program.

According to Udacity, you need to spend 10 hours per week to complete the whole program in 5 months.

That means, daily you need to spend  1.5 hours , but if you double the time and give daily  3 hours , then you can complete the whole Nanodegree program in  less than 3 months.

For managing your time and avoiding any distractions, you can use the  Pomodoro  technique to increase your learning.

As I mentioned earlier, after each course, you have to work on a project. And each project has a  set of rubrics . So before starting a section, I would suggest you just study the rubrics of the project. The rubrics will provide you with a rough idea about what topics and lectures are important for the project. So that you can make notes while watching these lectures.

And you can also implement the project phases after watching the related lecture . By doing this way, you can save time by watching one video two times. One at the time of learning and the second at the time of working on the project.

I hope these tips will help you to complete the  Udacity Data Engineering Nanodegree  program in less time. But you can also get Udacity Scholarship .

Do you want to know…How?

So, let me tell you how to get Udacity Scholarship.

How to Get Udacity Scholarship?

To apply for Udacity Scholarship, you need to go to their  Scholarship page , which looks like that-

udacity data engineering nanodegree review

On this page, you have to find the scholarship for the program you want to enroll. If you found your Nanodegree program on the list, then you need to apply for the scholarship by filling out these details-

Udacity Data Engineering Nanodegree review

After filling out these details, you need to click on the  “ Save and Submit “  button. And by doing so, you have applied for Udacity Scholarship. And if you are selected, then you will be notified via email.

But if your program is not listed in the scholarship section, then you fill out this  form on the Scholarship page section-

how to get udacity scholarship

So, when there will be a new scholarship available, you will be notified. I hope now you understood  how to apply for a   Udacity scholarship .

The next important thing you need to know is who will teach you  and what their qualifications are . So, let’s see the information of the Instructors-

Are Instructors Experienced?

Learning from  such experienced and knowledgeable instructors  is amazing and helpful. This is the reason, I personally love Udacity.  Udacity  also has its  forums , where you can ask your doubt to instructors , and they will answer your query.

Now I would like to mention some more Pros and Cons of the Udacity Data Engineering Nanodegree .

Pros of Udacity Data Engineering Nanodegree

Cons of Udacity Data Engineering Nanodegree

So the next and most important question is-

Is Udacity Data Engineer Nanodegree Worth It?

Yes, Udacity Data Engineering Nanodegree is Worth it because you will get a chance to advance your skills as a Data Engineer and work on various Real-world problems such as Data Modeling with Postgres and Apache Cassandra , building data pipelines with Airflow , etc. Along with that, you will get One-to-One Mentorship and a Personal career coach .

But I would suggest buying Udacity Data Engineering Nanodegree when they are providing any discount on the program. Because Udacity Data Engineering Nanodegree is expensive compared to other online courses.

Most of the time, Udacity offers some discounts. When they offer a discount, it appears something like that-

Udacity Data Engineering Nanodegree Review

When you click on the “New Personalized Discount” , you will be asked to answer 2 questions.

Udacity Data Engineering Nanodegree Review

After answering these two questions, press the “Submit Application” button. And then you will get a discount with a unique Coupon Code. Simply copy this code and paste it at the time of payment.

Udacity Discount

Now, you can save your money in Udacity Data Engineering Nanodegree Program. If you get the Udacity Data Engineering Nanodegree at discount, then I would say i t is totally worth it.

Final Thought

Udacity Data Engineering Nanodegree Review

Udacity Data Engineering Nanodegree is good for you if you have intermediate-level Python and SQL knowledge and want to advance your skills as a Data Engineer . And if you want hands-on practice and believe in “ how ” to do things like ETL and Data Warehousing.

Student Reviews

I’ve learned, and applied, skills that I’d always wanted to get to grips with. After hearing so much about Relational and NoSQL databases, it’s satisfying to become confident building them with PostgreSQL and Apache Cassandra in only the first two projects! Rob R.
My first project was very challenging to me but the reviewers and my mentor helped me constantly in improving my code and documentation and guided me well towards finishing the project. It was a great learning experience and looking forward to the other projects as well. Nitheesha T.
The program is good, but you need to look at the dataset in the first project. I’m not completely new to ETL, but to have a requirement “Sparkify are interested in the songs their users are listening to”, then a dataset that doesn’t get near that must be confusing for real newbies. Nicole S.

Now it’s time to wrap up this Udacity Data Engineering Nanodegree Review .

I hope this Udacity Data Engineering Nanodegree Review helped you to decide whether to enroll in Udacity Data Engineering Nanodegree or not?

If you found this Udacity Data Engineering Nanodegree Review helpful, you can share it with others. And if you have any doubts or questions, feel free to ask me in the comment section.

All the Best!

Yes, Many Nanodegree graduates have gotten jobs. Udacity has surveyed over 4,200 Udacity students and the survey results showed that nearly  70%  of Udacity students surveyed indicated that a Nanodegree program helped them advance their careers. And you can check  Top Companies that Hired Udacity Graduates here .

The  Dice 2020 Tech Job Report  labeled data engineer as the fastest-growing job in technology in 2019, with a 50% year-over-year growth in the number of open positions. The report also found it takes an average of 46 days to fill data engineering roles and predicted that the time to hire Data Engineers may increase in 2020  “as more companies compete to find the talent they need to handle their sprawling data infrastructure.”

According to  Indeed , the average salary of a Data Engineer is  $129,001  per year in the United States and a $5,000 cash bonus per year.

Yes, you can mention Udacity projects in your resume. That’s the main objective of Udacity to work on projects so that your portfolio becomes stronger and you  can  get jobs.

You May Also be Interested In

8 Best Data Engineering Courses Online- Complete List of Resources Best Course on Statistics for Data Science to Master in Statistics 8 Best Tableau Courses Online- Find the Best One For You! 8 Best Online Courses on Big Data Analytics You Need to Know in 2023 Best SQL Online Course Certificate Programs for Data Science 7 Best SAS Certification Online Courses You Need to Know Data Analyst Online Certification to Become a Successful Data Analyst 15 Best Books on Data Science Everyone Should Read in 2023 15 Best Online Courses for Data Science for Everyone in 2023 Data Analyst Online Certification to Become a Successful Data Analyst

Explore More about Data Science , Visit Here

Subscribe For More Updates!

[mc4wp_form id=”28437″]

Though of the Day…

‘ It’s what you learn after you know it all that counts.’ – John Wooden

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.


Helping you Learn

Udacity Data Engineering Nanodegree Review 2023: Be a Successful Data Engineer

Udacity Data Engineer Nanodegree Review

In this Udacity Data Engineering nanodegree review, I will share my experience of taking this course along with its pros and cons.

Udacity is offering a personalized discount

Planning to enrol in Udacity’s Data Engineering Nanodegree? A few months back I had the same thoughts. I hope my feedback on Udacity’s Data Engineering nanodegree will help you make the right choice. Here’s my story.

I studied Computer Science Engineering and during my college time, I got interested in the world of Data Science. So, I oriented my internships on this topic and learned the valuable and essential skills required for the job profiles of Data Scientists, Data Analyst, and Data Engineer which are the three magic roles in the world of Data Science.

After I graduated from college, I received a job opportunity as a Data Engineer in a biotechnology company. Working in that company, I came to know about various tools and techniques that were used by them such as Airflow, Spark, Apache Cassandra, etc.

These tools and software attracted me more since my background was mostly from Software Engineering and Development. I realize that I had a gap of knowledge on Data Modelling, Data processing engines, and some of the best practices with data. 

Since all of these topics were perfectly covered in the Data Engineering Nanodegree from Udacity, I decided to apply. 

In this Udacity Data Engineering nanodegree review, I am going to talk about the syllabus and detail and my project experience.

So keep reading..

If you buy the course through the links in this article, we can earn some affiliate commission.  This helps us to keep this blog up and running for your benefit.

Udacity data engineering nanodegree certificate

Table of Contents

How Much I Paid For Udacity’s Data Engineering Nanodegree?

Talking about the pricing, well I was not that lucky. At the time when I wanted to enroll, there was no valid offer or discount available. That’s the reason why I paid the full price for the course for 5 months subscription.

It was 1400€ which is around 1600$. Without an offer, it is very overpriced, since you are paying for only 5 months if you don’t complete it earlier. 

You have access to paid services, but I don’t recommend paying such a high price. If you get a discount of 50% or even 70%, go for it! The outcome will be worth it.

Cost: $399/month

Duration : 5 months (5hours/week)

Also Read: My experience with Udacity Nanodegree

Let me shed some light on the course structure and projects in this Udacity Data Engineering Nanodegree review

Which Topics Are Covered In Udacity’s Data Engineer Nanodegree ?

In general, I liked the overall syllabus and structuring of the modules with a realistic project.

They go from the basics to the most demanding and complex projects and you are always connected to the main topic.

Lesson1: Introduction to Data Modeling

In this module, you will be learning the main and key differences between a relational and non-relational data model. It is a great start to the NoSQL world.

This is very important. You need to fully understand this concept to make good decisions when working as a Data Engineer in a team. 

Udacity data engineering nanodegree project 1

The required task was to create a table for our database. And you can see, the first query is following the standard SQL.Here we have to build up a table to fit in our data model.

While in the second query, the objective is to create a table that fits the query. And that’s one of the main difference between SQL and NoSQL databases. 

Also, see – Udacity Data Analyst Nanodegree Review

Project 1: Data Modeling with Postgres

This is the first project of the Nanodegree. From this project onwards, you are required to model the data using a common schema pattern while working with Business Intelligence, where you have 2 dimensions and one fact table.

This project is using a relational data model. It is a beginner-friendly project where you can get familiarise with the working methodology of Udacity.

Data  Modeling with Postgres , Project workspace

If you have experience working with SQL and Python, this project will be pretty straightforward and simple for you. You are just required to read the Python library psycopg2 documentation . 

Project 2: Data Modelling with Apache Cassandra

So the second project is Data Modelling with Apache Cassandra . In this project, you need to do the same task that you have performed in the previous one. But in this project, you will have to follow and implement the best practices for creating a NoSQL database.

The results of these two projects might be the same. You will learn how to do the same task using the best practices and in different ways.

I liked the fact that the difficult part of the projects was not on the database or the data. It was on the steps that you need to follow to manipulate and change your mindset. “It’s NoSQL, Don’t think in SQL”.

Data modeling with Apache Cassandra, Project workspace

Lesson 2: Cloud Data Warehouses

So, lesson 2 is about the Cloud and Data Warehouses. In this lesson you will get in touch with cloud computing using AWS Infrastructure. You will learn about EC2 machines, S3 buckets, and RedShift.

However, at some points when I was learning on my own, I would end up testing with AWS, thanks to what I did and learned from this course.

Also, I would like to let you know in this Udacity data engineer review that Udacity gives you 25$ in credits to use inside the AWS. Cool, isn’t it?

Implementing Data Warehouses on AWS

Project: Build a Cloud Data Warehouse

So, the project is to build a cloud data warehouse. Here you will need to work on building a data warehouse on AWS cloud. To do so you will need to orchestrate the interaction between S3 buckets and your RedShift database.

Let me be honest in this Udacity data engineer nanodegree review that this project was groundbreaking for me because at this stage I was not afraid of building my cloud data warehouse for my projects. 

Figure 6 – Data Warehouse, project workspace

You can also check out the screenshots above. You will need to configure and connect the AWS instance to your project workspace.

Lesson 3: Spark and Data Lakes

Lesson three is the Spark and Data Lakes. In this lesson, you will learn about the advantages of using spark as your data processing engine.

You will learn to use it for cleaning and aggregating data. It is one of the most popular tools for working with data.

I was not aware of the technologies like Spark and Hadoop . It was great to know about them and the state of art technologies that should be known in the Data Science Industry.

The power of Spark

Project: Build a Data Lake

The project is to build a data lake and the task itself is quite simple. You will have to load data from S3 into in-memory tables and then back to S3 after modeling the data. In this project, you will experience the power of Spark.

I found this project quite useful. But you can barely see how fast is Spark and the big difference between Python and Spark when talking about modeling bit datasets. 

As you can see in the images, in the project template description it is specified that you will have to use Spark. This technology will be placed on the top of other technologies or tools that you have learned in the previous lectures. Then at the end, you will be building a different mindset to structure and develop your future projects. 

Data lake,project instructions

Lesson 4: Data Pipelines with Airflow

In this lesson you will learn about Airflow. It is a simple interface for monitoring and designing pipelines for our automated tasks.

This tool is critical and essential for you when you are focusing on designing pipelines, their development, and maintenance.

Data Pipelines

Project: Data Pipelines with Airflow

So, in this project you will need to use Airflow to structure and monitor your pipelines. You will also need to build a data warehouse similar to the project in lesson 2, but adding extra complexity for using Airflow.  

I found this project quite challenging because of the additional knowledge that you should have from the previous projects and you will have to add Airflow on the top of RedShift and S3 buckets. But, in the end, it was very pleasing to complete this overall project. 

Also, see – Review of Udacity’s Data Streaming Nanodegree

Data pipelines with airflow

Lesson 5: Capstone Project

Project: data engineering capstone project.

This was the final project or the Capstone project. It personally chosen by me. For this project I decided to create a report with agricultural data. This project covered an entire closed-loop scenario. For simplicity, I have just used one Python notebook. Further, we could divide this project into 5 steps:

And the result is below.

Udacity data engineering nanodegree capstone project

So this is all about the lesson and the projects in this Udacity Data Engineering Nanodegree review.

How was my project experience?

Every project was adding a step more to get completed.

Personally, the most challenging part of the lesson and projects for me was to complete the Airflow project. Because in that project you are required to combine and use the knowledge of Apache Airflow with RedShift and S3 Buckets. You can also add Spark to it as per your wish. 

How much time I took to complete this nanodegree?

It took me a bit longer than expected to complete this Data Engineering Nanodegree. I enrolled in this Nanodegree in October 2020, but because of work, I was not able to focus on the course until January. In finishing the Nanodegree in March, I paid for one extra month with an offer + 80 €.  

Honestly, it took me around 2 to 3 months to complete the Nanodegree. After my experience, I think that if someone is willing to go 100% for the course then he should choose a monthly payment plan instead of a 5-month subscription.

If you are studying this Nanodegree in parallel to anything else then I strongly recommend you to choose the 5-month subscription as speedrunning the projects will only make you struggle on project requirements. 

Also, see – My experience of Data Scientist Nanodegree

Udacity Features

Well there are several features of Udacity that I should tell you in this Udacity Data Engineering Nanodegree review.

First the Mentorship. Well I did not approach a mentor directly for advice. But I had a chance of submitting my GitHub repository and LinkedIn profile for review and I don’t think that it is something really special.

If you want to improve your GitHub repo or your LinkedIn profile, you can do 1 to 2 hours of research and visit more than 20 profiles. You will learn as much as taught by your mentor. Still, the overall mentorship was very straightforward and I appreciate the effort that the mentors put in. 

The reviewers of the projects were outstanding communicators and they were reviewing our projects with a very positive attitude, marking down the areas where improvement was required especially in the case when you did not pass the requirements. 

I was crystal clear about what I wanted to do before, during, and after the course so I did not opt for the career services. 

Pros and Cons of Udacity’s Data Engineering Nanodegree

My experience with Udacity Data Engineering Nanodegree was filled with positivity, the positivity that radiates when you are communicating with your mentor, and of course, let’s not forget about the tools and technologies that Udacity lets you work with.

Nowadays keeping the situation and circumstances in mind, most of us are at home and learning from our bedroom. Learning from online tutorials or just from other platforms that are mostly focused on just a bunch of videos.

It is very reinforcing to have personal reviews and monitoring, to ensure we are on the right path and not messing around with random tools.

Well this nanodegree quite expensive if we think about the service itself. We are just buying videos and predefined guidelines.

In the end, it’s worth it because of the knowledge that you get, the flexibility for studying, and the personal monitoring. Thinking just about the content, still I am a bit skeptical.

Is Udacity’s Data Engineer Nanodegree Worth it?

Before concluding this Udacity Data Engineering review, I want to ask you few basic questions:

If yes, then you should go for this Data Engineering Nanodegree .

But you should search for an offer if you don’t want to pay the full price of the Nanodegree because 1400$ is too expensive. I could see even before finishing, the difference when applying to job offers for Data Engineer.

Hope you find this Udacity Data Engineer Nanodegree Review useful.

udacity capstone project data engineer

Alberto Barnes

I am a Data Engineer and Computer science student passionate about Data Science. I have 2 years of experience working for public and private entities.


Can i get a job with udacity’s data engineering nanodegree.

I can say this nanodegree has played a vital role in shaping my career as a Data engineer. Today Iam working at Glovo as Data engineer. If I can find a job, surely you can too.

What is the salary of a Data Engineer?

Average data engineer earn up to $129000/year in United states(Source: Indeed)

Can I complete Udacity’s Data engineering nanodegree in a month?

Its possible to complete this nanodegree in a month provided you are familiar with the data engineering. A beginner must dedicate 3-4 months.

Udacity Data engineer vs Data scientist nanodegree: which one should I choose?

It all depends on which role you want go with. I suggest first to look at their job roles and decide. Both nanodegrees are Udacity’s one of the highest enrolled courses

Build a Robo Advisor with Python (From Scratch)

Udacity’s Data Engineering Nanodegree Program – Ratings and Review!

264 Reviews

In Udacity’s Data Engineering Nanodegree Program , learn to design and build production-ready data infrastructure, an essential skill for advancing your data career.

Udacity's Data Engineering Nanodegree Program

Enroll in Udacity’s Data Engineering Nanodegree Program today!

Overview of the Udacity’s Data Engineering Nanodegree Program

In Udacity’s Data Engineering Nanodegree Program , you’ll learn to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets. At the end of the program, you’ll combine your new skills by completing a capstone project.

Udacity’s Data Engineering Nanodegree Program Syllabus

1. data modeling.

Learn to create relational and NoSQL data models to fit the diverse needs of data consumers. Use ETL to build databases in PostgreSQL and Apache Cassandra.

2. Cloud Data Warehouses

Sharpen your data warehousing skills and deepen your understanding of data infrastructure. Create cloud-based data warehouses on Amazon Web Services (AWS).

3. Spark and Data Lakes

Understand the big data ecosystem and how to use Spark to work with massive datasets. Store big data in a data lake and query it with Spark.

4. Data Pipelines with Airflow

Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.

5. Capstone Project

Combine what you’ve learned throughout the program to build your own data engineering portfolio project.

Instructors of Udacity’s Data Engineering Nanodegree Program

Here is a list of instructors associated with the Udacity Nanodegree Program:

1. Amanda Moran (Developer Advocate at Datastax)

Amanda is a developer Advocate for DataStax after spending the last 6 years as a Software Engineer on 4 different distributed databases. Her passion is bridging the gap between customers and engineering. She has degrees from University of Washington and Santa Clara University.

2. Ben Goldberg (Staff Engineer at Spothero)

In his career as an engineer, Ben Goldberg has worked in fields ranging from Computer Vision to Natural Language Processing. At SpotHero, he founded and built out their Data Engineering team, using Airflow as one of the key technologies.

3. Sameh El-Ansary (CEO at Novelari And Assistant Professor at Nile University)

Sameh is the CEO of Novelari, lecturer at Nile University, and the American University in Cairo (AUC) where he lectured on security, distributed systems, software engineering, blockchain and BigData Engineering.

4. Olli Iivonen (Data Engineer at Wolt)

Olli works as a Data Engineer at Wolt. He has several years of experience on building and managing data pipelines on various data warehousing environments and has been a fan and active user of Apache Airflow since its first incarnations.

5. David Drummond (VP of Engineering at Insight)

David is VP of Engineering at Insight where he enjoys breaking down difficult concepts and helping others learn data engineering. David has a PhD in Physics from UC Riverside.

6. Judit Lantos (Data Engineer at Split)

Judit was formerly an instructor at Insight Data Science helping software engineers and academic coders transition to DE roles. Currently, she is a Data Engineer at Split where she works on the statistical engine of their full-stack experimentation platform.

7. Juno Lee (Instructor)

As a data scientist, Juno built a recommendation engine to personalize online shopping experiences, computer vision and natural language processing models to analyze product data, and tools to generate insight into user behavior.

Review: “Definitely worth enrolling in!”

After successfully completing the Nanodegree program, you will find that the program is very comprehensive and teaches you everything promised by the program.

Plus, the certification is also really helpful since this Udacity Nanodegree Program is well-recognized amongst multiple industries.

Leave a Comment Cancel reply

Save my name, email, and website in this browser for the next time I comment.

Study through a pre-planned curriculum designed to help you fast-track your Data Science career and learn from the world’s best collection of Data Science Resources.

Find us on social media:

As Featured On

udacity capstone project data engineer

Data Science Fast-Track

Terms of Use

[email protected]  | Phone Number: (208) 887-3696‬ |  Mailing Address: Kharpann Enterprises Pvt. Ltd, Balkhu, Nepal

Disclaimer: Efforts are made to maintain reliable data on all information presented. However, this information is provided without warranty. Users should always check the offer provider’s official website for current terms and details. Our site receives compensation from many of the offers listed on the site. Along with key review factors, this compensation may impact how and where products appear across the site (including, for example, the order in which they appear). Our site does not include the entire universe of available offers. Editorial opinions expressed on the site are strictly our own and are not provided, endorsed, or approved by advertisers.

2022 © Kharpann Enterprises Pvt. Ltd. All rights reserved.


  1. Top 20 udacity data engineering capstone project github mới nhất 2022

    udacity capstone project data engineer

  2. GitHub

    udacity capstone project data engineer

  3. 7 Resources to Becoming a Data Engineer

    udacity capstone project data engineer

  4. Udacity Data Engineer Nanodegree Review 2022

    udacity capstone project data engineer

  5. Udacity Data Engineer Nanodegree Review

    udacity capstone project data engineer

  6. Udacity Programming for Data Science with R Nanodegree Review

    udacity capstone project data engineer


  1. Udacity

  2. Capstone Project

  3. Udacity AML Engineer Capstone project

  4. Capstone Project Purwadhika Data Science & Machine Learning

  5. Capstone Project evaluation by Data Science Experts

  6. Capstone Project Data Karyawan Dyah Ayu Daratika


  1. What Is Presenting Data?

    In the field of math, data presentation is the method by which people summarize, organize and communicate information using a variety of tools, such as diagrams, distribution charts, histograms and graphs. The methods used to present mathem...

  2. What Is the Definition of “presentation of Data”?

    The presentation of data refers to how mathematicians and scientists summarize and present data related to scientific studies and research. In order to present their points, they use various techniques and tools to condense and summarize th...

  3. Big Data Technology Capstone Project

    The Big Data Technology Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this MicroMasters program to a medium-scale project. The Big Data Technology Capstone Project will allow you...

  4. GabrielGiurgica/Udacity-Data-Engineering-Capstone-Project

    This repository is my final project for the Data Engineering Nanodegree Program. - GitHub - GabrielGiurgica/Udacity-Data-Engineering-Capstone-Project: This

  5. Modingwa/Data-Engineering-Capstone-Project

    Udacity Data Engineering Nanodegree Capstone Project - GitHub - Modingwa/Data-Engineering-Capstone-Project: Udacity Data Engineering Nanodegree Capstone

  6. Udacity Data Engineering Capstone Project

    Udacity Data Engineering Capstone Project · Step 1: Scope the Project and Gather Data · Step 2: Explore and Assess the Data · Step 3: Define the

  7. How to Become a Data Engineer

    Data Engineering is the foundation for the world of Big Data. Enroll in Udacity's data engineering with AWS course and learn essential skills to become a

  8. Udacity Data Engineering Capstone Project

    Udacity Data Engineering Capstone Project. In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di

  9. Data Engineering Capstone Project

    First in a series wherein I complete a capstone project for my data engineering course.

  10. What I learned from finishing the Udacity Data Engineering Capstone

    What I'm most interested in is Cybersecurity and Privacy use cases, which I will talk about in my next project. Note that all these data are in

  11. Udacity Data Engineering Nanodegree Review in 2023- Pros & Cons

    In this Udacity Data Engineering Capstone Project, you have to gather data from several different data sources, transform, combine, and summarize it

  12. ND027: Data Engineer

    PROJECT 1: Data Modeling with Postgres ... PROJECT 6: Data Capstone.

  13. My Review: Udacity Data Engineering Nanodegree Review 2022

    This was the final project or the Capstone project.

  14. Udacity's Data Engineering Nanodegree Program

    At the end of the program, you'll combine your new skills by completing a capstone project. Udacity's Data Engineering Nanodegree Program Syllabus. 1. Data