- No suggested jump to results
Udacity Data Engineering Nanodegree Capstone Project
Name already in use.
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Data engineering capstone project, project summary.
The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events. A use case for this analytics database is to find immigration patterns to the US. For example, we could try to find answears to questions such as, do people from countries with warmer or cold climate immigrate to the US in large numbers?
Data and Code
All the data for this project was loaded into S3 prior to commencing the project. The exception is the i94res.csv file which was loaded into Amazon EMR hdfs filesystem.
In addition to the data files, the project workspace includes:
- etl.py - reads data from S3, processes that data using Spark, and writes processed data as a set of dimensional tables back to S3
- etl_functions.py and utility.py - these modules contains the functions for creating fact and dimension tables, data visualizations and cleaning.
- config.cfg - contains configuration that allows the ETL pipeline to access AWS EMR cluster.
- Jupyter Notebooks - jupyter notebook that was used for building the ETL pipeline.
- AWS EMR cluster
- Apache Spark
- configparser python 3 is needed to run the python scripts.
The project follows the following steps:
Step 1: scope the project and gather data, step 2: explore and assess the data, step 3: define the data model.
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up
To create the analytics database, the following steps will be carried out:
- Use Spark to load the data into dataframes.
- Exploratory data analysis of I94 immigration dataset to identify missing values and strategies for data cleaning.
- Exploratory data analysis of demographics dataset to identify missing values and strategies for data cleaning.
- Exploratory data analysis of global land temperatures by city dataset to identify missing values and strategies for data cleaning.
- Perform data cleaning functions on all the datasets.
- Create immigration calendar dimension table from I94 immigration dataset, this table links to the fact table through the arrdate field.
- Create country dimension table from the I94 immigration and the global temperatures dataset. The global land temperatures data was aggregated at country level. The table links to the fact table through the country of residence code allowing analysts to understand correlation between country of residence climate and immigration to US states.
- Create usa demographics dimension table from the us cities demographics data. This table links to the fact table through the state code field.
- Create fact table from the clean I94 immigration dataset and the visa_type dimension.
The technology used in this project is Amazon S3, Apache Sparkw. Data will be read and staged from the customers repository using Spark.
Refer to the jupyter notebook for exploratory data analysis
3.1 Conceptual Data Model
The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.
The us demographics dimension table comes from the demographics dataset and links to the immigration fact table at US state level. This dimension would allow analysts to get insights into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attract more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level.
The visa type dimension table comes from the immigration datasets and links to the immigaration via the visa_type_key.
The immigration fact table is the heart of the data model. This table's data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table.
3.2 Mapping Out Data Pipelines
The pipeline steps are as follows:
- Load the datasets
- Clean the I94 Immigration data to create Spark dataframe for each month
- Create visa_type dimension table
- Create calendar dimension table
- Extract clean global temperatures data
- Create country dimension table
- Create immigration fact table
- Load demographics data
- Clean demographics data
- Create demographic dimension table
Step 4: Run Pipelines to Model the Data
4.1 create the data model.
Refere to the jupyter notebook for the data dictionary.
4.2 Running the ETL pipeline
The ETL pipeline is defined in the etl.py script, and this script uses the utility.py and etl_functions.py modules to create a pipeline that creates final tables in Amazon S3.
spark-submit --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 etl.py
- Jupyter Notebook 95.1%
- Python 4.9%
Udacity Data Engineering Capstone Project
- Post published: July 4, 2020
- Post category: Data Engineering / Machine Learning
- Post comments: 0 Comments
The project follows the follow steps:
Step 1: Scope the Project and Gather Data
Step 2: explore and assess the data, step 3: define the data model.
- Step 4: Run ETL to Model the Data
Step 5: Complete Project Write Up
The project is one provided by Udacity to showcase the learnings of the student throughout the program. There are four datasets as follows to complete the project.
- i94 Immigration Sample Data : sample data which is from the US National Tourism and Trade Office. This data comes from the US National Tourism and Trade Office. This table is used for the fact table in this project.
- World Temperature Data world_temperature. This dataset contains temperature data of various cities from 1700’s – 2013. This dataset came from Kaggle. This table is not used because the data is available until 2013
- U.S. City Demographic Data us-cities-demographics. This dataset contains population details of all US Cities and census-designated places includes gender & race information. This data came from OpenSoft. The table is grouped by state to get aggregated statistics.
- Airport Codes is a simple table of airport codes and corresponding cities. The rows where IATA codes are available in the table are selected for this project.
The project builds a data lake using Pyspark that can help to support the analytics department of the US immigration department to query the information by extracting data from all the sources. The conceptual data model is a Factless fact based transactional star schema with dimensions tables. Some examples of the information which can be queries from the data model include the numbers of visitors by nationality, visitor’s main country of residence, their demographics and flight information. Python is the main language used to complete the project. The libraries used to perform ETL are Pandas, Pyarrow and Pyspark. The environment used is workspace by Udacity. Immigration data was transformed from sas format to parquet format using Pyspark. These parquest files were ingested using Pyarrow and explored using Pandas to gain an understanding of the data and before building a conceptual data model. Pyspark was then used to build the ETL pipeline. The data sources provided have been cleaned, transformed to create new features and then save the data tables are parquet file. The two notebooks with all the code and output are as follows:
Describe and Gather Data
“Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” ( https://en.wikipedia.org/wiki/Form_I-94 ) .It lists the traveler’s immigration category, port of entry, data of entry into the United States, status expiration date and had a unique 11-digit identifying number assigned to it. Its purpose was to record the traveler’s lawful admission to the United States ( https://i94.cbp.dhs.gov/I94/(
This is the main dataset and there is a file for each month of the year of 2016 available in the directory ../../data/18-83510-I94-Data-2016/ . It is in SAS binary database storage format sas7bdat. This project uses the parquet files available in the workspace and the folder called sap_data. The data is for the month of the month of April of 2016 which has more than three million records (3.096.313). The fact table is derived from this table.
World Temperature Data
Data is from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. The original dataset from Kaggle includes several files ( https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data ). But for this project, only the GlobalLandTemperaturesByCity was analyzed. The dataset provides a long period of the world’s temperature (from year 1743 to 2013). However, since the immigration dataset only has data in the year of 2016, the vast majority of the data here seems not to be suitable.
“Airport data includes IATA airport code.An IATA airport code, also known as an IATA location identifier, IATA station code or simply a location identifier, is a three-letter geocode designating many airports and metropolitan areas around the world, defined by the International Air Transport Association (IATA). IATA code is used in passenger reservation, ticketing and baggage-handling systems (https://en.wikipedia.org/wiki/IATA_airport_code)”. It was downloaded from a public domain source ( http://ourairports.com/data/ )
U.S. City Demographic Data
This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau’s 2015 American Community Survey. This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. The US City Demographics is the source of the STATE dimension in the data model and grouped by State.
Explore the data, exploringusingpandas.ipynb shows the workings to assess and explore the data., the main finding and cleaning steps necessary are as follows:.
- The dataset is for 30 days in the month of April and year 2016.
- Most of the people used air as mode of travel. Some people do not report their mode of transport.
- Males immigrated more than females
- i94 has missing values. These rows need to be dropped.
- there are no duplicate gender and address for each cicid.
- Immigration was to 243 different cities for multiple states.
- Immigration was from 229 different cities.
- Departure date is less than the arrival date. Therefore, these visitors are still in the country.
- airline & fltno is also missing in some rows and hence the mode of transport was different.
- i94 form supports O, M, F gender.Null values are considered invalid
- Some arrival & departure records don’t have a matching flag (matflag ).
- There is a minimum age as -3. The data selected is more than age of zero years.
- The dates are stored in SAS date format, which is a value that represents the number of days between January 1, 1960, and a specified date. We need to convert the dates in the dataframe to a string date format in the pattern YYYY-MM-DD.
- insnum can be dropped as it is for US residents or citizens
- Count, dtadfile, admnum, i94res, dtaddto, occup, visapost can be dropped as these do not provide any extra information or have high missing values.
- Demographic dataset doesnot have many missing values but has data for only 48 states.
- Most of the iata_code are missing. Almost 50% of local codes are also missing
- Select only where IATA codes are available from US airports and type of airport is either large, medium and small.
- Extract iso regions and dropped continent
- Rename the columns of the dataset to more meaning full names
- Convert the data types of the columns
- Removed city and race from Demographics
- Grouped the data to provide an aggregated statistics per US state
- Drop duplicates
3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model
For this project, Star schema is deployed in a relational database management system as dimensional structures. Star schemas characteristically consist of fact tables linked to associated dimension tables via primary/ foreign key relationships.
3.2 Mapping Out Data Pipelines
The project involved four key decisions during the design of a dimensional model:
- Select the business process.
The business process for the immigration department is to allow valid visitors into the country. The process generate events and capture performance metrics that translate into facts in a fact table.
- Declare the grain.
The grain establishes exactly what a single fact table row represents. In the project the records are recorded as the event of a visitor entring the USA occurs. It is done before choosing the fact and dimension table and becomes a binding contract on the design. This ensures uniformity on all dimensional designs and critical to BI application performance and ease of use.
- Identify the dimensions.
Dimensions table provides the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts. In this project, a dimension is single valued when associated with a given fact row. Every dimension table has a single primary key column. This primary key is embedded as a foreign key in the associated fact table where the dimension row’s descriptive context is exactly correct for that fact table row.
Dimension tables are wide, flat denormalized tables with many low cardinality text attributes. It is designed with one column serving as a unique primary key. This primary key is not the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. These surrogate keys are simple integers, assigned in sequence. The tables also denormalize the many-to-one fixed depth hierarchies into separate attributes on a flattened dimension row. Dimension denormalization supports dimensional modeling’s twin objectives of simplicity and speed.
- Identify the facts
The fact table focuses on the results of a single business process. A single fact table row has a one-to-one relationship to a measurement event as described by the fact table’s grain. Thus a fact table design is entirely based on a physical activity and is not influenced by the demands of a particular report. Within a fact table, only facts consistent with the declared grain are allowed. In this project, the information about the visitor is the fact. The fact table is transactional with each row corresponding to a measurement event at a point in space and time. It is also Factless Fact Tables as the event merely records a set of dimensional entities coming together at a moment in time. Factless fact tables can also be used to analyze what didn’t happen. These queries always have two parts: a factless coverage table that contains all the possibilities of events that might happen and an activity table that contains the events that did happen. When the activity is subtracted from the coverage, the result is the set of events that did not happen. Each row corresponds to an event. The fact table contains foreign keys for each of its associated dimensions, as well as date stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.
( http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf )
Step 4: Run Pipelines to Model the Data
4.1 create the data model.
Build the data pipelines to create the data model.
4.2 Data Quality Checks
Explain the data quality checks you’ll perform to ensure the pipeline ran as expected. These could include:
- Integrity constraints on the relational database (e.g., unique key, data type, etc.)
- Unit tests for the scripts to ensure they are doing the right thing
- Source/Count checks to ensure completeness
exploringUsingPyspark.ipynb are the workings for task 4.1 and 4.2
4.3 data dictionary, datadictionary.md has the data model.
- The data was increased by 100x. 1. Use of Redshift ( https://aws.amazon.com/redshift/ ). It allows querying petabytes of structured and semi-structured data across the data warehouse2. Use of Cassandra ( http://cassandra.apache.org/ ). It offers robust support for clusters spanning multiple datacenters with asynchronous masterless replication allowing low latency operations for all clients.
- The data populates a dashboard that must be updated on a daily basis by 7am every day. 1. For small datasets, a cron job will be sufficient2. Use of Airflow ( https://airflow.apache.org/docs/stable/macros.html )
- The database needed to be accessed by 100+ people. 1. Use of Redshift with auto-scaling capabilities and good read performance2. Use of Cassandra with pre-defined indexes to optimize read queries3. Use of Elastic Map Reduce ( https://aws.amazon.com/emr/ ). It allows provisioning one, hundreds, or thousands of compute instances to process data at any scale.
You Might Also Like
Facebook Data Analysis
Evaluating algorithms using mnist.
7 ShortListed Free and Decisive MLOps Tools for Food Recipe Application
Leave a reply cancel reply.
This site uses Akismet to reduce spam. Learn how your comment data is processed .
How to Become a Data Engineer
Understand the latest AWS features used by data engineers to design and build systems for collecting, storing, and analyzing data at scale.
04 Days 07 Hrs 43 Min 36 Sec
At 5-10 hrs/week
March 22, 2023
Get access to classroom immediately on enrollment
Apache airflow, aws glue, apache spark, redshift, amazon s3, what you will learn.
Data Engineering with AWS
You’ll master the AWS data engineering skills necessary to level up your tech career. Learn data engineering concepts like designing data models, building data warehouses and data lakes, automating data pipelines, and managing massive datasets.
It is recommended that learners have intermediate Python, intermediate SQL, and command line skills.
Learners will create relational and NoSQL data models to fit the diverse needs of data consumers. They’ll also use ETL to build databases in Apache Cassandra.
Cloud Data Warehouses
In this data engineering course, learners will create cloud-based data warehouses. They will sharpen their data warehousing skills, deepen their understanding of data infrastructure, and be introduced to data engineering on the cloud using Amazon Web Services (AWS).
Spark and Data Lakes
Learners will build a data lake on AWS and a data catalog following the principles of data lakehouse architecture. They will learn about the big data ecosystem and the power of Apache Spark for data wrangling and transformation. They’ll work with AWS data tools and services to extract, load, process, query, and transform semi-structured data in data lakes.
Automate Data Pipelines
This data engineer training dives into the concept of data pipelines and how learners can use them to accelerate their career. This course will focus on applying the data pipeline concepts learns will learn through an open-source tool from Airbnb called Apache Airflow. This course will start by covering concepts including data validation, DAGs, and Airflow and then venture into AWS quality concepts like copying S3 data, connections and hooks, and Redshift Serverless. Next, learners will explore data quality through data lineage, data pipeline schedules, and data partitioning. Finally, they’ll put data pipelines into production by extending Airflow with plugins, implementing task boundaries, and refactoring DAGs.
All our programs include
- Real-world projects from industry experts
With real-world projects and immersive content built in partnership with top-tier companies, you’ll master the tech skills companies want.
- Technical mentor support
Our knowledgeable mentors guide your learning and are focused on answering your questions, motivating you, and keeping you on track.
- Career services
You’ll have access to Github portfolio review and LinkedIn profile optimization to help you advance your career and land a high-paying role.
Flexible learning program
Tailor a learning plan that fits your busy life. Learn at your own pace and reach your personal goals on the schedule that works best for you.
- Content Co-created with Insight
- Real-world projects
- Project reviews
- Project feedback from experienced reviewers
- Student community
- Github review
- Linkedin profile optimization
Succeed with personalized services.
We provide services customized for your needs at every step of your learning journey to ensure your success.
- Experienced Project Reviewers
- Technical Mentor Support
Get timely feedback on your projects.
- Personalized feedback
- Unlimited submissions and feedback loops
- Practical tips and industry best practices
- Additional suggested resources to improve
avg project review turnaround time
Learn with the best.
Amanda is a developer advocate for DataStax after spending the last 6 years as a software engineer on 4 different distributed databases. Her passion is bridging the gap between customers and engineering. She has degrees from the University of Washington and Santa Clara University.
In his career as an engineer, Ben Goldberg has worked in fields ranging from computer vision to natural language processing. At SpotHero, he founded and built out their data engineering team, using Airflow as one of the key technologies.
Valerie is a curriculum manager at Udacity who has developed and taught a broad range of computing curriculum for several colleges and universities. She was a professor and software engineer for over 10 years specializing in web, mobile, voice assistant, and social full-stack application development.
Matt is a software and solutions architect focusing on data science and analytics for managed business solutions. In addition, Matt is an adjunct lecturer, teaching courses in the computer information systems department at the University of Northern Colorado where he received his PhD in Educational Psychology.
Sean currently teaches cybersecurity and DevOps courses at Brigham Young University Idaho. He has been a software engineer for over 16 years. Some of the most exciting projects he has worked on involved data pipelines for DNA processing and vehicle telematics.
Top student reviews
Get started today
Learn the high-impact AWS skills that a data engineer uses on a daily basis.
On average, successful students take 4 months to complete this program.
Related programs, aws cloud architect.
Build confidence planning, designing, and creating high availability cloud infrastructure.
Learn how to stream data to unlock key insights in real-time.
Program overview: why should i take this program, why should i enroll.
The data engineering field is expected to continue growing rapidly over the next several years, and there’s huge demand for data engineers across industries. Udacity has collaborated with industry professionals to offer up-to-date learning content that can advance your data engineering career.
By the end of the Nanodegree program, you will have an impressive portfolio of real-world projects and valuable hands-on experience.
What jobs will this program prepare me for?
This program is designed to teach you how to become a data engineer. These skills will prepare you for jobs such as analytics engineer, big data engineer, data platform engineer, and others. Data engineering skills are also helpful for adjacent roles, such as data analysts, data scientists, machine learning engineers, or software engineers.
How do I know if this program is right for me?
This Nanodegree program offers an ideal path for experienced programmers to advance their data engineering careers. If you enjoy solving important technical challenges and want to learn to work with massive datasets, this is a great way to get hands-on practice.
Enrollment and admission
Do i need to apply what are the admission criteria.
There is no application. This Nanodegree program accepts everyone, regardless of experience and specific background.
What are the prerequisites for enrollment?
The Data Engineering with AWS Nanodegree program is designed for learners with intermediate Python, intermediate SQL, and command line skills.
In order to successfully complete the program, learners should be comfortable with the following concepts:
- Strings, numbers, and variables
- Statements, operators, and expressions
- Lists, tuples, and dictionaries
- Conditions, loops
- Procedures, objects, modules, and libraries
- Troubleshooting and debugging
- Algorithms and data structures
- Table definition and manipulation (Create, Update, Insert, Alter)
- Run scripts from the command line
If you need to sharpen your pre-requisite skills, try our below programs:
- Intermediate Python
- Programming for Data Science with Python
If I do not meet the requirements to enroll, what should I do?
To prepare for this program learners are encouraged to enroll in one of the following programs:
Tuition and term of program
How is this nanodegree program structured.
The Data Engineering with AWS Nanodegree program has 4 courses with 4 projects. We estimate that students can complete the program in 4 months working 5-10 hours per week.
Each project will be reviewed by the Udacity reviewer network. Feedback will be provided and if you do not pass the project, you will be asked to resubmit the project until it passes.
How long is this Nanodegree program?
Can I switch my start date? Can I get a refund?
Please see the Udacity Program FAQs for policies on enrollment in our programs.
Software and hardware: What do I need for this program?
What software and versions will i need in this program.
There are no software and version requirements to complete this Nanodegree program. All coursework and projects can be done via Student Workspaces in the Udacity online classroom.
- Search for:
Un sito che tratta di data science, machine learning, big data e applicazioni varie.
... udacity data engineering capstone project.
In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di Udacity. Cosa sviluppare era libera scelta dello sviluppatore posto che alcuni criteri fossero soddisfatti, per esempio lavorare con un database di almeno 3 milioni di records.
Questa è il primo notebook del progetto, nel secondo ci sono esempi di queries che possono essere eseguite sul data lake.
Data lake with Apache Spark ¶
Data engineering capstone project ¶, project summary ¶.
The Organization for Tourism Development ( OTD ) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.
To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.
The project deals with building a data pipeline, to go from raw data to the data insights on the migration flux.
The raw data are gathered from different sources, saved in files and made available for download.
The project shows the execution and decisional flow, specifically:
- Describe the data and how they have been obtained
- Answer the question “how to achieve the target?”
- What infrastructure (storage, computation, communication) has been used and why
- Explore the data
- Check the data for issues, for example null, NaN, or other inconsistencies
- Why this data model has been chosen
- How it is implemented
- Load the data from S3 into the SQL database, if any
- Perform quality checks on the database
- Perform example queries
- Documentation of the project
- Possible scenario extensions
- 1. Scope of the Project
- 1.1 What data
- 1.2 What tools
- 1.3 The I94 immigration data
- 1.3.1 What is an I94?
- 1.3.2 The I94 dataset
- 1.3.3 The SAS date format
- 1.3.4 Loading I94 SAS data
- 1.4 World Temperature Data
- 1.5 Airport Code Table
- 1.6 U.S. City Demographic Data
- 2. Data Exploration
- 2.1 The I94 dataset
- 2.2 I94 SAS data load
- 2.3 Explore I94 data
- 2.4 Cleaning the I94 dataset
- 2.5 Store I94 data as parquet
- 2.6 Airport codes dataset: load, clean, save
- 3. The Data Model
- 3.1 Mapping Out Data Pipelines
- 4. Run Pipelines to Model the Data
- 4.1 Provision the AWS S3 infrastructure
- 4.2 Transfer raw data to S3 bucket
- 4.3 EMR cluster on EC2
- 4.3.1 Provision the EMR cluster
- 4.3.2 Coded fields: I94CIT and I94RES
- 4.3.3 Coded field: I94PORT
- 4.3.4 Data cleaning
- 4.3.5 Save clean data (parquet/json) to S3
- 4.3.6 Loading, cleaning and saving airport codes
- 4.4 Querying data on-the-fly
- 4.5 Querying data using the SQL querying style
- 4.6 Data Quality Checks
- Lesson learned
1. Scope of the Project ¶
The OTD want to run pre-defined queries on the data, with periodical timing.
They also want to maintain the flexibility to run different queries on the data, using BI tools connected to an SQL-like database.
The core data is the dataset provided by US governative agencies filing request of access in the USA (I94 module).
They also have other lower value data available, that are not part of the core analysis, whose use is unclear, therefore are stored in the data lake for a possible future use.
1.1 What data ¶
Following datasets are used in the project:
- I94 immigration data for year 2016 . Used for the main analysis
- World Temperature Data
- Airport Code Table
- U.S. City Demographic Data
1.2 What tools ¶
Because of the nature of the data and the analysis that must be performed, not time-critical analysis, monthly or weekly batch, the choice fell on a cheaper S3-based data lake with on-demand on-the-fly analytical capability: EMR cluster with Apache Spark , and optionally Apache Airflow for scheduled execution (not implemented here).
The architecture shown below has been implemented.
- Starting from a common storage solution (currently Udacity workspace) where both the OTD and its partners have access, the data is then ingested into an S3 bucket , in raw format
- To ease future operations, the data is immediately processed, validated and cleansed using a Spark cluster and stored into S3 in parquet format. Raw and parquet data formats coesist in the data lake.
- By default, the project doesn”t use costly Redshift cluster, but data are queried in-place on the S3 parquet data.
- The EMR cluster serves the analytical needs of the project. SQL based queries are performed using Spark SQL directly on the S3 parquet data
- A Spark job can be triggered monthly, using the Parquet data. The data is aggregated to gain insights on the evolution of the migration flows
1.3 The I94 immigration data ¶
The data are provided by the US National Tourism and Trade Office . It is a collection of all I94 that have been filed in 2016.
1.3.1 What is an I94? ¶
To give some context is useful to explain what an I94 file is.
From the government website : “The I-94 is the Arrival/Departure Record, in either paper or electronic format, issued by a Customs and Border Protection (CBP) Officer to foreign visitors entering the United States.”
1.3.2 The I94 dataset ¶
Each record contains these fields:
- CICID, unique numer of the file
- I94YR, 4 digit year of the application
- I94MON, Numeric month of the application
- I94CIT, city where the applicant is living
- I94RES, state where the applicant is living
- I94PORT, location (port) where the application is issued
- ARRDATE, arrival date in USA in SAS date format
- I94MODE, how did the applicant arrived in the USA
- I94ADDR, US state where the port is
- DEPDATE is the Departure Date from the USA
- I94BIR, age of applicant in years
- I94VISA, what kind of VISA
- COUNT, used for summary statistics, always 1
- DTADFILE, date added to I-94 Files
- VISAPOST, department of State where where Visa was issued
- OCCUP, occupation that will be performed in U.S.
- ENTDEPA, arrival Flag
- ENTDEPD, departure Flag
- ENTDEPU, update Flag
- MATFLAG, match flag
- BIRYEAR, 4 digit year of birth
- DTADDTO, date to which admitted to U.S. (allowed to stay until)
- GENDER, non-immigrant sex
- INSNUM, INS number
- AIRLINE, airline used to arrive in USA
- ADMNUM, admission Number
- FLTNO, flight number of Airline used to arrive in USA
- VISATYPE, class of admission legally admitting the non-immigrant to temporarily stay in USA
More details in the file I94_SAS_Labels_Descriptions.SAS
1.3.3 The SAS date format ¶
Represent any date D0 as the number of days between D0 and the 1th January 1960
1.3.4 Loading I94 SAS data ¶
The package saurfang:spark-sas7bdat:2.0.0-s_2.11 and the dependency parso-2.0.8 are needed to read SAS data format.
To load them use the config option spark.jars and give the URL of the repositories, as Spark itself wasn’t able to resolve the dependencies.
1.4 World temperature data ¶
The dataset is from Kaggle. It can be found here .
The dataset contains temperature data:
- Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv)
- Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)
- Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)
- Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)
- Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)
1.5 Airport codes data ¶
This is a table of airport codes, and information on the corresponding cities, like gps coordinates, elevation, country, etc. It comes from Datahub website .
1.6 U.S. City Demographic Data ¶
The dataset comes from OpenSoft. It can be found here .
2. Data Exploration ¶
In this chapter we proceed identifying data quality issues, like missing values, duplicate data, etc.
The purpose is to identify the flow in the data pipeline to programmatically correct data issues.
In this step we work on local data.
2.1 The I94 dataset ¶
- How many files are in the I94 dataset?
- What is the size of the files?
2.2 I94 SAS data load ¶
To read SAS data format I need to specify the com.github.saurfang.sas.spark format.
- Let’s see the schema Spark applied on reading the file
The most columns are categorical data, this means the information is coded, for example I94CIT=101 , 101 is the country code for Albania.
Other columns represent integer data.
It appears clear that there is no need to have data that are defined as double => let’s change those fields to integer
Verifying the schema is correct.
- convert string columns dtadfile and dtaddto to date type
These fields come in a simple string format. To be able to run time-based queries they are converted to date type
- convert columns arrdate and depdate from SAS-date format to a timestamp type.
A date in SAS format is simply the number of days between the chosen date and the reference date (01-01-1960)
- print final schema
2.3 Explore I94 data ¶
- How many rows does the I94 database has?
- Let’s see the gender distribution of the applicants
- Where are the I94 applicants coming from?
I want to know the 10 most represented nations
The i94res code 135, where the highest number of visitors come from, corresponds to the the United Kingdom, as can be read in the accompanying file I94_SAS_Labels_Descriptions.SAS
- What port registered the highest number of arrivals?
New York City port registered the highest number of arrivals.
2.4 Cleaning the I94 dataset ¶
These are the steps to perform on the I94 database:
- Identify null and NaN values. Remove duplicates ( quality check ).
- Find errors in the records ( quality check ) for example dates not in year 2016
- Counting how many NaN in each column, excluding the date type columns dtadfile , dtadddto , arrdate , depdate because the isnan function works only on numerical types
- How many rows of the I94 database have null value?
The number of nulls equal the number of rows. It means there is at least one null on each row of the dataframe.
- Now we can count how many null there are in each row
There are many nulls in many columns.
The question is, if there is a need to correct/fill those nulls.
Looking at the data, it seems like some field have been left empty for lack of information.
Because these are categorical data there is no use, at this step, in assigning arbitrary values to the nulls.
The nulls are not going to be filled apriori, but only if a specific need comes up.
- Are there duplicated rows?
Dropping duplicate row
Cheching if the number changed
No row has been dropped => no duplicated row
- Verify that all rows have i94yr column equal 2016
This gives confidence on the consistence of the data
2.5 Store I94 data as parquet ¶
I94 data are stored in parquet format in an S3 bucket, they are partinioned using the fields: year, month
2.6 The Airport codes dataset ¶
A snippet of the data
How many records?
There are no duplicates
We discover there are some null fields:
The nulls are in these colomuns:
No action taken to fill the nulls
Finally, let’s save the data in parquet format in our temporary folder mimicking the S3 bucket.
3. The Data Model ¶
The core of the architecture is a data lake , with S3 storage and EMR processing.
The data are stored into S3 in raw and parquet format.
Apache Spark is the tool elected for analytical tasks, therefore all data are loaded into Spark dataframe using a schema-on-read approach.
For SQL queries style on the data, Spark temporary views are generated.
3.1 Mapping Out Data Pipelines ¶
- Provision the AWS S3 infrastructure
- Transfer data from the common storage to the S3 lake storage
- Provision an EMR cluster. It runs 2 steps then autoterminate, these are the 2 steps: 3.1 Run a spark job to extract codes from file I94_SAS_Labels_Descriptions.SAS and save to S3 3.2 Data cleaning. Find nan, null, duplicate. Save the clean data to parquet files
- Generate reports using Spark query on S3 parquet data
- On-the-fly queries with Spark SQL
4. Run Pipeline to Model the Data ¶
4.1 provision the aws s3 infrastructure ¶.
Reading credentials and configuration from file
Create the bucket if it’s not existing
4.2 Transfer raw data to S3 bucket ¶
Transfer the data from current shared storage (currently Udacity workspace) to S3 lake storage.
A naive metadata system is implemented. It uses a json file to store basic information on each file added to the S3 bucket:
- file name: file being processed
- added by: user logged as | aws access id
- date added: timestamp of date of processing
- modified on: timestamp of modification time
- notes: any additional information
- access granted to (role or policy): admin | anyone | I94 access policy | weather data access policy |
- expire date: 5 years (default)
These dataset are moved to the S3 lake storage:
- I94 immigration data
- airport codes
- US cities demographics
4.3 EMR cluster on EC2 ¶
An EMR cluster on EC2 instances with Apache Spark preinstalled is used to perform the ELT work.
A 3-nodes cluster of m5.xlarge istances is configured by default in the config.cfg file.
If the performance requires it, the cluster can be scaled up to use more nodes and/or bigger instances.
After the cluster has been created, the steps to execute spark cleaning jobs are added to the EMR job flow, the steps are in separate .py files. These steps are added:
- extract I94res, i94cit, i94port codes
- save the codes in a json file in S3
- load I94 raw data from S3
- change schema
- data cleaning
- save parquet data to S3
The cluster is set to auto-terminate by default after executing all the steps.
4.3.1 Provision the EMR cluster ¶
Create the cluster using the code emr_cluster.py [Ref. 3] and emr_cluster_spark_submit.py and and set the steps to execute spark_script_1 and spark_script_2 .
These scripts have already been previously uploaded to a dedicated folder in the project’s S3 bucket, and are accessible from the EMR cluster.
The file spark_4_emr_codes_extraction.py contains the code for following paragraphs 4.3.1
The file spark_4_emr_I94_processing.py contains the code for following paragraphs 4.3.2, 4.3.3, 4.3.4
4.3.2 Coded fields: I94CIT and I94RES ¶
I94CIT, I94RES contain codes indicating the country where the applicant is born (I94CIT), or resident (I94RES).
The data is extracted from I94_SAS_Labels_Descriptions.SAS . This can be done sporadically or every time a change occurred, for example a new code has been added.
The conceptual flow below was implemented.
First steps are define credential to access S3a then load the data in a dataframe, in a single row
Find the section of the file where I94CIT and I94RES are specified.
It start with I94CIT & I94RES and finish with the semicolon character.
To match the section, it is important to have the complete text in a single row, I did this using the option wholetext=True in the previous dataFrame read operation
Now I can split in a dataframe with multiple rows
I filter the rows with structure \ = \
And then create 2 differents columns with code and country
I can finally store the data in a single file in json format
4.3.3 Coded field: I94PORT ¶
Similarly to extract the I94PORT codes
The complete code for codes extraction is in spark_4_emr_codes_extraction.py
4.3.4 Data cleaning ¶
The cleaning steps have already been shown in section 2, here are only summarized
- Load dataset
- Numeric fields: double to integer
- Fields dtadfile and dtaddto : string to date
- Fields arrdate and depdate : sas to date
- Handle nulls: no fill is set by default
- Drop duplicate
4.3.5 Save clean data (parquet/json) to S3 ¶
The complete code, refactorized and modularized, is in **spark_4_emr_I94_processing.py**
As a side note, saving the test file as parquet takes about 3 minute on the provisioned cluster. The complete script execution takes 6 minutes.
4.3.6 Loading, cleaning and saving airport codes ¶
4.4 querying data on-the-fly ¶.
The data in the data lake can be queried on-place. That is the Spark cluster on EMR is directly operating on S3 data.
There are two possible ways to query the data:
- using Spark dataframe functions
- using SQL on tables
We see example of both programming styles.
These are some typical queries that are run on the data:
- For each port, in a given period, how many arrivals there are in each day?
- Where are the I94 applicants coming from, in a given period?
- In the given period, what port registered the highest number of arrivals?
- Number of arrivals in a given city for a given period
- Travelers genders
- Is there a city where the difference between male and female travelers is higher?
- Find most visited city (the function)
The queries are collected in the Jupyter notebook Capstone project 1 – Querying the data lake.ipynb
4.5 Querying data using the SQL querying style ¶
4.6 data quality checks ¶.
The query-in-place concept implemented here uses a very short pipeline, data are loaded from S3 and after a cleaning process are saved as parquet. Quality of the data is guaranteed by design.
5. Write Up ¶
The project has been set up with scalability in mind. All components used, S3 and EMR, offer higher degree of scalability, either horizontal and vertical.
The tool used for the processing, Apache Spark, is the de facto tool for big data processing.
To achieve such a level of scalability we sacrified processing speed. A data warehouse solution with a Redshift database or an OLAP cube would have been faster answering the queries. Anyway nothing forbids to add a DWH to stage the data in case of a more intensive, real-time responsive, usage of the data.
An important part of an ELT/ETL process is automation. Although it has not been touched here, I believe the code developed here is prone to be automatized with a reasonable small effort. A tool like Apache Airflow can be used for the purpose.
Scenario extension ¶
- The data was increased by 100x.
In an increased data scenario, the EMR hardware needs to be scaled up accordingly. This is done by simply changing configuration in the config.cfg file. Apache Spark is the tool for big data processing, and is already used as the project analityc tool.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.
In this case an orchestration tool like Apache Airflow is required. A DAG that trigger Phython scripts and Spark jobs executions, needs to be scheduled for daily execution at 7am.
The results of the queries for the dashboard can be saved in a file.
- The database needed to be accessed by 100+ people.
A proper database wasn’t used, on the contrary Amazon S3 is used to store data and queries them in-place. S3 is designed to massive scale in mind, it is able to handle sudden traffic spikes. Therefore, accessing the data by many people shouldn’t be an issue.
The programming used in the project, provision an EMR cluster for any user that plan to run it’s queries. 100+ EMRs is probably going to be expensive for the company. A more efficient sharing of processing resources must be realized.
6. Lessons learned ¶
Emr 5.28.1 use python 2 as default ¶.
- As a consequence important Python packages like pandas are not installed by default for Python 3.
- install packages for Python 3: python 3 -m pip install \
Adding jars packages to Spark ¶
For some reason adding the packages in the Python programm when instantiating the sparkSession doesn’t work (error message package not found). This doesn’t work:
The packages must be added in the spark-submit:
Debugging Spark on EMR ¶
While evrything work locally, it doesn’t necessarily means that is going to work on the EMR cluster. Debugging the code is easier with SSH on EMR.
Reading an S3 file from Python is tricky ¶
While reading with Spark is straightforward, one just needs to give the address s3://…., with Python boto3 must be used.
Transfering file to S3 ¶
During the debbuging phase, when the code on S3 must be changed many time using the web interface is slow and unpractical ( permanently delete ). Memorize this command: aws s3 cp <local file> <s3 folder>
Removing the content of a directory from Python ¶
import shutil dirPath = 'metastore_db' shutil.rmtree(dirPath)
7. References ¶
- AWS CLI Command Reference
- EMR provisioning is based on: Github repo Boto-3 provisioning
- Boto3 Command Reference
Leave a Reply Cancel Reply
Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *
You may use these HTML tags and attributes:
Jul 26, 2019
What I learned from finishing the Udacity Data Engineering Capstone
This is my first experience with real Data Engineering. My tasks were to start with a lot of raw data from different sources, seemingly unrelated to each other, and somehow come up with a use case and design a data pipeline that could support those use cases. Time to put my data creativity hat on. The data potential is HUGE in this age.
Sources of data provided by Udacity: - US Immigration - World Temperature - US Demographics - International Airport Codes
The majority of Data Science works depends on what kind of data you have access to, and how to draw out the most relevant data attributes that can support your use cases and let the ML algorithms do its magic. I’ve read a lot of very creative use of data to do crazy inferences, like using Google house images and street views to infer potential car accidents at certain locations. Another crazy use case is combining traffic data, with car purchases, weather patterns, traffic lights patterns, to predict traffic. What I’m most interested in is Cybersecurity and Privacy use cases, which I will talk about in my next project. Note that all these data are in different forms and come from different sources. It would require a lot of Data Engineering work to extract, transform and load these into a usable data warehouses for analytics to happen.
The data sources provided by Udacity are not that interesting to me per se, but makes very good practice and preparation for my next project. Using the Snowflake schema, I’ve designed 8 normalized tables, such as airports, immigrants, state code, weather, etc … Using these normalized tables to design 2 analytics tables that are useful to consume, either for analytics or decision making, such as airport weather, and immigration demographics. During the process I had to deal with missing data, duplicates, malformed data, and so on.
Now that we got the data modelling out of the way, we’ll move on to my favourite topic, technologies. Particularly AWS, Spark and Airflow
First off, I’m particularly impressed at AWS CloudFormation, it’s such an easy and convenient way to deploy infrastructure with minimal effort, without spending hours clicking around multiple services trying to link them together. The configuration syntax is very intuitive and easy to learn.
An EC2 instance is used to host Airflow server, and an RDS instance is used as Airflow database, to store metadata about DAG runs, task completion, etc … Airflow is used to orchestrate the creation / termination of the EMR cluster. The basic workflow is: At the start of the DAG, create the EMR cluster and wait for completion. Once completed, start submitting Spark jobs to the cluster using Apache Livy REST API. When all jobs are completed, terminate the cluster.
I divided the workflow into 3 separate DAGs, which communicates with each other through Airflow Variables and SensorOperator
- dag_cluster: start the EMR cluster, and wait for all data transformation is finished, then terminate the cluster
- dag_normalize: wait for EMR cluster to be ready, then use Apache Livy REST API to create interactive Spark session on the cluster, submit a Spark script to read data from S3, do transformation and write the output to S3
- dag_analytics: wait for EMR cluster to be ready, and that normalized tables are processed, then read normalized tables to create analytics tables, and write to S3. This DAG handles immigration data, and runs in monthly intervals. Data is partitioned for 12 months from jan-2016 to dec-2016.
- create_spark_session script:
- submit_state script: (to submit pyspark script to the session)
- Example Spark script:
Few things to note:
- We can pass args from Airflow to the Spark script, such as execution date to partition the data
- We can read the Spark session logs into Airflow
There are lots of nuances in configuring Airflow and AWS, which I’ve learnt the hard way through doing this project. It’s really good preparation for me to embark on my own data journey. Stay tuned for my next project.
Project repo: https://github.com/dai-dao/udacity-data-engineering-capstone
More from Dai Dao
About Help Terms Privacy
Get the Medium app
Text to speech
Udacity Data Engineering Nanodegree Review in 2023- Pros & Cons
Are you planning to enroll in Udacity Data Engineering Nanodegree ? If yes, read this latest Udacity Data Engineering Nanodegree Review and its Pros and Cons . This Udacity Data Engineering Nanodegree review will help you to decide whether this Nanodegree program is good for you or not.
So, without further ado, let’s get started-
Udacity Data Engineering Nanodegree Review
Before we dive into a Udacity Data Engineering Nanodegree Review , I would like to clear one thing-
Udacity Data Engineering Nanodegree is not a beginner-level program. This is an intermediate-advanced data engineering course . So, If you don’t have previous Python and SQL knowledge, don’t directly enroll in this program .
In this case, you can check Programming for Data Science with Python .
Now, let’s see the Pros and Cons of the Udacity Data Engineering Nanodegree-
Pros and Cons of Udacity Data Engineering Nanodegree
- Provides hands-on labs to practice throughout each lesson.
- Covers the industry best practice to write code, documentation, and real-time projects.
- Provides a better knowledge about OLAP vs. OLTP, normalization, denormalization, and how to implement it into practice.
- The projects not only assess your knowledge of the subject being covered but continually enforce good programming practices.
- The content is well-developed and intuitive.
- Provides a good explanation of SQL vs. NoSQL.
- Discuss Postgres and Apache Cassandra commands.
- Provides perfect exposure to skills required in the data engineering industry.
- Focus on hands-on practice and believe in “ how ” to do things like ETL and Data Warehousing.
- Good explanation of distributed file systems and cluster computing.
- Clears the doubt between PySpark data frames and PySpark SQL.
- Provides Technical mentor support .
- Great community to help.
- Some of the lectures are not very polished .
- Data Modeling exercises have bugs.
- The demonstration code sample is not available to students.
- A bit of revision of Python and SQL is required.
- After completing the Nanodegree program, you will not get lifetime access to the course material.
So these are the Pros and Cons of the Udacity Data Engineering Nanodegree . Now let’s see how is the content of Udacity Data Engineering Nanodegree and what projects are covered throughout the Nanodegree program.
How is the Content & Projects of Udacity Data Engineering Nanodegree ?
The Udacity data engineering Nanodegree has 5 courses and 6 projects. Each course has 3-4 lessons and 1-2 Course Projects . You need to submit these guided projects after completing the course. And the contractor hired by Udacity reviews your projects.
Due to its practical approach , you will get to learn various new things. Because when you implement it by yourself, your understanding becomes stronger.
These are the 5 courses in Udacity Data Engineering Nanodegree –
- Data Modeling
- Cloud Data Warehouses
- Spark and Data Lakes
- Automate Data Pipelines
- Capstone Project
Course 1. Data Modeling
This is the first course where you will learn how to create NoSQL and relational data models to fill the needs of data consumers. You will also learn how to choose the appropriate data model for a given situation. Each course has some lessons. There are three lessons in the first course.
In the first lesson, you will learn the fundamentals of data modeling and how to create a table in Postgres and Apache Cassandra .
In the second lesson, concepts of normalization and denormalization will be introduced with hands-on projects. And you will also know the difference between OLAP and OLTP databases.
The third lesson of this course will teach you when to use NoSQL databases and how they differ from relational databases. You will also learn how to create a NoSQL database in Apache Cassandra .
Check the current Discount on-> Udacity Data Engineering Nanodegree
There are two projects in this first-course Data Modeling with Postgres and Data Modeling with Apache Cassandra .
In these projects, you have to model user activity data for a music streaming app called Sparkify . For this, you have to create a database and ETL pipeline , in both Postgres and Apache Cassandra , designed to optimize queries for understanding what songs users are listening to.
Course 2. Cloud Data Warehouses
The second lesson is focused on data warehousing, specifically on AWS . You will also learn various techniques like Kimball, Inmon, Hybrid, OLAP vs OLTP, Data Marts, etc. Some of the AWS tools that you’ll be using here are IAM, S3, EC2, and RDS instances.
There are three lessons in this course. In the first lesson, you will understand Data Warehousing architecture , how to run an ETL process to denormalize a database (3NF to Star) , how to create an OLAP cube from facts and dimensions , etc.
The second lesson will help you to understand cloud computing and teach you how to create an AWS account and understand their services , and how to set up Amazon S3, IAM, VPC, EC2, and RDS PostgreSQL .
In the third lesson, you will learn how to implement Data Warehouses on AWS . You will also learn how to identify components of the Redshift architecture, how to run the ETL process to extract data from S3 into Redshift , and how to set up AWS infrastructure using Infrastructure as Code (IaC).
In this course, there is one project where you have to build a cloud data warehouse to find insights into what songs their users are listening to. And for this, you have to build an ELT pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables.
Course 3. Spark and Data Lakes
This course provides an introduction to Apache Spark and Data Lakes. In this course, you will learn how to use Spark to work with massive datasets and how to store big data in a data lake and query it with Spark . You will also learn concepts such as distributed processing, storage, schema flexibility, and different file formats.
There are four lessons. In the first lesson, you will learn more about Spark and understand when to use Spark and when not to use it.
The second course will teach you data wrangling with Spark and how to use Spark for ETL purposes . In the third course, you will learn about debugging and optimization and how to troubleshoot common errors and optimize their code using the Spark WebUI .
The fourth course is all about data lakes and teaches you how to implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue. You will also understand the components and issues of data lakes.
In this course, there is one project where you have to create an ETL pipeline for a data lake, using data stored in AWS S3 in JSON format. And for this, you have to load data from S3 , process the data into analytics tables using Spark, and load them back into S3.
Course 4. Automate Data Pipelines
In the fourth course, you’ll use all the technologies learned in the above 3 courses. This is an exciting course where you will get an introduction to Apache Airflow and how to schedule, automate, and monitor data pipelines using Apache Airflow.
There are three lessons in this course. In the first lesson, you will learn how to create data pipelines with Apache Airflow , how to set up task dependencies , and how to create data connections using hooks .
In the second lesson, you will learn about data quality such as partitioning data to optimize pipelines, writing tests to ensure data quality , tracking data lineage, etc.
There is one project in this course, where you have to build data pipelines with Airflow. In this project, you will work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines. You have to configure and schedule data pipelines with Airflow and monitor and debug production pipelines.
5. Udacity Data Engineering Capstone Project
The last course is a capstone project . Where you will combine all the technologies learned in the entire course and build a data engineering portfolio project.
In this Udacity Data Engineering Capstone Project , you have to gather data from several different data sources , transform, combine, and summarize it, and create a clean database for others to analyze. Throughout the project guidelines, suggestions, tips, and resources will be provided by Udacity.
Now, let’s see whether you should enroll in the Udacity Data Engineer Nanodegree program or not-
Should You Enroll in Udacity Data Engineer Nanodegree?
Udacity Data Engineering Nanodegree is not a beginner-level program. This is the only intermediate-advanced data engineering course out there. This Nanodegree program requires the following skills before enrolling in the program-
If you are a beginner in Python , then don’t directly enroll in this program . To get 100% from Udacity Data Engineering Nanodegree Program, you need to know the following concepts-
- Strings, numbers, and variables; statements, operators, and expressions;
- Lists, tuples, and dictionaries; Conditions, loops;
- Procedures, objects, modules, and libraries;
- Troubleshooting and debugging; Research & documentation;
- Problem-solving; Algorithms and data structures
Along with Python knowledge, you should be familiar with SQL programs such as Joins, Aggregations, Subqueries , Table definition, and manipulation (Create, Update, Insert, Alter) .
If you meet the following prerequisites, then you can enroll in the Udacity Data Engineering Nanodegree . If not, first learn Python and SQL .
Now let’s see the price and duration of the Udacity Data Engineering Nanodegree program-
How Much Time and Money Do You have to Spend on Udacity Data Engineering Nanodegree ?
According to Udacity, the Udacity Data Engineering Nanodegree program will take 5 months to complete if you spend 5-10 hours per week.
And for 5 months they cost around more than $800 . But Udacity offers two options- One is either pay the complete amount upfront or you can pay monthly installments of $399/month .
So this is according to Udacity, but here I would like to tell you how you can complete the full Udacity Data Engineering Nanodegree program in less time .
Excited to know…How?
So, let’s see-
How to Complete the Udacity Data Engineering Nanodegree In Less Time?
To complete the Udacity Data Engineering Nanodegree program in less time, you need to manage your time productively.
You need to plan your day before and create a to-do list for each day. And you need to spend a good amount of time daily on the program.
According to Udacity, you need to spend 10 hours per week to complete the whole program in 5 months.
That means, daily you need to spend 1.5 hours , but if you double the time and give daily 3 hours , then you can complete the whole Nanodegree program in less than 3 months.
For managing your time and avoiding any distractions, you can use the Pomodoro technique to increase your learning.
As I mentioned earlier, after each course, you have to work on a project. And each project has a set of rubrics . So before starting a section, I would suggest you just study the rubrics of the project. The rubrics will provide you with a rough idea about what topics and lectures are important for the project. So that you can make notes while watching these lectures.
And you can also implement the project phases after watching the related lecture . By doing this way, you can save time by watching one video two times. One at the time of learning and the second at the time of working on the project.
I hope these tips will help you to complete the Udacity Data Engineering Nanodegree program in less time. But you can also get Udacity Scholarship .
Do you want to know…How?
So, let me tell you how to get Udacity Scholarship.
How to Get Udacity Scholarship?
To apply for Udacity Scholarship, you need to go to their Scholarship page , which looks like that-
On this page, you have to find the scholarship for the program you want to enroll. If you found your Nanodegree program on the list, then you need to apply for the scholarship by filling out these details-
- Background Information
- Prerequisite Knowledge
- Additional Questions
After filling out these details, you need to click on the “ Save and Submit “ button. And by doing so, you have applied for Udacity Scholarship. And if you are selected, then you will be notified via email.
But if your program is not listed in the scholarship section, then you fill out this form on the Scholarship page section-
So, when there will be a new scholarship available, you will be notified. I hope now you understood how to apply for a Udacity scholarship .
The next important thing you need to know is who will teach you and what their qualifications are . So, let’s see the information of the Instructors-
Are Instructors Experienced?
- Amanda Moran – She is a developer advocate at DATASTAX.
- Ben Goldberg – Staff Engineer at SPOTHERO
- Sameh El-Ansary – CEO of Novelari & Assistant professor at Nile University
- Olli Livonen – Data Engineer at WOLT
- David Drummond- VP of Engineering at Insight
- Judit Lantos- Data Engineer at Split
- Juno Lee- Instructor
Learning from such experienced and knowledgeable instructors is amazing and helpful. This is the reason, I personally love Udacity. Udacity also has its forums , where you can ask your doubt to instructors , and they will answer your query.
Now I would like to mention some more Pros and Cons of the Udacity Data Engineering Nanodegree .
Pros of Udacity Data Engineering Nanodegree
- The structure of the course is perfect if you are focused on hands-on practice and believe in “ how ” to do things like ETL and Data Warehousing.
- You will get Technical mentor support and the mentor will guide you from the start of your Nanodegree program until you finish the whole program.
- Provides good background information on data modeling, and traditional data schemas.
- The Udacity Data Engineering Nanodegree highlights data quality and data governance and how to introduce tests within your data pipeline.
- Udacity provides a great community of help. They have a Stackoverflow-style Q&A forum for people who are stuck with assignments, but it also has a pretty large slack, with channels for individual assignments and nano-degrees.
- Udacity provides a greater flexible learning program . So that you can learn at your own pace and from the comfort of your smartphone.
Cons of Udacity Data Engineering Nanodegree
- All of the projects (except the capstone) were based on the same problem domain (a song streaming startup), with the same data, using the same schema, and using different tools. So if you are not good at learning a new API, it will be difficult for you.
- Some of the lectures were not very polished, had very little post-editing, and were not rehearsed.
So the next and most important question is-
Is Udacity Data Engineer Nanodegree Worth It?
Yes, Udacity Data Engineering Nanodegree is Worth it because you will get a chance to advance your skills as a Data Engineer and work on various Real-world problems such as Data Modeling with Postgres and Apache Cassandra , building data pipelines with Airflow , etc. Along with that, you will get One-to-One Mentorship and a Personal career coach .
But I would suggest buying Udacity Data Engineering Nanodegree when they are providing any discount on the program. Because Udacity Data Engineering Nanodegree is expensive compared to other online courses.
Most of the time, Udacity offers some discounts. When they offer a discount, it appears something like that-
When you click on the “New Personalized Discount” , you will be asked to answer 2 questions.
After answering these two questions, press the “Submit Application” button. And then you will get a discount with a unique Coupon Code. Simply copy this code and paste it at the time of payment.
Now, you can save your money in Udacity Data Engineering Nanodegree Program. If you get the Udacity Data Engineering Nanodegree at discount, then I would say i t is totally worth it.
Udacity Data Engineering Nanodegree is good for you if you have intermediate-level Python and SQL knowledge and want to advance your skills as a Data Engineer . And if you want hands-on practice and believe in “ how ” to do things like ETL and Data Warehousing.
I’ve learned, and applied, skills that I’d always wanted to get to grips with. After hearing so much about Relational and NoSQL databases, it’s satisfying to become confident building them with PostgreSQL and Apache Cassandra in only the first two projects! Rob R.
My first project was very challenging to me but the reviewers and my mentor helped me constantly in improving my code and documentation and guided me well towards finishing the project. It was a great learning experience and looking forward to the other projects as well. Nitheesha T.
The program is good, but you need to look at the dataset in the first project. I’m not completely new to ETL, but to have a requirement “Sparkify are interested in the songs their users are listening to”, then a dataset that doesn’t get near that must be confusing for real newbies. Nicole S.
Now it’s time to wrap up this Udacity Data Engineering Nanodegree Review .
I hope this Udacity Data Engineering Nanodegree Review helped you to decide whether to enroll in Udacity Data Engineering Nanodegree or not?
If you found this Udacity Data Engineering Nanodegree Review helpful, you can share it with others. And if you have any doubts or questions, feel free to ask me in the comment section.
All the Best!
Yes, Many Nanodegree graduates have gotten jobs. Udacity has surveyed over 4,200 Udacity students and the survey results showed that nearly 70% of Udacity students surveyed indicated that a Nanodegree program helped them advance their careers. And you can check Top Companies that Hired Udacity Graduates here .
The Dice 2020 Tech Job Report labeled data engineer as the fastest-growing job in technology in 2019, with a 50% year-over-year growth in the number of open positions. The report also found it takes an average of 46 days to fill data engineering roles and predicted that the time to hire Data Engineers may increase in 2020 “as more companies compete to find the talent they need to handle their sprawling data infrastructure.”
According to Indeed , the average salary of a Data Engineer is $129,001 per year in the United States and a $5,000 cash bonus per year.
Yes, you can mention Udacity projects in your resume. That’s the main objective of Udacity to work on projects so that your portfolio becomes stronger and you can get jobs.
You May Also be Interested In
8 Best Data Engineering Courses Online- Complete List of Resources Best Course on Statistics for Data Science to Master in Statistics 8 Best Tableau Courses Online- Find the Best One For You! 8 Best Online Courses on Big Data Analytics You Need to Know in 2023 Best SQL Online Course Certificate Programs for Data Science 7 Best SAS Certification Online Courses You Need to Know Data Analyst Online Certification to Become a Successful Data Analyst 15 Best Books on Data Science Everyone Should Read in 2023 15 Best Online Courses for Data Science for Everyone in 2023 Data Analyst Online Certification to Become a Successful Data Analyst
Explore More about Data Science , Visit Here
Subscribe For More Updates!
Though of the Day…
‘ It’s what you learn after you know it all that counts.’ – John Wooden
Leave a Comment Cancel Reply
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.
Helping you Learn
Udacity Data Engineering Nanodegree Review 2023: Be a Successful Data Engineer
In this Udacity Data Engineering nanodegree review, I will share my experience of taking this course along with its pros and cons.
Udacity is offering a personalized discount
Planning to enrol in Udacity’s Data Engineering Nanodegree? A few months back I had the same thoughts. I hope my feedback on Udacity’s Data Engineering nanodegree will help you make the right choice. Here’s my story.
I studied Computer Science Engineering and during my college time, I got interested in the world of Data Science. So, I oriented my internships on this topic and learned the valuable and essential skills required for the job profiles of Data Scientists, Data Analyst, and Data Engineer which are the three magic roles in the world of Data Science.
After I graduated from college, I received a job opportunity as a Data Engineer in a biotechnology company. Working in that company, I came to know about various tools and techniques that were used by them such as Airflow, Spark, Apache Cassandra, etc.
These tools and software attracted me more since my background was mostly from Software Engineering and Development. I realize that I had a gap of knowledge on Data Modelling, Data processing engines, and some of the best practices with data.
Since all of these topics were perfectly covered in the Data Engineering Nanodegree from Udacity, I decided to apply.
In this Udacity Data Engineering nanodegree review, I am going to talk about the syllabus and detail and my project experience.
So keep reading..
If you buy the course through the links in this article, we can earn some affiliate commission. This helps us to keep this blog up and running for your benefit.
Table of Contents
How Much I Paid For Udacity’s Data Engineering Nanodegree?
Talking about the pricing, well I was not that lucky. At the time when I wanted to enroll, there was no valid offer or discount available. That’s the reason why I paid the full price for the course for 5 months subscription.
It was 1400€ which is around 1600$. Without an offer, it is very overpriced, since you are paying for only 5 months if you don’t complete it earlier.
You have access to paid services, but I don’t recommend paying such a high price. If you get a discount of 50% or even 70%, go for it! The outcome will be worth it.
Duration : 5 months (5hours/week)
Also Read: My experience with Udacity Nanodegree
Let me shed some light on the course structure and projects in this Udacity Data Engineering Nanodegree review
Which Topics Are Covered In Udacity’s Data Engineer Nanodegree ?
In general, I liked the overall syllabus and structuring of the modules with a realistic project.
They go from the basics to the most demanding and complex projects and you are always connected to the main topic.
Lesson1: Introduction to Data Modeling
In this module, you will be learning the main and key differences between a relational and non-relational data model. It is a great start to the NoSQL world.
This is very important. You need to fully understand this concept to make good decisions when working as a Data Engineer in a team.
The required task was to create a table for our database. And you can see, the first query is following the standard SQL.Here we have to build up a table to fit in our data model.
While in the second query, the objective is to create a table that fits the query. And that’s one of the main difference between SQL and NoSQL databases.
Also, see – Udacity Data Analyst Nanodegree Review
Project 1: Data Modeling with Postgres
This is the first project of the Nanodegree. From this project onwards, you are required to model the data using a common schema pattern while working with Business Intelligence, where you have 2 dimensions and one fact table.
This project is using a relational data model. It is a beginner-friendly project where you can get familiarise with the working methodology of Udacity.
If you have experience working with SQL and Python, this project will be pretty straightforward and simple for you. You are just required to read the Python library psycopg2 documentation .
Project 2: Data Modelling with Apache Cassandra
So the second project is Data Modelling with Apache Cassandra . In this project, you need to do the same task that you have performed in the previous one. But in this project, you will have to follow and implement the best practices for creating a NoSQL database.
The results of these two projects might be the same. You will learn how to do the same task using the best practices and in different ways.
I liked the fact that the difficult part of the projects was not on the database or the data. It was on the steps that you need to follow to manipulate and change your mindset. “It’s NoSQL, Don’t think in SQL”.
Lesson 2: Cloud Data Warehouses
So, lesson 2 is about the Cloud and Data Warehouses. In this lesson you will get in touch with cloud computing using AWS Infrastructure. You will learn about EC2 machines, S3 buckets, and RedShift.
However, at some points when I was learning on my own, I would end up testing with AWS, thanks to what I did and learned from this course.
Also, I would like to let you know in this Udacity data engineer review that Udacity gives you 25$ in credits to use inside the AWS. Cool, isn’t it?
Project: Build a Cloud Data Warehouse
So, the project is to build a cloud data warehouse. Here you will need to work on building a data warehouse on AWS cloud. To do so you will need to orchestrate the interaction between S3 buckets and your RedShift database.
Let me be honest in this Udacity data engineer nanodegree review that this project was groundbreaking for me because at this stage I was not afraid of building my cloud data warehouse for my projects.
You can also check out the screenshots above. You will need to configure and connect the AWS instance to your project workspace.
Lesson 3: Spark and Data Lakes
Lesson three is the Spark and Data Lakes. In this lesson, you will learn about the advantages of using spark as your data processing engine.
You will learn to use it for cleaning and aggregating data. It is one of the most popular tools for working with data.
I was not aware of the technologies like Spark and Hadoop . It was great to know about them and the state of art technologies that should be known in the Data Science Industry.
Project: Build a Data Lake
The project is to build a data lake and the task itself is quite simple. You will have to load data from S3 into in-memory tables and then back to S3 after modeling the data. In this project, you will experience the power of Spark.
I found this project quite useful. But you can barely see how fast is Spark and the big difference between Python and Spark when talking about modeling bit datasets.
As you can see in the images, in the project template description it is specified that you will have to use Spark. This technology will be placed on the top of other technologies or tools that you have learned in the previous lectures. Then at the end, you will be building a different mindset to structure and develop your future projects.
Lesson 4: Data Pipelines with Airflow
In this lesson you will learn about Airflow. It is a simple interface for monitoring and designing pipelines for our automated tasks.
This tool is critical and essential for you when you are focusing on designing pipelines, their development, and maintenance.
Project: Data Pipelines with Airflow
So, in this project you will need to use Airflow to structure and monitor your pipelines. You will also need to build a data warehouse similar to the project in lesson 2, but adding extra complexity for using Airflow.
I found this project quite challenging because of the additional knowledge that you should have from the previous projects and you will have to add Airflow on the top of RedShift and S3 buckets. But, in the end, it was very pleasing to complete this overall project.
Also, see – Review of Udacity’s Data Streaming Nanodegree
Lesson 5: Capstone Project
Project: data engineering capstone project.
This was the final project or the Capstone project. It personally chosen by me. For this project I decided to create a report with agricultural data. This project covered an entire closed-loop scenario. For simplicity, I have just used one Python notebook. Further, we could divide this project into 5 steps:
- Gather data from public sources (in my case, http://www.fao.org/faostat/en/#data )
- Load the tabular data, clean it, rearrange it and create the data model in-memory
- Load the data model into the Redshift data warehouse using the in-memory tables and the S3 buckets for batch processing the fact table
- Ensure the process was correctly done by doing data quality checks on your BD data model
- Create a visual report using a reporting tool (I chose Power BI)
And the result is below.
So this is all about the lesson and the projects in this Udacity Data Engineering Nanodegree review.
How was my project experience?
Every project was adding a step more to get completed.
Personally, the most challenging part of the lesson and projects for me was to complete the Airflow project. Because in that project you are required to combine and use the knowledge of Apache Airflow with RedShift and S3 Buckets. You can also add Spark to it as per your wish.
How much time I took to complete this nanodegree?
It took me a bit longer than expected to complete this Data Engineering Nanodegree. I enrolled in this Nanodegree in October 2020, but because of work, I was not able to focus on the course until January. In finishing the Nanodegree in March, I paid for one extra month with an offer + 80 €.
Honestly, it took me around 2 to 3 months to complete the Nanodegree. After my experience, I think that if someone is willing to go 100% for the course then he should choose a monthly payment plan instead of a 5-month subscription.
If you are studying this Nanodegree in parallel to anything else then I strongly recommend you to choose the 5-month subscription as speedrunning the projects will only make you struggle on project requirements.
Also, see – My experience of Data Scientist Nanodegree
Well there are several features of Udacity that I should tell you in this Udacity Data Engineering Nanodegree review.
First the Mentorship. Well I did not approach a mentor directly for advice. But I had a chance of submitting my GitHub repository and LinkedIn profile for review and I don’t think that it is something really special.
If you want to improve your GitHub repo or your LinkedIn profile, you can do 1 to 2 hours of research and visit more than 20 profiles. You will learn as much as taught by your mentor. Still, the overall mentorship was very straightforward and I appreciate the effort that the mentors put in.
The reviewers of the projects were outstanding communicators and they were reviewing our projects with a very positive attitude, marking down the areas where improvement was required especially in the case when you did not pass the requirements.
I was crystal clear about what I wanted to do before, during, and after the course so I did not opt for the career services.
Pros and Cons of Udacity’s Data Engineering Nanodegree
My experience with Udacity Data Engineering Nanodegree was filled with positivity, the positivity that radiates when you are communicating with your mentor, and of course, let’s not forget about the tools and technologies that Udacity lets you work with.
Nowadays keeping the situation and circumstances in mind, most of us are at home and learning from our bedroom. Learning from online tutorials or just from other platforms that are mostly focused on just a bunch of videos.
It is very reinforcing to have personal reviews and monitoring, to ensure we are on the right path and not messing around with random tools.
Well this nanodegree quite expensive if we think about the service itself. We are just buying videos and predefined guidelines.
In the end, it’s worth it because of the knowledge that you get, the flexibility for studying, and the personal monitoring. Thinking just about the content, still I am a bit skeptical.
Is Udacity’s Data Engineer Nanodegree Worth it?
Before concluding this Udacity Data Engineering review, I want to ask you few basic questions:
- Is your professional background lacking specific knowledge of technologies, techniques, and software that are widely used nowadays?
- Do you feel that you want to work as a Data Scientist, Data Analyst, or Data Engineer?
If yes, then you should go for this Data Engineering Nanodegree .
But you should search for an offer if you don’t want to pay the full price of the Nanodegree because 1400$ is too expensive. I could see even before finishing, the difference when applying to job offers for Data Engineer.
Hope you find this Udacity Data Engineer Nanodegree Review useful.
I am a Data Engineer and Computer science student passionate about Data Science. I have 2 years of experience working for public and private entities.
Can i get a job with udacity’s data engineering nanodegree.
I can say this nanodegree has played a vital role in shaping my career as a Data engineer. Today Iam working at Glovo as Data engineer. If I can find a job, surely you can too.
What is the salary of a Data Engineer?
Average data engineer earn up to $129000/year in United states(Source: Indeed)
Can I complete Udacity’s Data engineering nanodegree in a month?
Its possible to complete this nanodegree in a month provided you are familiar with the data engineering. A beginner must dedicate 3-4 months.
Udacity Data engineer vs Data scientist nanodegree: which one should I choose?
It all depends on which role you want go with. I suggest first to look at their job roles and decide. Both nanodegrees are Udacity’s one of the highest enrolled courses
Build a Robo Advisor with Python (From Scratch)
Udacity’s Data Engineering Nanodegree Program – Ratings and Review!
In Udacity’s Data Engineering Nanodegree Program , learn to design and build production-ready data infrastructure, an essential skill for advancing your data career.
Enroll in Udacity’s Data Engineering Nanodegree Program today!
Overview of the Udacity’s Data Engineering Nanodegree Program
In Udacity’s Data Engineering Nanodegree Program , you’ll learn to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets. At the end of the program, you’ll combine your new skills by completing a capstone project.
Udacity’s Data Engineering Nanodegree Program Syllabus
1. data modeling.
Learn to create relational and NoSQL data models to fit the diverse needs of data consumers. Use ETL to build databases in PostgreSQL and Apache Cassandra.
2. Cloud Data Warehouses
Sharpen your data warehousing skills and deepen your understanding of data infrastructure. Create cloud-based data warehouses on Amazon Web Services (AWS).
3. Spark and Data Lakes
Understand the big data ecosystem and how to use Spark to work with massive datasets. Store big data in a data lake and query it with Spark.
4. Data Pipelines with Airflow
Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.
5. Capstone Project
Combine what you’ve learned throughout the program to build your own data engineering portfolio project.
Instructors of Udacity’s Data Engineering Nanodegree Program
Here is a list of instructors associated with the Udacity Nanodegree Program:
1. Amanda Moran (Developer Advocate at Datastax)
Amanda is a developer Advocate for DataStax after spending the last 6 years as a Software Engineer on 4 different distributed databases. Her passion is bridging the gap between customers and engineering. She has degrees from University of Washington and Santa Clara University.
2. Ben Goldberg (Staff Engineer at Spothero)
In his career as an engineer, Ben Goldberg has worked in fields ranging from Computer Vision to Natural Language Processing. At SpotHero, he founded and built out their Data Engineering team, using Airflow as one of the key technologies.
3. Sameh El-Ansary (CEO at Novelari And Assistant Professor at Nile University)
Sameh is the CEO of Novelari, lecturer at Nile University, and the American University in Cairo (AUC) where he lectured on security, distributed systems, software engineering, blockchain and BigData Engineering.
4. Olli Iivonen (Data Engineer at Wolt)
Olli works as a Data Engineer at Wolt. He has several years of experience on building and managing data pipelines on various data warehousing environments and has been a fan and active user of Apache Airflow since its first incarnations.
5. David Drummond (VP of Engineering at Insight)
David is VP of Engineering at Insight where he enjoys breaking down difficult concepts and helping others learn data engineering. David has a PhD in Physics from UC Riverside.
6. Judit Lantos (Data Engineer at Split)
Judit was formerly an instructor at Insight Data Science helping software engineers and academic coders transition to DE roles. Currently, she is a Data Engineer at Split where she works on the statistical engine of their full-stack experimentation platform.
7. Juno Lee (Instructor)
As a data scientist, Juno built a recommendation engine to personalize online shopping experiences, computer vision and natural language processing models to analyze product data, and tools to generate insight into user behavior.
Review: “Definitely worth enrolling in!”
After successfully completing the Nanodegree program, you will find that the program is very comprehensive and teaches you everything promised by the program.
Plus, the certification is also really helpful since this Udacity Nanodegree Program is well-recognized amongst multiple industries.
Leave a Comment Cancel reply
Save my name, email, and website in this browser for the next time I comment.
Study through a pre-planned curriculum designed to help you fast-track your Data Science career and learn from the world’s best collection of Data Science Resources.
Find us on social media:
As Featured On
Data Science Fast-Track
[email protected] 188.8.131.52 | Phone Number: (208) 887-3696 | Mailing Address: Kharpann Enterprises Pvt. Ltd, Balkhu, Nepal
Disclaimer: Efforts are made to maintain reliable data on all information presented. However, this information is provided without warranty. Users should always check the offer provider’s official website for current terms and details. Our site receives compensation from many of the offers listed on the site. Along with key review factors, this compensation may impact how and where products appear across the site (including, for example, the order in which they appear). Our site does not include the entire universe of available offers. Editorial opinions expressed on the site are strictly our own and are not provided, endorsed, or approved by advertisers.
2022 © Kharpann Enterprises Pvt. Ltd. All rights reserved.
In the field of math, data presentation is the method by which people summarize, organize and communicate information using a variety of tools, such as diagrams, distribution charts, histograms and graphs. The methods used to present mathem...
The presentation of data refers to how mathematicians and scientists summarize and present data related to scientific studies and research. In order to present their points, they use various techniques and tools to condense and summarize th...
The Big Data Technology Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this MicroMasters program to a medium-scale project. The Big Data Technology Capstone Project will allow you...
This repository is my final project for the Data Engineering Nanodegree Program. - GitHub - GabrielGiurgica/Udacity-Data-Engineering-Capstone-Project: This
Udacity Data Engineering Nanodegree Capstone Project - GitHub - Modingwa/Data-Engineering-Capstone-Project: Udacity Data Engineering Nanodegree Capstone
Udacity Data Engineering Capstone Project · Step 1: Scope the Project and Gather Data · Step 2: Explore and Assess the Data · Step 3: Define the
Data Engineering is the foundation for the world of Big Data. Enroll in Udacity's data engineering with AWS course and learn essential skills to become a
Udacity Data Engineering Capstone Project. In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di
First in a series wherein I complete a capstone project for my data engineering course.
What I'm most interested in is Cybersecurity and Privacy use cases, which I will talk about in my next project. Note that all these data are in
In this Udacity Data Engineering Capstone Project, you have to gather data from several different data sources, transform, combine, and summarize it
PROJECT 1: Data Modeling with Postgres ... PROJECT 6: Data Capstone.
This was the final project or the Capstone project.
At the end of the program, you'll combine your new skills by completing a capstone project. Udacity's Data Engineering Nanodegree Program Syllabus. 1. Data