After a long wait the Microsoft Azure Data Engineering exams and certification has released in January 2019.
The Certification is Microsoft Certified:Azure Data Engineer Associate. The requirement for the certification are:
I just sat the DP-200 exam which is still in Beta. Working with a lot of Microsoft partners as part of my job I always get requests to assist with technical certifications, preparation and mostly if an exam really help technical resource get ready for real world projects. So my primary intention was to experience and examine the content first hand.
General Exam format:
Exam DP-200 consists of various types of questions but in general is very similar to other Microsoft Azure exams. Usually the number of question are varies between each exam instance slightly, but the beta version in my experience consisted of 58 questions overall.
- Case study question (about 4 – 5). These are the question which present you with a comprehensive case study and each one consists of multiple questions. The primary goal of these question are to examine the candidate’s overall architecture skills.
- Scenario Questions: It presents a scenario and based on that there is either
- multiple choice question single answer.
- multiple choice question multi answer.
- Drag and order multiple steps to achieve a solution.
- Drag and drop answers to fill in the gaps (Usually in Powershell CMDlets, JSON documents or Azure cli commands)
There was plenty of time to answer 58 exams and I ended up with about extra 60 minutes.
What is covered in the exam
I encourage you to review the exam objectives on the exam page (specially being a new exam there can be changes to objectives) but below are the parts I suggest to focus your studies from a product perspective:
- Azure SQL DW: Clearly you’d need a good understanding of the product but these are the areas I’d expect a lower proficiency in general.
- Security, encryption and monitoring Azure SQL DW
- How to load data from Blob to SQL DW
- External tables in SQL DW
- Azure SQL DB: being one of the most important offerings in Azure data portfolio so I recommend familiarize yourself with the product and spend extra attention to below topics.
- What is SQL DB managed instance?
- What is SQL DB Hyperscale?
- What are the different types of SQL DB you can provision, when would you use one or the other?
- Moving SQL Server on-prem to Azure
- Security, TDE and data masking.
- Monitoring and auditing.
- Elastic pools.
- Backup and recovery.
- High Availability and Disaster Recovery options and architecture in Azure SQL DB.
- Power Shell or CLI commands to provision an instance.
- Azure Data Factory:
- Different options of Integration run time
- Linked services
- Security options for ADF
- How it can be implemented in a Hybrid Cloud scenario
- There may even be question around how to compose JSON for pipelines or linked services.
- Blob Storage and Azure Data Lake Storage Gen2:
- What is the difference, when to use one or the other?
- How to optimize analytical workloads by using ADLS Gen2
- Security Options and configurations.
- Redundancy options and how to architect a HA/DR solution.
- Azure Stream analytics:
- Inputs and outputs in an ASA job.
- Parallelism of ASA jobs.
- Optimizing ASA Streaming Units.
- CosmosDB: Once again another very important part of many modern data platforms so not only for the exam but to help you in your project I strongly recommend getting a good understanding of ComosDB.
- Consistency levels in CosmosDB ( there are 5 consistency levels. make sure you know the difference and when to use which one)
- Different APIs CosmosDB provides (make sure you have a fair understanding of the APIs and how they organize the data i.e. documents, wide table, key-value and etc. )
- Security and access control
- HA and DR options.
- Azure HDinsight: HDI is the PaaS Hadoop offering in Azure. In my experience of the exam there were quite a large number of questions around HDI. In real life I don’t recommend HDI as a component of a solution with some very rare exceptions. But judging from the number of questions in the exam, the exam authors considered it an important part of the data solutions. At least for the sake of passing the exam I’d recommend getting familiar with HDI and put slightly more focus on below topics.
- Monitoring a cluster from Azure portal or through Apache Ambari UI.
- Different types of cluster you can provision in HDI and what is the use case for each, specially Spark cluster and interactive query cluster.
- General understanding of Apache Hive and Apache Spark (there may be questions regarding Jupyter Notebooks as well)
- Integration of HDI with Azure Data Lake Storage, Blob Storage, SQL Data Warehouse and Event hubs.
Preparation for the exam
Resources from Microsoft Learn
Microsoft learn is a free learning portal which has very comprehensive course and learning paths to prepare for the DP-200 exam. The below learning paths cover most of the exam objectives. The course also includes hands-on labs which comes with free Azure practice sandbox so no Azure subscription is required!
Duration of the course is 9 Hours and 39 Minutes
Azure for the Data Engineer
Duration of the course is 1 Hour and 51 Minutes
Implement a Data Warehouse with Azure SQL Data Warehouse
Duration of the course is 3 Hours and 21 Minutes
Large Scale Data Processing with Azure Data Lake Storage G2
Duration of the course is 1 Hour and 49 Minutes
Implement a Data Streaming Solution with Azure Streaming Analytics
Duration of the course is 1 Hour and 14 Minutes
Work with relational data in Azure
Duration of the course is 1 Hour and 34 Minutes
Work with NoSQL data in Azure Cosmos DB
Duration of the course is 2 Hours and 27 Minutes
Store data in Azure
Duration of the course is 3 Hours and 39 Minutes
Microsoft courses on Edx.org
If you are looking for a deeper understanding of particular Azure technologies and prefer On-demand Video Course, here is also a list of such course on Edx.org published by Microsoft.
- Databases in Azure (Length: 4 weeks Effort: 2 to 4 hours per week)
- Microsoft Azure Storage (Length: 6 weeks Effort: 2 to 4 hours per week)
- Azure HDinsight course (Length: 5 weeks Effort: 3 to 5 hours per week). Note: If you have no experience with Hadoop this course is strongly recommended to help you with HDI object part of the exam, though it may not be the most useful in your real world projects.
- Processing Big data with Azure data factory (Length: 4 weeks Effort: 3 to 4 hours per week)
- Azure Stream Analytics (Length: 4 weeks Effort: 3 to 4 hours per week) Note: I strongly recommend this course to learn how to do real-time stream processing ins Azure.
- Securing Data in Azure and SQL Server (Length: 4 weeks Effort: 2 to 3 hours per week)
- Using Apache Spark in HDinsight (Length: 6 weeks Effort: 3 to 4 hours per week) Note: Instead of this I’d recommend the self paced learning courses of Databricks here but they are around $75 each)
Verdict and Summary
Overall exam DP-200 covers most of the important skills a cloud data engineer(or perhaps architect) needs to be effective and successful in their role. Having said that there are gaps in my opinion.
- Primarily I would have preferred to see bigger focus on Azure Databricks rather than Azure HDinsight.
- The other missing skill is event processing and server-less architecture .Usually Cloud infrastructure engineers are proficient in these area but unfortunately in my experience it is not a strength for data engineers.
- It would have been great to include best practices for data architecture in the cloud as one of the objectives.
- The exam (like most other certification exams) is completely product focused but in my experience there is a large lack of skills in how to implement a modern data warehouse in the cloud. At end of the day having deep technical knowledge of how products work are very valuable but its the right architecture that defines a solution and determines if it lasts long enough to return the investment.