Azure End-To-End Data Engineering Project for Beginners (FREE Account) | API Tutorial

By AI Luke

Summary

## Key takeaways - **Earthquake Data's Business Case**: Earthquake data is vital for government agencies, research institutions, and insurance companies to understand seismic events, mitigate risks, assess likelihood of future events, and make data-driven decisions like insuring houses. [01:57], [02:16] - **Medallion Architecture Layers**: Data from API goes into bronze container as raw JSON, transformations in Databricks move it to silver after cleaning, then to gold with business-ready data like country codes and significance classes. [03:14], [16:14] - **RBAC Requires Two-Way Doors**: Services like Databricks and storage can't auto-communicate; must open doors both ways—assign managed identity roles on storage IAM and create external locations/credentials in Databricks catalog. [27:14], [36:46] - **Passing Parameters Orchestrates Flow**: Data Factory passes dynamic dates and paths between notebooks using dbutils.widgets.get for input and dbutils.notebook.exit to serialize output JSON, enabling bronze→silver→gold chaining. [01:28:31], [01:34:11] - **Synapse Queries Parquet Directly**: Use OPENROWSET to query gold Parquet files in data lake as tables, enabling aggregations like COUNT low/moderate/high significance earthquakes grouped by country code. [01:40:02], [01:41:16] - **Library Management Avoids Pip Chaos**: Install external libraries like reverse-geocoder at cluster level via Compute→Libraries→PyPI to avoid per-notebook pip installs that break on cluster restarts or team changes. [01:00:39], [01:01:27]

Topics Covered

Full Video

Full Transcript

if you want to be an Azure data engineer this is the perfect tutorial for you in today's video we're going to create an endtoend data engineering project in real time and Azure now you don't have

to have any experience this is all from scratch and I'm going to teach you exactly how do Services talk to one another why security is so important as well as covering the main Services asked

in every Azure engineering job we're going to look into agile data bricks for distributed data processing and Transformations Azure data Factory for

orchestrating data workflows Azure data Lake storage for scalable data storage and bronze silver and gold containers leverage and Medallion architecture and

Azure synapse analytics for quering and analyzing process data and I'll also show you at the end a little on how to use PBI so that you can really visualize and understand this data now let's get

into the video so welcome to the tutorial we're going to be working from this GitHub repo here which is my repo within here we have a read me which is

this here what you're looking at we then have a detailed stepbystep guide which is what I'm going to be following on the other screen and this here tells you everything you need to do so you can

literally just follow this guide not watch the video and get on with it uh and then additionally we have the notebooks which are the notebooks that we have that we're going to use within

Azure data bricks so the python notebooks we're going to use for the Transformations so without further Ado let's get into it so we're going to

create a data engineering pipeline specifically for earthquake data now if you think about it whenever you do projects like this especially if you're going to use them to try and get a job

you really want to be using or creating a project which has some sort of business use case so how does an earthquake data pipeline have a business use case

well earthquake data is vital for understanding seismic events and mitigating risks stakeholders such as government agencies research

institutions and insurance companies rely on accurate upto-date info for planning emergencies assessing risk and making datadriven

decisions so imagine for example you're a government you want to know what's the likelihood of an earthquake and how bad is it going to be and you want to see how has it happened in the past so that

you can predict in the future also insurance companies if you're going to insure a house say How likely is it that it's going to collapse due to earthquake for example so there is a

business case here and that is the business case so as for the pipeline architecture what's what are we actually going to do here well we're going to get an or we're going to use an API endpoint

to get the data the DAT is going to be ingested into here which is an azile data Lake which is a type of storage within Azure within Azure you get your

storage account don't worry we'll go into this get a storage account within your storage account you get containers one of the containers we're going to use is this here the bronze layer cuz we're going to use a medallion architecture

again we'll get to that in a bit but basically we get data from the API we ingest it into this container we do some Transformations and data bricks into the

next Transformations again into the next and gold is Boom we have the perfect data for the business now this is automated using Azure data Factory and

then we load the data into synapse analytics now this is so that we're abstracting the calculations like the SQL queries up into a specific

platform then from here we can say create views which we can then publish so that we can then access the data for visualization say in Fabric or powerbi

or Tavo and I say fabric here because fabric has an integrated powerbi with it so that's the general architecture so we're going to be creating these

resources here in the cloud so we'll get to that soon right prerequisites set up an Azure account so first things first I'm going to start

ging on this screen and we can follow along on the screen so if you don't have an Azure account we'll um got the guide sorry if you don't have an Azure

account you can set one up so create an account just go to this link here I'll just search it get started with Azure try Azure for free and then for me it will just try

and log me in but what you will get when you click on that Tak AG just to load $200 of credits so this will make this tutorial completely free and additionally you'll

get free monthly amounts for some Services free month amounts for always free services etc etc and obviously no upfront Comm cancel whatever you want so

go and sign up for Ain if you already have one consider making a another one for a different email I don't know I didn't

say that but you could so anyway now you have Azure what do we do let's set up some resources now if you know the channel I don't like clicky

clicky but it's the best way to learn click hops we call it which is where you click to make things as a cloud consultant Cloud engineer you know you're taught to

always use infrastructures code which is code which does all this for you but to learn it you click first so we'll do this tutorial clicking and then the next

tutorial I think we'll automate it using terraform so we'll have the entire thing you just click a button bam the whole infrastructure is up and running so anyway to create an environment so first

let's set up the data bricks because this will probably take the longest to create so you set you click on data bricks or you got to top and we'll

search data bricks we have this here click in data bricks and then you can either click create here or if it's your first time here click that so just

create and then from here you'll have project details which you have to fill in now subscription refers to which Azure subscription you're using subscription is like a l School grouping

you'll probably only have one and it would be the free one most likely if you're working for a company they could often have different subscriptions cuz it's like a logical grouping for management security billing Etc so you

could have a subscription for say different applications or different divisions and then under that we have resource groups now this is a way to logically group together resources

within Azure so for example all of the resources we're going to create today we'll create it in the same Resource Group because that way it's all they if you want to go and find one we just go to the resource Group and we can see

them all if we want to delete the entire project at the end CU you don't want to leave it up and running cuz it will just start accumulating uh cost then we want to delete the entire Resource Group in a

Warner and we can do that and it deletes all the resources in the group so if you click here you can select one we're not going to select one CU you won't have one or we you'll probably have that

Network Watcher but you don't have anything else let's create new and we'll call that this data

look um we just call it data look click okay name the workspace we'll just call this data look so this is just the name of the data bricks workspace doesn't

really matter what you call it region we'll just leave it in HS and we'll do premium so we can use Ro based access controls So within Azure and we'll get to that in a bit I'll open up excal draw

so we can start oops excal draw say a nice handy platform this is from when I was brainstorming

before but it's like a whiteboard that you can use online really handy um check out yeah anyway so you have Ro based

Access Control you'll sometimes hear it called like R back essentially when you have say a person that looks like a mushroom but

that's a person and then what's that that or say you have say a

service right we'll just do a box to indicate that well when you want the person or the servers to get access to something so they want to go through this

door this door will only allow in people that have a specific role so say the

role would be allow in right so if these two down here want to go in this door to use this service so say this is you the user or this is data bricks whatever

right and you're wan to connect to something else then Ro based access control is you then give these guys a role this role they allow in roll for example and then they're allowed in

whereas if you remove the r they're not allowed in and the best way to manage this is often in groups not as users because if you assign it to the person

then you assign it here then suddenly you're going to have to undo this one and undo this one whereas if you can just assign it to the group

and then call this group I don't know devs then anyone in the devs group automatically is allowed in simple right that's our back and an not

showell let's delete all that don't need it so anyway back to this premium robas St control and then from here you just review and create just jump straight to the

and then create cuz we don't need to worry about networking encryption security tags etc for this specific tutorial so now it'll go to this screen and this is where your deployment is now

in progress and it will just sit here doing this and it will tell you what resources it creates under here because for example data bricks has a bunch of resources that it creates doesn't just

create a datab bricks workspace it has to create the underlying Services because if think about it for datab bricks you write code that code has to run somewhere so it runs on compute

which is allocated by datab BR we can have a look at that in a bit anyway this is now is the screen on Azure it's creating resources you can see up here we have a notification icon with loading

that's CU of deployment is in progress that will tell us when it's done so in the meantime let's create the storage account so if you recall from the

diagram so if we go back to the GitHub repo we can see from this architecture these three containers they're all in a

storage account because a storage account is where you keep all of your stuff that is storage related and Azure within reason you know if you're going to use a SQL specific service then

that's not within storage account however within your storage account you can have files you can have cues you can have tables you can have containers Etc

so I've just clicked create on a new storage account again same subscription for Resource Group this time we're going to use the same one doesn't have to be the same one but

we're going to make it the same one cuz see you could have a resource Group for all your storage accounts don't know why maybe it would so s account name we'll just take a

look wonder if that's available no because it has to be a globally unique name for a storage account because storage accounts you can get to them via URL this there we go that's fine region do

the same region that's fine primary service now we'll click aile data storage Gen 2 that's the primary service that's what these containers are here as

I was just spoken about ADLs G2 Azure data Lake storage Gen 2 performance standard that's fine for us redundancy

now for resources in the cloud part of the benefit of being the cloud is redundancy that you can leverage so if you have a data center say like say at home you

have your laptop you have stuff in your laptop laptop braks stuff's gone forever right you could have a backup laptop you could have that in your house but then if something happens to your house both of them are gone you could have backup

laptop in someone else's house but then say both of these houses are within the same street and something happens to the street then they're both gone well instead you could have one on a laptop which is in someone else's house which

is in another city or another country so these are the layers of redundancy that you get in within Azure it's locally redundant storage lrs Geo redundant Zone

redundant and geozone redundant for us we'll just leave it as GRS that's fine but we don't care because it's just a tutorial but in practice you would

want to do this relative to you know how much failover can you really accept how much downtime can you accept what do you want to hedge against so

we'll just go with that that's fine click next this time we're going to go to Advanced because we want to enable hierarchical namespacing now that's to

facilitate dat L Storage and click that enable hierarchical namespacing now we'll just go straight review and create and then we'll click create when it validates you see it's now initializing

the deployment so we've got up to this notification we'll that the data bricks one has successfully been deployed that's good let's go to the resource and

check it out so in the data bricks resource we can see here that we have this managed Resource Group now don't worry about the other stuff we get to that when we get to the datax portion

let's just look at my Resource Group so this is what I was referring to where datab brex requires its own resources and that is within a managed Resource Group managed by data bricks by us so we

don't have to worry about it and this is what it creates managed identity storage account access connector which we'll be using network storage group and the virtual

Network so that's just kind of for your information if you want to look more into it you can um oh there we go storage account is now up so go to resource for that

one you see this here's your storage account don't get overwhelmed while the data you know once you've seen it a handful of times it will all seem pretty

obvious you have the essential stuff up top Resource Group location primary secondary subscription ID performance replication blah blah blah right a bunch of stuff we don't really know need to

know about any of this not for the tutorial but in the left hand side you can see we have this data storage option and within here you have the types of data storage that we have enabled within

storage account or what we have available containers fil shells cues and tables we're going to create containers as you can tell from the the logo that's

the same icon as here cuz it's just a container that we're going to use so you you see that it automatically has a container log container we're want to

create three containers bronze silver gold so click plus container or bronze

oops and then just create and that's your bronze container created do the same for silver create I'll do the same for

gold create and now these are the containers that we're going to be putting information into and taking information out of in data bricks the bronze being the raw data

straight from the API so the Json gives us the IPI put it straight in the way it is we then clean it tidy it up move it into silver and then based on this clean

data we will now then get what we want for the business case and then put that into the goat so before we do that though another

thing we have to create which takes ages synapse analytics so aure synapse analytics click here we'll click create and the reason I'm doing this is because then when we come to use it it will

already be there ready for us to use same as normal subscription Resource Group we're just going to do the same as normal that's fine manage reso script

let it use its own if it wants one we workspace we just call it data look for no particular reason but which data L do you want to attach to the synapse analytics workspace we're going to

attach data look 1 2 3 which one we just created and which file system do we want to use we will create a new file system

and we'll just call it synapse FS okay and then we want to assign ourselves the storage blob data

contributor role so if you recall from the arback explanation we want to have a role which is the storage blob data contributor role so that synapse can

interact with the data Lake using our Ro because we'll be doing the stuff on it so that's good and then

we review and create then we'll do the same thing again and we just click create do note here here 5

USD so that's the cost per terabyte for This Server SQL so when you run queries per terabyte so if you're using this for the first time you have your

$200 credits don't worry you could use like $195 worth of credits and you'll be fine cuz that's going to be a lot of queries but if you're not still don't worry

about it cuz like at most what's going to cost you maybe a couple of Cent to run like a days worth of queries for what we're going to do you probably use like

a scent or something so anyway let's just create that just want to flag it though because when you see the 5 USD you're like wait it's going to cost me a fer to start

this no because it's Compu resources it's you know the model of the cloud pigo essentially operational expenditure if you use it you pay for it

you don't normally don't pay to make the thing you pay to use the thing so that's deploying that's cool right let's jump

into datab bricks so up top and just search for it it'll be here in your recent and we you'll see this one that's the one we want click on

that and from here we want to launch our workspace and then just pick your account and it will log you into the workspace

because a your data bricks operates like a SE a separate app as does data Factory which we'll create when we get to that no reason to create it now doesn't take long uh and then the synapse analytics

one as well they their own like applications essentially so when you come to data bricks this will be what you see first

again don't worry too much about what you see on screen we're only going to cover a couple of things mainly up the top here we have our

workspace so this workspace is where we're going to store the notebooks we're going to use we then have catalog which is where we are going to

connect the data like storage because we want data bricks to able to talk to and communicate with the data L Storage so that we can manipulate data push data

pull data Etc we then have compute where we're going to be creating the compute to run on Notebook cuz as said notebook is just code it has to run somewhere and it doesn't just run on data bricks

because that manage Resource Group you had just seen for data bricks essentially is to like run the datab bricks application as far as I'm aware so you have to actually create the compu

that you want to use for your notebook so if you're doing like a massive data PIP plan you may want to use a leverage distributed

compute so that like if you have I don't even know a number set billion or whatever right if you have one where a million zeros after it or

something number of rows and you want to split you don't want to just process them top to bottom one two you'll be there all day so you want to split them up on different

computers so that each computer can process a batch of them or a number of them a a portion of them right and that will speed it up because if you think about it if you have one person doing

calculations on a calculator and then another person doing calculations calculator it's going to be you know half as long as if just one person was doing so it's similar type of principles

for distributed computing and that's where High spark and data bricks work so well is when you're scaling up that they can scale up with you because for what

we doing you don't really need it it works perfectly fine in Microsoft fabric cuz this tutorial I've already done it in Microsoft fabric so you can go and try that out if you want much simpler but these are the tools that you're

going to be using if you want to be an Azure data engineer so now that we're here we want to create a compute instance so click create compute and this will be the screen

you'll be given from here you can also actually top right UI you can change it to Jason if you want you can change it from here if you want to just change it programmatically we'll just go with the

UI because it'll be simpler especially if you're not too familiar with with coding so my cluster that's fine policy unrestricted that's fine we want it to

be single node because I don't think multi node will work with your free trial so single node access mode single user that's fine data re runtime version

we want to use a long-term supported version so an LTS that'll be fine 15.4 Photon acceleration we're not going

to need that node type so here you can click and node type is essentially like what spec do you want the computers that run the stuff

on and for us we're just going to use a general purpose we'll use the the smallest one they have cuz we don't really need any processing power 14 memory 4 core click on that you can see

that all let us do that you can and this is how what I'm talking about scaling up is if you scroll right there you can see you have like GPU accelerated uh compute as well memory

optimized etc etc but we're just going to use the the standard one and then we have this terminate after X number of minutes of inactivity so this is because you're spinning up clusters of Computer

Resources and you're allocating them and doing calculations on them this is going to cost you money now you don't want this to be sitting say like your calculations say you only run your

pipeline Monday Friday you don't want this pipel to be sitting all weekend hogging up some super expensive computer resources if you're doing like some of

these more optimized GPU ones because that would cost a lot of money so instead you want to terminate after so many minutes of inactivity so for this to too are you going to leave you know

go get a cup of coffee go to a toil or something you're not going to be longer than 30 minutes for sure may be longer than 15 but not longer than 30 so we'll do that um which means that it will

terminate the instances of the Computer Resources now the reason that this is worth noting is because the computer can

take like 5 to 15 minutes to start up because data bricks has to go get the resources allocate them blah blah blah sort them out get them all in line get

these ducks in a row and now me I'll get it in a minute two minutes me sure but basically it's going to take

like 15 minutes max maybe 5 to 15 to spin up this cluster so you don't want it to tear when you're away getting a coffee come back you need wait 15 minutes to be able to run your notebooks so anyway let's

click create computer I'll look the C I'm back to SEC so start spinning up let's go back to excal draw so you have your storage account with your

containers right right and then you have your data R instance now although they both live in Azure kind of thing they

can't automatically communicate to one another because if they could think of the issues you would have if you have a bunch of storage accounts and a bunch of data bricks

notebooks and they could automatically all talk to one another in some sort of like you know dodgy topology I'm not even going to try and

draw it but you get the gist and then they can all just talk to one another so as you can imagine this has caused a load of issues cuz say this was De Say

this was production say this was QA see this was actually a completely different application see it was users data or something right so as you can imagine it

gets all uh a bit messy if you to let them automatically communicate to one another so in my previous tutorial someone had said it seems like a bit of a mess and Azure now the reason is

because of that because you don't want services to automatically communicate with one another because you're not using like one entire Azure subscription

for one application for example and even still if you willar that application doesn't not all the components of the application require connections to to one another so you want to have that

principle of least privilege which is a standard security principle basically to say that you should only have as much privilege as absolutely necessary to

complete your job not anymore so if you need to read and write you don't have to you know manage who else can read and write should you really get

admin privileges or should you have a specific role which allows you only to read and write for example so anyway you have this storage account you have your data bricks they must talk to one

another now one thing to note in aure is that when they have to talk to one another there is a door on either site not literally but you can think about it this

way which means that if you on the data brick side connect to storage count and say yeah cool storage count is connected here that doesn't mean that it's going to work because you

then also have to connect it from the storage account side as well to say basically on the DAT brick side you're like cool yeah my door's open for stage

account but then you have to be able to have stage account's door open to let data bricks get the information type of thing so how does this practically work

well we have our cluster here that's cool now we want to connect to this storage C and what we do is we go to catalog and from here we go to external

data you can see I've got a bunch here you won't have any but I've done this before and it saves the the data we first want to go to credentials

we want to create a credential now credential type is azure managed identity because these are both in Azure you just let Azure manage their

identity basically they ID cards they can be sorted out by Azure but this credential name we'll call it data look I don't think it will let me because

I've already got one call data look too and the access connector ID this is what I was referring to if we go back to the portal the data bricks here you can see that we have this managed resource

group if we go to that manage Resource Group and then we can see we have this Unity catalog access connector now this can also be found if you search up top

access connector for Azure data bricks you click here the exact same things come up but we'll go with this just so I know which one I'm using we click on it can

then see that it has a resource ID this here is the connector ID so if you hover over you can see the little copy log comes up or icon click on that go back

here paste that in and that's now your access connector ID click create and this will now create a new credential

that can be used by that connector so where do we want this credential to be used we want it to be

used for go back to external data external locations specifically this bronze gold and silver or these bronze Gold and Silver locations now the way that you do this

and I'll go in and edit them but I'll show you how to create one first and then you can create one for each of bronze gold and silver is you click create external location and then the

external location name is whatever makes sense so for us it would just be bronze and then the URL here it has to be in this format AB

FSS uh colon for SL SL the container name which would be bronze at the storage account name which for us is

data look do dfsco windows.net and now that is the end point that we're going to use to get to The Container

so after you have then written that in you then get the storage credential and for us this is where we're pulling that credential we've just made which for for me is data look 2 um but in here it

would look something like this so ABF SS slash bronze at don't something like that and then you click create and then

it will create it for you and now you have to do this for bronze silver and gold you may get a complaint initially I'll go in and edit mine just now to show you cuz it might come up for me so

if you ever have to edit one click on it you're in here now if you then click on Three Dots and then edit and from here I'm just going to change this to data

look copy that and I'll change the storage credential to my new one and update we'll see if it comes up so yeah I thought something like this

may happen so this is just double checking the path blah blah blah um brze at data look let me check this oh data

look 1 two 3 that's what it was 1 2 3 that's the new one we've made an update um but it's still making up with an error now you go here but this is just

because your container is empty so cuz as you say this D may be due to the location being empty so just force the

update then that's that then we'll just go the same for silver and gold they look they look one two three and the same credential update then I'll

have to force it and finally the same again for gold and there we go so now that we have these sorted out it's time

to create create the notebooks and then we'll run the notebooks I'll show you that it's not going to work because we've only opened one door and then I'll show you how it open the second door so

in reference to this here you on data brick side said yep connected all good however here in front of this door supp I should do probably in a different

color let's do red but this door you have I am an Azure and are back thing we

talking about the rules so we're going to have to go the S count I am and then we're going to have to give this data bricks a roll oops disappeared but anyway we'll

get to that I suppose let's just don't know so if we go to the stage account stage accounts go to our stage

account dat look 1 2 three and then from here if we go to I am so this the storage account will go to I which is your Access Control you can see

here you can check access blah blah blah Ro assignments that's the thing that we are concerned about click on Roll assignments we see that there will be some roles assigned however or you may

not have many you may only have you as owner but a few we want to add a new Ro assignment so we click up the top here and add and then add roll

assignment and now from here we can select one of the rowes so for us what we want to add is Storage storage

blob so we want to storage blob data contributor because ultimately the storage we're leveraging is blob binary large objects so storage blob data

contributor allows for read write and delete access because that's what we want data Bri to be able to do to be able to write to be able to read and possibly delete

sure so click on that and then we click next and so now we have to assign it to someone so we have user group or service principle or managed identity if you

recall from the explanations we're using managed identities for these services so click manage identity and then members

we select members and it's from our subscription the managed identity is this access connector for Azure datab

BRS for here I have two you only have one it's not very handy and it gives me almost the exact same one

um how am I going to know who's who what is the difference here subscriptions so this is the resource ID we seeing so if I open this up it's just

so you can know suppose then if ever you encounter this issue is if we go to data bricks data look the manage Resource Group that's just so I make sure we're going

to the right connector because they just have the same name connector click on the connector and then we have this here that's the resource ID don't know if I can copy by that or because it needs a

name oh no there we go click on that one select there we go now reviewing aign here assign and that's it sign the rule you see it will come up twice here that's

fine just refresh it it's a quark of azure and then it comes back down to one so all good we now have the I am on

this side too which means that this two-way Communication channel should be all good and approved from both sides so

now let's finally get into the the meat bit here should who whe say as a vegan hey whatever the the main portion

bronze silver gold so go to workspace and let's set up on notebooks so from the workspace here you can you know organize your workspace however you want

for me we just in workspace and users and then me we can create like say a folder in here if you want we could call it like earthquake data but we're just going to go straight to here no folders

just straight to the notebooks so click on Notebook it'll take to here which is an i python notebook so you'll probably know it from from jupit notebooks

Etc oops but an IPython notebook is an interactive python notebook because it's an interactive notebook it's not just code that you run you don't run it with you know Python 3 command or something

you can run it directly in the notebook so from here I'm going to open this up another page and we'll copy and paste lines of code in one at a time now

I would strongly suggest that you don't just go to this GitHub here right don't just go to the GitHub download the file and upload the file because you won't

Lear anything ideally what you should do is read what I write on screen listen to what I'm saying and then type the code out yourself get the muscle memory on

your hands but additionally try and remember entire lines or entire chunks of code and then write them in a one

because memory comes from recall not from doing per se so that's how like you could just sit and transcribe audio all

day and never remember a what it's more about if you listen to audio then stop then try and remember it and write it you'll remember that much better that's a tip actually if you're studying I

that's you be like at Schooler at Uni if I see a slide on screen I try and remember the entire slide and write it all out entirely in a h or as much as I humanly possible before I'd look up

again so that I would then do active recall as I'm essentially reading it and writing it so anyway back back to this here the um bronze notebook let's pull

it in one part at a time so the first thing we have to do is or I suppose let me give you a quick rundown of datab BRS uh python notebook this here is your

notebook up top we can rename it so let's rename this to bronze notebook to do it is just click on it as you can see it's a python notebook you can have it

as python SQL Scala or r but we're just going to do python that's then why it's going to be an i python notebook uh you see that as like file

names it's like I do iy NB which is an i python notebook anyway so from here file you can get a new notebook import blah blah blah export it as IPython notebook

that's how you get it to then put it in the GitHub for example uh edit stuff blah blah if you run this is Handy to know run debug you

can run all the cells a cell I'll get to in a moment you can run all the cells you can run selected cells run the cells above where you are below where you are Etc and you can clear as well so that

like you can get in a nice format if you're going to save it because when you run something the output comes directly underneath it so within here we have

this which is a cell a cell can be code or text whatever you want it to be within here you just do one thing

normally or like one one set which moves you kind of one state forward or one state in One Direction so basically using different code cells

like this so you can see it's code or text you're able to then divide it up such that each cell can run logically distinct from one another meaning you

can track errors and stuff and change much better much faster do things on the Fly check things out debug Etc and if you want to just run all of them you click run all and up here as well

you have connect this is where you connect the cluster where you want this to run so first things first our first chunk that we're going to add in is this

here which we're going to mount the Azure data storage now we're going to do that via this same path so that AB FSS

path you see I've just separate that we've got bronze silver gold and then we just inject into here and for tier and te and then that creates like an array

of that in here so this is a Warner for that that and that now that's just so that we don't have repeatability or we don't have repeated code because standard software principle dry don't

repeat yourself if you are going to write the same thing or same portion of a thing again and you would want them if you were to change one it would change

the other then you should always varize it hence this here is that if I'm going to change this inste I hav to change that three times for bronze silver and

gold just going to change it once for here so if we run this block notifications it's h getting those paths

and then it's putting those paths in as variables here and then showing them out here but we don't have anything in them so nothing to

show now the next part we're going to add is a couple of imports we're going to import requests Jason and date time we'll just play that and I'll show you

why we're going to do that in a second so we have this here the understanding the API so there quick API phras detailed seismic event data for

specified start and end date so if we go to this API URL here let's copy and paste that in go here and you see this is the API

documentation for the earthquake catalog so it's USGS and they have this here query end point like an API to submit data

requests so you give it in this format htps earthquake blah blah blah query format and Json with a start and end time it will give you back all the information about earthquakes that have

happened during that time or seismic events so that's the kind of query we're going to use so let's get back at dat PRS now that that makes sense to you so

the next piece of code we do is the start and end date because as you recall from that API require a start and end

date and so for that we the start date as today minus and that's why we passing this time Dela minus one day and then the end date is

today so from yesterday to today get all those events so let's run what we have so far so we run that run that let's run this you see there's no output but the

beauty of like using a notepad a notebook like this is you can then paste in or write something like that like start date end date play and it will give you the

values so you see how handy this is to to use on your day-to-day working so that's a start and end dates that's cool next chunk We'll add in is this

here which is going to get the data so looks a bit confusing don't worry it's mainly because it has a a try and accept and stuff it's uh trying to handle errors essentially but we first

construct the API URL so actually let's put that into its own little code block so we have the URL equals and then

we have this F then the URL and then we inject in these variables into the string so this here is a string this F denotes that we're going to inject some

variables into the string and then this is how we do this uh color braces with the variable name is inside it so if we click play on that that will then sign this

variable and initialize it and if we just put URL and play that oops we can then see that it's created that URL with the start and end it which is

perfect so next we want to use that URL and we're going to try and get back data so we're going to hit the URL endpoint and see if it gives us data

back and we have a try and accept here to basically say if it doesn't work let me know don't just kill the program let me know it's not worked and try and give me the error message but the main thing here that

we're after that what we're concerned about is we have a data variable which is the response the Json response that

we get from getting the URL so HTTP get requests so getting data from this URL and then if there's no data says there's

no data otherwise we create a file path of where we want to store this data so this is our bronze layer so we want in the bronze ADLs which is scroll up the

end point of that with Bronze in it so the bronze container go to that bronze container create a new file with the start date because if this runs every

day we want raw data for every day and you may want to store it so that you have like the historical data there so we want to save the data and we have this Json data equals json. dumps and

we're passing in the data and then DP which is a data bricks utils FS which I think is file system input so create a file at that file path with that ad Json

data and if it exists already just overwrite it because we're going to make a new one for each day so if we get it again that day and it's more up to- dat

data then just overwrite it so if we go to portal here and we go to storage account we'll go open up our storage account

data look and then from here data storage on the left go to containers and then let's open up our bronze container see there's nothing here so we'll go back to data re and

let's C play on that and then that's it done wrote

3 51,000 bites they successfully saved to here I won't be able to get it from internet but if we go back here and

refresh there you go see the Json F now exists and if we click on that we could download it and open it or within data bricks we can just click code here and a

way to look at the data that had come back so this uh response piece here is if we do data I'll show you what that is cck

play see this is the data that comes back but this will have everything that's that's come back like so all the data right we just want to say the first

one so we can just validate what is one item that comes back we do zero that's indexing array to just get the first one and this here's the first one so as you can see this is all the data that comes

back which is perfect that's what we're after so now that we've done that that's the bronze notebook sorted we will come

back and change this so that every day this can be dynamically changed from the data Factory now it doesn't necessarily

need to be done but it's just to show you how you could inject variables into here and get them back out but this is looks good as is we'll come back and modify it when we move it a data Factory

so bronze perfect let's now go back to what space click create notebook and then we'll do the silver notebook we could also for reference file

then new notebook so 6 and half dozen anyway silver notebook so the first thing we do for the silver

notebook is we're going to assign from date time we're going to import that and then we're going to assign the start date the bronze and

silver DLS so as you can see it's like this now I'm copying code if this changes that one changes I shouldn't be doing this no I shouldn't and that's why

in data Factory we're going to leverage passing variables from one notebook into the next into the next so that we don't have this repeatability so that we can just write

it once and reference it thereafter but for the tutorial right now so to get each of these notebooks to run independently of one another we don't

reference the values from the other one so we have say we have notebook one two

and three when you then put these into Data Factory you can then pass the variables from there to here and from here to here and from the first one all

the way to here however when you're just working straight from data H bricks it's much easier for us to just look at this as an isolated block and run it on its

own so that's what we're doing here is running that one as a block running this one as block Etc so we won't be uh doing this once the environment's up and

running this is just so that we can follow on the code and actually see what it brings back so we assign those variables that's cool and then we do some more Imports

and this time we're going to get into Pi spark so let's import these SQL functions from P spark and you see automatic computer resource connect

sucess so that's because when you create when you open a new notebook you have to connect it to some computer resource now we're only going to use the same cluster for each notebook so it just automat

automatically detects we only have one and assigns that so that's fine so now that we have spark up this is where we start using things like data frames which your data frame is essentially

just your frame which has your data in it so data frame it stores it like it stores a schema plus the data essentially I'll

show you in in a moment what that looks like so data frame equals spark read option line true and it's adjacent file and this is the file name SL location so

if we play that cell oh I haven't played the first one oh well good example what we can do is we can actually just click run all and we just run from the top down because it has to say the variable

first then it'll get to here so when this is done see here it's created this data frame this is the data frame now the way to actually see the data frame

for us is if we click if we write DF and click play see this is the data frame it has a struct of coordinates blah blah blah right bunch of stuff so let's have

an example of what does it actually look like to is so we can do do head to get the first one if click play this will then giv us back the actual values so there you go that's the

values so we have coordinates this this this type that blah blah blah URL etc etc so now we're able to see the

actual data and see the data frame so that's perfect so the first thing I want to do here is reshape the data because if you

just look at this it's a bit of a mess not only because there are loads but like I don't even really know what it means so like coordinates array double

type string I mean I get it it's a array of type double but what are coordinates so if I

mean from like if I was a business perspective would be just looking at coordinates so a way that we can reshape the data is here we are going to that

geometry geometry column geometry coordinates which is this here then we're getting item zero because it's an array the index and starts at zero not

one so this is the zero item this is the the one item and two Oops There You Go hack it let's get rid

of that um and then Alias and we're going to call it longitude latitude and elevation much easier to look at and then we have properties. tile instead of

that we don't need properties. Title

it's just title properties. place just

Place sig. Sig mag mag type time updated Etc so now if we run that see that it's reassigning that same data frame as that

same data frame but selecting only these columns and renaming them so now if we do DF again and run that you can see that it matches what we've recreated so

we've reshaped the data from this let's just drag it down so you can see reshaped the data from this to this much more manageable and it's got only what

we need and nothing more perfect so we can now do that DF doad again and we can see that the data is

now looking like this there you go have an ID a longitude latitude elevation the title Place description significance magnitude mag timing updated so looking

much better but still don't really like that time thing so we'll get to that in a minute but it looks much better much more readable than if we drag this one down this here so now you can see the

power of um of the engineering is this is a nightmare to work with this 100 times better so next thing we want to do is we

want to check for null values so we are validating the data now and so it's same kind of thing DF equals DF do with column so we're not selecting this time cuz selecting

is basically if you have a table with 100 columns you only want to select column A B C D then now the this new data frame only has those columns we don't want to do a select we're just

saying get these columns and then when it's no change the value to zero otherwise just leave it as it is and so this will then run through a data frame and it will affect every

single row click play on that it will make sure there are no NOS what done here accept and run see any

difference must have pasted it wrong something um so that's fine no NOS now that would be good so now we're getting to the good bit let's fix this time man

so dfal DF with column when it's time get that column that time column it by th000 and cast it to this time stamp now that code will change it from seconds

into actual time so if we run that or actually first let's add a code here let's do the DF do head

oops play now we've got that there and that's ugly time if we now run that then add a

code block and we do again DF do head click play we'll now see that we have a date

time of object which has the actual year Etc so 100 times better to work with than that perfect so now we have this

updated data frame we want to store that as a table so let's first Define the output part so we want it to go into the

silver container and we'll put it into a folder called earthquake events silver so click play and that's just defining the output path it's not actually moving

it over yet so then the next line we want to append the data frame to this container using the output path in par a

format which is a columnar format so it stores the data in columns not as entities like not in rows now this makes

for efficient data lookup but it also par stores or stores schema version history and so if you change a schemer

across time it's h good for Version Control so click play and now this will write the data frame because it's DF do

WR to this here and now it's in a pended because as you can see the path doesn't have a date because the raw file has a date because the raw file is new every

day but the PAR file we want that to just just continually update because we want to be able to look at the the data across time say if we're visualizing it

so and here just append because we'll just add to the park file add these new rows in for example and that's that that's the silver one so now if we go over

here back to our storage account and this time go to Containers we go to Silver you'll see we now have this here which if we open up this is our our par data

and this here is a Parky file it's a bunch of other files that that Pary likes to likes to have to work so from here now we want to create our final

notebook our gold notebook let's rename it as gold notebook and this time let's click in here first thing I want to do again is

to have these variables added in and as we've spoken about we will remove these when we get to the data Factory so that we can pass

the variables in from one to the next so if we click play on that that will do the import and then run these again it just automatically connect to the only computer that we

have now next up run this here so you can can see that we have a

library that we're not not used to having so a new library we have which is this reverse geoc coder so if we click

play here you see a all error because it can't find reverse geoc coder because it's not a library that's just automatically exists within python it's an external

Library so if we will take access this external Library we could possibly because python we could pet install

reverse geocoder what would be the problem here well it mean that every time the notebook runs it would have to install it

now do you really want to be managing the libraries like this because say someone comes in they're changing the code they like

gone what happens now do we even know if it is installed in the cluster we would have to check so it just becomes a bit of a nightmare so the way that we manage this is we go to

compute and from here we have our cluster let's click on our cluster and now at the top you see we have notebooks that this cluster is is

uh attached to we also have libraries now this is the way we can manage libraries without having to deal with Pip and stalls within the notebook directly so so what to do is click on

install new and then from here the library Source we'll use piie which is pi package integration I think something

like that anyway it's reverse geoc coder and then we'll click install and this will now install the reverse dual coder

Library onto our cluster so it may take a couple of minutes maybe five minutes max but then it will be within the compute so if we go back to the workspace go

back to Gold we'll just start filling out the the code and we can run it when the the library is installed so the next thing

we do here is we read in the Parky file because that's the file that we're going to be interacting with or the the data frame essentially that we're going to be

interacting with to then manipulate it and turn it into the gold uh final dat frame essentially so we read in that

silver data which is this variable here so that's the path to the park file then I add the next code block now

what we're going to add in here is a limit so we're going to for a data frame we don't want the entire data frame we only want the first 100 now the

reason for this is because when I was trying to run this with no limit it was was taking too long taking ages possibly

because we're using a a cheap low spec compute possibly due to the bottleneck of this external Library not quite sure but all I know is that in fabric it

worked faster so I think it could be something to do with how data bricks is trying to break up the data frame and then run it but anyway we're just limiting it to 100 so that we're we can

do it much faster so this is also good if you're just trying to prove something so instead of do the entire data set you can just try and prove your changes on just a little bit of data so that it's much

faster so what is it we're going to do well we're going to create a user defined function and we'll tell you what that means in a moment I'll tell you what the function is first so the function is get country code so we're passing in a

latitude and a longitude and we'll defining a function called get country code and then within here this is what it does country code for given lat and longitude parameters returns and there's

an example and so what it does is it sets coordinates and then it gets the result by running the search function within RG

which if you recall from here we import reverse GE coder as RG so this is reverse GE coder do search pass in the coordinates and then it gives us back

the country code so we have that function defined but it is a user defined function and when you have a user theine function you have to register it with spark to say I'm going

to use this on a data frame so we have this get country code UDF user the find function and it is this user theine

function get country code and it's uh of string type so an example we can try in a minute it says

here get country code and then just some random coordinat so let's see if this has installed the library yet there we go yep inst so perfectly

didn't even have to reset the cluster perfect let's just click run all then and it'll run right through all of these down to the bottom and there you

go so these coordinates you pass them in to that get country code and it will passes back A4 for France but we don't want to just do it manually like that

we're wanting to use it within the data frame so we have here a new data Frame data frame with location and we have data frame with

column so a new column country code and the value we want to be this get country code UDF so that use defined function and we're want to pass in these two variables and whatever the result is

from this function is what we're going to assign the value within the column so for example here if we do actually let's just run it first so

we'll click play to run add a new code block underneath and if we do DF do R schema the how is it again print scha

there we go it's another one for you to know the F print schema you can see here this is the schema so these are the columns

essentially if we do the DF location.

print schema we play this we'll see that there is a new column that with column country code so if we were to get

say DF doad you'll see there won't be a country code but then if we get DF with location do head that will now have the

country code assigned at the end there we go code us so it's going to be much easier now for business users to look at this

and see based on the country the number of earthquakes for example so now that we have that we want

to classify the significance so again we're going to add a column and this time the column

is s class so is it low moderate or high so when this column s class is less than 100 it's low if it's more than 100 less

than five it's moderate and otherwise it's high so if it's more than 500 now this here something to know and how this runs and from looking at it cuz

if you look at it you're like well where's the 4 each Loop shouldn't this shouldn't you do for every item in the data frame that's not how it works you

just create a new column and it will just automatically do this for every single item so click play and then what we could do now is we could check the

schema for that and you'll see that the Sig class will exist so click play see S Class exist now if we were to run ahead again you would see that if

you to do the DF with location Sig class doad it would also have this low moderate high so now we could see countries that have low moderate or

high and we could run queries to to say how many countries have this number of high or how many low are within this

country or which country has the most high most low Etc so it's much better for aggregation and for grouping now so

now that we have this data which is much better for the user let's save it so we have this gold output path gold ADLs earthquake events Golds the fold we're going to use so we

just name that that's just exactly the same as the silver one then we'll do another block where we appended and so again just so you can see if we go back to storage count go to Gold nothing will

be in here if we run this and then we'll see I canceled it by accident you can see here details if you

follow that details of what's happening but it's going to be so fast cuz it's done that there's no use to actually look into it but type information you can go and look

here have we look around anyway so now if we go back to Gold refresh and we'll see there you go

oops C back in you can see there it is there so perfect and as you can see you know if you go Savvy I the size is a lot

smaller it's because we're only taking the first 100 as you recall so if you to actually run this and you Ed all of the data because you would otherwise you've truncated the data and now if you do a

count it's completely inaccurate but it's just for demonstration purposes because you want this to run fast so now that that's been

done we want to go back to portal and this time let's open up data Factory oops so Azure data factories again

create here or create from here if it's your first time we'll click create same subscription we'll just do the same Resource Group the name we'll just call

it data look that's fine HS that's fine blah blah blah review and create and then just click create and so as i' spoken about here is

essentially now within data Factory we're going to have oops within data fact we're going to have this here like this so notebook one notebook two book 3

and we're going to pass parameters between them and we're going to automate them to run using data Factory and you can make it run based on a bunch of triggers but the trigger we're going to

do is say like every day at midnight so 0 0 0 0 rerun this pipel planine so we go to data fact you see

that's now deployed let's go to Resource and launch Studio CU as I'd said before it's kind of like its own application opens up in a new window as you can see it's ADF

do.com and here's the data Factory so there are a bunch of things to look at here if you want to learn more about data Factory if you do a couple of like

tutorials like this you'll learn a fair amount but if you have specific things you want to learn you don't know where to start go to Learning Center and from here you get quick starts tutorials

videos templates Etc so good to know there I always find out Microsoft stuff they always have the learnings teachings well documented it's just a well

documented system which is why I was drawn to Azure especially starting out in my day engineering Journey so if we go to author we can see from here so

home author monitor manage learning center learning center just shown you manage we're not going to really do anything here home not going to really do anything here either to be fair

author and monitor so author is where we're going to create the pipeline so under pipeline just click that there's nothing there and if we go to monitor there'll be nothing here to monitor but this is where you would see your pipel

running an action so if this has run ran daily for the past month we could go on and see when was it successful when was it not what were the parameters

Etc so anyway back to author in pipelines here click on the three dots and we're going to go new Pipeline and this here is your pipeline so here is kind of like a canvas you're going to

drag and drop things on here and below is your settings variables Etc relative to either like right now this is relative to the entire pipel plane but

if you were on a specific activity then it would be relative to that activity and then on the side here are some properties too so these are your activities here

things that you can add in so like you can have like some machine learning update resource not going to use that but pretty cool the stuff you can do you

have data BRS notebooks so we're going to use then you can have synaps you can copy data so for instance in the last tutorial we used copy data to get the

data from a SQL Server into your data Lake but anyway today nice and simple we're going to just use notebook so let's drag and drop onto here and you

can see the properties at the bottom or the settings or the the stuff about it and for the name let's call this bronze

notebook bronze notebook and then that's everything from General now let's go to Agile data bricks so here we need to create a new

data rinks linked service so if we click here nothing so let's click create new and now the reason we're doing this

again is this same whoops the same as this here so we have data Factory now data Factory wants to talk

to data not the S account because storage account talks to data brick so dat Bri talks to storage account but data Factory only really wants to communicate with data brick and then

data brick does this here so again you know least uh principle of least privilege if if everything was open access it would be like this we don't

need that connection there we only need that connection here so again it's going to be a twodo twoo stop a door on each side so on the data

brick side we'll go and do that in a moment we need to create a connector and then on data Factory side we need to create this linked service to be able to

link over so let's go do that now within here we have a linked service which is called Azure datab one that's fine connect via integration runtime Auto resolve because they're all within Azure

just Auto resolves that's fine account selection method we do from a subscription from our subscription the workspace is our workspace and then

we'll use an existing interactive cluster so we'll use our or for me it's look Burns CPS cluster but for you it will be your name cluster authentication

type now we're going to use an access token but here you can either paste in an access token or use an Azure key Vault

now for Simplicity sake we're just going to use an access token but if you were to use key volt you would then go and create a key Vault within Azure and

within that key Vault you would then store a secret so I'll show you kind of roughly what it looks like um just to get up to speed because key V Vols are

used a lot for Secrets especially in the r World whenever you have a username or a password things like that Keys which you're going to use multiple times which can often be bad practice

but Secrets Keys Etc you want a key Vault now click create key Vault you don't have to do this per say this is just for me to show you if you want to skip all this and just paste in the key

jump to whatever time's on screen now but we'll just do data look so that's the same Resource Group keyot name we'll just call it data look that's fine and then all that's fine let's just

review and create and then create and what it's going to do is going to create essentially a key value a key value pair

key value pair am I getting confused is it yeah of course it is key value pair um storage within Azure and then what it means is you can like storing your

passwords in there instead of having to store them in text see because the different ways you can store things are essentially you can use a third party to

store them like hash cor fault or like key VA here or um I think it's Secrets manager on AWS or you can sto them in plain text

and your GitHub repo which is a Sure Fire way to get a filed B hacked C some guy coming and using your resources so

that he can M for Bitcoin or something cost you a bunch of money whatever it is it's always bad so don't ever do that or you can store it on the computer so you can export as a variable into your

actual operating system but the problem is if you using compute and your compute changes every time then it go it disappears right so

the best way to do it is to store it in Azure if you're going to use it again so if we go to this KEYT resource you can see here on the left let's go to objects you can store Keys you can store secrets

and you can store certificates so for the keys example generate our input generate an RSA key

so this is a like an access key the one where you get public and a private or you can instore Secrets this is where you would store an access token

within here you can upload manual just give it a name doesn't oh doesn't matter and here you would paste in your secret value and then click create and now what it means is you can just continually use that secret value

but as far as I can see that's not good practice for here because your access token should be the token for single piece of access so let's get this access

token so if we go to data bricks top right go to your REI your re icon go to settings within settings go down to

developer from developer we're going to go to access tokens here click on manage and then we're want to generate a new new token what's this token for it's for data look cuz that's the name everything

else has that we've used today so we just call it data look and generate and there we go so copy the token now you won't be able to see it ever again so

let's copy it done it's gone forever and then paste it in here and if you were doing the key Vault thing then you could paste it in there too but we not going to do that just going to do here paste

that in and now we can choose from existing clusters where we want to link that service to and it's just my cluster look Bond's cluster and click create and

now we have this data data bricks linked service linking data Factory to data Bricks now if we go to settings

next now just before we go to settings we can test this connection here to make sure that is linked so click test connection successful link so that's perfect if we were to just for reference

if we were to use this as your key vot one thing we would have to do is we'd have to assign the I am rule because again on excal

draw let's go over here if you had a key V again it's another service within Azure and you want this to talk to here

now based on the way it works here is that the door

from this side is always open so it's always good but this side isn't so here it's like Yep this is just fetching out

to keyo that's fine but to get the data from Key all over to you it's the exact same as the sage count here in which it's you'd have to

go to I am you'd have to assign a rle and then once you've assigned the ion roll then you get the big green tick and obviously the ion roll for reference if

you're going to use this the iron roll would be follow data Factory and the rows would be secret

officer secret officer has some rough rating my mouse was sliding and uh secret

user so able to modify the secrets but also to read them um no I don't

know cut that about anyway back to here so let's disc the changes go to settings and so settings we want to set or assign The

Notebook path so let's browse and this is where I was telling you before like in the workspace so if you go to Shared which is where ours are or sorry not Shi

users every time I get that wrong users then for me it's my email from my email bronze click okay there we go it's now linked to the

bronze one and then let's just copy and paste that I'm just command C command V copy

paste copy paste let's rename this to Silver and then rename this one to gold and then we'll change in here this datab bricks link service perfectly fine

but we want to change the settings browse and we're going to change these to Silver and to Gold

Silver and this one to Gold users you can also type it in here but I have had issues before with that so just for the sake of making sure it works I'm just going to go through here

I think it's because there's a space don't know how it treats that space anyway so we now have these three notebooks if we were to try and run this it's like what order are they running in

cuz you know it's like I can I can just set them like this world it going to run so that's why we have you see there's like these

little chunks after it here well the this one is for on skip this is for on success this one is on failure and this

one is on completion and this allows you to orchestrate the flow of this data pip planine for us nice simple data pip planine it's always on success and to do that you just click on it and drag it to

the next one and so now we can validate this make sure it's all looking good then we can debug it and debugging it will run it

but it won't log it as a run in monitor it just runs it to for you to debug kind of thing but be aware because in the notebook we're

linking it to the data Lake then it ends up it modifies the data in the data lake so although it is debugging it's actually just running it

but it just doesn't store the Run um so I'm just showing you just so you know but anyway so this is what it will look like Anyway if you do run it from here so if you were to to try and run it from

here using the trigger cuz on this add trigger option you can trigger now or you can add a new trigger We'll add the daily trigger in a moment but you can just trigger it now and you'll see it here like this so exactly the same kind

of output that's successful let me talk you through this a little bit let's get rid of those properties so we have the right collapse these so we can see better so we have the silver notebook

bronze notebook after this one's successful we then go into the gold now popup too if you go over them you can see that there's a few few little icons there we go we'll use this one first

because the icons you can see more come up so this one here is input so for here the only input is The Notebook path when we change to get the parameters from

bronze and silver they will also be in the input now after output you can then see these glasses pop up it's good timing these

glasses are free to watch essentially and so it gives you a run page URL which sends you to data bricks where you see under data engineering job runs you can

actually watch it run live so right now you see this last one is running see it status is running but what this allows you to do is to come here to debug it so

if there was if it fails you can come to here via the link here or via just go to job runs then just selecting

one and you can then follow it through to see Additionally you can see the job ID task run ID who run it when it start it ended how long it took and then

whether it succeeded and then you can also see you know permissions compute and which notebook it is ETC and you can export it so very handy to come and have

a look in these values here what will change in a moment so as you can see that has succeeded which means our static pip planine so not so Dynamic

pipeline well it's kind of dynamic Works um but then the third one which now has appeared this here this is your output so you get the Run page URL which is that there then it just gives you a

bunch of other stuff um which we don't actually require for for this here but good to see it especially if there's an error it will tell you in the output

what the error is which Sayes you having to go here so that's working now let's change this so that we're passing

variables from one notebook into another so the first thing we want to do is to head over to data bricks we'll shut one down so we get one open we'll go to our

workspaces and let's start with the bronze notebook so the first thing that we're want to do is we scroll down we have this remove this

before running data fact pipeline so let's remove that one and then we'll also delete this confirm and so now you see there's

red you can't really see actually because because of the underscore but there's a a red squiggly line or a orange squiggly line there you go you see it it's saying what is this I can't

see it cuz there's no variable declared so so the way that we're going to do it is we're going to get the base parameters like this here so we have our

data bre utilities and it's widgets text start date and end date so we get or we set this as the variable and then we get

the value using this dot get and we assign it to these start and end date variables now the way that this pulls it through is essentially it's creating these

variables and then it's getting the information for them now the way that it gets them is through data Factory so if we go to the notebook we can see what we

have base parameters now from base parameters we can click plus new and then from here we're wanting to

have the name of start date then click new again and we'll have another for end date now to get the value for these on value

you can click here and you could just type something in so like you could have a specific static variable so like this

is the value but we want to add Dynamic content we don't want it to be just this value we want it to change every day because it is the date so within here

I'll paste it and I'll talk you through it is that we're formatting the date and time and we're going to add days minus one so

remove a day from UTC now and want it in this year month day format because that aligns with our API format what it expects is it in that

format and so that's our just click okay that will give the start and then the end date is similar so we paste that in except this time it's just UTC now so

let's remove that ad days function and there we go and and click okay so that's the start and end dat so what this allows you to do then is you're done a little bit of coding you

can use parameters system variables functions see look bunch of different functions and actual variables that you have for your pipeline we don't have any

but could um to then do a bit of logic bit of coding to then inject the variables in so that's what we're going to pass in

to here and they pass it in in that way now at the end here after we've done everything we're wanting to pass some of these

values so that they can be used in the silver notebook and the gold notebook so we want to pass in like the silver ADLs to the silver gold to the gold Etc so

the way that we do that is let's click code and this time we're going to create a dictionary or an object so we have this output data it's going to have the

start and end dates the bronze silver and gold however within base parameters you can actually have a value of an object so what we do is we serialize

that dictionary into a string and then that string we use this function here to return the string from

The Notebook so to push it out as an output and that's this dbutils notebook exit and so that that's the bronze

notebook we are getting them and then putting them out here so that's that one done if we then go to Silver we want to remove

this paste this in so we're importing Jason right and then we're assigning bronze parameters to get the bronze parameters so these

are going to get passed in through the base param which we'll do in a moment and then we're going to load that as adjacent object back from being that serialized string so turn it from a

string back into an object meaning that we can then use that output data. get

and then pass in what we're after and it will get the key value it will get the value based on the key the key being started and then the value it will get

it in and now that's cuz we're tur it back into adjacent so we have that and then additionally at the very

bottom what we're going to pass out scroll down CL Plus Code we're going to Output this DB notebook except and we're

going to pass the silver output path because we're going to have to use that within the gold notebook so if we go to the gold notebook I'll show you so go

back up here got gold CU you can see at the top we want that silver output path file So within gold again let's remove

that the hardcoded stuff and we'll paste in a similar thing from last time which is this here utting the Json getting the

bronze and silver parameters and then we are loading the bronze as adjacent object whereas the silver pams we don't have to

because it's just a string as is and then at the very end we don't have to Output anything because it's the last notebook of them all so that's the notebooks done let's go back over here

and see how does that map between each of the notebooks in regards to parameters so the base parameters for the bronze notebook that's fine start date end date that is all that we

require however now for the silver notebook what do we pass in here well if we go to the base params click on new

and now this time we're going to pass in as you recall bronze parameters and the value here is going to be dynamic again and the value itself we're actually

going to pass in this here this bronze notebook activity outputs and it's at activity bronze notebook. output but we

want the Run output and we want to ensure that the whole thing is a string just like that click okay and that'll

pass that bronze parameters into there if we then go into the gold notebook we want the same thing again so let's do bronze params just paste it straight in but we

also want silver parameters and again paste it in except this time let's change this to Sol

notebook output and we'll click okay and that's how to pass the paramet through and so what we can do now is let's publish all that save the pip

plane it will deploy the changes to the factory and they will add a trigger and when we add the trigger we go we'll get it to run now so

click Now new sledit on ADD trigger you can choose your triggers if you've made any if you haven't probably haven't click plus new we'll change the name to

daily trigger and then schedule you can have it as Tumbl Windows storage events custom events these are things you can go and explore if you want we're just going to use a normal schedule one start

date today time that's fine recurrence we want it to run once a day and let's get it to execute

at 0 so bang on midnight but let's start the trigger now so if we click okay click okay then publish all this will publish that

trigger does it appear to be triggering don't know why but what we can do then instead just to show you is go and Trigger trigger now and click okay and then you'll see it will start

running and you can click here view pipeline run it will automatically go to Monitor and pipel run that's because I triggered it now whereas if it was triggered so say at

000000 then it will go into this on triggered runs but it's in pipeline runs cuz we just clicked run and it comes up here just the same as it looked before pretty much if we click on that here

it'll open it up and there you go looks exactly as we're expecting as we had seen within the the author one too you can view it from here and from here just

as normal and so this is how to schedule or add a trigger to schedule your pipeline right there we go it has now

completed that pipeline so you have done amazing so far what we have done is we have ingested data using python

within data bricks to get it from an API endpoint into the bronze layer of a medallion architecture aure data storage container

within a storage account we have then used data bricks to transform this data and clean it into silver container then into a gold container so that it can be

used directly for business use we have then brought these notebooks into Azure data Factory abstracted VAR out of the notebooks into Data Factory

and orchestrated the pipeline from data Factory as well so we've done amazing if you've gotten this far comment I don't know pineapple or something down below

um and if you have as well please like comment subscribe it's all good helps the channel grow helps me learn more by doing more tutorials and help you by doing more tutorials and if you would

like tutorials La if you'd like to help decide on which to make next and get emojis in my face and stuff consider joining the channel as a member but anyway let's get back to this

so now we've done this let's load the data into synapse so we can run some queries against it now my initial sys deployment seem to not work so I've

redeployed it it may take a moment or two but as soon as it's ready we'll Jump Right In right so now that it's complete let's go to the resource Group and then here

you can see all the stuff go to this data look workspace synapse workspace click on this and from here we want to open

synapse studio so let's click open and then it pops out here same as I was saying before it's like a different kind of application um and it will start loading

up and then once it loads up what we're going to do is we're going to link the database and run queries against it so this here is

the the main screen the home screen from synaps and antics you can do a bunch of stuff so you could probably do this entire tutorial in synaps analytics to

be fair because there's a lot overlap between different services in an Azure anyway for us what we're going to do is on the left let's go to

data and from data let's go to linked and then AZ storage Gen 2 CU that's what we linked when we created it and then from here if we click down you can then

see the stuff we have we have so that's f F bronze gold and silver so if we click on gold here what we can do is we can start

writing queries against it so up top you see new SQL script so let's do that and select top 100 rows file type it's a PAR file click

apply and now we could run this script and it will retrieve the query results and it'll put it below so just take a second to run

against it and there we go and you can see it's getting back all this information now this here do notice that

it's from open row set so normally you would have from table name right but par isn't a table it's a file and so open

row set allows us to open this file of Parky format as if it were a table and so that so we can select this from here

now the cool thing about Synapse analytics is that you can do really quite complex queries for example let's paste this one

in here where we are selecting country code and then how many low how many medium and how many high for ROM this open row set dat

Lake and we're going to group by country code so for each country we want to see low medium high and so if we run this click

run now what we can do is we could do much more complicated queries for example let's do instead we want the country

code and then we want to have

say count the number of say low uh classifications so let's do case of when S

Class [Music] low oops we'll paste that and we'll do the same for

moderate and high we change them as

moderate count and as I count add a comma there remove that

one from this and then we'll do order by or Group by country code and so what this will do here now

is it'll select the top 100 of these and it will count the low moderate and high so like column low moderate high and country code and then for each country code it will have the number of low

moderate and Highs but we'll just do the top 100 just so that it's a bit faster and there you go so from these country codes which were there

us 258 low nine moderate no high then you can see the variety for each on there so you can see just how powerful leveraging this synapse analytics what space is because we can now query our

data as if it were a SQL database so the final thing you can do is a visualization so you can try and visualize this data in powerbi because synapse analytics is a way to interact

with the data run SQL queries get some information back but if you really want to see it you want to add it into tab blow or powerbi powerbi you can't run it on Mac I'm on Mac so I can't run it

unfortunately I could possibly try and run it in Microsoft fabric but then it doesn't look like I'm able to get the data from synapse I would have to

download a hard copy and it just be a bit of a faf so if you do want to put it into powerbi I have a video which will be carded here and also be one of the ones on the side go and check that video

out and you can do a simple visualization it's exact same steps uh and I'll show you how to do it there so I do hope you've enjoyed any comments questions queries let me know in the

comments below and until next time take care welcome to inch my name is l and in today's episode we're going to be covering part five which is the powerbi

visualization of our data engineering end to end project and aure now in previous episodes we covered the business request as to why you would

even do a project like this the basic architecture we set up an on Prem SQL database the services here for the pipeline data

factory data L data bricks and synapse for the extract transform and load and then the setting up of powerbi which was actually in the first episode so if you want to follow along go and watch those

episodes that we called above or a link in the description today we're focusing on perbi and tomorrow subscribe for this next episode tomorrow we're going to be

doing the triggering for the pipeline so how we going to make this pipeline automatically run every day with the most upto-date data so we'll show you that tomorrow and look into the active

directory as well so that the relevant people have the relevant levels of access depending on their team or role or group so without further Ado let's

get into powerbi so in episode one I show you how to set up your powerbi because you do need a Works email however there is a workaround links in the description of that video on how to do it

now you can see here as well app.

powerbi.com if you go on there you can only change things such as like fils Etc so you can only interact with dashboards you can't actually make dashboards from

scratch so for that you do need the app and for the app you do have to have a Windows machine you try and virtualize it which is a bit of a nightmare so honestly you're probably best to use

Tableau if you're going to not have a Windows machine but best thing to do is to install it and sign up via the link in that there so anyway in

powerbi let's open it up this is with the home screen and we want to open up a blank report now for this blank report we need data so let's go up to the top

ribbon and we'll go to get data and in here we want to select more and when that loads up we search in Azure see there's a variety of sources

we can link to now what we want to select is aure synapse analytics SQL because we want to get the data from there so if we click on that we'll click

connect and some settings will come up it will say which server is it and which database is it which is optional for us it's that go DB database from the

previous episode oops but for the server name if we go back to Chrome we go to the resource Group we go to synapse and it's the same SQL

endpoint as last time so settings properties serverless SQL endpoint copy go back to perbi paste that in here

we'll click okay and here we want to authenticate with our Microsoft account so let's sign in this will pop up you can just sign in

as whichever email you are using for your portal click connect and then this will load the DAT up now as you can see here it's loading

all of the tables what you can do is you can either load this data or you can transform it if you transform it essentially you're creating a series of

steps in which the data will always go through whenever it gets updated so in this example here if you were to actually transform this data you could

rearrange columns change types rename columns whatever it may be it can be quite handy but for us right now all we'll do is

just select all of these tables one by one and load them in right very brief overview of perbi wh we're

here report view now this is your dashboard view table view is a way for you to then view your tables and model view is where you can then view the data

model which is actually really important and we'll get to in a second and then lastly there's Dax query view so this is where you can write functions to query

your data Dax here being a library of functions and operators So within powerbi if we go back to these this model here you can see that it's

automatically created a model now what it does is it tries to get similar column names for instance and say that oh this must come from here that must link to there so such that then if you

run queries or filters that it's able to to match things together right now this is important because you can get say here address linked to this customer

address table so if you were to filter Say by city that you could then get that reflected in your customers now it's pretty messy if you're looking at it here you can't really see much one thing

to know is that these are your tables and these are the relationships this is the direction of the relationship and then is you know whether it's one to one

one to many many to many whatever it may be the star indicating many one being one so for instance here in the relationship between a customer and an

address One customer can have one address and one address can have one customer now does that sound about right possibly not because you have multiple customers with the same address if they

live at the same place and so what we can do here is if we go up to the top we can see manage relationships that's a much better way to see it there's the auto detect button

which you click that it won't change anything because these are Auto detected there we go what we can do then is for that address one for instance it's customer address 2 address so if we

go customer address to address it's one to one we would probably say that actually in our case let's say that that would actually

be one to many and then for the cross fotter Direction can have have both now this is just as you can tell filter Direction so you could do it just one

way meaning that if you're filtering here it reflects there or vice versa so we'll click save and that's fine for us another thing to note just before we move on to the next one is that an

important concept for data engineering but we'll not cover it in this because this is more about just using Azure for data engineering as fact and dimension tables fact tables essentially contain

quantitative data for analysis most Dimension tables contain descriptive attributes which are related to the data in the fact table and so here's a good

example of it here where the fact table has the foreign keys and then these have the primary keys so there's different locations this would just say this item

is in this location and in the location here's the contact person where's it actually is blah blah blah so you get the gist

so that's why these relationships are so important so now if we go back to our report let's actually build out a little report so first of all let's have a

quick look at the UI so here are all the things that you can drag and drop into your report so different types of graphs Etc and then a slicer here which is a

filter and here is all your data so if you click open any of these tables it will show you all the columns within the table with an indicator as to what the key

is so first thing we'll add is a card which is this here so we click it it'll pop open a card here and we have these two parts here which are Fuller analysis which we're

not going to use format in the visual and then the visuals here and the fields that you need for it so if we click on

LS here and then we can click on field what we can do is let's add product ID drag that into

field and you can see here it's automatically chosen count now you can do a different ones you can do sum for instance but in this instance that doesn't make any sense why would you sum

it whereas you count it you can then see count of product ID what we can do here is switch off that category label that's that count thing there switch that off

we'll go to General and we can turn on a title and then we can call this number of products and then you can see the title

pops up there and then you can do all of this stuff here to change what you want it to be so colors background color blah

blah blah which type of heading so now we have that Cod let's contrl Ctrl V and in this one let's change this here to

the sum of sales so if we x that out on the left here let's go to sales order detail and we can see line total let's

drag that in and it says sum of line total so it's kind of getting the gist as to what what it is and so sum of line total if we going to the visual go to

General title we could change this here instead to total

sales so we have two we have a number of products card which we can use this here to change size and stuff then we have the total sales

card do that there something like this just for examples as to what you can see so now if we open back up the context so we'll

go to business wck right so if we come back to the business request to see what's actually creating this dashboard let's have a look so blah blah blah

dashboard that's dashboards provide inputs into sales by gender and product

category showing total products total sales clear gender split amongst

customers filter byproduct category and gender so I can UNH highlight that cuz that's captured in the below more

effectively Let's us friend interface the Bas queries so we show total right so we've done that now we want to be able to filter by category and gender

and they want to see actually a clear gender split amongst customers as well so let's first we'll do agenda split so

we have some visualizations going on we got the powerbi let's pick a donut chart here and in this donut chart let's do

buy if we go to customer we can try and inir which gender they are based on the

title so if we click Title Here we move that into values and then another one in details and what that will do now is that will show us the

different titles so here you can quite clearly see that it would appear based on titles that the likelihood is we have

majority male customers but we have a variety of products we have 295 so which products so this is where we could add in our filter so if

we go to slicer oops contrl Z that Z again let's click out here we'll create a slicer let's create two of them

so we'll expand this make a bit bigger this is not for to learn how to do any visualization or best practices this is just so you can see what's

happen happen so in here let's do one with title so we'll drag that into field then we'll do another slicer just for sizing oops we control C and

V do the same thing just going to line it up and in this one uncheck that we'll go to product category and we go to

name and now we have a variety of product categories so let's move these two here over over we'll move this title

one up here it smaller and then we'll do this and we'll make this categor bigger since there are a load of categories so now in this visualization what we could

do is we could say what about brakes what's a split for brakes so you can see here that brakes

affects here so there's two products for breakes and the total 830 say dollars or pounds but you can see that it's not affecting this now that's because of the

importance of the model view CU if we go back to model view what we'll see here is that there is no connection between customers and

category I suspect and so if we come back to these relationships you can see the customer is all alone not get a relationship to everything and that would be why when you update the

customer it's not affecting this and when you update this it's not affecting there so if we go back to relationships manage relationships yours might be different

because last time mine looked completely different to this so what we're going to do is we're going to scrap all of them and then I'm going to create some new relationships you pause the screen

and either check yours looks like this or yours might work out the box or you can pause the screen and create them yourself right and there we go we click close

go back to the report so now in here let's make sure that's back on category if we click

breakes see that that's now reflected in both or all three of the products sales and the chart and the same up here with title if we change the title changes

that so if we say for instance let's see let's scroll let's just pick something random pumps two products no

sales pan one no sales locks not really selling much of they so mountain bikes there we go and there a split here so we

can see that it's 87.5% Mis title so 87.5% male gender likely because it's in failed and up here we can change it so

let's say we want to see the total sales products that are bought by Mis click that 20 150k total sales so there we go

we've now loaded the data into powerbi and we've created a very simple dashboard so now that we have the

dashboard we want we can then save this as whatever we want right whatever it is it saves it as a powerbi workbook but we can also share or export it so if you

share it someone else can then access the file right what we can do is we can export or publish it so if we export it we can do it a PDF to send people or we

can actually publish it to perbi and when we publish it we will then get a link we can follow which will then send you to this dashboard where you can interact with it so if you wanted

someone to have a copy of it an interactive copy you would send them that published share whereas if you just want them to have a PDF you can export it or if you want to share so they can

edit it you can share as well now in the next episode what we're going to do is we're going to go to the pipeline and we're going to create a trigger so that this runs automatically cuz as here it

says we're going to do up to-date accurate data so we're going to edit this pipeline or add a trigger which schedules for each day and then we're

going to go to active directory and we're going to Ure that only specific people can access specific resources and I'm going to show you that just so that you have an overview because we won't

need it for this specific project unless you know you're going to use this basis to build something big on so anyway any questions comments queries as usual put

them in the comments below and I'll see you in the next episode

Loading...

Loading video analysis...