Data pipeline vs Dataflow vs Shortcut vs Notebook in Microsoft Fabric
ฝัง
- เผยแพร่เมื่อ 28 มิ.ย. 2024
- FREE 40-minute Fabric fundamentals course: www.skool.com/microsoft-fabri...
So you want to move your analytics workloads from a Power BI-centric model to a Fabric-centric model? But how do you do that?
This video is the third in the series where I discuss different options for getting data into Microsoft Fabric, including Data Pipelines, Dataflow Gen2, Fabric Notebooks, OneLake Shortcuts, Database Mirroring.
We talk about the pros and cons of each to help you make better architectural decisions in Microsoft Fabric.
Catch up on the Power BI to Microsoft Fabric Transition Guide series here: • Power BI to Fabric Tra...
Timeline
0:00 Intro
1:08 Series review
1:29 The problem
3:07 Overview of the different methods
5:19 Data ingestion principles
7:29 Dataflow overview
7:54 Dataflow: when to use/ when not to use
10:09 Dataflow: implementation notes
13:11 Data pipeline overview
13:27 Pipelines: when to use/ not to use
15:35 Pipelines: implementation notes
17:38 Notebooks overview
18:29 Notebooks: when to use/ not to use
20:16 File/database replication overview
22:30 Shortcuts overview
22:52 Internal shortcut limitations
24:05 Shortcuts: implementation notes
25:17 Database mirroring overview
27:53 What to consider when choosing a method
#powerbi #microsoftfabric #dataanalytics
Hey everyone, I'm back! If you found this video helpful, please do give it a like or share it with colleagues, it really helps grow the channel 😊 THANK YOU!
Awesome video. Thoughtful considerations and nicely done.
Great video thank you
😊Thank you! Very helpful
Thanks for sharing this kind of content,super useful!
This is my favorite video and reshaping my knowledge framework about Fabric! Thank you, Will !
Great to hear, glad you’re finding it useful!
Greetings form Guatemala Central America, Thanks I´m learning a lot with your videos ⌛
⏳watched it all the way through. Thanks for the detail. Loads of options for loading data!
You have such an inclusive approach- how could we resist watching until the end? Thank you for such helpful content!
Thanks for the lovely comment, glad you’re finding the videos useful 🙌
Thank you. I love the was you explained each concepts.⏳
"How to get the data into MS Fabric"? - Better way explanation on possible options to get the data into MS Fabric i.e, using Data pipeline, Dataflow, Internal/External Shortcuts, files/Database replication, Fabric Notebooks, Database mirroring. Keep it up Will :)
Thnks!
great video, nice pacing and really useful framing of the topics covered. Thank you!
Thanks a lot for your comments, glad you enjoyed the vid 🙌
I really enjoyed your approach to describing the features of Fabric. Thank you. ⌛️
⏳ Loving the super helpful videos. Especially this series of Fabric videos which are really useful even if you are not coming from Power Bi. Thanks for your content 🙏🏻
Glad you're enjoying, thanks for watching 🙌
Very well explained ! Thanks !
thanks for watching!
I really liked the way you encapuslated the methods. Much better more high level than Learn. And hopefully I have enough ⏳ to finish these before the exam.
Thanks a lot for all your videos. I’m new to fabric and they have been a huge help in getting my feet wet. I really like your presentation style and pacing
Thanks for watching and for the lovely comment, glad you're enjoying!! A lot more to come :)
If it makes sense for your channel, I would love to see a video or series that goes into some best practices or design patterns for incremental processing of data through a medallion architecture. I am seeing use cases where users need to periodically drop a file every so often like after a weekly payroll or after a month end. Users may need to add files, delete files, or replace a file in case a mistake happened. I’m coming up with some creative processes to manage that without reprocessing the entire set through the silver and gold layer each time but I’m not sure if there are better patterns out there.
Thank you for this explanation of data ingestion in Fabric environments ⏳
No problem, glad you enjoyed!
⌛️brilliant series! I’m learning fabric so I can ingest azure resource graph queries from azure tenants for use in powerbi dashboards
Awesome, good luck with that!
⏳ - superb, thanks Will!
Thanks for watching James, glad you enjoyed 🙌
Very helpful ! Thank you⌛️
Ah I'm so glad you found it helpful, thanks for watching and for commenting, it really means a lot!!!
Perfect explanation
Thanks for watching!!
Really useful information thanks for sharing ⏳
Thanks for watching!! Really appreciate it
⌛️
Thanks so much such insightful contents. Waiting for capacity and costing consideration video.
Thanks for watching! Here's some previous content that you might have missed:
On capacities: th-cam.com/video/H6SmdqFbhE0/w-d-xo.html
On costing: th-cam.com/video/w481BSXk0Bw/w-d-xo.html
This is very helpful stuff as I study for the DP-600, and also eye a on-prem Postgres database that I’d like to get into a Fabric lakehouse. Thank you! ⏳
Glad you found it useful! Got more DP-600 related stuff on the way soon 👍
⏳ Thank you for sharing your knowledge and experience.
Thanks for watching!!
Great work.
Wow thank you, that’s very generous! I really appreciate your support 🙌🏽
Great Video....Please do publish videos to prepare for DP 600 certification.
Thanks a lot for watching and for commenting, I really appreciate it! And yes... DP-600 is coming v soon (after I finish this series, that will most likely be the next one) 😊
Are you currently preparing for the exam?
Thanks!
Wow very generous - really appreciate it, thanks 🙌🙌🙌
⌛️very useful. Thanks a lot
No problem, thanks for watching!!
⏳ Thank you
⌛Thanks, nice overview
Thanks for watching!! 🙌
⌛Enjoyed the pragmatic approach. Development and engineering require knowing what tools to use and when to use them and data ingestion methods are vital in this arena. This video did not have any fluff but encapsulated the following quote: "Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." - Antoine de Saint-Exupéry
😂 very kind, thanks for watching! 🙌
This is by far the best Fabric series that I have ever seen (even the learning materials from MS are not so well organized ). By the way I would like to ask
1. Can we use shortcut between two adls gen2 accounts, or must the destination of the shortcut be Fabric storage(data warehouse or lakehouse).
2.why did you mention that database mirroring prevents data duplication? It actually duplicates the whole set of data in delta format right?
Also I am really looking forward to the cost and vnet related video. For example, how does Microsoft bill different ingestion methods? Which ingestion method is more cost effective? Because most of time, cost is the deal breaker of how people choose a product(Fabric or others) or a methods(dataflow or spark) to use.
Cheers.
Thanks, glad you’re enjoying 🙌🏽
1. no
2. Yes, mirroring is duplication. Not sure why I wrote that, and how I missed it when recording. Sorry about that.
And yes interesting to hear you talking about cost and capacity usage of different methods, I’d love to do some benchmarking in the future 👍
Very helpful! ⌛️
Excellent tutorial videos Will! Question: When you say Small or Large datasets, what sizes are we considering to both?
Good Content !
Thanks!
⌛️ love your videos!
Glad you like them! Thanks for watching! 😊
Thanks⌛
Thanks a lot for watching!
Thank you. ⌛️
Thanks a lot for watching!! Really appreciate it
⌛excellent video!
Thanks, glad you enjoyed 🙌🏽🙌🏽
⏳Thank you!
thanks for watching!
Thanks again Will, i can't thank you enough for what you are providing the fabric community with. here is the hour glass you have asked for ⌛⌛⌛⌛⌛⌛ 😄😄😄😄😄😄. I have a question related to the difference between shortcuts + data base mirroring from one side and the three tools that you have showed for Data ingestion. you have mentioned in the slide of File / database replication that shortcuts and Database mirroring doesn't include ETL process, but in the next point you have said that fabric would create a delta table 'cache' of the full dataset the first time. So isn't the latter a Extract from source and Load to onelake process or a at least a copy process? Thanks in advance.
Haha thanks 🙌 mirroring is database replication yes. With shortcuts, by default the data stays in the source location until query time, BUT there is a setting in the admin portal (for Amazon S3 and Google Cloud Storage) to Enable Caching - this creates a cache of the data in Fabric, which can help reduce your egress costs
UPDATE: the following features have now been released:
Database Mirroring (Public Preview) has now been released, you can read more here: learn.microsoft.com/en-us/fabric/database/mirrored-database/overview
On-premise data gateway for data pipelines: learn.microsoft.com/en-us/fabric/data-factory/how-to-access-on-premises-data
Thanks for sharing, any reference of configuring Apache Sedona in Microsoft Fabric?
Thanks for watching! it’s not something I’ve used I’m afraid, and haven’t seen anyone talk about Sedona Fabric integration I’m afraid! Good luck though ☺️
⌛️ great overview!
Thank you! And thanks for watching!
Great video - I captured all your graphics and organized then into my Fabric OneNote notebook, so much easier than taking notes. Hope that is ok?⏳
Thanks Christopher - and no problem, whatever helps you learn! 😊
⌛
Thank you for the video!
Do you know if Azure DB for postgreSQL is also in private preview for database mirroring?
Hey Sergio! Thanks for watching 😊
I believe currently in Private Preview is Azure Cosmos DB, Azure SQL DB and Snowflake. But they do mention they are working on SQL Server, Azure PostgreSQL, Azure MySQL, MongoDB to be released sometime later this year Read more here : blog.fabric.microsoft.com/en-us/blog/introducing-mirroring-in-microsoft-fabric/
Is that a feature you'll be waiting for I'm guessing??
Thanks so much Will for very good and detailed content, easy to follow with this pace. I know your channel as many people shared your TH-cam Channel in Fabcon24 in Vegas. One question please, in the video you mentioned that on-premises gateway is not available with Data Pipeline but I can see it now in Fabric. Is it something new after the video was published or I misunderstood that part? Thanks again
Yup that one was announced and released at FABCON (after I recorded), I have added it to the comments section ☺️ thanks! Hope you enjoyed FABCON!
Great videos !!!
Thanks for sharing these details and comparisons.
Talking about CDC and delta time travel, a question came to my mind. Do the transactions are replicated in CDC or it's a batch process? If it is batch, we're not really getting a time-travel option on the delta side as if we can land in any previous time or state.
That's an interesting question, I assume you're talking about when the mirroring process does the first lift-and-shift of the current state when you first set it up? I don't know the answer, but Database Mirroring is now in Public Preview so we can test it out
@@LearnMicrosoftFabric
Yes that’s a good idea, thank you so much. Can you also create a video comparing databricks connected to unity catalog and Microsoft Fabric.
⌛
Great content again, Thank you Will!
- I do see parameters in dataflow gen2, and they seem to work OK, perhaps I am misunderstanding you
- yes unfortunately I am also not seeing looping or other control blocks, before Fabric, I have implemented some transformation logic using Synapse mapping dataflows that utilize them (particularly your example, when consuming paginated api response), well… will have to do it with notebook as you are saying (I prefer to have the looping or control encapsulated in the T activity), although the name is misleading, I guess this is more of a direct replacement of PowerQuery (in Excel/PowerBI) than Synapse/Azure dataflow.
- I believe pipelines CAN be run with RUN button w/o having to schedule them
- I am very used to (in Synapse and other Azure places) to be able to make quick changes by looking at the json code behind objects, dataflows and pipelines give me a “View Json Code” that is unfortunately read only :-(
- I really liked the “metadata driven” approach for pipelines.. I do not have a use case for that at the moment, but it is very interesting concept. What I have used pipeline hierarchy before is for reusability of common components. One example is that in Synapse pipeline, there is no “email” activity, so I have encapsulated email logic in a reusable pipeline (with several parameters) that in time call http Azure LogicApp that send the email.
Hi Ricky, thanks for watching and for the great questions!!
- Parameters: sorry yes, I probably wasn't clear enogh. You can create 'parameters' but these are just constants, currently you cannot pass dynamic input paramters to a dataflow from a data pipeline (like you can in a notebook).
- Looping: you can use the Until activity to loop through a list in a Data Pipeline. You can also use the Pagination Setting in a CopyData activity for that specific use case (but yes personally I prefer managing it in a Fabric Notebook code as sometimes the pagination logic is not 'obvious' or regular for some APIs
- Yes pipelines they can be run manually using the Run button.
- JSON code is currently view only, that's correct
- here's two resources if you want to read more about metadata-driven pipelines:
1: Using Lakehouse: techcommunity.microsoft.com/t5/fasttrack-for-azure/metadata-driven-pipelines-for-microsoft-fabric/ba-p/3891651
2L using Data Warehouse: techcommunity.microsoft.com/t5/fasttrack-for-azure/metadata-driven-pipelines-for-microsoft-fabric-part-2-data/ba-p/3906749
Thanks for the great questions!!
Good Job !
Thanks a lot for watching!! Have you implemented any of these yet?
@@LearnMicrosoftFabric I'm attempting to utilize a Microsoft Fabric solution with PowerBI, but there's currently a lack of comprehensive documentation on this topic! I eagerly anticipate your future videos, as they have proven to be immensely beneficial
Nice video.For data validation and data quality testing in notebooks which methods do you suggest in automated way?
Thanks! I have a 1hr+ tutorial on data validation in Fabric with demos and notebooks coming out on Friday so watch out for that one :)
Hi Will, really useful video, thanks. ⌛ You refer to shortcuts as a live sync of the source. Are you sure about that? My understanding was that it was a live link and wasn't actually copying or moving data anywhere. For ADLS and internal shortcuts at least. Of course database mirroring is different, which I'm glad to hear is in the roadmap. Cheers.
Thanks for watching Mark and for the great question! You’re right, I could have been a bit more accurate with my description on this, I will clarify in the next video ☺️ in short, for external shortcuts to ADLS, and internal shortcuts, it’s a live ‘link’ I.e. no data copy, but for database mirroring and external shortcuts to S3, a local cache inside fabric is needed.
Great, thanks for clarifying.
We are exploring ways to synchronize live data from our on-premises Oracle database to Fabric. Could anyone share their experiences or suggest the best practices for implementing this? Any insights on tools or methods that work well with Microsoft technologies would be greatly appreciated.
Not sure about that one tbh!
⌛
great video
Thanks a lot for watching!! Whch of these do you think you'll be using out of interest?
⏳great content
Thanks, and thanks for watching!
At 12:00 you mention looping through a looping through a paginated API. Is this the same as looping through multiple csv files that all have the same headers?
Hey! It's a different use case, but you could use the ForEach loop to solve both problems yes!
⏳ good job
Thanks a lot 🙌
Is it possible to import a shortcut into a semantic model and add calculated columns or even make transformations in the data?
I think it would be possible to achieve something similar to that. In general, you can't edit shortcut data. Also you can't import a shortcut into a semantic model, only into a Lakehouse/ KQL database, and then create your semantic model using that shortcut table. You can however build on top of shortcuts, so you could create a SQL view on top of the shortcut that added additional logic/ calculated columns. OR you could do it in the Power Query editor of Power BI Desktop but I wouldn't necessarily recommend that!
Does that make sense?
@@LearnMicrosoftFabric amazing, thanks !
Compared to the environment of Powerplatform, what would be the direct merit despite of scalebility. Compared to Power BI pro, how much more do you need pay, TY
Hi, Fabric is quite a different platform to the Power Platform tbh, feel free to check out the 38 minute fundamentals video on my channel for a full outline of capabilities. plus my video on pricing for details on pricing 👍
Question: You said that Database mirroring does not "replicate data," however, it seems to me it is. Original copy in a Azure SQL database, and a second copy in Fabric, with automatic updating. Am I seeing this wrong?
Hmm not sure I said that did I? database mirroring definitely is a form of data replication 👍
At 27:17 in the video, the graphic says "Prevents Data Duplication." Maybe I'm interpreting the graphic wrong.
⌛⌛!! thanks man!
Thanks for watching!
⌛⌛ Very Informative video - thank you for this! ⌛⌛
Regarding using Notebooks to write scripts that fetch data from external APIs - how can we store the credentials that we need to authenticate against the APIs in a secure place?
Also, like the requests library, what are the other libraries available? Can we also download & use any library we want, from some sources like pip's requirements.txt?
Thanks for watching! On your first point, you can use Azure Key Vault to store keys securely and then access them using the azure identity & azure key vault python packages, like this: learn.microsoft.com/en-us/azure/key-vault/secrets/quick-create-python?tabs=azure-cli
I've mostly only used 'requests' for Python API calls, which is the industry standard. But you can use any Python library you like! Either you can install the package in a notebook cell using %pip install {package_name} or you can install libraries at the workspace level, in Workspace Settings.
Does that answer your questions?
@@LearnMicrosoftFabric yes! Thank you so much Will!
On the first one -
I assume we have to create a managed identity and assign it to the fabric resource, and that will automatically be picked up when authenticating against keyvault. I am not sure if this is possible yet.. any idea?
P.S I am looking forward to more of your videos! ✌
⌛😄 Well done!
Thanks! And thanks for watching 😊
⏳!
Thanks for watching!!
⏳
It lacks the "Eventstream" ingestion. :-(
I know! Yes sorry about that. The video was already quite long and I want to cover Eventstreams in more detail (because they are quite different to the others!). I'll be doing a whole series on real-time/ KQL/ eventstreams 👍
⌛
Thanks for watching Mark!
⏳
⌛
⏳
⏳
Thanks for watching!!
⏳
Thanks for watching! Appreciate it 🙌
⏳
Thanks for watching!!
⏳
Thanks Carlos 🙌
⏳
Thanks for watching!!
⌛
⌛
Thanks for watching!!!
⌛
Thanks for watching!!
⌛
Thanks for watching!! Appreciate it 🙌
⌛
Thanks for watching!!
⌛
Thanks for watching!