00:01 Data pipelines automate data collection, transformation, and delivery. 00:38 Data pipeline involves stages like collect, ingest, store, compute, and consume. 01:18 Data pipeline captures live data feeds for real-time tracking. 01:56 Data pipeline involves batch and stream processing of ingested data 02:40 Data pipeline tools like Apache Flink and Google Cloud are used for real-time processing of data streams. 03:23 Data is transformed for analysis in storage phase 04:05 Data pipelines enable various end users to leverage data for predictive modeling and business intelligence tools. 04:47 Data pipeline enables continuous learning and improvement using machine learning models. Crafted by Merlin AI.
Great video. Showed me the fundamentals of data pipelines and processes from collection to consumption. There are so many tools/applications extensively used for data processing at various stages that I have never heard of, or only encounter in job descriptions, but since I am not a data specialist, I had no idea of! Thanks for putting these short summaries online. Helpful for people like myself!
Loved how simply you explained this complicated concept! Also what are your thoughts on Irys, world's only provenance layer ensuring the data integrity and accountability.
Suppose we have 100 microservices deployed as different AWS Lambda functions. Out of these, more than 30 Lambda functions need to write data to MongoDB Atlas. Each of these 30 functions is triggered simultaneously via SNS (Simple Notification Service), and each function will be invoked 200,000 times to process different data. Given this setup, the MongoDB Atlas connection limit will likely be exhausted due to the large number of simultaneous requests. What would be the best approach to handle this scenario without running into connection problems with MongoDB Atlas? May you create a video for this scenario, sir?
Those functional applications should likely use the same data platform for their functional applications, the only difference is how you're serving the transformed result. What's the difference then that you think should be talked about?
Functional applications are most likely consume very small amount of data while BI and AI ML models required way more likely gb to TB amount of data to work with. There's no possible way you can load 1gb of data in your web app or sql it just makes your app clogging and time consuming.
Because more and more non-traditionally technical business roles are leveraging data for business intelligence - so the demand for understanding these concepts is greater there (than in complex application architectures where more traditional technical skill accumulates).
@@manishshaw1002this isn’t always true, at the health insurance company I work at we have functional applications that internal users and providers use to view data about members and there are vast amounts of data streaming to and from these applications
Why is Apache Flink not an option for batch processing? As I understand it, it makes more sense to use the same computation frameworks when doing both, so why not use Flink for both given Flink can support batch jobs?
Why would ETL here be considered as real time when ETL is slower as you need to transform every single extraction before you load it into a db warehouse?
I dont know why but the gain of the microphone is too high, there is a little background noise and its a bit noticeable, keep it in check. Great video, as always in the channel.
So i need to build a way so retrieve man many emails and categorize them with a ml model and then save them in the right system. Do i build this with kafka and pyspark? Or how can this be done easaly
I like your content a lot but you have a lot of mistakes. Not only in this video but also in the others. Mislabeling, duplicities. It might get confusing a lot for a beginner. Similarly if you are using acronyms I would recommend explaining them or at least stating the full name
they said that about Mainframe computers 30 years ago, but they are still here/in production. Large organizations are not going to adopt the latest solutions for all there data needs (for instance data that isn't accessed that often/specific use cases, or they might have support staff that is more familiar with legacy tools and they don't see the need to adopt latest methods at the moment). So I can guarantee Hadoop is NOT completely dead.
@@shilashm5691 Most use AWS S3 as storage for their datalake, others Azure Data Lake Storage. MapReduce is dead and HDFS is on the brink of obscurity as well. I pity those who still have to work with some inhouse hdfs from the darkest and most painful era of data engineering (hadoop era)
I work as the PM in data enablement, this video was amazing for understanding each component in a data pipeline.
3:13 typo *AWS Glue.
Love these vids, thanks!
bruh had me googling whats AWS glow
00:01 Data pipelines automate data collection, transformation, and delivery.
00:38 Data pipeline involves stages like collect, ingest, store, compute, and consume.
01:18 Data pipeline captures live data feeds for real-time tracking.
01:56 Data pipeline involves batch and stream processing of ingested data
02:40 Data pipeline tools like Apache Flink and Google Cloud are used for real-time processing of data streams.
03:23 Data is transformed for analysis in storage phase
04:05 Data pipelines enable various end users to leverage data for predictive modeling and business intelligence tools.
04:47 Data pipeline enables continuous learning and improvement using machine learning models.
Crafted by Merlin AI.
The best animated introduction to data pipelines in just five minutes.
Great video. Showed me the fundamentals of data pipelines and processes from collection to consumption. There are so many tools/applications extensively used for data processing at various stages that I have never heard of, or only encounter in job descriptions, but since I am not a data specialist, I had no idea of! Thanks for putting these short summaries online. Helpful for people like myself!
Loved how simply you explained this complicated concept! Also what are your thoughts on Irys, world's only provenance layer ensuring the data integrity and accountability.
Spark is widely used in stream processing too, not only batch, see spark structured streaming.
For stream processing, Apache Flink is more suited. Even though both can do stream and batch processing.
I love the short video format, as I can dive deeper on topics and terms I am interested in on my own time :)
Amazing explanation, so far the most easy to digest video about data pipelines.
Your channel is a blessing.
0:49 Shouldn't the last one be 'Consume'?
Yeah...error but you can the video has to be published. They cannot go back to edit from the beginning
Love the presentation, Do you recommend some resource to do it?
This video was amazing for understanding, thank you 🤗🤗
I like your presentations. What do you use to make them?
I also want to know what he uses to create the presentation illustrations. They look neat
@@chrisalmighty Adobe illustrator and after effects
Great video. Small remark: the AWS service for ETL is called AWS Glue, not Glow
Video illustrations look neat. What tool did you use create the presentation illustrations?
I think you meant AWS Glue 3:18. Appreciate these informative videos
💯 Looking like your channel is on track for 1 million subscribers by year end! Great stuff! 😎✌️
Suppose we have 100 microservices deployed as different AWS Lambda functions. Out of these, more than 30 Lambda functions need to write data to MongoDB Atlas. Each of these 30 functions is triggered simultaneously via SNS (Simple Notification Service), and each function will be invoked 200,000 times to process different data.
Given this setup, the MongoDB Atlas connection limit will likely be exhausted due to the large number of simultaneous requests.
What would be the best approach to handle this scenario without running into connection problems with MongoDB Atlas? May you create a video for this scenario, sir?
Fantastic video and graphics, what program do you use to animate your graphics? It's great stuff.
tq very much .mind blowing explantion
Maybe some examples of simplified pipeline on specific application would make this video even better.
Why do we mostly talk about data pipelines for BI or ML when many times we also need it for functional applications?
Those functional applications should likely use the same data platform for their functional applications, the only difference is how you're serving the transformed result. What's the difference then that you think should be talked about?
Functional applications are most likely consume very small amount of data while BI and AI ML models required way more likely gb to TB amount of data to work with.
There's no possible way you can load 1gb of data in your web app or sql it just makes your app clogging and time consuming.
Because more and more non-traditionally technical business roles are leveraging data for business intelligence - so the demand for understanding these concepts is greater there (than in complex application architectures where more traditional technical skill accumulates).
Just call it messaging and you’re good to go
@@manishshaw1002this isn’t always true, at the health insurance company I work at we have functional applications that internal users and providers use to view data about members and there are vast amounts of data streaming to and from these applications
thanks for the knowledge you share
I want to learn system design for data pipelines
Could you please suggest how to proceed ? What books ?
Very good discussion
Very informative !! But how you do all these animations ??what product do you use !!
Which tool do you use to create these animated presentations?
Trade secret 😂
I also want to know what he uses to create the presentation illustrations. They look neat
What do you use to create these animations/info graphics
I think it could be either figma or canvas.
@@Biostatistics is there a video out there that shows how that is done in power point? I see these data like infographics a lot these days
@@user-data_junkieit’s says in the description of this video, he used Adobe illustrator and after effects. 😊
@@Biostatistics thanks. I did check at the time and did not see anything. Appreciate the update
Why is Apache Flink not an option for batch processing? As I understand it, it makes more sense to use the same computation frameworks when doing both, so why not use Flink for both given Flink can support batch jobs?
Important information about refunds: what a joy
this was very useful. thanks for sharing.
one-stop shop Video . loved it ♥
Very Good Video!! Easy to get!
Top quality work as always
Thanks!
Love it. This jargon cleared now
Amazing video. Thanks for your great efforts!
Intimidating!
Thank you for doing this!
No mention of Apache Iceberg and such technology?
Is GA4 consider a data stream? And big query a storage and transform tools?
Why would ETL here be considered as real time when ETL is slower as you need to transform every single extraction before you load it into a db warehouse?
Your diagram had compute arrows twice when you verbally said compute and consume for the last two phases.
Aws glow or aws glue?
Thanks
3:13 , what is AWS Glow ? Typo ??
Looks like your examples are only AWS or Google stack. Why not cover examples from MS Azure stack as well?
Tomorrow i have an interview :)
Did not make a mention on data lakehouse
Always so so good
请问这些精美的图是怎么画的?太赞了
I dont know why but the gain of the microphone is too high, there is a little background noise and its a bit noticeable, keep it in check.
Great video, as always in the channel.
you have an error in diagram, 2 computes, it should be compute and consume
Bravo!
So basically a data pipeline is similar to a system flowchart?
So i need to build a way so retrieve man many emails and categorize them with a ml model and then save them in the right system. Do i build this with kafka and pyspark? Or how can this be done easaly
Kafka dear
Leaving out all Azure tools... really a shame
Maybe it's intentional. Many serious data scientists aren't fond of the Azure UI for big data pipelines.
Microsoft training has that covered
AWS Glow or Glue?
AWS Glue*
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤
AWS Glue, not Glow
"Trade Secret" name of the tool used to create the animations ...😂
Lopez Robert Lee Gary Williams Christopher
apache hive logo is on acid
Davis Jose Harris Christopher Jackson Ronald
Rest api
😎🤖
I like your content a lot but you have a lot of mistakes. Not only in this video but also in the others.
Mislabeling, duplicities. It might get confusing a lot for a beginner. Similarly if you are using acronyms I would recommend explaining them or at least stating the full name
looks like you need to change the mic you are currently using. there is some crackling noise when you talk.
Hadoop is dead
Why, what's the reason
they said that about Mainframe computers 30 years ago, but they are still here/in production. Large organizations are not going to adopt the latest solutions for all there data needs (for instance data that isn't accessed that often/specific use cases, or they might have support staff that is more familiar with legacy tools and they don't see the need to adopt latest methods at the moment). So I can guarantee Hadoop is NOT completely dead.
Lol it’s not dead at all, and its ecosystem tools are still widely used
😂 most uses hdfs as data lake, when you say hadoop.is dead be precise and say mapreduce.is dead, bcoz hadoop ecosystem is large and still functioning
@@shilashm5691 Most use AWS S3 as storage for their datalake, others Azure Data Lake Storage. MapReduce is dead and HDFS is on the brink of obscurity as well. I pity those who still have to work with some inhouse hdfs from the darkest and most painful era of data engineering (hadoop era)