Very good introduction to streaming ETL architecture and Kafka. Misleading title. Streaming ETL is just another way of implementing ETL. Traditional batch-oriented ETL doesn't have to be totally replaced by Streaming ETL.
I think instead of saying ETL is dead, just say I have not clue. I've never in my life recreated two streams to process the same data into different destinations (12:59) I'd do exactly the same as at (13:29) but with ETL tools.
I agree. If step 2 is same for both the destinations, why will you repeat for Cassandra. You'll just add that destination also to the load logic of the existing ETLs. I wish the speaker gave better example where we may end up doing this and how streaming could have helped. I believe streaming is advantageous from cost perspective (ETL tools are super expensive) and for real time very large volumes, they cannot scale. I'm also not sure if streaming really solves this problem - I've yet to work on streaming technologies.
This was a well thought out presentation by sharing a brief introduction of existing systems, their limitations. And transitioning to the need for kakfa, the way it is designed and also explaining how the limitations are addressed by Kafka. Good one.
2 Observantions: 1) History of ETL - missed the entire evolution of data warehouse from MIS systems 2) example of old and new “T”. You applied “remove PII fields” at streaming platform . Who will identify what is this common transformations which would have to be applied at streaming platform. One benefit is : one higher level of abstraction
Very informative, precise and too the point introductory talk on data streams. It gives enough information that one knows why and when to look for streaming solutions and one also knows what specific areas to dig in for once they decide to go for such solution.
Really? Data integration and Application integration is not the same. ETL and EAI solve two totally unrelated problems. And how can one say that MQ does not scale when if one want's to scale he can choose DDS or whatever different messaging technology.
ETL and EAI probably addresses different problems compared to streaming, practically according to me streaming is more of using capabilities of the platform to integrate rather than using a tool to do ETL or Real-Time it addresses the data transfer logic so we can avoid tools, correct me if I'm wrong.
ETL is outdated? That's news to any company that has no need to process terabytes of data in real time. This is the problem with keynotes from super giant companes. They only speak from the perspective of a super giant company. The overwhelming majority of enterprises do not have scale problems in this category, but people from such companies walk out of the keynotes thinking "yeah, this is what we should do!". No, you probably shouldn't.
This is an intelligent and articulate overview of how Kafka in particular manages increasing volume, velocity and variety of "big data" using real-time streams. It may not resonate with everyone; not everyone needs this. Excellent for those getting started with streaming data and transitioning away from messaging queues or redundant ETL processes.
The "messy" diagram can simply be redrawn to match the Kafka-based diagram. Lots of good information, but the real differentiate is not the integration patterns. Anyway, Kafka is a great product.
the T in ETL has nothing to do with scrubbing ('data cleaning') or normalization. if you're using ETL to scrub you're already too late in the pipeline and using a hammer as a screwdriver when you want a paintbrush. it's gibberish. ETL is for data snapshots to move between environments where you want only a subset of the data but it is transactionally stable. ETL is how you leave the house. Kafka is the road you drive on to deliver the payload from said house. different topics. Kafka should be viewed as a simd replacement for amqp/zmq or as she has presented it a comparison vs elk for log processing as a limited use case. the streams discussion should be compared with apache storm for analytical capability or a distributed replacement for memcached performance counters. local state is a poor way of saying cache locality and migration. this talk is all over the place. no mention of the problem of dealing with subaggregation and priority dependency issues inherent in kafka/storm without explicit payload tagging or reentrant use of the architecture in general as befits any simd speedup discussion. if you are familiar with the concepts of noshared architectures for data presentation and want a messaging solution with the same principles then kafka may interest you. do not expect magic.
Kafka is not a ETL replacement, it is a streaming/message broker. ETL is a platform that offers adapters for receiving and writing data from/to multiple source/destination types (files, DBs, queue systems), its a centralized mapper tool (say XMLCVS), and supports various integration patterns (best practices). So ETL can typically be used to read/write from/to Kafka while it is performing mapping so that the destination system understands what the source system is trying to send, in real-time. EAI systems, another platform type she mentioned, are particularly written for event/real-time purposes so its more suitable platform for such type of work as it supports transactional behavior and unified monitoring of what is flowing through it, in addition to adapters and centralized mapping. How this woman managed to compare oranges and apples without receiving more down votes it beyond me.
Nice Presentation...I would like to know what Vendors of ETL Tools like Informatica, DataStage ..etc., has to say about their products in the sense of this briefing..bec these two are quite busy in coming up with new versions.
36:50 come on. You just took the stream processing java app and the dashboard app and put them inside in one application. So the database is inside kafka and the job processing and dashboard are merged. There should have been 2 boxes not 1
data integration and service integration layer are handled by different products on the market. that's the main problem. and it is good to see them in a convergent approach. that's why Kafka is on the spot. this convergence brings organizational effectiveness to enterprise. because you can now combine BI's ETL team and Middleware team, so you can get holistic integration capabilities which will also creates advantage point for transformation. on the other hand, scalability is a relative concept. in an enterprise, EAI or ESB is scalable. ETL is batch oriented but it is feasible for an enterprise's near realtime concerns.
Can parallel processing in bandwidth fill multiple packages help in big data and distributed database and buffering work in hand in hand help in streaming. Also 3d volume fill data storage and extracting data format
The speaker has no inkling of what ETL is or what a Datawarehouse is and how they are architected, designed, developed, provisioned and sustained. Apache Kafka is great open source tool for integrating streaming data into your data lake and is not a paradigm that will replace technology agnostic paradigm name ETL. I have used Spark SQL to accomplish/realize a ETL based solution. Again Spark SQL is a tool and not a paradigm.
You confuse two loosely connected areas. Kafka is NOT the successor to ETL. ETL is a completely different group of products with a completely different application. Kafka may be the next generation of ESB. In addition, you must know that in the vast majority of companies around the world their "even driven architecture" is MS Excel. Why companies like Google or Faceook have their power? Because they are really unique. Meanwhile most of companies do things like 20 years ago. For them ETL is miracle. They do not need any Kafka. It's beyond their perception.
By Mark 5:00 you'd figure out all the shenanigans regarding streams, data integration and why these corporate tech lords created Kafka. Good presentation.
im only using this to identify any hackers uploading anything of any kind. number one it was without my permission. hacking is a federal offense.and it violates my privacy rights.
Great presentation. She makes it look simple...Does anyone know the program used to create the presentation?? I like the look, as though was drawn ""free-hand"....very sharp
this is not at all comparable, Both meant for different purpose. I doubt if she has ever looked at the DWH code and design .And bet you if you show me one single implementation which include complete fact table design to solve customer business problem
Saurabh u are right, if u look at her work history she worked for just 1 company ( linkedin) and took kafka out as a new company, she is trying to just make money out of that......she has no idea why facts and dimensions are needed, you add any stream someone needs to transform them into data which data analysts or data scientists can use,
It was a nice presentation but majority of data generated by user actions are still stored in databases(SQL,Oracle) and thus ETL tools like SSIS are still needed to read them and send processed data to destinations. Some data could be in flatfiles but not too often seen these days unless we are gathering from multiple public sources. Whenever I try to read into the minds of speakers in youtube presentations to see why they are using Kafka or Spark, all they give is an example of 'word count' which is sad. Take an example of Spark, sure it can do distributional computing but so can a lot of other tools too if you have an array of cheap servers.
From what I’ve heard from this lady I’m making assumption she has a very little experience in ETL development (manual validation for example), she just follows the modern fashion.
I love how americans can make acronyms out of the most important words in a title and just assume everyone knows what they abbreviate. It always amazes me how they try to get thoughts across an audience with a bunch of these 3 letter, context specific, magic words flying around.
bullshit ...I don't know which platform give these people to open their mouth even they don't have clear knowledge this shows the quality of Indian IT managers and Leaders
Completely agree , kafka is nice technology but this person doesn’t seem to have any idea about enterprise architecture or problems ETl tries to solve.....
It's a shame someone knows on just one tiny topic thinks she knows how it works and applies for all. On the final diagram there an icon of a DWH, I wonder how she explains how that DWH is getting populated without ETL. Oh...she probably thinks that is readymade available for her to stream from. lol.
She doesn't deny it in the talk actually. She says that the batching ETL is dead / outdated, and now the streaming ETL is a way to go. Though I agree that part of the title is a bit misleading.
Darshan Sangodkar really? I’d like to actually hear your tech talk some day. Do pick a deeply technical topic please. And an original one while you’re at it. If you struggle with that though, drop me a word. Happy to share some tips.
@@tinameh I guess many are not aware that she was amongst ppl who built Kafka, I have seen her other talks and I found those enlightening and also built a unicorn startup.
Introduction to Kafka really starts at 17:36.
Comments like this are helping this world become better place
At a speed of 1.25 too
@@mwandulu You can go to 1.5 too with very little difference. :)
For me the whole talk was pretty good. See no reason to skip.
No reason to skip. The preamble puts things into context.
Very good introduction to streaming ETL architecture and Kafka. Misleading title. Streaming ETL is just another way of implementing ETL. Traditional batch-oriented ETL doesn't have to be totally replaced by Streaming ETL.
I think instead of saying ETL is dead, just say I have not clue. I've never in my life recreated two streams to process the same data into different destinations (12:59) I'd do exactly the same as at (13:29) but with ETL tools.
I agree. If step 2 is same for both the destinations, why will you repeat for Cassandra. You'll just add that destination also to the load logic of the existing ETLs.
I wish the speaker gave better example where we may end up doing this and how streaming could have helped.
I believe streaming is advantageous from cost perspective (ETL tools are super expensive) and for real time very large volumes, they cannot scale. I'm also not sure if streaming really solves this problem - I've yet to work on streaming technologies.
Absolutely True, No one is dumb enough to run the computation twice when we have the option of adding the data to multiple destinations.
ETL's are not dead, they just transformed. The KEY is not apache kafta, the key is DATA ARCHITECTURE, otherwise it will add more mess.
This was a well thought out presentation by sharing a brief introduction of existing systems, their limitations. And transitioning to the need for kakfa, the way it is designed and also explaining how the limitations are addressed by Kafka. Good one.
"event and batch have tradeoffs. now ignore the trade offs and try to use streams for everything" :/
2 Observantions:
1) History of ETL - missed the entire evolution of data warehouse from MIS systems
2) example of old and new “T”. You applied “remove PII fields” at streaming platform . Who will identify what is this common transformations which would have to be applied at streaming platform.
One benefit is : one higher level of abstraction
Very informative, precise and too the point introductory talk on data streams. It gives enough information that one knows why and when to look for streaming solutions and one also knows what specific areas to dig in for once they decide to go for such solution.
So many angry comments. It's just an attractive title, not an actual PhD thesis.
Really? Data integration and Application integration is not the same. ETL and EAI solve two totally unrelated problems. And how can one say that MQ does not scale when if one want's to scale he can choose DDS or whatever different messaging technology.
ETL and EAI probably addresses different problems compared to streaming, practically according to me streaming is more of using capabilities of the platform to integrate rather than using a tool to do ETL or Real-Time it addresses the data transfer logic so we can avoid tools, correct me if I'm wrong.
Not all data is big data, and all data will never be all big data. There will always be a huge place for standard ETL.
ETL is outdated? That's news to any company that has no need to process terabytes of data in real time.
This is the problem with keynotes from super giant companes. They only speak from the perspective of a super giant company. The overwhelming majority of enterprises do not have scale problems in this category, but people from such companies walk out of the keynotes thinking "yeah, this is what we should do!". No, you probably shouldn't.
This is an intelligent and articulate overview of how Kafka in particular manages increasing volume, velocity and variety of "big data" using real-time streams. It may not resonate with everyone; not everyone needs this. Excellent for those getting started with streaming data and transitioning away from messaging queues or redundant ETL processes.
The "messy" diagram can simply be redrawn to match the Kafka-based diagram. Lots of good information, but the real differentiate is not the integration patterns. Anyway, Kafka is a great product.
Great architecture overview of Kafka Streams. Convinced me to look deeper into the Streams API and capabilities.
the T in ETL has nothing to do with scrubbing ('data cleaning') or normalization. if you're using ETL to scrub you're already too late in the pipeline and using a hammer as a screwdriver when you want a paintbrush. it's gibberish. ETL is for data snapshots to move between environments where you want only a subset of the data but it is transactionally stable. ETL is how you leave the house. Kafka is the road you drive on to deliver the payload from said house. different topics.
Kafka should be viewed as a simd replacement for amqp/zmq or as she has presented it a comparison vs elk for log processing as a limited use case. the streams discussion should be compared with apache storm for analytical capability or a distributed replacement for memcached performance counters. local state is a poor way of saying cache locality and migration.
this talk is all over the place. no mention of the problem of dealing with subaggregation and priority dependency issues inherent in kafka/storm without explicit payload tagging or reentrant use of the architecture in general as befits any simd speedup discussion.
if you are familiar with the concepts of noshared architectures for data presentation and want a messaging solution with the same principles then kafka may interest you. do not expect magic.
April 2021, Batch is still more popular than stream.
ETL is dead. Long live ETL!
ETL is not only meant for data integration.. what about business intelligence and analytics apps..
I think they are saying ETL in the form of ingestion of data INTO some tool. Not Spark or Hadoop jobs. In that case you could just subscribe to Kafka.
Kafka is not a ETL replacement, it is a streaming/message broker. ETL is a platform that offers adapters for receiving and writing data from/to multiple source/destination types (files, DBs, queue systems), its a centralized mapper tool (say XMLCVS), and supports various integration patterns (best practices). So ETL can typically be used to read/write from/to Kafka while it is performing mapping so that the destination system understands what the source system is trying to send, in real-time. EAI systems, another platform type she mentioned, are particularly written for event/real-time purposes so its more suitable platform for such type of work as it supports transactional behavior and unified monitoring of what is flowing through it, in addition to adapters and centralized mapping.
How this woman managed to compare oranges and apples without receiving more down votes it beyond me.
6 years later, ETL is still alive
Nice Presentation...I would like to know what Vendors of ETL Tools like Informatica, DataStage ..etc., has to say about their products in the sense of this briefing..bec these two are quite busy in coming up with new versions.
Wonderful talk...than you so much Neha for the presentation.
Thanks a lot for explaining clearly about: what happened yesterday, what's the pain point, what's the new requirement, and HOW.
First 15 minutes are more like a pitch deck dumbed down for a VC.
Misguiding. Why would batch processing be dead if it is just enough to do batch processing of your data?
7:30 "ETL (Extract Transform Load) and EAI (Enterprise Application Integration) are outdated"
Very clarifying explanation about Kafka, helped me a lot to understand the concept.
all Principles should be implemented in any streaming data to be in compliance at all time.
36:50 come on. You just took the stream processing java app and the dashboard app and put them inside in one application. So the database is inside kafka and the job processing and dashboard are merged. There should have been 2 boxes not 1
data integration and service integration layer are handled by different products on the market. that's the main problem. and it is good to see them in a convergent approach. that's why Kafka is on the spot.
this convergence brings organizational effectiveness to enterprise. because you can now combine BI's ETL team and Middleware team, so you can get holistic integration capabilities which will also creates advantage point for transformation.
on the other hand, scalability is a relative concept. in an enterprise, EAI or ESB is scalable. ETL is batch oriented but it is feasible for an enterprise's near realtime concerns.
31:56 link to video please. Unfortunately can't hear names clearly
Can parallel processing in bandwidth fill multiple packages help in big data and distributed database and buffering work in hand in hand help in streaming. Also 3d volume fill data storage and extracting data format
Share distribute processing by using percentage of iddling resource in cloud sharing processing network
Awesome presentation skills.and clear explanation about ETL changing from batch to Real -Time
Great presentation and talk. Now I want to explore streaming platforms in detail.
Great video and gave overall idea on what Kafka is and how to play with it in real use cases. Excellent and kudos!
The speaker has no inkling of what ETL is or what a Datawarehouse is and how they are architected, designed, developed, provisioned and sustained. Apache Kafka is great open source tool for integrating streaming data into your data lake and is not a paradigm that will replace technology agnostic paradigm name ETL. I have used Spark SQL to accomplish/realize a ETL based solution. Again Spark SQL is a tool and not a paradigm.
You confuse two loosely connected areas. Kafka is NOT the successor to ETL. ETL is a completely different group of products with a completely different application. Kafka may be the next generation of ESB. In addition, you must know that in the vast majority of companies around the world their "even driven architecture" is MS Excel. Why companies like Google or Faceook have their power? Because they are really unique. Meanwhile most of companies do things like 20 years ago. For them ETL is miracle. They do not need any Kafka. It's beyond their perception.
Though it is not paradigm shift, the approach given here eventually modern EDW with real time streams.
By Mark 5:00 you'd figure out all the shenanigans regarding streams, data integration and why these corporate tech lords created Kafka. Good presentation.
im only using this to identify any hackers uploading anything of any kind. number one it was without my permission. hacking is a federal offense.and it violates my privacy rights.
Really helpful.. Nice explanation..
A very well structured talk. Thanks for it. :-)
this butthurted all ETL folks..
E --(k)-- T --(k)-- L this is where the kafka in ETL, ETL will never dead but kafka is a good stream used in ETL processes.
It's a very harsh statement to say ETL is dead. No, ETL is not dead.
catch title
Great presentation. She makes it look simple...Does anyone know the program used to create the presentation?? I like the look, as though was drawn ""free-hand"....very sharp
Great content... useful information
ETL is not dead and if you want to be taken seriously in the world of data, I recommend you drop this suggestion...
Very helpful video! Quality content and great presentation. Stellar job
great thanks for information..
Great and it is very helpful, thank you
Just use dma.
this is not at all comparable, Both meant for different purpose. I doubt if she has ever looked at the DWH code and design .And bet you if you show me one single implementation which include complete fact table design to solve customer business problem
Saurabh u are right, if u look at her work history she worked for just 1 company ( linkedin) and took kafka out as a new company, she is trying to just make money out of that......she has no idea why facts and dimensions are needed, you add any stream someone needs to transform them into data which data analysts or data scientists can use,
Do you even know what "dead" means? ETL is used in many companies. Therefore ETL is not dead. Floppy disk is dead.
Great talk.
Trolling title... I'm not sure if the speaker would approve of this title. It opens her idea for ridicule
It was a nice presentation but majority of data generated by user actions are still stored in databases(SQL,Oracle) and thus ETL tools like SSIS are still needed to read them and send processed data to destinations. Some data could be in flatfiles but not too often seen these days unless we are gathering from multiple public sources. Whenever I try to read into the minds of speakers in youtube presentations to see why they are using Kafka or Spark, all they give is an example of 'word count' which is sad. Take an example of Spark, sure it can do distributional computing but so can a lot of other tools too if you have an array of cheap servers.
hello you use chroma key in this video?
From what I’ve heard from this lady I’m making assumption she has a very little experience in ETL development (manual validation for example), she just follows the modern fashion.
Igor Andriychuk yup, lets see how much longer silicon valley supports such scam artists in the name of VC funding....
elsewhere:
th-cam.com/video/4CkRewmRnRc/w-d-xo.html
I love how americans can make acronyms out of the most important words in a title and just assume everyone knows what they abbreviate. It always amazes me how they try to get thoughts across an audience with a bunch of these 3 letter, context specific, magic words flying around.
you don't have to use this kind of name to attract viewers
bullshit ...I don't know which platform give these people to open their mouth even they don't have clear knowledge this shows the quality of Indian IT managers and Leaders
Completely agree , kafka is nice technology but this person doesn’t seem to have any idea about enterprise architecture or problems ETl tries to solve.....
It's a shame someone knows on just one tiny topic thinks she knows how it works and applies for all. On the final diagram there an icon of a DWH, I wonder how she explains how that DWH is getting populated without ETL. Oh...she probably thinks that is readymade available for her to stream from. lol.
You will be subscribe to multiple topics, and using Stream API process those message, which can potentially do the job.
I vant to try her curry
Crap talks. ETL is a concept that is always there.
She doesn't deny it in the talk actually. She says that the batching ETL is dead / outdated, and now the streaming ETL is a way to go. Though I agree that part of the title is a bit misleading.
Shitty comment. U dont even understand what she is talking about.
People always have a tendency to resist change, as a result, they don't listen carefully
she mentioned etl tools not concept
Shankar K the title says ETL is dead, she is dumb as a rock....
click bait title... kafka is great but this talk is a disaster
I think you are HR rather Technical ...from you scrap it is clear that you don't know both hadoop and ETL
Think again.
She is clueless i am shocked she is even allowed to talk at a summit
This talk is a disaster
I wish this presentation was given by some techie guy.
Darshan Sangodkar really? I’d like to actually hear your tech talk some day. Do pick a deeply technical topic please. And an original one while you’re at it. If you struggle with that though, drop me a word. Happy to share some tips.
@@tinameh I guess many are not aware that she was amongst ppl who built Kafka, I have seen her other talks and I found those enlightening and also built a unicorn startup.
Not worth......
U r relly pretty
Rubbish material, holy cow letter from india
Lol,
A silica nigre....
😉
Your mum should have advice to you for dont play with the Hammer...
Yes,
Tyrannosaurus burgers were greeeeeeat.
BullSsssssssst