- 35
- 20 266
Anirvan Decodes
India
เข้าร่วมเมื่อ 14 ม.ค. 2022
Hi All,
I am Anirvan Sen, I love to talk about SQL, Python, Apache Spark, Data Engineering.
My passion is to teach complex topics regarding software engineering in a very simplified way.
That is why the channel is "Anirvan Decodes" where we decode complex topics and make them simple.
I am starting this youtube channel to build a community where we learn from each other and grow in our life.
Join me to learn different skills of Software Engineering in a fun and simple way.
See you in my videos!
I am Anirvan Sen, I love to talk about SQL, Python, Apache Spark, Data Engineering.
My passion is to teach complex topics regarding software engineering in a very simplified way.
That is why the channel is "Anirvan Decodes" where we decode complex topics and make them simple.
I am starting this youtube channel to build a community where we learn from each other and grow in our life.
Join me to learn different skills of Software Engineering in a fun and simple way.
See you in my videos!
Handling Late Arriving Data in Spark Structured Streaming with Watermarks
Spark Structured Streaming Sinks and foreachBatch Explained
In this video, we explore the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing.
📽️ Chapters to Explore
0:00 Introduction
0:13 What is late data
3:29 State store
5:36 Rocks DB state store
5:59 Handle late data using watermarks
💻 Code is available in GitHub: github.com/anirvandecodes/Spark-Structured-Streaming-with-Kafka
🌟 Stay Connected and Level Up Your Data Engineering Skills!
🔔 Subscribe Now: www.youtube.com/@anirvandecodes?sub_confirmation=1
🤝 Let's Connect: www.linkedin.com/in/anirvandecodes/
🎥 Explore Playlists Designed for You:
🚀 Spark Structured Streaming with Kafka: th-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
🛠️ DBT (Data Build Tool): th-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
🌐 Apache Spark for Everyone: th-cam.com/play/PLGCTB_rNVNUOigzmGI6zN3tzveEqMSIe0.html
📌 Love the content? Show your support! Like, share, and subscribe to keep learning and growing in data engineering. 🚀
Song: Dawn
License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0
open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0
Music powered by BreakingCopyright: breakingcopyright.com
In this video, we explore the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing.
📽️ Chapters to Explore
0:00 Introduction
0:13 What is late data
3:29 State store
5:36 Rocks DB state store
5:59 Handle late data using watermarks
💻 Code is available in GitHub: github.com/anirvandecodes/Spark-Structured-Streaming-with-Kafka
🌟 Stay Connected and Level Up Your Data Engineering Skills!
🔔 Subscribe Now: www.youtube.com/@anirvandecodes?sub_confirmation=1
🤝 Let's Connect: www.linkedin.com/in/anirvandecodes/
🎥 Explore Playlists Designed for You:
🚀 Spark Structured Streaming with Kafka: th-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
🛠️ DBT (Data Build Tool): th-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
🌐 Apache Spark for Everyone: th-cam.com/play/PLGCTB_rNVNUOigzmGI6zN3tzveEqMSIe0.html
📌 Love the content? Show your support! Like, share, and subscribe to keep learning and growing in data engineering. 🚀
Song: Dawn
License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0
open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0
Music powered by BreakingCopyright: breakingcopyright.com
มุมมอง: 19
วีดีโอ
Streaming Aggregates in Spark : Tumbling vs Sliding Windows with Kafka
มุมมอง 2514 วันที่ผ่านมา
Streaming Aggregates in Spark - Tumbling vs Sliding Windows with Kafka In this video, we break down the concept of streaming aggregates in Spark Structured Streaming and explain the difference between tumbling and sliding windows. Using Kafka as the data source, we demonstrate how to effectively process and aggregate real-time data. 📽️ Chapters to Explore 0:00 Introduction 0:20 Use Case for str...
Spark Structured Streaming Sinks and foreachBatch
มุมมอง 3414 วันที่ผ่านมา
Spark Structured Streaming Sinks and foreachBatch Explained This video explores the different sinks available in Spark Structured Streaming and how to use the powerful foreachBatch sink for custom processing. 📽️ Chapters to Explore 0:00 Introduction 0:25 Types of sinks 0:50 Memory Sink 2:08 Kafka Sink 4:00 Delta Sink 4:30 toTable does not support update mode 4:53 How to use foreachBatch 💻 Code ...
Spark Structured Streaming Output Mode | Append| Update | Complete Modes
มุมมอง 2814 วันที่ผ่านมา
Spark Structured Streaming Output Modes Explained In this video, we explore the output modes in Spark Structured Streaming, a crucial concept for controlling how processed data is written to sinks. Understanding these modes helps in designing efficient and accurate streaming pipelines. 📽️ Chapters to Explore 0:00 Introduction 0:45 Complete Mode 2:30 Update Mode 4:24 Append Mode 5:32 How to use ...
Spark Structured Streaming Checkpoint
มุมมอง 2814 วันที่ผ่านมา
Understanding Spark Structured Streaming Checkpoints In this video, we dive deep into checkpoints in Spark Structured Streaming and their critical role in ensuring fault-tolerant and stateful stream processing. 📽️ Chapters to Explore 0:00 Introduction 0:20 Why Checkpoint is required? 1:50 How to define checkpoint 2:15 Content of a checkpoint folder 4:26 Kafka offset information in checkpoint 5:...
Spark Structured Streaming Trigger Types
มุมมอง 4514 วันที่ผ่านมา
In this video, we dive into Spark Structured Streaming Trigger Modes-a key feature for managing how your streaming queries process data. Whether you're working with real-time data pipelines, ETL jobs, or low-latency applications, understanding trigger modes is essential to optimize your Spark jobs. 📽️ Chapters to Explore 0:00 Introduction 0:40 Why Do We Need Trigger Types? 1:50 Default Trigger ...
Spark Structured Streaming Introduction
มุมมอง 6121 วันที่ผ่านมา
Welcome to this introduction to Spark Structured Streaming! In this video, we’ll break down the basics of Spark Structured Streaming and explain why it’s one of the most powerful tools for real-time data processing. Song: Dawn License: Creative Commons (CC BY 3.0) creativecommons.org/licenses/by/3.0 open.spotify.com/artist/5ZVHXQZAIn9WJXvy6qn9K0 Music powered by BreakingCopyright: breakingcopyr...
Databricks Setup for Spark Structured Streaming
มุมมอง 5621 วันที่ผ่านมา
In this tutorial, we’ll guide you through setting up Databricks for Spark Structured Streaming, enabling you to start building and running real-time streaming applications with ease. Databricks offers a powerful platform for big data processing, and Spark Structured Streaming makes it easy to process streaming data with Spark’s DataFrame API. By the end of this video, you’ll have your environme...
Kafka Consumer Tutorial - Complete Guide with Code Example
มุมมอง 54121 วันที่ผ่านมา
In this in-depth Kafka Consumer tutorial, we’ll walk through everything you need to know to start building and configuring Kafka consumer applications. From understanding core concepts to exploring detailed configurations and implementing code, this video is your one-stop guide to Kafka consumers. Here's what you'll learn: Kafka Consumer Basics: Get an overview of Kafka consumers, how they work...
Kafka Producer Tutorial - Complete Guide with Code Example
มุมมอง 4421 วันที่ผ่านมา
Welcome to this comprehensive Kafka Producer tutorial! In this video, we’ll dive deep into the fundamentals of Kafka producers and cover everything you need to know to get started building your own producer applications. Here's what we'll cover: Kafka Producer Basics: Learn what a Kafka producer is and how it fits into the Kafka ecosystem. Producer Workflow: Understand the steps for sending mes...
Setting Confluent Cloud for Kafka and walkthrough
มุมมอง 124หลายเดือนก่อน
☁️ Setting Up Confluent Cloud for Kafka | Spark Structured Streaming Series ☁️ In this video, we’re walking through the steps to set up Confluent Cloud for a seamless Kafka experience! Confluent Cloud offers a fully managed Kafka service, making it easier than ever to get started with real-time streaming without the hassle of self-managing Kafka infrastructure. Join me as we cover everything yo...
Kafka Fundamentals Part-2
มุมมอง 35หลายเดือนก่อน
In this video, we’ll dive into the essential roles of Kafka Producers and Consumers-the backbone of any Kafka-powered streaming application. Whether you're just starting with Kafka or brushing up on streaming concepts, this session will break down how data is sent to and retrieved from Kafka, making real-time streaming possible. What We’ll Cover: Kafka Producers: Learn how Kafka Producers send ...
Kafka Fundamentals Part -1
มุมมอง 90หลายเดือนก่อน
🦅 Bird's Eye View of Kafka Components | Spark Structured Streaming Series 🦅 In this video, we’ll take a high-level look at the critical components that make Apache Kafka a robust and reliable platform for real-time data streaming. Whether you're new to Kafka or want to solidify your understanding of its architecture, this video will provide a clear overview of Kafka’s inner workings and how eac...
Understanding Apache Kafka for Real-Time Streaming
มุมมอง 96หลายเดือนก่อน
🎬 Understanding Apache Kafka for Real-Time Streaming | Spark Structured Streaming Series 🎬 In this video, we’ll explore the basics of Apache Kafka to understand why it’s become the go-to solution for real-time data streaming. Whether you’re new to streaming or looking to expand your data engineering skills, this session will introduce the core concepts of Kafka and how it powers modern streamin...
Spark Structured Streaming with Kafka playlist launch
มุมมอง 167หลายเดือนก่อน
🔥 Welcome to my new TH-cam series: “Spark Structured Streaming with Kafka!” 🔥 In this series, we’re diving deep into the powerful combination of Apache Kafka and Spark Structured Streaming to master real-time data processing. 🚀 Get ready to learn all about building scalable, fault-tolerant streaming applications for real-world scenarios like financial transactions, fraud detection, and more! Wh...
DBT Tutorial: Snapshot - SCD type 2 in DBT
มุมมอง 4413 หลายเดือนก่อน
DBT Tutorial: Snapshot - SCD type 2 in DBT
DBT Tutorial: DBT Tests | Generic and Singular Tests
มุมมอง 9739 หลายเดือนก่อน
DBT Tutorial: DBT Tests | Generic and Singular Tests
DBT Tutorial: How to generate automatic documentation in DBT
มุมมอง 5859 หลายเดือนก่อน
DBT Tutorial: How to generate automatic documentation in DBT
DBT Tutorial: How to use Target | Deploy project to different environment
มุมมอง 3599 หลายเดือนก่อน
DBT Tutorial: How to use Target | Deploy project to different environment
DBT Tutorial: Project and Environment variables
มุมมอง 1.1K9 หลายเดือนก่อน
DBT Tutorial: Project and Environment variables
DBT Tutorial: Incremental Model | Updates, Appends, Merge
มุมมอง 4.1K9 หลายเดือนก่อน
DBT Tutorial: Incremental Model | Updates, Appends, Merge
DBT Tutorial : How to structure DBT project
มุมมอง 63110 หลายเดือนก่อน
DBT Tutorial : How to structure DBT project
DBT Tutorial : How does DBT run your query ?
มุมมอง 65610 หลายเดือนก่อน
DBT Tutorial : How does DBT run your query ?
DBT Tutorial : Everything you need to know about Sources and Models
มุมมอง 88610 หลายเดือนก่อน
DBT Tutorial : Everything you need to know about Sources and Models
DBT Tutorial : Setting up your first dbt project
มุมมอง 1.2K10 หลายเดือนก่อน
DBT Tutorial : Setting up your first dbt project
Spark Skewed Data Problem: How to Fix it Like a Pro
มุมมอง 1.1Kปีที่แล้ว
Spark Skewed Data Problem: How to Fix it Like a Pro
I need complete video on Pyspark
Can you create a project video where iot device data is processed using kafka streams in real time that would be great.thx in advance 😊
Thanks for the idea! , Will try to create some project videos
I have a query that you might be able to solve I am saving data from Kafka to timescaleDB But for each offset mesages I need to query db to get userId associated with IOT sensor. So for each offset processing one query is executed. Causing max connection error. Any solution for that (for now I added redis + connection pooling) but I don't think it will solve it for the long term 2. As data grows to 30-40gb of the single table inserts get slower in timescaleDB what should we do to make it fast
Thx in advance
You should try to batch your query (Task will be to mimize the db call) or you can copy the data from db to databricks , Checkout this one : th-cam.com/video/n0RS7DB_y9s/w-d-xo.html
Just subscribed your channel
Thank you , Please share the playlist with your Linkedin network which will help this channel to grow.
Nice video bro
Thank you so much
It's Short and Sweet and Very Descriptive. I have install as per the video. But i encountered with an error: Error from git --help: Could not find command, ensure it is in the user's PATH and that the user has permissions to run it: "git". Please let me know How to resolve this error.
Thank you . So git path is not set properly , Check out this one : th-cam.com/video/lt9oDAvpG4I/w-d-xo.html , Please share the playlist with your network which will help this channel to grow.
@@anirvandecodes Thank you so much
Great content, thank you for sharing! Special respect for github link
thank you , please share the playlist with your LinkedIn network so that it reach to wider audience.
hi how to setup Kafka ? Do we have any video on this ?
Yes , Checkout this video to setup Kafka on confluent cloud : th-cam.com/video/miN4WLiJnRE/w-d-xo.html Playlist link : th-cam.com/play/PLGCTB_rNVNUNbuEY4kW6lf9El8B2yiWEo.html
Hey, great series, thanks. how can i make the producer, produce faster?
There are a few configuration changes you can do Batching: Set linger.ms (5-50 ms) and increase batch.size (32-128 KB). Serialization: Opt for efficient formats like Avro or Protobuf. Partitions & Proximity: Add more partitions and deploy near Kafka brokers. In production people generally use more scalable solutions than just a Python producer app, Check out this : docs.confluent.io/platform/current/connect/index.html Do share the playlist with your LinkedIn community.
14 completed ❤
You are making a great progress , Please share with your friends and colleagues
13 completed ❤
12 completed ❤
11 completed ❤
10 completed ❤
9 completed ❤
8 completed ❤
6 completed ❤
5 completed ❤
4 completed ❤
3 completed ❤
3 completed ❤
2 completed ❤
1 completed ❤
I have copied the yml file in the folder staging, marts, I am getting the conflict to rename the yml sources , how do we effectively define sources in the models
Can you share the complete error text and project structure?
@@anirvandecodes so in your video you pasted the yml file containing sources in the all the 3 folders, since the source is the same for all 3 files I just pasted the model sql files inside the folder and kept the yml file outside the folder so this resolved the error, I believe with the new dbt version you cannot have 2 yml files having the same source referencing the same table at the same folder level currently my folder structure looks like models -staging - - staging_employee_details.sql -intermideate - - intermideate _employee_details.sql -marts - - marts_employee_details.sql -employee_source.yml in the video you pasting the yml file in each 3 folders (staging, intermideate, marts) which gives naming_conflict_error your videos have been very informative, I went through the whole playlist was struggling to install dbt on my system and understand it thank you so much ! 😄😄
@ i think you might have same spirce name mentioned in two place , take a look into that
@@anirvandecodes dbt found two sources with the name "employee_source_EMPLOYEE". Since these resources have the same name, dbt will be unable to find the correct resource when looking for source("employee_source", "EMPLOYEE"). To fix this, change the name of one of these resources: - source.dbt_complete_project.employee_source.EMPLOYEE (models\marts\marts_employee_source.yml) - source.dbt_complete_project.employee_source.EMPLOYEE (models\staging\stg_employee_source.yml)
should the source name always be unique ?
bro, what if I don't want to share my data with confluent. Can we do the confluent kafka setup on premises?
Absolutely , They call it self managed kafka , Check this out www.confluent.io/get-started/?product=self-managed
best dbt playlist man! searched for a lot throughout youtube, no one comes closer to clarity of explanation!
Made my day , Thank you , Do share with your network.
@@anirvandecodes you deserve it man!
How to see the column lineage?
dbt core does not have any out of box column mapping lineage . You can explore column lineage in dbt cloud or check out this one tobikodata.com/column_level_lineage_for_dbt.html
Hi @anirvan, thanks for your detailed explanation dbt concepts.which has helped me a lot
Glad to hear that , Please share the content with your network.
Complted the tutorials, I loved it. Please create more tutorials playlist for more topics.
Thank you for the support , Yes I will be publishing content on spark structured streaming with kafka.
loved your video, it cleared my doubt about sources and models and how we create sources.
Glad it was helpful! , do share with your network
Hi, I ran dbt dubug from command prompt and worked well, i am running from pycharm and getting error , The term 'dbt' is not recognized as the name of a cmdlet, function, script file
looks like this is some pycharm path related issue , try to debug if path is coming properly in pycharm or you can also select different terminal as git bash , you can get more info on google
This error generally comes when the path is not added in the system ,try to use stackoverflow or chatgpt and you can try to do with git bash
Hey my man, just wanna say thanks for this whole series you did. Extremely helpful to people who are specifically looking for guidance in this new tool. Really appreciate your hard work man.
Thank you so much, it really made my day :)
hello I have question dbt doesn't recognize my model as incremental I using incremental modling to take snap shot of table row count and insert it to build time serise table contain row conut for every day
I will upload one video on snapshot soon , check that out !
How to achieve incremental insert in dbt without allowing duplicates base on specific columns?
You can apply distinct in sql to remove the duplicates or use any other strategy to remove the duplicates
Pleas teach on snowflake dbt integration and how dbt works on entire process SCD type 2 thnk u
sure will create one video on it
Man, amazing work. Can't wait....Subscribed! Do keep the videos coming, please?
Thanks! Will do!
I was looking for this and you are like a saviour. Thanks
Glad I could help
Hiii
hello
Are you teachjng dbt
Yes I have a complete dbt playlist here : th-cam.com/play/PLGCTB_rNVNUON4dyWb626R4-zrLtYfVLa.html
Can you please share the dbt models as well
sorry i lost the model file
Awesome tutorials.. keep the good work going...when can we expect tutorials on other tools like airflow, airbyte etc ?
thank you so much , I have two more videos dbt to complete the playlist , will plan after that
Nice explanation
Keep watching
Very nice explained
Thank you so much 🙂
how to deploy code from DEV to QA to PRD , Please make video on this... thank you
yes , i am in the process on making video on how to deploy dbt project on cloud. stay tuned!
Hey Anirvan, Thanks for clearly explaining. I am currently learding dbt and I came across this question whether we can keep multiple where conditions in incremental load
Yes, definitely. think it as a sql query with which you are filtering out the data.
@@anirvandecodes Hello Anirvan, any code snippet or any format suggestion from your end??
cool thanks
No problem!
Hello, i have a question how to do insert update and delete based on column other than a date. I am loading from Excel into postgres and generating a hashed column, every time a new record, or updated record a new hash key is generated for that column. I am trying to do an incremental update. Here is the select stmnt SELECT id, "name", alternate_name, description, email, website, tax_status, tax_id, year_incorporated, legal_status, logo, uri, parent_organization_id, hashed_value FROM organization; My CDC is based on hashed_column. Lets say name is changed in excel, when i load the data into postgres i get a new hashed key for the hashed_value column and similarly for a new record. How do i do my incremental load? any suggestion
that concept is called as change data feed , you need to first find out the records which have changed , There are different techniques , like in sql you can do SELECT * FROM table1 EXCEPT SELECT * FROM table2 to find what rows have changes then only insert those records.
It’s not 1.29 seconds. we have to also include other steps right like when we add salt key tf to both dataframe. That should be considered the full time
Hi, in this technique we are not adding salt to both dataframe which is not needed , We are adding salt to one dataframe and we are exploding the other dataframe to do a join at the end.
Thank you for the video, very helpful!
thank you for watching, I am glad it helped you to understand the concept.
So we can use: 'table' -- for truncate and load 'incremental' -- for append or insert 'incremental' with 'unique_key' -- for upsert or merge.... Is the above statement is right?
Yes that is correct, keep watching