Hello interviewer asked what kind of transformation you have done in project can you explain on that ? What are the transformation we can apply in project
It clearly depends on Projects and the domain but generally transformations are applied to clean the Data and make Data quality better, then it depends on ETL or ELT but in both cases Transformation will be applied using joins between multiple tables and select only columns for final report requirements. If any further questions please feel free to mail yt.the.data.channel@gmail.com
What if the records are deleted from source permanently and the same records need to be deleted from target table. How can we do this? Please explain this as well. Thank you.
Good question, if there is delete scenario, logic would be little different. If record is deleted in upstream and we can choose to delete the record in downstream or else we can choose to keep with additional column like delete_indicator and flag it is Y when record deleted in upstream
Ok, but can you give me the logic how to delete the records from target table? And is it possible to implement Incremental load without using any variables or parameters in PC? This is interview question. Thank you in Advance@@thedatachannel878
Preprocessing here means, suppose you want to ingest only 100 records out of 1000 based on certain where condition. Hope this is clear, if not please let me know, happy to explain much in detail of needed
@@thedatachannel878 actually me and my team use to do daily prod ingestion. In KT one of senior explained that first you need to do Preprocessor for all tables then u can start incremental ..we use to activate Preprocessor in the watcher ..if Preprocessor is completed then we can start incremental load for that table.
@@vanshchauhan3910 preprocessing simply means, you can apply row level filter in where or column level filters in select clause while selecting columns or sometimes it can even be joining 2 or more tables. But important point to note when we say preprocess in ETL is these filters is applied before actually ingestion happen and also the compute for this will be your source system
sir, in general medium level project like Telecom Domain project at what volume data is loaded and how many records generally loaded from source to target?
There might be multiple ways of doing. One of them would be write a script/program which will take count of source and count of target system, then perform the count validation. Also you can take some KPI like sum of amount column, average of some number columns from both source and target and then do comparisons of those KPI to see if these match than your incremental load is validated to be working as expected Hope this helps
Usually this process is only for inserts or updates, if there are old records deleted and that's the scenario, would recommend to run some separate housekeeping job scheduled which will check all the records from target and see if they exists in source, if not delete from target
File can be used as control instead of table, however it is highly not recommended not to use File based control as it leads to huge performance impact especially when file is Big. Even though if still there is need to used file based, recommend file format is Json as it is easy to read or update and has performance benefits as well
@@thedatachannel878 most of the data is moving to cloud these days I don’t see performance hit for control file .. as the same row get hit everyday growth of data would be less . Do you see any other performance cause this case ? Much appreciate your response
Yes, still it will have performance issue and when I say that it is the performance cost not performance time. Also it is not an efficient way of doing as every time your process has to do IO operations on file
One way is to have additional column called active-indicator which can be Y if record is active and you can set to N if the record gets deleted in source or also can delete in target as well. However this very rare in case on incremental loads as incrementals are mostly transactional in nature and generally don't get updated
Thank you for this detailed simple explanation. You are God sent.
Glad that it was useful for you
Please Subscribe and share. Happy Learning 👍
Thanks a million for the detailed explanation. Please keep doing the great stuff.
Thank you, keep supporting. Happy learning
Very precise explanation, thank you sir😊
Thank you Prashant
Please Subscribe and share
Happy Learning 😊
one of the best SCD video🙂
Thank you, happy learning. Please Subscribe and share
Much Useful...Please Bring More and more such content..
Thank you for the support
full load stagging layer main use karte hai na ?
Generally yes
But thatis we called truncate and load?
@junedkhan-ge5mi when we do Full load it is usually truncate and load unless we like to keep existing data for some specific case
Great explanation. Thank you
Thank you for your support. Please subscribe and Happy Learning 😊🎉
Hello interviewer asked what kind of transformation you have done in project can you explain on that ? What are the transformation we can apply in project
It clearly depends on Projects and the domain but generally transformations are applied to clean the Data and make Data quality better, then it depends on ETL or ELT but in both cases Transformation will be applied using joins between multiple tables and select only columns for final report requirements.
If any further questions please feel free to mail yt.the.data.channel@gmail.com
What if the records are deleted from source permanently and the same records need to be deleted from target table. How can we do this? Please explain this as well. Thank you.
Good question, if there is delete scenario, logic would be little different. If record is deleted in upstream and we can choose to delete the record in downstream or else we can choose to keep with additional column like delete_indicator and flag it is Y when record deleted in upstream
Ok, but can you give me the logic how to delete the records from target table? And is it possible to implement Incremental load without using any variables or parameters in PC? This is interview question. Thank you in Advance@@thedatachannel878
This answered a lot of questions I had about delta loads. Thank you
Thank you, please Subscribe and Happy Learning 👍
@@thedatachannel878 what is Preprocessor while doing ingestion ?
Preprocessing here means, suppose you want to ingest only 100 records out of 1000 based on certain where condition.
Hope this is clear, if not please let me know, happy to explain much in detail of needed
@@thedatachannel878 actually me and my team use to do daily prod ingestion. In KT one of senior explained that first you need to do Preprocessor for all tables then u can start incremental ..we use to activate Preprocessor in the watcher ..if Preprocessor is completed then we can start incremental load for that table.
@@vanshchauhan3910 preprocessing simply means, you can apply row level filter in where or column level filters in select clause while selecting columns or sometimes it can even be joining 2 or more tables. But important point to note when we say preprocess in ETL is these filters is applied before actually ingestion happen and also the compute for this will be your source system
Super explination thanks bro
Thank you, please subscribe and share with your friends.
Happy Learning 💐
sir, in general medium level project like Telecom Domain project at what volume data is loaded and how many records generally loaded from source to target?
It completely depends on project to project but I can say something from 1million to 10million records on average
@@thedatachannel878 Thank you sir
Very well explained, Thanks.
Thank you Sridhar
Hello How to validate incremental loading.interviewer asked me this question, can you exaplain it
There might be multiple ways of doing. One of them would be write a script/program which will take count of source and count of target system, then perform the count validation. Also you can take some KPI like sum of amount column, average of some number columns from both source and target and then do comparisons of those KPI to see if these match than your incremental load is validated to be working as expected
Hope this helps
Great learning material!
Thank you! Cheers!
Informative 🙌
Thank you...keep learning...Happy Learning.
very well explained
Thank you....keep supporting and keep learning...😊 Happy Learning 👍
awesome explanation! could you please also explain on merge concepts.
Sure, watch out, will soon explain about merge.
Happy Learning 🥳
Great explanation ❤
Thank you, please Subscribe and share with your friends.
Happy Learning 👍
it was really helpful... thank you
Glad to know it is helpful.. Kindly SUBSCRIBE and that will motivate us to bring more quality content like this...Thank you ...!
What if in source table some record gets deleted then how will it be executed in incremental load ?
Usually this process is only for inserts or updates, if there are old records deleted and that's the scenario, would recommend to run some separate housekeeping job scheduled which will check all the records from target and see if they exists in source, if not delete from target
You can choose to schedule this daily/hourly based on how frequently or critical this is about.
Hope that helps, Happy Learning 👍
Well Explained 🙂
Thank you John
Could you post the practical code snip for this example please. Thanks in advance!
Sure, will bring up more practical examples
How to use control file instead of control table ?
File can be used as control instead of table, however it is highly not recommended not to use File based control as it leads to huge performance impact especially when file is Big.
Even though if still there is need to used file based, recommend file format is Json as it is easy to read or update and has performance benefits as well
@@thedatachannel878 thank you for the response . Do you have any blog or reference for this ?
@@thedatachannel878 most of the data is moving to cloud these days I don’t see performance hit for control file .. as the same row get hit everyday growth of data would be less . Do you see any other performance cause this case ? Much appreciate your response
Yes, still it will have performance issue and when I say that it is the performance cost not performance time. Also it is not an efficient way of doing as every time your process has to do IO operations on file
But still if you want to do just read json control file from python and read it is dictionary and write it back as dictionary
good explation
Thank you, please subscribe and share,
Happy Learning 🎉
How about if data gets deleted how to do it in incremental load
One way is to have additional column called active-indicator which can be Y if record is active and you can set to N if the record gets deleted in source or also can delete in target as well. However this very rare in case on incremental loads as incrementals are mostly transactional in nature and generally don't get updated
please explain end to end data pipeline for data engineering
Sure Vinod, we will upload more such videos, please keep Following us
Thank you for good content
Thank you Marco ❤️
Thank you so much sir
You are welcome thank you for the support, Happy Learning
Welcome, Happy Learning 👍
Where is the lady