DP-203: 09 - Data lake structure - Raw layer

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 35

  • @christianraouldjatio1738
    @christianraouldjatio1738 ปีที่แล้ว +16

    Hello
    I am preparing the dp-203 and your channel is simply magical.
    you explain complex concepts very simply. I really like your method with the whiteboard and the hand drawings.
    thank you very much for this quality content in your channel.
    I know it's a lot of preparation work behind the final video.😊🙏

  • @TheGutxD
    @TheGutxD 12 วันที่ผ่านมา

    I liked so much that class!!! I'm happy to review and to learn some concepts and understand real time scenarios.
    so important Thank you

    • @TybulOnAzure
      @TybulOnAzure  8 วันที่ผ่านมา

      My pleasure, I'm glad that those videos were useful.

  • @prabhuraghupathi9131
    @prabhuraghupathi9131 11 หลายเดือนก่อน

    Great content on how to structure/organize our data in Raw layer!!

  • @HamzaBray
    @HamzaBray หลายเดือนก่อน

    Thank you, now i understand landing zone, raw zone, ect... I used them before but i just got how & why are they there!

    • @TybulOnAzure
      @TybulOnAzure  หลายเดือนก่อน

      Glad it was useful!

  • @EmilianoEmanuelSosa
    @EmilianoEmanuelSosa หลายเดือนก่อน +1

    Hi Tybul! Question, when you said that the bronze layer is immutable, does that rule out that the tables can have some kind of incremental update? I mean, for example. I had to extract from an on-premise database via CDC, and in turn schedule the entire CDC azure preview resource. Would that be wrong? In addition to that process, I only update the update date of the bronze tables in that layer.

    • @TybulOnAzure
      @TybulOnAzure  หลายเดือนก่อน

      Hi, the approach depends on how you manage incremental updates. If you ingest only the changed data (the increments) without any immediate processing, it should go into the raw, immutable layer. From there, you would process the data to apply the changes - such as insert, update, or delete - and store the result in another layer. This resulting layer would essentially be a 1:1 copy of your source data, stored in Delta Lake format, assuming you are implementing SCD Type 1 and do not need to maintain historical records.
      Alternatively, if your ingestion process directly updates the target data, it would be better to store the data directly in the next layer since it is being continuously updated.

  • @mdpdurawix1834
    @mdpdurawix1834 5 หลายเดือนก่อน

    Hi Piotr,
    It would be nice to see some video about how to verify the quality of the data in different layers before it reaches end user.
    Great video like always, please keep it up!

    • @TybulOnAzure
      @TybulOnAzure  5 หลายเดือนก่อน +1

      As a "Data Engineer" member of my channel, you’ll have the special privilege of suggesting topics for new videos and voting on them. If you have a topic in mind, I’d love for you to join as a member. I’ll be setting up the first poll once I complete the DP-203 course.

  • @listen_learn_earn
    @listen_learn_earn 6 หลายเดือนก่อน

    Hi Tybul,
    The contents that you are delivering is awesome!!!. Can you also please make a video on Data partitioning and its types and implementation.

    • @TybulOnAzure
      @TybulOnAzure  5 หลายเดือนก่อน

      What do you have in mind?

    • @listen_learn_earn
      @listen_learn_earn 5 หลายเดือนก่อน

      You are explaining the complex concepts in a nice way. So I thought it would be great to listen the partitioning concept from you.Beacause I found it somewhat confusing when I started learning it by myself.

  • @qaz56q
    @qaz56q 8 หลายเดือนก่อน

    15:58 You mention that we can process all data from scratch. Is it also possible to easily process data from a certain point? For example, all data from the last 2 weeks.

    • @TybulOnAzure
      @TybulOnAzure  8 หลายเดือนก่อน +1

      It is possible to process only a subset of data - I'm mentioning this in the "Dynamic ADF" episode.

  • @chgeetanjali7919
    @chgeetanjali7919 10 หลายเดือนก่อน +1

    Hi Tybul. Nice explanation . I have a query regarding PII. Can we anonymise the PII in the raw data itself ? or we anonymise the PII during transformations?

    • @TybulOnAzure
      @TybulOnAzure  10 หลายเดือนก่อน +1

      It depends on your requirements and what your legal team says, e.g. you might not be able to store PII data in raw layer at all. Then what? I can see three basic options:
      1. Don't ingest PII data at all (if possible).
      2. Get rid of PII data on the fly before writing the data to the raw layer.
      3. Add an additional zone (raw-PII) with tight security measures, dump your raw data there, then read from it, get rid of PII data and save the outcome in regular raw layer. Optionally, set automatic removal of files from raw-PII layer after few days or so.

    • @chgeetanjali7919
      @chgeetanjali7919 10 หลายเดือนก่อน

      @@TybulOnAzure thanks for the detailed explanation .

  • @amataratsu006-xs6hv
    @amataratsu006-xs6hv 4 หลายเดือนก่อน

    Hi Tybul. I am training to become a data engineer on Azure and I was planning in joining the club of the "Junior section". However, I could not find what I was looking for.
    For a fee, would you be able to to make interviews for real job scenarios? Would it be something you would consider to be part of your service package?
    Your tutorials are great and it gives me confidence, great work!

    • @TybulOnAzure
      @TybulOnAzure  4 หลายเดือนก่อน +2

      Due to TH-cam's membership policy, I can't offer 1:1 meetings. However, I'm thinking about introducing a new membership tier that would include a monthly group call. In these sessions, we could cover different topics, brainstorm ideas, do live training or interviews, consult, or just have a casual chat. Please note, though, it would be a group setting.

    • @amataratsu006-xs6hv
      @amataratsu006-xs6hv 4 หลายเดือนก่อน

      @@TybulOnAzure thanks for replying. That group setting would be a good start

  • @yakupbilen7612
    @yakupbilen7612 7 หลายเดือนก่อน

    Hello Sir,
    The question I will ask may not be relevant to topic of the video. Is there a specific reason to partition our Sales Orders Dataset by the Ingestion Date?

    • @TybulOnAzure
      @TybulOnAzure  7 หลายเดือนก่อน

      Yes - just to know when given set of data was ingested from the source.

  • @TheMapleSight
    @TheMapleSight 8 หลายเดือนก่อน

    Is raw layer also called 'staging'? I think it's used for silver layer

    • @TybulOnAzure
      @TybulOnAzure  8 หลายเดือนก่อน +1

      You can call it however you want, e.g. staging, raw or bronze. The important thing is to make everyone aware what it means and what kind of data it stores.
      I talked more about data lake zones in 30th episode.

  • @tecain
    @tecain 8 หลายเดือนก่อน

    Hello Pybul, this course is very good. It was what I wanted to complement my architecture data master. I'm really not clear on how to load the same database every day without repeating the same data over and over again, with increasing daily cost. Can you give a real example of how to face and solve this problem?

    • @TybulOnAzure
      @TybulOnAzure  8 หลายเดือนก่อน +1

      Sure. Basically you would write your data extraction SQL queries in an incremental way.
      Take a look here (learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview) for more details.

    • @tecain
      @tecain 8 หลายเดือนก่อน

      @@TybulOnAzure Thanks Tybul

  • @PamTiwari
    @PamTiwari 4 หลายเดือนก่อน

    Wonderful

  • @smithapisharath9610
    @smithapisharath9610 6 หลายเดือนก่อน

    can you please explain medallion architecture?

    • @TybulOnAzure
      @TybulOnAzure  6 หลายเดือนก่อน

      It is mentioned in future episodes.

  • @zouhair8161
    @zouhair8161 ปีที่แล้ว

    i agree with christian

  • @LATAMDataEngineer
    @LATAMDataEngineer 9 หลายเดือนก่อน

    🤙 Thanks

  • @Aryaveer_Ki_Duniya
    @Aryaveer_Ki_Duniya หลายเดือนก่อน

    Wonderful