20: Distributed Job Scheduler | Systems Design Interview Questions With Ex-Google SWE

แชร์
ฝัง

ความคิดเห็น • 62

  • @rajeev9293
    @rajeev9293 24 วันที่ผ่านมา +2

    Excellent stuff and lot of details covered in short time. I always need to watch your videos multiple times to grasp all the intricacies since your content covers so much depth.👏

  • @venkatadriganesan475
    @venkatadriganesan475 22 วันที่ผ่านมา +2

    One of the excellent System design videos I have ever seen, Touched all the concepts in 30 minutes.

  • @HimanshuPatel-wn6en
    @HimanshuPatel-wn6en หลายเดือนก่อน +3

    Your videos are gem, many so-called paid courses do not have this level of quality.

  • @hazardousharmonies
    @hazardousharmonies หลายเดือนก่อน +5

    Another Jordan classic - great learning material as always! Thank you Sir!

  • @viralvideoguy1988
    @viralvideoguy1988 หลายเดือนก่อน +3

    I'm a chronic procrastubater myself. Thanks for taking the time to create this Jordan.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +2

      Thanks for taking the time to watch it, hopefully it didn't stop you from beating the wood for too long

  • @owenmorris1725
    @owenmorris1725 9 วันที่ผ่านมา +1

    Just wanna say I really like the addition of the initial high level design! Definitely wouldn’t say it was incomprehensible before (I think your other videos are great too, thanks for all the content!), but this style definitely feels a little more like interview style and helps to better understand where your deeper explanations fit in the system.

  • @charlesliu1439
    @charlesliu1439 หลายเดือนก่อน +1

    Thanks Jordan for these wonderful videos!

  • @LeoLeo-nx5gi
    @LeoLeo-nx5gi หลายเดือนก่อน +1

    Amazing one Jordan, learned a lot from this!!

  • @oskarelvkull8800
    @oskarelvkull8800 หลายเดือนก่อน +1

    Great content, one question about the "cron-table". Is it used in your final solution? I can't understand when it used, maybe except for the first scheduling, since you are rescheduling the heads of the DAGs by putting them as the dependencies of the tails. Am I missing something?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Dah yeah I mean you basically want to ensure that if Cron schedule changes you can update that in the scheduling table, so tasks should read from the Cron table when they schedule their next instance

  • @anindita71
    @anindita71 หลายเดือนก่อน +1

    Thank you, Jordan! Your videos are really helpful. I have a request for one of the amazon's most asked HLD system design interview questions - traffic control system. Would be really helpful if you could make a video on this🙏

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Hopefully at some point I'll have time to do so!

  • @stephanies4064
    @stephanies4064 หลายเดือนก่อน +1

    Thanks Jordan! Very nice video!

  • @WallaceSui
    @WallaceSui 29 วันที่ผ่านมา +1

    Thanks Jordan for your video! But have one question: Whether DAG jobs and cron jobs will have some overlapping? I understand that for simplifying the design, we can see that in most cases DAG jobs rely on job dependency finish and cron jobs rely on the time. But if it is possible that some DAG jobs may also be the cron jobs? If this is true, whether that means we need more cols in cron table for this? Or may need an extra table for this? Thanks a lot.

    • @jordanhasnolife5163
      @jordanhasnolife5163  28 วันที่ผ่านมา +1

      Typically the first nodes in the dag will be on some cron schedule, so yeah I would agree there would be additional logic to do there! I don't know that we'd need more logic in the cron table to do this, I think it's more so just what timestamp you throw on the dag job when you put it in the scheduler table (for the next time that it should run)

  • @CompleteAbsurdist
    @CompleteAbsurdist หลายเดือนก่อน +1

    Thanks Jordan! For writing notes, do you just use Apple notes? Or this is a different app?

  • @nahianalhasan5151
    @nahianalhasan5151 หลายเดือนก่อน +1

    In the slide starting at minute 6:00, I'm curious as to what the best strategy for the database logic is to schedule a job based on its dependencies, e.g. for job 3, when 1: 1 and 2: 1. Is the logic dependent on the epochs of the parent nodes becoming unequal and then equal again to trigger job 3?

  • @deadlyecho
    @deadlyecho หลายเดือนก่อน +1

    Hi Jordan, I am newbie to system design, I have a couple of questions, I assume that the executer is the pool of cron jobs scheduled to run every minute. I also that only one cron job will pull the scheduled tasks eligible for running. My questions are:
    1- What if we have many tasks scheduled at a particular interval and all these get picked up, what is the liklihood of this scenario, and should we even care about the throttling of the executer?
    2- Is running the task exactly at the specified time a non functional requirement? Or do we allow a margin?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      1) The executor is basically a bunch of random nodes responsible for running a task, that is passed to it from the message broker. I'm not sure what you mean by this question, we'll absolutely have a lot of tasks scheduled at once.
      2) I suppose that's up to your interviewer, the more that you partition those scheduling tables the faster you can get jobs in the queue, but this doesn't guarantee when they'll be run if there aren't enough executors available.

  • @jianchengli8517
    @jianchengli8517 หลายเดือนก่อน +1

    Do you think if it makes more sense on just creating schedules whenever it gets to the scheduled time? Executor could possibly take a long time to execute a heavy job and therefore the scheduling will be delayed and users might be confused on why the job was not kicked off on the scheduling window.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Not entirely sure what you mean here, feel free to elaborate. When the job gets to the executor has nothing to do with the scheduling time, once the job gets to the executor, we'll increase the retry timestamp as well

  • @soumik76
    @soumik76 16 วันที่ผ่านมา +1

    Hi Jordan,
    If DAG update isn't needed (as in if it's a simple cron job) then does executor directly updates schedules table, as there won't be CDC in this case?

  • @Anonymous-ym6st
    @Anonymous-ym6st หลายเดือนก่อน +1

    at 22:12 about indexing, I am wondering if we index by status then when we want to update the delivery has been succeeded, don't we need to search that job id without the index (which would take a lot of time)?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Fair point! I think this might be a good use case for either a local secondary index for the job id

  • @zy3394
    @zy3394 4 วันที่ผ่านมา +1

    is it good idea to serialize DAG in application code (topological sort) and treat it as a single task (containing bunch of sub tasks which are serialize DAG tasks), have one worker executing the subtasks orderly ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  วันที่ผ่านมา

      Probably not because people may still have other constraints to starting subtasks such as a time, so then the worker has to sit idle. Plus they may have different CPU requirements.

  • @nisarggogate8952
    @nisarggogate8952 26 วันที่ผ่านมา +1

    Bro this was next level! Love you bruh

    • @nisarggogate8952
      @nisarggogate8952 25 วันที่ผ่านมา

      Got this in Amazon interview today. Was LLD though but your overall video helped a lot!

  • @ravipradeep007
    @ravipradeep007 20 วันที่ผ่านมา +1

    Excellent video Jordan
    1. I have few doubts on how the system would scale when
    R1. For a high priority job scheduled at 2pm i want it to get executed within 200ms of scheduled time
    Constraint : The s3 binary for the job itself might be 100 mb , and downloading that would take 5 sec .
    Here is my high level approach
    Two options here .
    1.Have a resource manager
    2.Execution Planner
    3.Executor
    Execution planner , at 1.30 pm starts and see what are the tasks planned at 2.00 pm .
    Categorizes them into high resource , medium resource, low resource
    and how much
    Talks to Resource planner pre identify apppropriate workers and pre warm the nodes ,
    1. Pre download the s3 binary
    Creates task execution , worker node mapping
    Any changes eg. cancellation are communicated to the worker nodes,
    Now at 2.00 pm , it can again result into a thundering herd problem where the database gets inundated with queries ,
    To avoid that , we can push the jobs , before to workers , and a local cron job ,
    so it runs exactly at 2.00 pm , since the binary is already downloaded.

    • @jordanhasnolife5163
      @jordanhasnolife5163  19 วันที่ผ่านมา +1

      Seems fairly reasonable to me. I think if any tasks came in like this you could just ensure that they were split into a binary pre cache step and a run step. You'd either then have to ensure that those steps run on the same physical node, or the physical node would basically have to remain idle from 1:30 to 2

    • @ravipradeep007
      @ravipradeep007 19 วันที่ผ่านมา

      @@jordanhasnolife5163 Thanks that should be better IMO , Using existing system , just divide into two part and preschedule with a constraint like job schedule time < T+30min ,
      and schedule .

    • @ravipradeep007
      @ravipradeep007 19 วันที่ผ่านมา +1

      A lot of other SD youtube or other coaches never go into the depths you are going , with so less of a experience , this is L6 - stuff definitely

  • @kevinding0218
    @kevinding0218 หลายเดือนก่อน +1

    Thank you, Jordan!
    I still have some clarifications to get a better understanding:
    1. What does "step" mean in the context of updating the run_timestamp each time we process the job? For example, if we update the job's run_timestamp from 2:01 to 2:06, is this just a one-time update, or do we continue to update it at subsequent steps, say from 2:06 to 2:11?
    2. I'm struggling to understand the need for the run_timestamp according to "increase the run_ts for reflect how much time we should wait before rescheduling the job".
    Especially when we already have a status column. Typically, we can determine which jobs to queue by checking the status field, for example, moving jobs from "READY" to "PROCESSING". For scenarios involving failure and retry, if a job fails and the executor is still operational, we could simply update the status to "FAILED". If the executor fails, it seems another executor pick up the job via a message queue, and handle the status updates accordingly?
    3. Concerning priority scheduling, is there a risk of resource wastage, especially since it appears that all long-running jobs might subsequetitially occupy all executor resources connected from low to mid and to high-level message queues, since we always have any job start from the lowest level?

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      1) Steps: Job is read by scheduling cron, job gets put in message queue, job reaches executor. Nope, we'll continue to update it in the future if we retry a job!
      2) If we don't have a run timestamp, we will just constantly retry the job every time that we poll our scheduling table. If we instead use some sort of enum like a status to say whether a job is completed, in progress, or failed, then we may not retry the job if the node running it goes down and can never tell us that it failed.
      3) Yes, but that's typically why you have the lowest queues have a pretty small timeout. In theory, we could also have users submit a minimum priority to run at when they submit a job.

    • @kevinding0218
      @kevinding0218 หลายเดือนก่อน

      @@jordanhasnolife5163 Thanks a lot!!

    • @rakeshvarma8091
      @rakeshvarma8091 19 วันที่ผ่านมา +1

      @@jordanhasnolife5163
      Continuing on this, when exactly we update the run_timestamp ? If we do it everytime, then we will end up running the job again although it's finished in an earlier run isn't it ?

    • @jordanhasnolife5163
      @jordanhasnolife5163  19 วันที่ผ่านมา

      @@rakeshvarma8091 The run timestamp is updated to say our restart time if we reach it. In the case of finishing the job, we can remove our entry from the table upon completion, or use a separate status column to say don't run it again.

  • @ajayreddy9176
    @ajayreddy9176 22 วันที่ผ่านมา +1

    Basically Jenkins master and slave set up deployed on Kubernetes for scalability

  • @yaoxianqu9014
    @yaoxianqu9014 หลายเดือนก่อน +2

    If we make the root node dependent on its child nodes, wouldn’t this make the graph no longer acyclic? How would we be able to figure out which one is the root node in this case?

    • @parthsolanki7878
      @parthsolanki7878 หลายเดือนก่อน +1

      Yeah. Came to the comments section to ask the same. 1->2->4->1

    • @adityasoni1207
      @adityasoni1207 หลายเดือนก่อน +1

      The node will still have higher epoch number I presume but yeah, not sure what all issues it can create. We can take a look at how argo scheduler works and use that idea as well probably.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      The root nodes have a non-null Cron schedule, so should be fairly easy to identify for a given dag

  • @aa-kj5xi
    @aa-kj5xi หลายเดือนก่อน +1

    I propose using Temporal to simplify and abstract away all the retry logic, locking, and ensure idempotency.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      This is new to me - I'll take a look, thanks!

  • @xiangchen-nh3px
    @xiangchen-nh3px หลายเดือนก่อน +1

    Thanks for share! Would you please offer the content doc

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      Yeah I've been procrasturbating, will likely upload everything in batch in like 8 weeks when this series is done

    • @rajatahuja6546
      @rajatahuja6546 หลายเดือนก่อน

      ​@@jordanhasnolife5163 what's your next series that your planning ?

  • @rajatahuja6546
    @rajatahuja6546 หลายเดือนก่อน +1

    Can you share notes on google drive link or some other way via icloud

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน

      I will do this eventually, but it will realistically be a couple of months

    • @rajatahuja6546
      @rajatahuja6546 หลายเดือนก่อน

      @@jordanhasnolife5163 what do you plan to start once this series get over ?

  • @user-gn6fj2ri1z
    @user-gn6fj2ri1z หลายเดือนก่อน +1

    I still prefer reading compared to watching videos for tech stuff. Wondering whether you can also publish your content as writing somewhere. There are also platforms writers get paid for their content. Or probably a book like Alex Xu's.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      I will likely do this eventually! Though as you alluded to, I may try and get paid for it lol

  • @martinwindsor4424
    @martinwindsor4424 หลายเดือนก่อน +1

    Thought I'd be clapping cheeks on a weekend, but I'm making notes from Jordans videos. fml.

  • @shivanand0297
    @shivanand0297 หลายเดือนก่อน +1

    helll yeah

  • @priteshacharya
    @priteshacharya หลายเดือนก่อน +1

    On the DAG Table, you mentioned "When all dependency task have an equal epoch for a given row, schedule that task". By epoch, do you mean just a counter?
    If we use an actual linux epoch (which is number of seconds elapsed 1 January 1970), they won't be the same because two task will finish in different time.

    • @jordanhasnolife5163
      @jordanhasnolife5163  หลายเดือนก่อน +1

      Yes just an epoch. Linux is "millis since epoch", where they use that to mean 1970, but yeah I just mean a monotonically increasing sequence number.