Most Common Data Science SQL Interview Question from DoorDash [window functions & partitions]

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ก.ค. 2024
  • This is the most common data science SQL interview question from DoorDash, a food delivery company. This question tests your ability to split your data into percentiles using window functions. I’ll walk you through solving the question like we’re in an interview and give you some tips on how to approach the solution.
    Go to the question through the link below and follow along with me.
    Link to the question: platform.stratascratch.com/co...
    ______________________________________________________________________
    👉 Subscribe to my channel: bit.ly/2GsFxmA
    👉 Playlist for more data science interview questions and answers: bit.ly/3jifw81
    👉 Playlist for data science interview tips: bit.ly/2G5hNoJ
    👉 Practice more real data science interview questions: platform.stratascratch.com/co...
    ______________________________________________________________________
    Timeline:
    Intro: (0:00​​​)
    Interview Question: (0:35​​​)
    Framework to solve the problem: (1:40​​​)
    Understand your data: (3:58​​​)
    Formulate your approach: (6:40​​​)
    Code Execution: (8:55​​​)
    Code Optimization: (14:35​​​)
    Conclusion: (17:08​​​)
    ______________________________________________________________________
    About The Platform:
    I'm using StrataScratch (platform.stratascratch.com/co..., a platform that allows you to practice real data science interview questions. There are over 1000+ interview questions that cover coding (SQL and python), statistics, probability, product sense, and business cases.
    So, if you want more interview practice with real data science interview questions, visit platform.stratascratch.com/co.... All questions are free and you can even execute SQL and python code in the IDE, but if you want to check out the solutions from me or from other users, you can use ss15 for a 15% discount on the premium plans.
    ______________________________________________________________________
    Contact:
    If you have any questions, comments, or feedback, please leave them here!
    Feel free to also email me at nathan@stratascratch.com
    ______________________________________________________________________
    #DataScience #SQLInterviewQuestion

ความคิดเห็น • 43

  • @sophie20324940
    @sophie20324940 หลายเดือนก่อน

    I believe the choice between using ntile(100) with num

  • @joaopedroreissilva7075
    @joaopedroreissilva7075 2 ปีที่แล้ว

    Thank you, Nate!
    Impressive content, like always.

  • @grainofsalt2113
    @grainofsalt2113 2 ปีที่แล้ว +1

    I love how methodical you are. I totally stole this for my own approach lol

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      I hope you do and apply it to your interviews!

  • @shipragupta4382
    @shipragupta4382 3 ปีที่แล้ว +2

    Hi Nate, Your videos are really helpful. Thanks. Can you make a video on " How to transpose variable from rows to columns, using case when and pivot"?

  • @PawanSharma-gs7cy
    @PawanSharma-gs7cy 2 ปีที่แล้ว

    Thank you !! keep providing such informative SQL concepts..

  • @techiewithcamera
    @techiewithcamera ปีที่แล้ว

    Please validate my approach, I think this can be done in this way too :)
    My Solution:
    with filtered_dataset as
    (SELECT
    '1' as partition_no,
    restaurant_id,
    sum(order_total) as revenue
    FROM doordash_delivery
    where to_char(customer_placed_order_datetime, 'MM-YYYY')='05-2020'
    group by 2)
    select *
    from
    (select
    restaurant_id,
    revenue,
    sum(revenue) over (partition by partition_no) total_revenue,
    (revenue / sum(revenue) over (partition by partition_no))*100 as revenue_percentage
    from
    filtered_dataset)a
    where revenue_percentage

  • @AAMIRKHAN1602
    @AAMIRKHAN1602 2 ปีที่แล้ว

    Very good explanation and it is helping me to think clearly and give solutions.
    Your videos are great!
    Thanks

  • @moutaouakkilyoussef5825
    @moutaouakkilyoussef5825 2 ปีที่แล้ว +1

    Thank you for the clean tutorials!! Should've much more views considering the quality and information in the channel

  • @lecryptojames
    @lecryptojames 2 ปีที่แล้ว

    Great! I landed a DE Consultant role where my potential client is DoorDash.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว +1

      Congrats! Hope these videos are helpful!

  • @v.sprasad8399
    @v.sprasad8399 2 ปีที่แล้ว

    Really helpful Nate

  • @deepshikasharma8436
    @deepshikasharma8436 3 ปีที่แล้ว +1

    Your videos are hands down the best resource I have come across so far in answering DS interview questions! Could you also do a series on how to explain advanced SQL functions/concepts?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Thanks for watching! Most of my videos on DS interview questions do contain some level of explanation on advanced functions I use. But what I can do is add the advanced concept/function in the title of the video and also provide a more in-depth explanation on the concept. I think that would help.

    • @deepshikasharma8436
      @deepshikasharma8436 3 ปีที่แล้ว

      @@stratascratch Thanks for responding! That would help immensely! The reason I asked, is because you mentioned that while answering a code question, we should explain which operators we're using and why. I get tripped up when I try to explain the (advanced) operator I'm using in non-technical/simple terms while answering the question.

  • @InYourHorror
    @InYourHorror 3 ปีที่แล้ว +1

    Keep up great work, your videos helped me get a pretty good starter job right after uni.

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      That’s great! I’m glad I was able to help in some way. Good luck in your new job.

  • @poisson12376
    @poisson12376 2 ปีที่แล้ว

    Man, your contents are really good! Have got much chance to work on SQL in my current job, need to practice more.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      Thanks for watching! Keep practicing!

  • @vrajpatel8256
    @vrajpatel8256 3 ปีที่แล้ว

    Great video

  • @owenren1626
    @owenren1626 2 ปีที่แล้ว

    Great Video, just a question, when would we use PERCENT_RANK()? Is it because the question says evenly distributed buckets?

  • @nadiaennab4586
    @nadiaennab4586 2 ปีที่แล้ว +1

    Thank you, Nate! Could you please clarify the following, I am a little confused on these:
    1. How does the window function work with the group by clause? I have never used a window function with group by. Does the query execute the [SELECT restaurant_id,
    SUM(order_total) as total_order FROM doordash_delivery GROUP BY retaurant_id] and then assign ntiles to the grouped rows? Or, is something else going on?
    2. Why do you have to do a subquery/CTE instead of using a having clause after group by? I guess this is related to my first question regarding order of operations. Is the ntile part not executed before group by such that it cannot be referenced in a having statement?
    3. This is trivial, but when you used the ntile statement, you did not assign it an alias name. Does it automatically become called "ntile" such that you were able to say ntile=1 in the where statement?

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว +1

      Thanks for watching the video and trying out a solution! Would you be able to copy and paste your question on the forum attached to the question on the platform? Someone from my team will answer each of your questions within 3 days. It's the best way to get a timely answer! Thanks again!

  • @poojagupta-dy1gv
    @poojagupta-dy1gv 2 ปีที่แล้ว

    Hi Nate, Can we create 100 bucket and choose two for bottom 2% ?

  • @gajjelapriyanka5301
    @gajjelapriyanka5301 3 ปีที่แล้ว

    Hi Nate, how can we find creation date of jobs from June 2020 to June 2021 in sap bods? I need a SQL query for this?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      In sap bods? What do you mean?

    • @gajjelapriyanka5301
      @gajjelapriyanka5301 3 ปีที่แล้ว

      @@stratascratch need an SQL query for how to know the list of jobs name which was created in between the 2020june and 2021 june

  • @ashimahorra9623
    @ashimahorra9623 2 ปีที่แล้ว

    @strataScratch Why cant we use PERCENTILE_CONT here?

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      I don't know...does it give you the same output if you use it? How is PERCENTILE_CONT different than what I'm using? There could be an edge case that you're not considering or I'm not considering.

  • @hariprasath9997
    @hariprasath9997 2 ปีที่แล้ว

    suppose we can split the dataset into two means then how the query will be ? any one guide me please
    those 2 tables name was : table_one and table_two

    • @nikhilgarakapati566
      @nikhilgarakapati566 2 ปีที่แล้ว

      You can use the CTE concept here.
      If the number of rows are even, you can put filter like from start to count(rows)/2, that would give you first half table and (count(rows)/2)+1 to count(rows), that would give you the second half. You can use this as a subquery, but I would prefer a CTE as it can be used anytime whereas subquery, we have to run it even though it wont be needed.
      If the number of rows are odd, we can cannot evenly split. In that case, you write you condition accordingly.

  • @priyankalad7789
    @priyankalad7789 3 ปีที่แล้ว

    why we are not doing NITILE(50) over(partition by sum(order_total) asc) instead ?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      I get where you're going with this. The result is just 1 NTILE when implemented for some reason. Did you have code on the platform that worked? Can you copy and paste it here?

    • @priyankalad7789
      @priyankalad7789 3 ปีที่แล้ว

      @@stratascratch select a.restaurant_id,
      a.total
      from
      (select restaurant_id,
      sum(order_total) as total,
      ntile(50) over (partition by sum(order_total) asc) as bucket
      from doordash_delivery
      where customer_placed_order_datetime between '2020-05-01' and '2020-05-31'
      group by restaurant_id)a
      where a.bucket = 1
      order by 2 desc

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      @@priyankalad7789 This isn't executing for me on the platform. I get an error. But either way, I'm not sure what the PARTITION is supposed to do to help solve the problem. Once I have it in 50 buckets, I just need to find ntile = 1 to get to the bottom 2% since each ntile is worth 2% of the distribution.

    • @priyankalad7789
      @priyankalad7789 3 ปีที่แล้ว

      @@stratascratch ok yea. Partition doesn't make sense here. This is first of this kind in your series of videos. Thanks

  • @papatingle9964
    @papatingle9964 2 ปีที่แล้ว +1

    I would like to challenge the usage of NTILE function in this question. While it is a possibility things need to be clarified in front of the interviewer but in my opinion NTILE is the least valid option out of the 3 possible here.
    1. NTILE as used in the video
    2. ROW_NUMBER() OVER(ORDER BY SUM(order_total))/(SELECT COUNT(*) FROM doordash_delivery)*100
    3. RANK() OVER(ORDER BY SUM(order_total))/(SELECT COUNT(*) FROM doordash_delivery)*100
    To present my arguement let's assume there are 101 restaurants and we were tasked with giving the bottom 2%
    Let's also assume the bottom 5 restaurants have the exact same revenue. We'd get the following results when querying for the bottom 2% (AKA ntile=1 or row_number()

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว +1

      Hi papa tingle,
      That's a great comment, thank you for the very detailed and easy to follow argumentation.
      You detected a real problem - what to do in case revenue/score of different entities are tied. Do we want to output all the tied entities, or we want to limit ourselves to return as closely to 5% of data points as possible?
      If we want to just return 5% of data, then NTILE and row_number are good solutions for us.
      If we want to return all the tied entities - we'd go with the RANK option. One alternative to the RANK option that is more intuitive to us is PERCENTILE_CONT/PERCENTILE_DISC function, e.g. "SELECT percentile_disc(0.05) within group (order by summed_total_order) as percentile_05". This gets you the score that is at 5% and then you have the flexibility to do what you want with it.
      Always discuss with your interviewer what is expected from you in such cases. When we think about it, returning all the tied entities makes more sense. But someone could have a different opinion.