Facebook and Microsoft Data Science SQL interview question walkthrough (advanced)

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ก.ค. 2024
  • This data science SQL interview question is from Facebook and Microsoft that tests your ability to find and segment users as well as join aggregated tables together. Let's walk through solving this question as we do in the interview.
    Link to the question: platform.stratascratch.com/co...
    ______________________________________________________________________
    👉 Subscribe to my channel: bit.ly/2GsFxmA​​​
    👉 Playlist for more data science interview questions and answers: bit.ly/3jifw81​​​
    👉 Playlist for data science interview tips: bit.ly/2G5hNoJ​​​
    👉 Practice more real data science interview questions: platform.stratascratch.com/co...
    ______________________________________________________________________
    Timeline:
    Intro: (0:00​​​)
    Interview Question: (0:32​​​)
    5 Step Framework: (1:20​​​)
    Explore the Data: (2:06​​​)
    List Assumptions: (2:40​​​)
    Outline Approach: (5:20​​​)
    Code in Increments: (7:32​​​)
    Calculate New Users: (7:50​​​)
    Calculate All Users: (11:02)
    Join Tables: (11:36​​​)
    Calculate User Share: (12:54​​​)
    Optimize Code: (14:15​​​)
    Conclusion: (16:20​​​)
    ______________________________________________________________________
    About The Platform:
    I'm using StrataScratch (platform.stratascratch.com/co..., a platform that allows you to practice real data science interview questions. There are over 1000+ interview questions that cover coding (SQL and python), statistics, probability, product sense, and business cases.
    So, if you want more interview practice with real data science interview questions, visit platform.stratascratch.com/co.... All questions are free and you can even execute SQL and python code in the IDE, but if you want to check out the solutions from me or from other users, you can use ss15 for a 15% discount on the premium plans.
    ______________________________________________________________________
    Contact:
    If you have any questions, comments, or feedback, please leave them here!
    Feel free to also email me at nathan@stratascratch.com
    ______________________________________________________________________
    #SQLInterview

ความคิดเห็น • 61

  • @TheMHud
    @TheMHud 2 ปีที่แล้ว

    I don’t know how you do it but after watching one of your videos I always come up with a solution for one of my coding problems. I am new on my job as a business analyst and am setting up dashboards for the company. Thank you so much for this quality content - keep up the good work.

  • @PATRICKCHUAD
    @PATRICKCHUAD 3 ปีที่แล้ว

    Very Clear explanation. Thanks Nate for creating a video for this problem.

  • @mohitnisar3866
    @mohitnisar3866 2 ปีที่แล้ว

    Hey, just found out about your channel, been going through videos almost all day and wanted to give my thanks! The content is amazing.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      Thanks so much! More to come!

  • @oliviaou1940
    @oliviaou1940 3 ปีที่แล้ว

    great video! I've learned a lot through your videos!! Thanks. Please keep posting this kind of videos

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      I definitely will! Thank you for watching my videos

  • @leonardlau4951
    @leonardlau4951 3 ปีที่แล้ว

    i love your videos narrating your mind map of creating queries, which is very helpful for sql learners!

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +1

      Thank you! Hope you keep watching!

  • @joaopedroreissilva7075
    @joaopedroreissilva7075 2 ปีที่แล้ว

    Impressive.
    The people who work with you and especially the team you lead are certainly very lucky.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว +1

      Thanks, buddy! Really appreciate it

  • @Bustarhymesisnotmyfather
    @Bustarhymesisnotmyfather 2 ปีที่แล้ว +1

    Videos are great. Thanks.

  • @ismafoot11
    @ismafoot11 2 ปีที่แล้ว +1

    You dont need the count(distinct user_id) at 9:47. you did a min in your inner query which keeps once instance of that column per user_id. you said a user could show up multiple times in the same month but the min takes care of that

    • @paul_devos
      @paul_devos 2 ปีที่แล้ว

      that's correct, when using a GROUP BY here you don't even need a DISTINCT as the "Map -> Reduce" like feature of the Group By would do all the work. And depending on the query engine underneath, you rarely ever want to use DISTINCT as it could lead to a cartesian product (n!) timespace. Altho, I do think most query engines underneath, now know to substitute the group by for distinct anyway.

  • @ameyraj4947
    @ameyraj4947 3 ปีที่แล้ว

    Awesome loved the video🎉🎉

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Thanks so much! check out the other ones on my channel!

  • @myrandomandboringvideos
    @myrandomandboringvideos 3 ปีที่แล้ว +1

    Your videos are helpful. Insights on how to break a problem statement are very detailed.

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +1

      Thanks for the kind words! I'll keep doing the best I can with these videos!

    • @myrandomandboringvideos
      @myrandomandboringvideos 3 ปีที่แล้ว

      @@stratascratch Hey Nate, I was solving this question 'Finding User Purchases' and wrote this query but this is not getting accepted as the right solution. if it is okay, could you please let me know why?
      select
      distinct user_id
      from
      (
      Select
      user_id,
      created_at - last_purchase_date as days_difference
      from
      (
      SELECT
      user_id,
      created_at,
      lag(created_at,1) over (partition by user_id order by created_at) as last_purchase_date
      FROM AMAZON_TRANSACTIONS
      order by user_id asc, created_at asc
      ) a
      ) b
      where days_difference >1 and difference < 8

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +1

      @@myrandomandboringvideos Thanks for giving it a try and definitely can help. But can you post this code and your question on the user forum of the platform? Either myself or one of my teammates will help debug the code for you =)

  • @jerrygong8976
    @jerrygong8976 3 ปีที่แล้ว +1

    good video. only catch is that you don't need the 'distinct' keyword in the first CTE because it's already grouped by user_id and user_id will always be distinct in that case

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Thanks so much for the insight and glad to see you're watching the video! You're definitely right about now needing distinct.

  • @Nadyusa
    @Nadyusa 2 ปีที่แล้ว

    I have been doing similar task at work utilising my colleagues shorter approach. In cte I count new user date, then in the main query I calculate shares like this: sum(iif(new user date>=month,1,0))/count(user_id), and group by month. Less subqueries, would it be quicker as well? And thanks for great videos, looking forward to solving many problems with window functions you taught.

  • @sujayshashank1198
    @sujayshashank1198 3 ปีที่แล้ว +1

    Why we need to use distinct user_id ?? When we already have filtered out the data in subquery and did a group by on user_id and fetched the min(time_id).
    So even if user would be having multiple entries in the same month but will result only one row when we did min(time_id)

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +1

      absolutely true. No reason to use a distinct but it just is a fail safe in case there's a duplicate user somehow. As you mentioned, there shouldn't be.

  • @maheshchandra3833
    @maheshchandra3833 3 ปีที่แล้ว

    Just came across your videos and I really like your explanations! I was checking out StrataScratch and was wondering how the difficulty level maps to the difficulty level in Leetcode Database. Are your “Hard” questions harder than Leetcode “Hard”?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Sort of...StrataScratch has data science type of questions so the approach and solution is much different than Leetcode which tends to be more algorithm based. The LC hard database questions are of the level of the SS hard questions in terms of syntax and functions needed in the solution, but I think the hard questions for SS are much harder since there's a data science aspect to the question.

    • @maheshchandra3833
      @maheshchandra3833 3 ปีที่แล้ว

      @@stratascratch Thanks for the reply. For a new grad SQL interview, would you say "Hard" level SS questions are overkill? Or you think "Hard" level questions are quite common?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      ​@@maheshchandra3833 No it's not quite overkill but you probably won't get too many of those questions. Make sure you know to do the medium ones for sure. That's most common. You'll get 1-2 hard ones during the later rounds.

  • @kamakshijoshi3655
    @kamakshijoshi3655 3 ปีที่แล้ว

    Hello Nate, quiet good video, just wanted to know that the question you share, are these type of question useful for a Data Engineer Sql rounds as well?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +1

      Yes, some are useful for DE sql interviews as well. Later in the year I'll do better to distinguish DS with DE questions and release some DE specific questions.

  • @ajaxaj6749
    @ajaxaj6749 2 ปีที่แล้ว

    we can use lag function instead

  • @SuperLOLABC
    @SuperLOLABC 3 ปีที่แล้ว

    Great video, Nate! I am a new grad and I have a question; whenever I try to solve a medium or hard problem, I can’t seem to think beyond ‘joining’ tables. I can’t seem to think of using CTEs or sub queries. Can you point me to any resources or use cases where I can learn to use CTEs/subqueries intuitively? Also are CTEs and subquiries have the same use case?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +2

      I think it really just takes more practice. I have an entire playlist dedicated to these questions and many of them require CTEs/subqueries. With more practice, you'll get an understanding of how to think about your approach to the problem and organically will start using ctes/subqueries in your solution. So suggest doing more problems on StrataScratch and Leetcode to get comfortable with solving these types of questions. Hackerrank is another source that could help. But I don't know many other platforms that give you a lot of practice problems.

    • @SuperLOLABC
      @SuperLOLABC 3 ปีที่แล้ว +1

      @@stratascratch Thanks for your reply. Practice does make one perfect I guess. Great work on the platform by you and your team!

  • @vivekkamisetti4
    @vivekkamisetti4 2 ปีที่แล้ว +1

    Hi Nate, I think there's one thing wrong about this query. What if for a month there are 0 new users? By doing an inner join you're missing out on that logic.
    In my opinion, the right way to do it is doing a left join between all users and new users and then assign a 0 wherever there's a null(in new users table).
    Let me know what you think.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      I think you're correct with this edge case. I like that logic!

  • @priyankalad7789
    @priyankalad7789 3 ปีที่แล้ว

    @Nate Could you please post some more data science or data engineer SQL questions?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Yes! I will keep posting both ds and de style sql questions in the future. I have a variety of other types of content coming out over the next few months but coding questions is something I plan to keep doing.

    • @priyankalad7789
      @priyankalad7789 3 ปีที่แล้ว

      @@stratascratch Great Thanks for your contribution

  • @SuperLOLABC
    @SuperLOLABC 3 ปีที่แล้ว

    Hey Nate, is it necessary to know the time complexity of queries during the interview? I understand it is expected in a coding interview but is it necessary for SQL interviews too?

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Time complexities no. But understanding what can make execution time longer in your queries is definitely fair game. You just need to point out parts of your code where you could potentially optimize to speed execution time.

    • @SuperLOLABC
      @SuperLOLABC 3 ปีที่แล้ว

      @@stratascratch Thank you for your prompt reply! I have an interview coming up in two weeks and you along with your platform are a godsend!

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      @@SuperLOLABC Good luck! Hope you do well!

  • @anumitamondal1320
    @anumitamondal1320 3 ปีที่แล้ว

    Nate could you post some tutorial about Data Modelling question

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      Yea, I'll add it to the queue. I have a few months of videos on the list right now but we can do some modeling questions afterwards. I'm assuming these are for data engineering interviews?

    • @anumitamondal1320
      @anumitamondal1320 3 ปีที่แล้ว

      @@stratascratch Yes Nate. For Data Engineer , there's a section for Data Modelling and forming the SQL on top of it. If you can make some video on that..it would be nice. Waiting for it.

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      @@anumitamondal1320 I'll see if I can find any questions on data modeling. If I don't have one in my current repository, I will try to find one. I plan to add some DE questions to this channel and stratascratch later in the year so there will be some more content around in later. Thanks for offering this suggestion.

  • @VinodKumar-nn7go
    @VinodKumar-nn7go 2 ปีที่แล้ว

    Hi Nate, In the problem statement it says ratio of new users to "existing users". I was wondering whether we should exclude the new users from all users and then get the existing users?? Thanks

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว

      Yes, you can do that! i think that makes sense. I think it just depends on what the interviewer defines as "existing users".

  • @ajeetvishwakarma7964
    @ajeetvishwakarma7964 3 ปีที่แล้ว +1

    “New users are defined as users who started using services in the current month.” But where are we checking that condition? Your solution checks min value of the time_id column. As per my understanding the current month should the month in which we exceute the query. Could you please explain.

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว

      We are checking that condition with the min() function as you mentioned. In the outer query, we extract the month from the min(time_id) of the user to get to when the user executed the query.

    • @pranavlal2289
      @pranavlal2289 2 ปีที่แล้ว

      @@stratascratch Thanks Nate, but I agree with Ajeet here, the question clearly mentions that the user who starts in the current month is the new user, we need to check the month(min(time_id)) to be the same as the month of current_date and that should be the users who are new users. What do you think?

  • @its_me7363
    @its_me7363 3 ปีที่แล้ว +2

    waiting for API videos

    • @stratascratch
      @stratascratch  3 ปีที่แล้ว +2

      It's right here (th-cam.com/video/fklHBWow8vE/w-d-xo.html). I walk through using the requests library for python to extract data from the TH-cam API. But you can extend the request library for any API. In about 2 weeks, I'll have the next step of the process which is saving the data to a database in the cloud.

  • @oscararmandocisnerosruvalc8503
    @oscararmandocisnerosruvalc8503 ปีที่แล้ว

    why are we using 1 - and then the formula ???.
    regards

  • @techiewithcamera
    @techiewithcamera ปีที่แล้ว

    My Solution:
    with month_dataset as
    (select
    to_char(time_id, 'MM-YYYY') as month_year,
    string_agg(user_id, ';#') as user_id_list
    from fact_events
    group by 1),
    user_status as
    (select distinct
    to_char(t1.time_id, 'MM-YYYY') as month_year,
    t1.user_id,
    case
    when t2.user_id_list ''
    then 'existing user'
    else 'new user'
    end as user_status
    from fact_events t1
    left join month_dataset t2 on
    to_char(t1.time_id, 'MM-YYYY') > t2.month_year
    and t2.user_id_list like '%' || t1.user_id ||'%')
    select
    month_year,
    round(sum(existing_user)/ (sum(existing_user)+sum(new_user))::decimal, 2) share_existing_users,
    round(sum(new_user)/ (sum(existing_user)+sum(new_user))::decimal, 2) share_new_users,
    sum(existing_user)+sum(new_user) all_users
    from
    (select
    month_year,
    case
    when user_status='existing user'
    then 1
    else 0
    end as existing_user,
    case
    when user_status='new user'
    then 1
    else 0
    end as new_user
    from user_status)x
    group by 1
    order by 1;

  • @dannyrodin1151
    @dannyrodin1151 2 ปีที่แล้ว

    I don't get the "with" at the beginning. Too bad you never explained it.

    • @stratascratch
      @stratascratch  2 ปีที่แล้ว +1

      Sorry about that. I assume there's a level of experience people have with SQL since these vids focus on how to approach the solution as if you're on an interview. WITH is the start of a CTE (common table expression) that's popular in postgres. Not all db engines have this function. MySQL doesn't allow for this for example. But you can always create temp tables or subqueries in place of CTEs.