Facebook Statistics Interview Question | Google Data Scientist | DataInterview

แชร์
ฝัง
  • เผยแพร่เมื่อ 30 มิ.ย. 2024
  • 🚀 Land your dream data job using datainterview.com/.
    ====== ✅ Details ======
    🤔 "If you sample 10,000 users multiple times, what would the distribution of false positives look like?"
    Dan (Ex-Google/PayPal Data Scientist) explores how to address this interview question posed in the statistics round of a data scientist interview at Facebook.
    He covers the following:
    1. Preview of the solution in the Case-In-Point course.
    2. Walkthrough on the solution along with a self-assessment.
    3. Demonstration of the Central Limit Theorem on Colab.
    You can access the Colab here:
    colab.research.google.com/dri...
    If you want more prep content including AB testing courses, case course, product SQL course, Slack group and much more, make sure to check out datainterview.com/
    👍 And, lastly, feel free to subscribe, like and share this video!
    ====== ⏱️ Timestamps ======
    0:00 Intro
    01:19 DataInterview Case Course
    04:12 Solution Steps Overview
    05:03 Step 1 - Clarification
    06:33 Step 2 - Single Sample
    08:07 Step 3 - Multiple Samples
    12:52 Self-Assessment
    13:51 Colab Simulation
    ====== 📚 Other Useful Contents ======
    1. Principles and Frameworks of Product Metrics | TH-cam Case Study
    / principles-and-framewo...
    2. How to Crack the Data Scientist Case Interview
    / principles-and-framewo...
    3. How to Crack the Amazon Data Scientist Interview
    / crack-the-data-scienti...
    ====== Connect ======
    📗 LinkedIn (Dan) - / danleedata
    📘 Medium (DataInterview) - / datainterview
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 25

  • @chessplayer0106
    @chessplayer0106 2 ปีที่แล้ว +6

    I think the thing that tripped me up was "sample users" - cause sampling users does not mean we *classify* the users. And it's in the classification of the users (for example classify them as "bot" vs "not a bot") that we obtain false positives. Once I stopped tripping over that the problem made sense!

  • @AnshumanPanwar
    @AnshumanPanwar ปีที่แล้ว +6

    False positives is a term used when there is prediction involved. Using the wrong terminology is making a straightforward question unnecessary complicated.

  • @kennethleung4487
    @kennethleung4487 2 ปีที่แล้ว

    Excellent video with clear explanations, looking forward to more of your content on Statistics

  • @hello-pd7tc
    @hello-pd7tc 2 ปีที่แล้ว +2

    This is so helpful! Thank you!! Please upload more videos like this for Facebook Statistics questions :)

  • @zhili4558
    @zhili4558 2 ปีที่แล้ว +1

    Hi Dan, I was asked a distribution question like this as well. the question is for those samples at median or 95% pencentile, if we check back in 2 months, what's would it look like? not sure how to answer it.

  • @mahmoudabdelsattar8860
    @mahmoudabdelsattar8860 ปีที่แล้ว

    very nice and helpful need more vids

  • @korchageen
    @korchageen 2 ปีที่แล้ว

    Excellent.. keep up your good work.

  • @hhliu7287
    @hhliu7287 2 ปีที่แล้ว +2

    the p(1-p)/n is the binomial distribution variance, which is sample 10000 people one time. If sample multiple times, it become the normal distribution, will it still be the same variance? do we still use the same n 10000? shouldn't we use the n=multiple times , such as 1000 times.

  • @lizlawler9027
    @lizlawler9027 2 ปีที่แล้ว +4

    Will the normal distribution of false positives always have a mean of the alpha value you set? It seems so from running a couple of tests in your colab. Intuitively that makes sense, since the alpha value you set is the cut off probability in whatever probability distribution you set, so if you pick alpha = 0.0n, you lock yourself in to getting false positives 0.0n proportion of the time (or n% if you want to look at it that way). And so you'd be sampling with a chance of getting a false positive 0.0n of the time, which would inherently imply that your mean would be your alpha value (0.0n).
    Another question off the stats for a bit: is it normal for companies to provide questions devoid of any context? I would think your ability to set an appropriate alpha value would be informed by the actual hypothesis being tested, and if the alpha value will inherently set your probability distribution for false positives, doesn't it become more important to set an appropriate p-value?

    • @DataInterview
      @DataInterview  2 ปีที่แล้ว

      Yes, to your first question. Regardless of the alpha you set, under large sample size given CLT, the alpha would equal the sample mean of the distribution. On your second question, yes, because in reality, many projects start with ambiguous situations. Companies want to see how you frame the problem, understand it and solve it.

  • @chiranjeeveemohapatra
    @chiranjeeveemohapatra ปีที่แล้ว

    The problem I had was I jumped straight to normal distribution using CLT. But never figured out binomial distribution. Well I need to study more.

  • @popo-je8ze
    @popo-je8ze ปีที่แล้ว

    I think the point is that :(Am I right?🤔help me check🙏)
    X is the false positive rate
    after sampling for many times
    we get X1,X2…Xn
    X1: the false positive rate of sampling 10000 users for the first time
    ……
    based on C.L.T
    the distribution of the "mean" of false positive rate is normal distribution with mu is asymtopic to alpha

  • @abhirama
    @abhirama ปีที่แล้ว

    Something off with the sample variance. Isn't the sample.variance npq for binomial distribution?

  • @hanzhuzhao7167
    @hanzhuzhao7167 2 ปีที่แล้ว +3

    Hi Dan, I'm confused about the expected value and variance. A single sample out of 10000 should have a probablilty of p(0.05) and the distribution is Bernoulli distribution, expected value is p. When we draw 10000 samples, the distribution is Binomial distribution, according to the formula for binomial distribution, the expected value should be n*p(10000*0.05=500), and variance should be n*p*(1-p) = 475. I'm confused why your formula and numbers are different from these. Can you help explain? Thanks! Btw, all your videos are super helpful!!

    • @NnamdiNw
      @NnamdiNw 2 ปีที่แล้ว

      The binomial distribution tells you the probability you’re gonna get a false positive n number of times from running the experiment . Instead we are looking at the distribution of the false positive rates for a random sample.

    • @renatofillinich305
      @renatofillinich305 2 ปีที่แล้ว +1

      I think he’s making quite a bit of confusion, as he’s using the Bernoulli distribution instead of the Binomial. The two are related, and depending on the question you can conceptualize it as either, but the graph and formulas used come from the Bernoulli distribution, which makes sense if you are talking about proportions, and based on how he’s approached the problem.
      Each observation comes from a Bernoulli process with E(X) = p, Var(X) = p(1-p), and what he is doing is estimating the p by taking the mean (the proportion is really a mean in disguise) of the 10k sample. So for each sample you’d get a proportion that is an estimation of p. Then the question is what happens if we repeat this process, i.e. what does a distribution of estimated means look like? The CLT tells us that the distribution of sample means will (asymptotically) tend to a Normal, with mean = p (same as the single distribution) and variance = p(1-p) / n (or standard error sqrt[ p(1-p) / n ] ). As you can see, no Binomial distribution is involved.
      Note: as I mentioned, the two are related and the problem can be conceptualized with either

  • @walter_ullon
    @walter_ullon 2 ปีที่แล้ว

    Thank you so much for this!

  • @oliviazhang2922
    @oliviazhang2922 2 ปีที่แล้ว

    shouldn't the sample variance for Binomial Distribution be n*p*(1-p), instead of what you showed at 12:29 of p*(1-p)/n?

    • @eraheem
      @eraheem 2 ปีที่แล้ว +2

      Its the variance of the sampling distribution of p.

  • @ritikadhiman7459
    @ritikadhiman7459 2 ปีที่แล้ว

    What is statistics 101?

  • @leonardocasarsadeazevedo7022
    @leonardocasarsadeazevedo7022 4 หลายเดือนก่อน +1

    This is such an ill-posed question, it makes me angry 😂

  • @xinxinli8779
    @xinxinli8779 2 ปีที่แล้ว

    It's not normal distribution but approximate normal, right? The random variable is not a continuous variable even as the sample size increases, i.e. at the minimum it's 1/10K but not lower than that.