Amazon Data Science Business Case | FAANG Interview Prep

แชร์
ฝัง
  • เผยแพร่เมื่อ 2 ต.ค. 2024

ความคิดเห็น • 20

  • @datamar2891
    @datamar2891 2 ปีที่แล้ว +7

    Hey Dan, my name is Sam. I'm new to this and I'm early in my DS/MLE journey. I've only watched one of your case videos and going to take a stab at it by using the template you used in that one. I would definitely appreciate some feedback:
    Clarifying Questions:
    1. What's the current process look like:
    - How does amazon currently calculate CLV of their shoppers?
    - What's the average lifespan of a customer of Amazon?
    - What's Amazon's profit percentage from a single sale?
    2. What's the business objective:
    - Is the goal to improve CLV considering the latter half of the prompt?
    3. Data sources:
    -Customer-specific data: (number of purchases, whether they were a prime member (month-to-month vs annual), total spent,, time duration, address (including city and country)
    Fleshing out the ML Pipeline:
    1) EDA:
    - Observe distributions of the features
    - Correlation analysis between Xs and Y using something like Pearson's correlation
    2) Data Pre-processing
    - Handle missing values: 1) If they account for +50% of data remove feature. 2) Build a classifier and use that feature as your target variable 3) You can use clustering
    - Encode categorical features: 1) If that feature has few categories then you can do one-hot/dummy encoding. 2) If high amount of categories you can do numerical target encoding
    - Normalize data (no need to if using decision tree based model like Random Forest
    3) Feature Engineering
    - If we had more detailed purchase history with timestamps we can break it do some time decomposition: month - day - hour - minute - second
    - Can't think of anything else
    4) Feature Selection
    - To avoid curse of dimensionality we can try PCA, Random Forest variable Importance or L1 Regression
    5) Model Selection
    - Since we are predicting continuous, real-value we will be doing Regression
    - Random Forest Regressor comes to mind or XGBoost Regressor given their robust reputation
    - Hyperparameter tuning: (depth of trees, number of trees, min number of sample per leaf, pruning, learning rate in case of XGBoost)
    6) Model Evaluation
    - MSE and possibly try MAE since it is less sensitive to outliers and we don't want our model necessarily to sensitive to them
    - K-fold cross validation to make sure our model generalizes well
    7) Productionalize
    - Not too experienced here other than using a REST API using one of the cloud service providers like AWS, Azure or even something like Databricks which I know has that functionality

    • @DataInterview
      @DataInterview  2 ปีที่แล้ว +3

      Hey Sam👋. Welcome aboard! Interviewing is challenging, but with practice, you will succeed =D. Thanks for your solution. I placed my feedback on quotation below👇
      Clarifying Questions:
      1. What's the current process look like:
      - How does amazon currently calculate CLV of their shoppers?
      - What's the average lifespan of a customer of Amazon?
      - What's Amazon's profit percentage from a single sale?
      👉 “Great questions to start off. The only thing I would advise is that I would generally avoid questions that the interviewer would presume that you would already know. This is to demonstrate expertise from the gecko. For instance, if you are a practitioner in the field, or you’ve already done some research, you would anticipate that CLV is calculated based on some fixed horizon, let’s say 1, month, 3 months, 6 months, 12 months. I would provide some info up front then confirm with the interviewer.”
      2. What's the business objective:
      - Is the goal to improve CLV considering the latter half of the prompt?
      3. Data sources:
      -Customer-specific data: (number of purchases, whether they were a prime member (month-to-month vs annual), total spent,, time duration, address (including city and country)
      👉 “Great list of signals!”
      Fleshing out the ML Pipeline:
      1) EDA:
      - Observe distributions of the features
      - Correlation analysis between Xs and Y using something like Pearson's correlation
      2) Data Pre-processing
      - Handle missing values: 1) If they account for +50% of data remove feature. 2) Build a classifier and use that feature as your target variable 3) You can use clustering
      - Encode categorical features: 1) If that feature has few categories then you can do one-hot/dummy encoding. 2) If high amount of categories you can do numerical target encoding
      - Normalize data (no need to if using decision tree based model like Random Forest
      3) Feature Engineering
      - If we had more detailed purchase history with timestamps we can break it do some time decomposition: month - day - hour - minute - second
      - Can't think of anything else
      4) Feature Selection
      - To avoid curse of dimensionality we can try PCA, Random Forest variable Importance or L1 Regression
      👉 “Agreed, but remember interpretation matters for the latter case. So, I’d stick to L1 Regression or RF.”
      5) Model Selection
      - Since we are predicting continuous, real-value we will be doing Regression
      - Random Forest Regressor comes to mind or XGBoost Regressor given their robust reputation
      - Hyperparameter tuning: (depth of trees, number of trees, min number of sample per leaf, pruning, learning rate in
      👉 “+1”
      case of XGBoost)
      6) Model Evaluation
      - MSE and possibly try MAE since it is less sensitive to outliers and we don't want our model necessarily to sensitive to them
      - K-fold cross validation to make sure our model generalizes well
      👉 “+1”
      7) Productionalize
      - Not too experienced here other than using a REST API using one of the cloud service providers like AWS, Azure or even something like Databricks which I know has that functionality
      👉 “Depending on the level of depth the interviewer expects, follow-up questions may vary. But, generally the framework works like this - you need to build an ETL, establish a Cron job (not sure what Amazon readily uses, but you could use Airflow), wrap your model in a REST API that includes preprocessing, training and prediction. The result should be stored in DBs for prediction, monitoring and so and so”
      Hope this helps! 😉 If you have any questions, feel free to reach out dan@datainterview.com Happy interviewing!

    • @datamar2891
      @datamar2891 2 ปีที่แล้ว +1

      @@DataInterview Thank you for your reply, Dan. I appreciate the feedback.

  • @huanchenli773
    @huanchenli773 7 วันที่ผ่านมา

    i dont understand why all you business case has to relate to ML

  • @DeepakRajput-wl2pi
    @DeepakRajput-wl2pi 2 ปีที่แล้ว +4

    Please do more of these

  • @KS-df1cp
    @KS-df1cp 2 ปีที่แล้ว +3

    Thank you! One thing that I learnt is your style! I jumped straight on defining target, getting features and talking about metrics. I think my approach is robotic and the impact is missing. Will definitely practice some on your channel.
    There is no way my thoughts can occur so fast though!! Is that acceptable? How can I think faster in system design interview? Is practice the only way? Thank you.

    • @DataInterview
      @DataInterview  2 ปีที่แล้ว +1

      Thanks! Many of us start awkward when we are trying something new for the first time. But, with practice and experience you will get better! - Dan

    • @KS-df1cp
      @KS-df1cp 2 ปีที่แล้ว

      @@DataInterview thank you .. :) this is really helpful.

  • @Aidan_Au
    @Aidan_Au 2 ปีที่แล้ว +1

    Thank you Dan for walking us through and providing commentary in another question.
    Whoever doesn't leave a comment for your feedback is missing out!

  • @TheMISBlog
    @TheMISBlog 2 ปีที่แล้ว +1

    Very informative,Thanks Dan

  • @chetnamohapatra5181
    @chetnamohapatra5181 2 หลายเดือนก่อน

    Hey Dan! This is exactly what I wanted. Not just the template but going in details of each step and explaining why we chose something vs the other. This is amazing! I am going to subscribe to practice more of these.

  • @anthonykinruizcalvo7516
    @anthonykinruizcalvo7516 2 ปีที่แล้ว +1

    Nice video! Please can you explain how you do Numerical Encoding when a large number of products? I couldn't understand how you avoid adding too many features if you are going to get avg/total per product. Thanks!

    • @DataInterview
      @DataInterview  2 ปีที่แล้ว +2

      Hey Anthony, thanks for the question, the numeral encoding works like this:
      For each categorical value, you essentially try to have a numerical representation of it.
      Suppose you have a category of product items. Instead of applying one-hot on each product which is going to cause your feature space to explode, you aggregate each product on some continuous variables like, historical sales, volume, inventory count and such.
      So now, you have some dense representation for each product item, then you merge those columns to your model data.
      Hope this clarifies!

    • @anthonykinruizcalvo7516
      @anthonykinruizcalvo7516 2 ปีที่แล้ว

      @@DataInterview thank You very much!

  • @rezarafieirad
    @rezarafieirad 2 ปีที่แล้ว

    just another perfect video. thanks

  • @mariullom8105
    @mariullom8105 2 ปีที่แล้ว

    I absolutely love your videos.

  • @mohammadrahmaty521
    @mohammadrahmaty521 ปีที่แล้ว

    Amazing! Thanks!

  • @sonug2924
    @sonug2924 ปีที่แล้ว

    Great work ! Thanks Dan