Building a Recommendation System in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.ค. 2024
  • ===== Likes: 652 👍: Dislikes: 21 👎: 96.88% : Updated on 01-21-2023 11:57:17 EST =====
    Ever wonder how the recommendation algorithms work behind large tech companies? (Facebook, Google, Apple, Netflix, Amazon etc) Look no further! I explain how the recommendation systems work and how to create your own using Matrix Factorization and Kmeans clustering.
    I create a recommendation system for movies. So, stay tuned! ;)
    Github for code: github.com/SpencerPao/Data_Sc...
    Data Citation:
    F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligents Systems (TiiS) 5, 4: 19:1-19:19.
    Data Link: (MovieLens)
    grouplens.org/datasets/moviel...
    0:00 - Why do we care about Recommendation Algorithm & System?
    1:22 - Game Plan!
    1:38 - Collaborative Filtering and Content-Based Filtering & Objective
    3:39 - Google Collab Setup & Data
    7:18 - Matrix Factorization Model Initialization & Training / Tuning Model
    10:30 - Kmeans Clustering & Movie Recommendations
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 90

  • @nayibahued5955
    @nayibahued5955 2 ปีที่แล้ว +111

    deepest data scientist voice in the world

  • @user-jj3we9jv9i
    @user-jj3we9jv9i 8 หลายเดือนก่อน +3

    Holy cow! That is a really good recommendation system! Humbling tutorial as well!

  • @icequeen2778
    @icequeen2778 ปีที่แล้ว

    Would love to see more of this type of video!

  • @ayushthombare9235
    @ayushthombare9235 2 ปีที่แล้ว

    Very informative and useful video.... Thank you so much

  • @dan7582
    @dan7582 2 ปีที่แล้ว

    Nice video, keep up the good work!!

  • @naderkhaled9410
    @naderkhaled9410 2 ปีที่แล้ว +14

    Dude I know this is off topic, but ur voice is insanely satisfying !!

  • @ea1766
    @ea1766 10 หลายเดือนก่อน

    easily the best video on this subject, all the other videos were so boring and mundane. I wish TH-cam promoted this video more to the top.

  • @vincent_hall
    @vincent_hall ปีที่แล้ว

    Thank you sir.
    I have forked it and shall have a go collaborating with a friend.

  • @stmasanti
    @stmasanti 8 หลายเดือนก่อน

    Great video!

  • @Agent7155
    @Agent7155 2 ปีที่แล้ว +6

    Ended up searching up for movies to watch at the end xD

  • @folahan
    @folahan ปีที่แล้ว +1

    The first time I will follow a training using my own dataset and I didn't get any error from start to finish.

  • @gauravpoudel7288
    @gauravpoudel7288 10 หลายเดือนก่อน

    Thanks for the awesome content.
    BTW Is that really your voice?

  • @Bjorn_R
    @Bjorn_R 5 หลายเดือนก่อน

    Hello Spencer im split between collaborative recommender systems and a confirmation tree project for my master thesis. What would be most beneficial?

  • @marcelomlr
    @marcelomlr 3 หลายเดือนก่อน

    Hey man, nice video, and thanks for the tutorial. I'm actually trying to build a recommendation system for online courses, like udemy, but I can't find any datasets for user reviews to make the collaborative filtering. So I decided to manually create a dataset, and thought of choosing like 4 subjects and putting some users to rate like 10-15 courses of each subject. Do you know if something like that can work, or have any tips you can give me?

  • @NobixLee
    @NobixLee ปีที่แล้ว

    Great video, but how do we then get scores for the User_ID? Something like there is this much probability that User_ID 2 will be in cluster 2? Thank you.

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว +1

      One way that you can go about this:
      You'd need more data to have a more accurate way of doing this. Since there are only 4 features: userID, movieID, rating, timestamp in the dataset I am using in this video. However, with the way that I have done this in the video, you can go forth and associate the average of the ratings that each user has appled for all of the users' ratings with the movies in each cluster. Normalize across all clusters with the given movie and sort upon highest ratings per cluster for the user. Whichever movies that may not have been seen by the user in the cluster should be recommended to the user. I am open to hearing your thoughts on this!

  • @vinayvajrala4366
    @vinayvajrala4366 8 วันที่ผ่านมา

    A big like for that voice

  • @vaiterius
    @vaiterius 8 หลายเดือนก่อน +1

    How do you know which libraries/functions to use to make these algorithms? I’m trying to make a videogame recommendation system from a Steam games dataset, similar to what you’re doing here

    • @hamzak5674
      @hamzak5674 5 หลายเดือนก่อน

      Hey, I’m making something similar using the RAWG dataset. Did you manage to get anywhere? I’m planning to start in the next few days

    • @SpencerPaoHere
      @SpencerPaoHere  4 หลายเดือนก่อน

      Python typically wraps around alot of theoritcal applications behind C/C++. When it comes to a recommendation system base, tensorflow/keras are the building blocks and are quite effective when building something from scratch or fine tuneable

  • @user-cn4co9mt4p
    @user-cn4co9mt4p หลายเดือนก่อน

    Dude is not only learn deep learning but deep voice. damn

  • @dustinvo6097
    @dustinvo6097 2 ปีที่แล้ว

    Hi Spencer. Nice video as always. I am working on a problem where the users interact with banking website and app. So I have userid, the interaction name, timestamps and some demographic varibables. I'm trying to cluster them into some "personas" based on their interaction and timestamp for biz use. Do you have any ideas how to do that? Thanks.

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +1

      Glad you enjoyed it! That use case can definitley be quite tricky. You'd first need to categorize what personas you are trying to bucket users in. Based on those personas, what actions (i.e features ) would link them to said persona?
      I'd suspect that a lot of AB testing would be required to fulfill your hypotheses. But, if its literally just something related to money management via banking, I'd probably look at it from the angle of on-time payments, quantity, frequency, tiered users, time of withdrawl from ATM, fees encountered, zipcodes, and features related to that. (excluding PII unless TOS states as such)

    • @dustinvo6097
      @dustinvo6097 2 ปีที่แล้ว

      @@SpencerPaoHere thank for the advice. Another question: if I try to focus on just userid and interactionname, how can I cluster the userid basing on the interactions (withdraw, request credit score,...) while they are repeated categorical measurement? Kmode is a good one?

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +1

      @@dustinvo6097 I think I have just the video for you :) th-cam.com/video/NKQpVU1LTm8/w-d-xo.html
      (If you haven't seen it already)

  • @elisama2936
    @elisama2936 ปีที่แล้ว +1

    Hello! :) Ty for the video. I have a question regarding the line " def __init__(self, n_users, n_items, n_factors=20)". Can you explain why 20?

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว +2

      Number of latent factors was arbitrary! Though, you could optimize for that value.

    • @elisama2936
      @elisama2936 ปีที่แล้ว +1

      @@SpencerPaoHere Thank you for your answer!

  • @erick388
    @erick388 ปีที่แล้ว

    Heyo, and thanks for the video! This was incredibly helpful to learn and understand how to make something rudimentary (even if I imagine a full fledged system would be SO much more complex in how you measure input from the user and live data to form a more robust recommendation). I do have one quick question though, since when I tried making my own slight version (mostly changing the dataset and some small aspects), I came across a slight issue regarding the loading aspect.
    To attempt to make this run faster, I had used panda to fuse both the ratings and movies csv's together, and then I shuffled, and split them to have an even distribution with less values (this is for a class of mine more than anything, and 100k entries is a lot to run during a presentation). The columns remain the same, and headers remain the same, and all that has 'shifted' is the order in which the rows appear (which is to say its not a bunch of toy story reviews in a row, not a bunch of star wars reviews in a row, etc) and I acquired this error.
    self.ratings.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x])
    self.ratings.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x])
    It processes movieid correctly. But when we reach the application of the lambda to the userid it proceeds to return.
    Key Error, NaN.
    Given that the csv is the same, save for the alteration to the order of the rows but not the headers, and the values are all indeed numeric, what would be a feasible way to fix and remove this error? Or could it bet he way that I shuffled the dataset that's causing it to assume that the numeric values are NaN and that there's a peculiar way I have to shuffle the values?
    Also on a fun sidenote, I've run this both with and without CUDA installed. I didn't particularly find anything that changed, but maybe that's just me. It runs regardless, though I presume that will create its own problems when it comes down to it.

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว

      Glad you enjoyed it ! This might be an issue when your are shuffling the data together. There could be many reasons why this is the case. Though, I'd recommend to obtain a small subset of your dataset and run the cleaning algorithm from there. (It'd be easier to debug)
      It seems you are attempting to combine 2 datasets together based on movieId. Have attempted to do Join statements? (inner join to be specific). Also double check if the casting is appropriate. You may be getting a null value due to the userID somehow becoming a string.
      Otherwise, could you provide an example on what the current dataset looks like and what you are trying to achieve?

    • @erick388
      @erick388 ปีที่แล้ว

      @@SpencerPaoHere Yeah I got it working. I think it was a messed up join on my end which prematurely ended my experimenting with the dataset, so all's good!
      On another sidenote, as I'm still learning some machine learning stuff, I have friends who keep talking about accuracy for machine learning algorithms, and the more I look into it I begin to wonder how that may apply here, or if it's even an actual possible thing to quantify here. I know that MSE calculates the error between predicted values, and actual rating values (do correct me if I'm wrong), which makes me question if 'accuracy' or 'error' are actual aspects of this algorithm, or if that's related to other forms of algorithms that are more specific with their goal?
      Regardless! Big thanks for the help and awesome video. This was honestly a pretty good starting point as it helped me get curious about a lof ot topics I had never got to touch before.

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว

      @@erick388 Glad you enjoyed the content!
      Regarding the accuracies, there are actually several metrics you can go about optimizing for. A great optimizer function would be adam. Accuracy by itself is not that 'accurate'; you need precision as well. Take a look into F1 scores. That'll help.
      Increasing "accuracy" comes down to additional features, more data, and different ML algorithms, or tuning algorithms. That's essentially the world of Data Science.

    • @erick388
      @erick388 ปีที่แล้ว

      @@SpencerPaoHere Gotcha, I'll look into that too. It's a lot to take in but it's always fun and interesting to learn. Appreciate all the advice!

    • @erick388
      @erick388 ปีที่แล้ว

      @@SpencerPaoHere Actually, I suppose one final question is how I would qualify something as a false positive, or a true positive (or really any of the prerequisite information) for the calculations of F1 Scores (such as the requirements for Precision, Recall, etc). I'm not quite sure how to do that given that in this example here we're giving a recommendation of ten movies based on their overall rating, and I don't really know what would quantify as a false positive (or a true positive).

  • @obi666
    @obi666 8 หลายเดือนก่อน

    I'm not sure what these clusters are (for example Cluster #1 and printed titles), are they some sort of groups of similiar movies?

    • @SpencerPaoHere
      @SpencerPaoHere  4 หลายเดือนก่อน +1

      Yep! Each cluster represents a group of data points that are similar.

  • @sachamallet5157
    @sachamallet5157 11 หลายเดือนก่อน

    Hi, I would like to know if the mac mini M2 pro with only 16gb of RAM is enough for 8Go of data analysis. Thank you so much for your feedback

    • @SpencerPaoHere
      @SpencerPaoHere  11 หลายเดือนก่อน

      Yeah it should be good for smaller datasets. Though you never know until you try ! (Maybe try 2 gb and see how long that’ll take - and approximate from there)

  • @ryderthewatermelon611
    @ryderthewatermelon611 15 วันที่ผ่านมา

    If i was to adapt this methodology to recommend songs based on user song selection, and used a dataset with parameters of a songs, how would i do that?

  • @abi_xyz
    @abi_xyz ปีที่แล้ว

    great

  • @bhadauriaji
    @bhadauriaji 2 ปีที่แล้ว +1

    Hi Spencer. Was working on a similar problem where i have users who have listened to a set of songs and based on there listen history. I have to recommend new songs to the user. Almost 10. How to do that?
    Also I don't have ratings for songs I have listen count for each song. And listen count is in relation to user.

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +4

      You'll probably need additional features such as length of listen, genre, artist, etc for a better recommendation algorithm.
      You could do the frequentist approach (to start) where you recommend the song that has been listened the most and slowly make your application more advanced once you've accumulated more focused data.

    • @bhadauriaji
      @bhadauriaji 2 ปีที่แล้ว

      @@SpencerPaoHere The problem is I can't have more features. My dataset has UserId, SongID,listen count , artist, song title, and date of the song only. I have to build a recommendation engine using that only. Also I tried using Kmeans and some brute force filtering techniques but not getting accuracy.

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +2

      @@bhadauriaji Unfortunately, those features aren't going to be doing recommendations justice. You could, however, do a weighted sampling song recommendation based on hits. Its not perfect, but it may be what you are looking for.

    • @bhadauriaji
      @bhadauriaji 2 ปีที่แล้ว

      @@SpencerPaoHere Thanks a lot for the info, will try that surely. 🤗

  • @casewhite5048
    @casewhite5048 2 ปีที่แล้ว

    How do you set a rating system for the output of movies lets say it recommends a movie you never want to watch like Fried Green Tomatoes recommends Avengers: Endgame tell it to rate it 10/10 and train it to find more clusters with higher ratings and train it to find more of these over time as more movies come out

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว

      There are many ways that you can go about doing this: I'd check out the ELO/FIDE rating system. Based on user input, they manually click either "Yes" or "No" depending on whether they like the recommendation. You can use this system to tailor prediction output to the customer.

  • @kain5244
    @kain5244 2 ปีที่แล้ว

    thanks

  • @user-vo2lc5he9m
    @user-vo2lc5he9m หลายเดือนก่อน

    helo brother,can i use any movie dataset from kaggle?

  • @christianmoreno7390
    @christianmoreno7390 ปีที่แล้ว

    dang bro do you practice retention ??

  • @lilyh4573
    @lilyh4573 2 ปีที่แล้ว +1

    I'm sorry I was distracted by your good looks xD

  • @maximshidlovski23
    @maximshidlovski23 ปีที่แล้ว

    Hi Spencer, thanks for the video. I am currently working on the problem of creating a tag-based recommendation system. The user has a list of tags of interest to him and needs to recommend content based on tags and words that are hyperonyms and hyponyms of these tags. I have the user's UserId, FavoriteUserTagsIds and the content's ContentID and ContentTagsIds. Do you have any ideas how to do that? What is best way to create tag-based recommendation system?
    Thanks.

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว +1

      This seems like an NLP type problem! You can check out a generalized large language model to see if your keywords exist within its vocabulary. Then, using its word embeddings, you can perhaps utilize the distances between the vectors as a gauge behind the meaning. Then, you can plug in the output of the NLP model to a recommendation system.

    • @maximshidlovski23
      @maximshidlovski23 ปีที่แล้ว

      @@SpencerPaoHere Thanks, I came up with a similar solution yesterday, now I'm working on implementing it.

  • @sospixs
    @sospixs ปีที่แล้ว

    Hi Spencer
    Thanks for your vdo .
    I've arrange the code , but got stuck in section for loop tqdm
    len(losses) = 0
    for it in tqdm(range(num_epochs)):
    ....
    ....
    ZeroDivisionError Traceback (most recent call last)
    Input In [59], in ()
    11 optimizer.step()
    12 #print(loss.item())
    ---> 13 print("iter #{}".format(it), "Loss:", sum(losses) / len(losses))
    ZeroDivisionError: division by zero
    any ideas ?

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว

      yeah. Whatever is populating your losses is not being done correctly or there is a divergence issue. The len(losses) == 0. You'd need to figure out why that is the length is zero.

    • @sospixs
      @sospixs ปีที่แล้ว

      @@SpencerPaoHere Yep,
      I'm using jupyter in my PC , And Is running on GPU: False
      I think that the problem

  • @ujjwal.kandel
    @ujjwal.kandel 2 ปีที่แล้ว

    How would I pass a movie title to the recommender and get a list of recommendations?

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +1

      Great question! You might have to change the model itself to be more 'linear' to return a movie title that is most similar to the input.
      With the Kmeans algorithm, you can technically "Pass in a movie title" and the list would be the cluster associated with that movie title. You can then sort by shortest distance and get the top most rated movie. Some additional coding will be required to do that.

    • @ujjwal.kandel
      @ujjwal.kandel 2 ปีที่แล้ว +2

      @@SpencerPaoHere I could really use that extra code you're talking about. I'm doing a recommender for my final year project without zero experience in machine learning. Half this code is gibberish to me lol. I just need 10 recommendations for any list of movies. That's all I ask for😭

  • @aumasandra9307
    @aumasandra9307 ปีที่แล้ว

    Why do I keep getting KeyError: 46970 in the code train_set = Loader()
    And how do I solve this error

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว

      Is this my code? Did you run through all the cells? If so, check out the loader(Dataset) class and provide some logging statements to see which lines are throwing that error.

  • @sssaturn
    @sssaturn ปีที่แล้ว

    is there a reason you dont split the data set?

    • @SpencerPaoHere
      @SpencerPaoHere  ปีที่แล้ว

      I just wanted to highlight the recommendation aspect (not necessarily the training aspect)
      Though, in an ML model, you definitely want to do the typical 60/20/20 split!

    • @sssaturn
      @sssaturn ปีที่แล้ว

      @@SpencerPaoHere cool, thank you spencer!

  • @appyviral8753
    @appyviral8753 2 ปีที่แล้ว +2

    How much u charge for making a video recommendation system for Android app?

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +1

      If it's highly interesting, $0.00.

    • @appyviral8753
      @appyviral8753 2 ปีที่แล้ว +1

      @@SpencerPaoHere it will be! how to contact u?

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +1

      @@appyviral8753 You can send me a message at business.inquiry.spao@gmail.com

    • @seankirbycordova3937
      @seankirbycordova3937 ปีที่แล้ว +1

      Can I ask the source code? im building library system, I have no idea implemting the collaborative filtering algo. Thank you if you can help me 😊

  • @nazrulabuzhar2210
    @nazrulabuzhar2210 2 ปีที่แล้ว +9

    What is your skincare routine sir? You're looking good

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว +5

      😂😂😂
      Comment made my day!
      Cleanser + Moisturizer

  • @hmhm2903
    @hmhm2903 2 ปีที่แล้ว

    dataset link pls

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว

      You can try here: files.grouplens.org/datasets/movielens/ml-20m-README.html

  • @brahimsabiri3116
    @brahimsabiri3116 2 ปีที่แล้ว

    Could you share the code plz

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว

      github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems

  • @guitar300k
    @guitar300k 2 ปีที่แล้ว

    How to solve big scale problem, you guys?

    • @SpencerPaoHere
      @SpencerPaoHere  2 ปีที่แล้ว

      It depends on the use case, but there are many ways to scale a problem. All of which are somewhat unique. For deployment on a website for example, Kubernetes is quite popular.

  • @mainguyenhoang2667
    @mainguyenhoang2667 3 ปีที่แล้ว

    can you share the code sir?

    • @SpencerPaoHere
      @SpencerPaoHere  3 ปีที่แล้ว

      As requested, here is my code: github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems

    • @mainguyenhoang2667
      @mainguyenhoang2667 3 ปีที่แล้ว

      @@SpencerPaoHere thanks you

  • @rabiaedaylmaz1198
    @rabiaedaylmaz1198 2 ปีที่แล้ว

  • @umershabir7045
    @umershabir7045 2 หลายเดือนก่อน +1

    is your voice AI generated?

  • @phatle-248
    @phatle-248 ปีที่แล้ว

    I can't hear that "deep" voice clearly