Recommendation Engines Using ALS in PySpark (MovieLens Dataset)

แชร์
ฝัง
  • เผยแพร่เมื่อ 3 พ.ย. 2024

ความคิดเห็น • 114

  • @abrahamsanchez8610
    @abrahamsanchez8610 6 หลายเดือนก่อน +1

    Love it. a simple explanation, just what I needed.

  • @parimargu845
    @parimargu845 5 ปีที่แล้ว +3

    Very explanation within 9 minutes. Superb!!!

  • @hemanthkumar-ez9zz
    @hemanthkumar-ez9zz 4 ปีที่แล้ว +14

    This is 2020 and this is still the best video for ALS in pyspark :)

  • @rhettshipp1
    @rhettshipp1 7 ปีที่แล้ว +5

    Loved this. Very intuitive explanation of ALS.

  • @LuisMorales-bc7ro
    @LuisMorales-bc7ro ปีที่แล้ว

    great explanation!

  • @evertoncicerolongobresqui1386
    @evertoncicerolongobresqui1386 4 ปีที่แล้ว +1

    Thank you very much, I'm from Brazil.

  • @hencheung413
    @hencheung413 5 ปีที่แล้ว +1

    Thank you for this very straightforward video!

  • @wagdynaeem
    @wagdynaeem 6 ปีที่แล้ว +1

    Love it... simple, and
    to the point

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      Thanks Wagdy. :)

  • @penniesshillings
    @penniesshillings 5 ปีที่แล้ว +3

    Define irony: I upvoted this video. :-)

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว

      Haha. Thanks Jaco.

    • @kapilricky
      @kapilricky 5 ปีที่แล้ว +1

      me too :D

  • @kaapiglass
    @kaapiglass 4 ปีที่แล้ว +2

    great explanation!!!

  • @ccleahyy
    @ccleahyy 5 ปีที่แล้ว +1

    Thanks for making this amazing video!

  • @kapilricky
    @kapilricky 5 ปีที่แล้ว

    Brilliant, without much knowledge i got the much needed details :)

  • @ihomry684
    @ihomry684 2 ปีที่แล้ว

    1st - thank you for the invesment - clearly I see that you put in the effort , I have a problem in the logic behind of it all.
    I fail to realize the whole contribution of the prediction column in this case,
    I mean besides what you have said untill now... still the question remains.
    here's an example:
    let's say user_1 and user_2 has shared population of movies seen by them, and one day user_2 see's (and rates) a movie that user_1 has not seen,
    Q1: what is the added contrubution of a value in the prediction column (lets say 3.73) in user_2 for a certain movie, that user_1 has not seen yet. ?
    following that logic ...
    lets say we take user_2 top rated movies that are not existing in user_1 list (been seen by user_1), how does the prediction contributes understanding
    how would user_1 would rate it ? , I mean we already know how user_2 rated it .
    so if we sums my current example: I FAIL to see how knowing "predictions" values on movies that I seen (and you have not) ,
    or the other way around (that you seen and I didn't) .
    helps any analyst to make a connections, to find insights based of 1 side predictions value. (remember one of us has seen movie that the other didn't)
    Q2: I better of to take the rating of user_2 the movie unseen by user 1, and recommend it directly to user 1. no ?
    Much appriciated for the video , and for any answer you might give me

  • @asadulhaqmshani4737
    @asadulhaqmshani4737 5 ปีที่แล้ว +2

    Thanks a lot!

  • @faisal.fs1
    @faisal.fs1 5 ปีที่แล้ว +1

    Thanks!

  • @xudongliu2333
    @xudongliu2333 6 ปีที่แล้ว +3

    Thanks for this simple and clean code! I just love this script! Just have one question about ParamGridBuilder. Why there are 3 parameters? There are only P and U matrix, should be 2? or is this the CorssValidation for 3 folder so it should be 3 values in the addGrid? Looking for your answer!

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว +3

      Hi Xudong. The ParamGridBuilder is what allows you to tell Spark what hyperparameters to tune. ALS has 3 hyperparameters for explicit ratings models. These include:
      -rank: The number of latent features you want the the two factor matrices to have. This number will be the same for both matrices, so this makes up only one hypreparameter
      -maxIter: The number of times you want ALS to alternate between matrices U and P to decrease the RMSE
      -regParam: The regularization parameter or lambda. This keeps the model from converging too quickly so as to prevent overfitting
      For the ALS model, the number of factor matrices is always 2 and therefore isn't a hyperparameter that needs to be tuned.
      Because this video was made before the Spark CrossValidator could be used with ALS, it uses the TrainValidationSplit ("tvs") which only allows 1 fold. If you see in the description of the video, I've provided further instructions on how to use the CrossValidator to use several folds than what the TranValidationSplit allows.
      Let me know if this doesn't answer your question.

    • @xudongliu2333
      @xudongliu2333 6 ปีที่แล้ว

      Hi Jamenlong. Thanks so much for your explanation! I think I understood this hyperparameters. Also when you do the cross validation. Did you just separate 3 folders in the training data set and keep the testing data set alone? Thanks~~

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว +1

      Yes, that is correct. You call the cross-validator on the training set, and it will create the folds within the training dataset for you and not do anything with the test set. When the cross-validator is finished running and the results are satisfactory, then you run the model on your test set to confirm that it will perform equally well on unseen data.

    • @xudongliu2333
      @xudongliu2333 6 ปีที่แล้ว

      jamenlong1 Awesome!Thanks heaps for ur explanation

    • @xudongliu2333
      @xudongliu2333 6 ปีที่แล้ว

      Hi Jamenlong, I just found another misunderstanding point.. Could you please explain how to calculate the RMSE for the blank value? As far as I know, the final evaluation output(RMSE) would be the prediction value minus the real value and add regularization function. For example, there is one online purchased history for one user. The quantity of some items is 2,4,11 and some items that this user never bought which is Nan. How can model calculate the RMSE for the Nan value? IF using the Cold Start Strategy for this, what would be the Nan value looks like? 0? Thanks!

  • @chzigkol
    @chzigkol 5 ปีที่แล้ว +1

    Very informative video indeed. I have a question regarding the function that returns the top recommendations. Does it return only unrated items or not? If not, is it possible to make it return only such items? Thanks

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว +1

      Hi Christos. Good question. It turns out that the output of the ALS algorithm is a set of predictions for ALL movies, whether the users have seen them or not. So the highest returned value(s) might be for movies already seen. For that reason you'll need to filter the results so that only recommendations for unseen movies are provided. This course talks about this topic specifically and provides instruction on how to do it: www.datacamp.com/courses/recommendation-engines-in-pyspark.

    • @chzigkol
      @chzigkol 5 ปีที่แล้ว +1

      @@jamenlong1 thanks I will check it out but I feel like a join operation would do the job :)

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว +1

      @@chzigkol You're right. That's basically the solution offered in the course.

    • @Pacal_II
      @Pacal_II 2 ปีที่แล้ว +1

      @@jamenlong1 ​ @Christos Zigkolis I also had the same thought, the problem is the algorithm will only recommend movies you've seen, so if you filter out the movies the users have seen you'll get nothing.

    • @ihomry684
      @ihomry684 2 ปีที่แล้ว

      I agree - I fail to realize the whole contribution of the prediction column in this case,
      I mean besides what you have said untill now... still the question remains.
      here's an example:
      let's say user_1 and user_2 has shared population of movies seen by them, and one day user_2 see's (and rates) a movie that user_1 has not seen,
      Q1: what is the added contrubution of a value in the prediction column (lets say 3.73) in user_2 for a certain movie, that user_1 has not seen yet. ?
      following that logic ...
      lets say we take user_2 top rated movies that are not existing in user_1 list (been seen by user_1), how does the prediction contributes understanding
      how would user_1 would rate it ? , I mean we already know how user_2 rated it .
      so if we sums my current example: I FAIL to see how knowing "predictions" values on movies that I seen (and you have not) ,
      or the other way around (that you seen and I didn't) .
      helps any analyst to make a connections, to find insights based of 1 side predictions value. (remember one of us has seen movie that the other didn't)
      Q2: I better of to take the rating of user_2 the movie unseen by user 1, and recommend it directly to user 1. no ?
      Much appriciated for the video , and for any answer you might give me

  • @shrujalambati3730
    @shrujalambati3730 3 ปีที่แล้ว

    can you post the whole code including the ending part where it displays multiple recommendations for one user

  • @congly5708
    @congly5708 2 ปีที่แล้ว

    I'm building a recommendation system using ALS. Beside RSME, i want to get hit rate metric for als. How can I implement this?

  • @oilbarn2218
    @oilbarn2218 7 ปีที่แล้ว +1

    Great video! If there was any feedback I'd give it would be to have Buckethead in here somewhere being awesome.

    • @jamenlong1
      @jamenlong1  7 ปีที่แล้ว

      That's a great suggestion. I'll keep that in mind for my next video.

  • @azizilyosov2801
    @azizilyosov2801 6 ปีที่แล้ว

    Thank for such a usefull video. I have a question: when I search about bayesian rexommender system which uses metadata I read that they used matrix factorization. Then I thout about using spark als to take advantage od metadata in recommender system. Question 1)is it possible? 2)how . Thanks in advance

  • @esthermdzitiro31
    @esthermdzitiro31 4 ปีที่แล้ว

    Great video thank you so much. How do you save the predictions to a csv file ?, I’ve been getting unusual error.

  • @ujjwal.kandel
    @ujjwal.kandel 2 ปีที่แล้ว

    How would I implement an item-based collaborative filtering model? And could you please share the full source code?

  • @NickKartha
    @NickKartha 5 ปีที่แล้ว +1

    lol, everyone I know has seen this video. Now I need to rework my code :D

  • @deadania9902
    @deadania9902 5 ปีที่แล้ว

    Thank you for uploaded this video, i have a question if you don't mind to answer. Why predicted ratings has value more than 5?

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว +2

      Hi Dea. There's nothing to limit a prediction from being higher than 5. In the original matrix R you have ratings that go from 1 to 5, however, when performing a factorization, ALS will do it's best to create matrices U and P such that when they are multiplied back together, they will come as close to the original matrix R as possible. However, UxP will only be an approximation of R, so naturally there will be some variation between what it predicts and the original ratings. Because of this, a prediction may be slightly lower or higher than an original value, meaning a prediction can be higher than 5 or lower than 1. There's nothing really limiting what the output of UxP will be. The only restriction with ALS is that everything must be positive.

  • @shrujalambati3730
    @shrujalambati3730 3 ปีที่แล้ว

    also what is the purpose of the get_recs_for_user function and what is the parameter to enter for user_id

  • @ongonoolingajeangalbert7526
    @ongonoolingajeangalbert7526 5 ปีที่แล้ว

    Hi, I thank you for this job. It is very fluent and useful. I have a question. I made the same code as you did and I am still wondering today why my pyspark code can not print the Rank, the MaxIter and the RegParam. Do you have any idea or explanation ? Thanks a lot for your answer.

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว

      Hi. What version of Spark are you using?

  • @AnuragHalderEcon
    @AnuragHalderEcon ปีที่แล้ว

    Hi Jamen is there a way we can fetch the U and P matrices once the model is fit?

  • @benisonpang824
    @benisonpang824 5 ปีที่แล้ว +1

    Hi Jamen, I am trying this approach and code on a different dataset, but the prediction output is 0.0 for every single user + item. I've checked to make sure my data fits the format required for ALS, but nothing seems to be working. Do you have any thoughts?

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว

      Hi Benison. Sorry it took so long to reply.
      My first thought is that if your data includes 0's for the items that have not been rated, then your output will be heavily influenced by those 0's. And if it's a sparse data frame like the MovieLens dataset, then almost every item will receive a 0 prediction. If this is the case with your data, you'll want to remove all rows that did not receive any sort of rating so that there are no zeros, and no blanks (the ALS algorithm in Pyspark doesn't need to be told what is blank. It will figure it out when it sees some items rated by other people that haven't been rated by others). Let me know if this doesn't work.

  • @esthermdzitiro31
    @esthermdzitiro31 4 ปีที่แล้ว +1

    How do I save the actual matrix.

  • @sairam4630
    @sairam4630 5 ปีที่แล้ว

    I’m looking for classification and Clutering video or website for this dataset

  • @eminedefneozcan2229
    @eminedefneozcan2229 6 ปีที่แล้ว

    Thanks for your video. I want to ask a question. I run als algorithm in my local pc and on dataproc with the same dataset and same initial ALS parameters totaly and completly. I get different prediction results for the same users for dataproc and local pc. Do you know what is the reason of this state?

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      Hi Emine. There is an element of randomness in how ALS begins filling in the two factor matrices which are subsequently adjusted iteratively. The differences could be attributed to these random initial values and slightly different subsequent iterations. However, upon convergence, unless the original data is far too sparse for ALS to identify any meaningful patterns, the results should at least be similar between what you're running on dataproc and locally on your pc. How different are the results? Are they only off by a fraction, or are you seeing differences that are significant? How do the RMSE's differ between the two? And is it safe to assume that the difference does not come after cross validating separately, meaning, you're simply running one model with one set of model parameters. Am I understanding that correctly?

  • @ahmedbenamor814
    @ahmedbenamor814 2 ปีที่แล้ว

    how did u display this table in 7:32

  • @mohamedbidewy3567
    @mohamedbidewy3567 6 ปีที่แล้ว

    will the recommender model provide an API if it's built using spark MLlib?
    And if I'm dealing with large database, how can i connect with it? and what are the inputs and outputs to this model? will they be JSON?

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      The Spark ML and Spark MLlib output of an ALS model won't provide an API as far as I know. And it doesn't appear that either have the ability to export an ALS model as a PMML. If you're asking because of the mention of API's in the Collaborative Filtering Spark documentation, that refers to Spark's Python API which allows Python users to build Spark code in a familiar way. It sounds like you're interested in providing recommendations in real-time in which case you may need to simply generate/update recommendations for all users on a nightly frequency and simply configure your production code to extract recommendations from the recommendation matrix on an as-needed basis. As far as connecting your Spark environment to a large database, it will depend on where your data is stored (AWS, GCP or other), so I can't tell you specifically, though there are likely other videos that might be more helpful than I can be with regards to connecting to your data sources. As far as the inputs and outputs go, both are Pyspark data frames. The output can be converted to whatever format you need.

  • @MohamedMohamed-wk4de
    @MohamedMohamed-wk4de 3 ปีที่แล้ว

    Plz recommendation system using R

  • @fatfat1380
    @fatfat1380 4 ปีที่แล้ว

    thank you for your video but I have encountered an issue:
    'ALSModel' object has no attribute 'recommendUsersForProducts'
    where I follow the same libraries , do you know what is the issue? Thanks

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว +1

      You’d have to post all of your code in order to know what’s going on. Are you following my code or are you doing it differently?

    • @fatfat1380
      @fatfat1380 4 ปีที่แล้ว

      @@jamenlong1 thanks for your feedback. Yes, im following and using Spark 2.0.0, it comes out an error when Im creating the ALS model as well, '__init__() got an unexpected keyword argument 'coldStartStrategy'' but the model can be created if get rid of 'coldStartStrategy' on the other hand.

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว

      @@fatfat1380 Were you able to figure this out? If not, feel free to paste the few lines of code that contain model with hyper parameters.

  • @fxsignal1830
    @fxsignal1830 5 ปีที่แล้ว

    please: you made ALS with implicitPrefs =false (default value), ok? so my question is: is the "rank" param good for the implicitPrefs =false? or must it done for implicitPrefs =true only?

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว

      Hi. Yes, the “rank” parameter can and must be used for both implicitPrefs=true and implicitPrefs=false. If you’re interested in a full, in-depth review of the ALS algorithm for all kinds of ratings, I put together a full course on DataCamp.com.

  • @jamespaz4333
    @jamespaz4333 ปีที่แล้ว

    Great explanation! I was wondering if we can use ALS for a multi-class problem.

  • @bouhmid95
    @bouhmid95 4 ปีที่แล้ว +1

    hi, if someone can help me , the Alternating Least Squares (ALS) algorithm is it an supervised or unsupervised learning ?

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว

      Hi. It’s supervised as indicated by having an error metric (RMSE).

    • @bouhmid95
      @bouhmid95 4 ปีที่แล้ว

      ​@@jamenlong1 Thank you for the reply, but I want to make sure, because in many articles posted, they say it is unsupervised learning
      .

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว

      king bouhmid interesting. Let me know what you conclude.

    • @bouhmid95
      @bouhmid95 4 ปีที่แล้ว

      @@jamenlong1 mapr.com/blog/apache-spark-machine-learning-tutorial/

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว

      king bouhmid interesting. It does indeed say that ALS is an unsupervised algorithm. It says, “Unsupervised learning, also sometimes called descriptive analytics, does not have labeled data provided in advance.” But I still disagree that ALS, in the context of making predictions as illustrated in this video, is unsupervised. The ratings, provided by users, serve as labels, which are then used in training, testing and validating in conjunction with a defined loss function (RMSE). Another use of the underlying ALS algorithm, otherwise known as non-negative matrix factorization is also used to classify items such as news articles, books, etc. or uncover latent features. In this use case, the algorithm serves a purpose more akin to clustering algorithms which are indeed unsupervised. In my opinion, using user-provided labels (ratings), a loss function and cross validation to generate predictions (recommendations) means it’s a supervised algorithm. I’m curious which way you’re leaning after researching this.

  • @asadulhaqmshani4737
    @asadulhaqmshani4737 5 ปีที่แล้ว +1

    btw what is the data type of movie_ratings?

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว +1

      movie_ratings is a pyspark data frame. `userId` and `movieId` are integer datatypes and `rating` is float datatype.

    • @asadulhaqmshani4737
      @asadulhaqmshani4737 5 ปีที่แล้ว

      @@jamenlong1 nice bro

  • @Sandraskarshaug
    @Sandraskarshaug 6 ปีที่แล้ว

    Do you have a workaround for how I can convert my userid's (which are strings) and itemid's (which are strings) to ints? Except that, I have a format which is exactly the same as yours.

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      Hi Sandraskarshaug. When you say that the format is the same as mine, I'm assuming that the strings in the userid and itemid columns are number values, but they are in string format. If this is the case, then this shouldn't be too difficult to resolve. You'll have to do this one column at a time:
      from pyspark.sql.functions import col
      from pyspark.sql.types import IntegerType
      df = df.withColumn("userid", col("userid").cast(IntegerType()))
      df = df.withColumn("itemid", col("itemid").cast(IntegerType()))
      df.describe #Should show you the new datatypes

    • @Sandraskarshaug
      @Sandraskarshaug 6 ปีที่แล้ว

      Oh, no, it is the same format, as a dataframe with columns userid, itemid and rating, but the id's for the users and items are unique strings, like "cx:2fs9x8i7". I need to convert these into unique int id's, but are not sure how to approach the problem.

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      I see what you're saying. What I've done in the past is used monotonically_increasing_id which generates unique integer ids. You'll need to import it from pyspark.sql.functions, then add a column of monotonically increasing id's to your data frame. Then you'll need to have them persist in the data frame. The truth is that monotonically_increasing_id is a bit of a tricky function. The id's it generates can change as you do different, seemingly unrelated things to the data frame, so using the persist function causes them to persist through all subsequent data frame manipulations. Here's the code that I've used in the past:
      from pyspark.sql.functions import monotonically_increasing_id
      df = df.withColumn("new_id_name", monotonically_increasing_id())
      df = df.persist()
      Another, perhaps more familiar approach would be to leverage the SQL qualities of spark:
      df = spark.sql("SELECT ROW NUMBER() OVER(PARTITION BY... ORDER BY...) AS new_id_name")
      Let me know if this doesn't make sense.

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      All of that is assuming that each id in your dataset occurs only once. Another thing to note is that monotonically_increasing_id will indeed generate unique id's that are in monotonically increasing order, however, it will only do this uniquely across each partition of your data. So it is best to repartition your data to only one partition before using the monotonically_increasing_id method. After that, you can then repartition as needed for your subsequent functions.

  • @keerthireddy6534
    @keerthireddy6534 3 ปีที่แล้ว

    what to give value for recs?

  • @ITCertAcademy
    @ITCertAcademy 6 ปีที่แล้ว

    Are you put the working code on github?

  • @pferrel
    @pferrel 5 ปีที่แล้ว +1

    Though very nicely explained (thanks!). This is a bit out of date. Ratings are not very useful in predicting what users will like - period. None of the big Ecom people use ratings anymore, Netflix doesn't even allow you to rate. The thing Ecom people DO use is conversions (movie views in this example). ALS works with these but unfortunately this is not discussed here. Furthermore ALS is fundamentally flawed in that it does not account for any user behavior except rate (or convert). That means that if you do not convert it cannot recommend. Try Correlated Cross-Occurrence (CCO) instead if you want a more modern way to recommend. It uses many types of user behavior to increase the quality of recommendations. developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

    • @jamenlong1
      @jamenlong1  5 ปีที่แล้ว +1

      To be clear, this video is intended for people who specifically want to learn the logic of the ALS algorithm, not a case for where it ranks amongst recommender algorithms. Additionally, it should be clear from the context of the video that it is not exhaustive of all aspects of ALS. If you consider the various things an exhaustive course would cover from implicit vs explicit ratings and binary vs non-binary ratings, to addressing the pseudo-class imbalance via user- or item-oriented weighting, it’s just far too much to include in a sub-9 minute video. The goal of the video was to provide intuition and a starting point.
      But to comment more directly on your remarks... I take a bit more of a nuanced perspective. The best methodology really depends on the data available to you, the context of that data, the user base, the user experience, product application, the talent and time available to build a recommender amongst a thousand other things including production testing. Plenty of companies still use ratings and find them to be helpful. I mean Apache actually took the time to not only build ALS for both implicit and explicit ratings into Spark but also enhanced it across subsequent releases. Even the website you linked to refers to the use of likes (e.g. ratings) as a means to develop a “decent recommender”. To say that they aren't very useful "period" is rather dichotomous. But beyond that, ratings do include conversion. After all, you can’t rate something if you haven’t consumed it (i.e. converted). About the cold start problem, there are many ways to get around it when using ratings. For example, many companies require users to explicitly state their category preferences as part of their registration process or ask users to rate movies/products they've already seen/consumed in order to make initial recommendations. It’s been years since I started by subscription with Netflix, but I believe they still do that during new user registration. Ultimately, however, choosing the best approach comes down to trying several different algorithms, approaches and datasets and seeing which one performs best. But, even if ALS with explicit ratings wasn't useful, learning it can be extremely useful beyond just the recommender context. The intuition is fairly straightforward and provides a good place to start before delving into more complex matrix factorization concepts such as SVD. But beyond that, ALS is also applicable in non-recommender situations. I've used it to infer various characteristics of a base of customers as well as to predict customer behavior without actually using the output as recommendations. So even if other algorithms perform better in a given situation, I still think it’s extremely useful to understand.

    • @pferrel
      @pferrel 5 ปีที่แล้ว +2

      @@jamenlong1 fair points but why not cover the lens that history sees RMSE and ratings as methods for recommendations. Caveats and editorials are what people need when looking at a complicated mathematical algorithm. As I said, it was a very nice tutorial for ALS with ratings.

    • @ihomry684
      @ihomry684 2 ปีที่แล้ว

      I fail to realize the whole contribution of the prediction column in this case,
      I mean besides what you have said untill now... still the question remains.
      here's an example:
      let's say user_1 and user_2 has shared population of movies seen by them, and one day user_2 see's (and rates) a movie that user_1 has not seen,
      Q1: what is the added contrubution of a value in the prediction column (lets say 3.73) in user_2 for a certain movie, that user_1 has not seen yet. ?
      following that logic ...
      lets say we take user_2 top rated movies that are not existing in user_1 list (been seen by user_1), how does the prediction contributes understanding
      how would user_1 would rate it ? , I mean we already know how user_2 rated it .
      so if we sums my current example: I FAIL to see how knowing "predictions" values on movies that I seen (and you have not) ,
      or the other way around (that you seen and I didn't) .
      helps any analyst to make a connections, to find insights based of 1 side predictions value. (remember one of us has seen movie that the other didn't)
      Q2: I better of to take the rating of user_2 the movie unseen by user 1, and recommend it directly to user 1. no ?
      @Pat Ferrel any answer you might give me ?

  • @seymatas
    @seymatas 4 ปีที่แล้ว +1

    Why did you put only one video? Make others!

    • @jamenlong1
      @jamenlong1  4 ปีที่แล้ว

      seyma tas Thank you. I actually put together a whole course on ALS. It’s on DataCamp.com.

  • @羅彥廷-z9m
    @羅彥廷-z9m 6 ปีที่แล้ว

    What’s this spark version?

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      The latest version I've run this in is Spark 2.2.1. But I've also run it in previous versions, as early as 2.0 I believe.

    • @羅彥廷-z9m
      @羅彥廷-z9m 6 ปีที่แล้ว

      Thanks .What is input in the function get_recs_for_user? A row of user_recs?

    • @羅彥廷-z9m
      @羅彥廷-z9m 6 ปีที่แล้ว

      In my case I want to recommend items to users which they didn’t rate before but recommendForAllUsers(10) seems return just top 10 items.
      So I need to filter top 10 items again...
      Is there a better way to do this?

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      I'm glad you mentioned this. I see a small gap in this part. The input of the "get_recs_for_user" function should be the recommendations for one user. The way you get the recommendations for one user is by taking the user_recs, which is the function "best_model.recommendForAllUsers(), and filtering for the user for whom you'd like to make a recommendation. In other words, let's say you want to get the recommendations for user 42, then you would do this:
      from pyspark.sql.functions import col
      user_42_recs = user_recs.filter(col("userId") == 42)
      get_recs_for_user(user_42_recs)

    • @jamenlong1
      @jamenlong1  6 ปีที่แล้ว

      The input for recommendForAllUsers is the number of recommendations you want returned for each user. In the video example I used 10, but you could input 1 or 5 or 100, and it should return a data frame of the top x number of movies/items that each user has not yet seen that they are most likely going to rate highly. So the output of recommendForAllUsers should only return recommendations for movies/items that each user hasn't yet rated. It is assumed that they haven't seen a movie if they haven't provided a rating for it.

  • @petrkoutny7887
    @petrkoutny7887 3 ปีที่แล้ว

    Being a beginner in terms of ML knowledge, this did not help me much.