Building a Recommendation System in Python

Spencer Pao

Просмотров 83 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 дек 2024
Наука

Комментарии • 100

@nayibahued5955 2 года назад ⁺¹⁶²
deepest data scientist voice in the world
@umeshkumarasamy6608 8 месяцев назад ⁺³
He learnt deep
@naderkhaled9410 2 года назад ⁺²³
Dude I know this is off topic, but ur voice is insanely satisfying !!
@SpencerPaoHere 2 года назад ⁺²
😂
@AbhishekChandraShukla Год назад ⁺⁴
Holy cow! That is a really good recommendation system! Humbling tutorial as well!
@Agent7155 2 года назад ⁺¹¹
Ended up searching up for movies to watch at the end xD
@SpencerPaoHere 2 года назад ⁺¹
😂
@ea1766 Год назад ⁺¹
easily the best video on this subject, all the other videos were so boring and mundane. I wish RUclips promoted this video more to the top.
@icequeen2778 Год назад ⁺¹
Would love to see more of this type of video!
@folahan Год назад ⁺¹
The first time I will follow a training using my own dataset and I didn't get any error from start to finish.
@dan7582 2 года назад
Nice video, keep up the good work!!
@vincent_hall 2 года назад
Thank you sir.
I have forked it and shall have a go collaborating with a friend.
@l3o_pl4ys51 3 месяца назад ⁺²
couldn't figure out totally what he's used in this video. the only Things that I could get was that he separated the user and movies to their ratings, after this he wheighted all of them to put in a sort of a scale and then he made the predictions into clusters. Did I miss something?
@robbillington1603 4 месяца назад
Jaba ah voice! Great video
@ahmadjunaidi21-l6l 6 месяцев назад
Dude is not only learn deep learning but deep voice. damn
@ayushthombare9235 3 года назад
Very informative and useful video.... Thank you so much
@elisama2936 Год назад ⁺¹
Hello! :) Ty for the video. I have a question regarding the line " def __init__(self, n_users, n_items, n_factors=20)". Can you explain why 20?
@SpencerPaoHere Год назад ⁺²
Number of latent factors was arbitrary! Though, you could optimize for that value.
@elisama2936 Год назад ⁺¹
@@SpencerPaoHere Thank you for your answer!
@ryderthewatermelon611 5 месяцев назад
If i was to adapt this methodology to recommend songs based on user song selection, and used a dataset with parameters of a songs, how would i do that?
@tactical_savant01 25 дней назад
Hi Spencer, the github link for the code is not working. can you pls resolve it. Thank you
@stmasanti Год назад
Great video!
@marcelomlr 7 месяцев назад
Hey man, nice video, and thanks for the tutorial. I'm actually trying to build a recommendation system for online courses, like udemy, but I can't find any datasets for user reviews to make the collaborative filtering. So I decided to manually create a dataset, and thought of choosing like 4 subjects and putting some users to rate like 10-15 courses of each subject. Do you know if something like that can work, or have any tips you can give me?
@vaiterius Год назад ⁺¹
How do you know which libraries/functions to use to make these algorithms? I’m trying to make a videogame recommendation system from a Steam games dataset, similar to what you’re doing here
@hamzak5674 10 месяцев назад
Hey, I’m making something similar using the RAWG dataset. Did you manage to get anywhere? I’m planning to start in the next few days
@SpencerPaoHere 9 месяцев назад
Python typically wraps around alot of theoritcal applications behind C/C++. When it comes to a recommendation system base, tensorflow/keras are the building blocks and are quite effective when building something from scratch or fine tuneable
@vinayvajrala4366 5 месяцев назад
A big like for that voice
@Bjorn_R 10 месяцев назад
Hello Spencer im split between collaborative recommender systems and a confirmation tree project for my master thesis. What would be most beneficial?
@sachamallet5157 Год назад
Hi, I would like to know if the mac mini M2 pro with only 16gb of RAM is enough for 8Go of data analysis. Thank you so much for your feedback
@SpencerPaoHere Год назад
Yeah it should be good for smaller datasets. Though you never know until you try ! (Maybe try 2 gb and see how long that’ll take - and approximate from there)
@obi666 Год назад
I'm not sure what these clusters are (for example Cluster #1 and printed titles), are they some sort of groups of similiar movies?
@SpencerPaoHere 9 месяцев назад ⁺¹
Yep! Each cluster represents a group of data points that are similar.
@bhadauriaji 2 года назад ⁺¹
Hi Spencer. Was working on a similar problem where i have users who have listened to a set of songs and based on there listen history. I have to recommend new songs to the user. Almost 10. How to do that?
Also I don't have ratings for songs I have listen count for each song. And listen count is in relation to user.
@SpencerPaoHere 2 года назад ⁺⁴
You'll probably need additional features such as length of listen, genre, artist, etc for a better recommendation algorithm.
You could do the frequentist approach (to start) where you recommend the song that has been listened the most and slowly make your application more advanced once you've accumulated more focused data.
@bhadauriaji 2 года назад
@@SpencerPaoHere The problem is I can't have more features. My dataset has UserId, SongID,listen count , artist, song title, and date of the song only. I have to build a recommendation engine using that only. Also I tried using Kmeans and some brute force filtering techniques but not getting accuracy.
@SpencerPaoHere 2 года назад ⁺²
@@bhadauriaji Unfortunately, those features aren't going to be doing recommendations justice. You could, however, do a weighted sampling song recommendation based on hits. Its not perfect, but it may be what you are looking for.
@bhadauriaji 2 года назад
@@SpencerPaoHere Thanks a lot for the info, will try that surely. 🤗
@simyixiang3358 3 месяца назад
Notebook setting use T4-GPU ?
@alexhort__ 4 месяца назад
How would you do it from a real-time database, with real users?
@Historyiswatching 2 года назад ⁺¹
I'm sorry I was distracted by your good looks xD
@NobixLee 2 года назад
Great video, but how do we then get scores for the User_ID? Something like there is this much probability that User_ID 2 will be in cluster 2? Thank you.
@SpencerPaoHere 2 года назад ⁺¹
One way that you can go about this:
You'd need more data to have a more accurate way of doing this. Since there are only 4 features: userID, movieID, rating, timestamp in the dataset I am using in this video. However, with the way that I have done this in the video, you can go forth and associate the average of the ratings that each user has appled for all of the users' ratings with the movies in each cluster. Normalize across all clusters with the given movie and sort upon highest ratings per cluster for the user. Whichever movies that may not have been seen by the user in the cluster should be recommended to the user. I am open to hearing your thoughts on this!
@aumasandra9307 Год назад
Why do I keep getting KeyError: 46970 in the code train_set = Loader()
And how do I solve this error
@SpencerPaoHere Год назад
Is this my code? Did you run through all the cells? If so, check out the loader(Dataset) class and provide some logging statements to see which lines are throwing that error.
@gauravpoudel7288 Год назад
Thanks for the awesome content.
BTW Is that really your voice?
@simyixiang3358 3 месяца назад
bcs when i run the code at 9:10 in the video ,the output error
@nazrulabuzhar2210 3 года назад ⁺⁹
What is your skincare routine sir? You're looking good
@SpencerPaoHere 3 года назад ⁺⁵
😂😂😂
Comment made my day!
Cleanser + Moisturizer
@JaisonSimon-h7p 6 месяцев назад
helo brother,can i use any movie dataset from kaggle?
@erick388 2 года назад
Heyo, and thanks for the video! This was incredibly helpful to learn and understand how to make something rudimentary (even if I imagine a full fledged system would be SO much more complex in how you measure input from the user and live data to form a more robust recommendation). I do have one quick question though, since when I tried making my own slight version (mostly changing the dataset and some small aspects), I came across a slight issue regarding the loading aspect.
To attempt to make this run faster, I had used panda to fuse both the ratings and movies csv's together, and then I shuffled, and split them to have an even distribution with less values (this is for a class of mine more than anything, and 100k entries is a lot to run during a presentation). The columns remain the same, and headers remain the same, and all that has 'shifted' is the order in which the rows appear (which is to say its not a bunch of toy story reviews in a row, not a bunch of star wars reviews in a row, etc) and I acquired this error.
self.ratings.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x])
self.ratings.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x])
It processes movieid correctly. But when we reach the application of the lambda to the userid it proceeds to return.
Key Error, NaN.
Given that the csv is the same, save for the alteration to the order of the rows but not the headers, and the values are all indeed numeric, what would be a feasible way to fix and remove this error? Or could it bet he way that I shuffled the dataset that's causing it to assume that the numeric values are NaN and that there's a peculiar way I have to shuffle the values?
Also on a fun sidenote, I've run this both with and without CUDA installed. I didn't particularly find anything that changed, but maybe that's just me. It runs regardless, though I presume that will create its own problems when it comes down to it.
@SpencerPaoHere 2 года назад
Glad you enjoyed it ! This might be an issue when your are shuffling the data together. There could be many reasons why this is the case. Though, I'd recommend to obtain a small subset of your dataset and run the cleaning algorithm from there. (It'd be easier to debug)
It seems you are attempting to combine 2 datasets together based on movieId. Have attempted to do Join statements? (inner join to be specific). Also double check if the casting is appropriate. You may be getting a null value due to the userID somehow becoming a string.
Otherwise, could you provide an example on what the current dataset looks like and what you are trying to achieve?
@erick388 2 года назад
@@SpencerPaoHere Yeah I got it working. I think it was a messed up join on my end which prematurely ended my experimenting with the dataset, so all's good!
On another sidenote, as I'm still learning some machine learning stuff, I have friends who keep talking about accuracy for machine learning algorithms, and the more I look into it I begin to wonder how that may apply here, or if it's even an actual possible thing to quantify here. I know that MSE calculates the error between predicted values, and actual rating values (do correct me if I'm wrong), which makes me question if 'accuracy' or 'error' are actual aspects of this algorithm, or if that's related to other forms of algorithms that are more specific with their goal?
Regardless! Big thanks for the help and awesome video. This was honestly a pretty good starting point as it helped me get curious about a lof ot topics I had never got to touch before.
@SpencerPaoHere 2 года назад
@@erick388 Glad you enjoyed the content!
Regarding the accuracies, there are actually several metrics you can go about optimizing for. A great optimizer function would be adam. Accuracy by itself is not that 'accurate'; you need precision as well. Take a look into F1 scores. That'll help.
Increasing "accuracy" comes down to additional features, more data, and different ML algorithms, or tuning algorithms. That's essentially the world of Data Science.
@erick388 2 года назад
@@SpencerPaoHere Gotcha, I'll look into that too. It's a lot to take in but it's always fun and interesting to learn. Appreciate all the advice!
@erick388 2 года назад
@@SpencerPaoHere Actually, I suppose one final question is how I would qualify something as a false positive, or a true positive (or really any of the prerequisite information) for the calculations of F1 Scores (such as the requirements for Precision, Recall, etc). I'm not quite sure how to do that given that in this example here we're giving a recommendation of ten movies based on their overall rating, and I don't really know what would quantify as a false positive (or a true positive).
@nikhilsastry6631 4 месяца назад
Deepest Learning
@SpokenTruth-e4r 2 дня назад
Hello Spencer I was wondering if you have an email I can reach you at b/c I believe my old boss out of all people has poisoned my recommendation systems but I had wanted to confirm for sure if this is what's going on? Thanks
@casewhite5048 2 года назад
How do you set a rating system for the output of movies lets say it recommends a movie you never want to watch like Fried Green Tomatoes recommends Avengers: Endgame tell it to rate it 10/10 and train it to find more clusters with higher ratings and train it to find more of these over time as more movies come out
@SpencerPaoHere 2 года назад
There are many ways that you can go about doing this: I'd check out the ELO/FIDE rating system. Based on user input, they manually click either "Yes" or "No" depending on whether they like the recommendation. You can use this system to tailor prediction output to the customer.
@dustinvo6097 2 года назад
Hi Spencer. Nice video as always. I am working on a problem where the users interact with banking website and app. So I have userid, the interaction name, timestamps and some demographic varibables. I'm trying to cluster them into some "personas" based on their interaction and timestamp for biz use. Do you have any ideas how to do that? Thanks.
@SpencerPaoHere 2 года назад ⁺¹
Glad you enjoyed it! That use case can definitley be quite tricky. You'd first need to categorize what personas you are trying to bucket users in. Based on those personas, what actions (i.e features ) would link them to said persona?
I'd suspect that a lot of AB testing would be required to fulfill your hypotheses. But, if its literally just something related to money management via banking, I'd probably look at it from the angle of on-time payments, quantity, frequency, tiered users, time of withdrawl from ATM, fees encountered, zipcodes, and features related to that. (excluding PII unless TOS states as such)
@dustinvo6097 2 года назад
@@SpencerPaoHere thank for the advice. Another question: if I try to focus on just userid and interactionname, how can I cluster the userid basing on the interactions (withdraw, request credit score,...) while they are repeated categorical measurement? Kmode is a good one?
@SpencerPaoHere 2 года назад ⁺¹
@@dustinvo6097 I think I have just the video for you :) ruclips.net/video/NKQpVU1LTm8/видео.html
(If you haven't seen it already)
@appyviral8753 2 года назад ⁺²
How much u charge for making a video recommendation system for Android app?
@SpencerPaoHere 2 года назад ⁺¹
If it's highly interesting, $0.00.
@appyviral8753 2 года назад ⁺¹
@@SpencerPaoHere it will be! how to contact u?
@SpencerPaoHere 2 года назад ⁺¹
@@appyviral8753 You can send me a message at business.inquiry.spao@gmail.com
@seankirbycordova3937 2 года назад ⁺¹
Can I ask the source code? im building library system, I have no idea implemting the collaborative filtering algo. Thank you if you can help me 😊
@sospixs 2 года назад
Hi Spencer
Thanks for your vdo .
I've arrange the code , but got stuck in section for loop tqdm
len(losses) = 0
for it in tqdm(range(num_epochs)):
....
....
ZeroDivisionError Traceback (most recent call last)
Input In [59], in ()
11 optimizer.step()
12 #print(loss.item())
---> 13 print("iter #{}".format(it), "Loss:", sum(losses) / len(losses))
ZeroDivisionError: division by zero
any ideas ?
@SpencerPaoHere 2 года назад
yeah. Whatever is populating your losses is not being done correctly or there is a divergence issue. The len(losses) == 0. You'd need to figure out why that is the length is zero.
@sospixs 2 года назад
@@SpencerPaoHere Yep,
I'm using jupyter in my PC , And Is running on GPU: False
I think that the problem
@ujjwal.kandel 3 года назад
How would I pass a movie title to the recommender and get a list of recommendations?
@SpencerPaoHere 3 года назад ⁺¹
Great question! You might have to change the model itself to be more 'linear' to return a movie title that is most similar to the input.
With the Kmeans algorithm, you can technically "Pass in a movie title" and the list would be the cluster associated with that movie title. You can then sort by shortest distance and get the top most rated movie. Some additional coding will be required to do that.
@ujjwal.kandel 2 года назад ⁺³
@@SpencerPaoHere I could really use that extra code you're talking about. I'm doing a recommender for my final year project without zero experience in machine learning. Half this code is gibberish to me lol. I just need 10 recommendations for any list of movies. That's all I ask for😭
@guitar300k 2 года назад
How to solve big scale problem, you guys?
@SpencerPaoHere 2 года назад
It depends on the use case, but there are many ways to scale a problem. All of which are somewhat unique. For deployment on a website for example, Kubernetes is quite popular.
@christianmoreno7390 2 года назад
dang bro do you practice retention ??
@abi_xyz Год назад
great
@maximshidlovski23 2 года назад
Hi Spencer, thanks for the video. I am currently working on the problem of creating a tag-based recommendation system. The user has a list of tags of interest to him and needs to recommend content based on tags and words that are hyperonyms and hyponyms of these tags. I have the user's UserId, FavoriteUserTagsIds and the content's ContentID and ContentTagsIds. Do you have any ideas how to do that? What is best way to create tag-based recommendation system?
Thanks.
@SpencerPaoHere 2 года назад ⁺¹
This seems like an NLP type problem! You can check out a generalized large language model to see if your keywords exist within its vocabulary. Then, using its word embeddings, you can perhaps utilize the distances between the vectors as a gauge behind the meaning. Then, you can plug in the output of the NLP model to a recommendation system.
@maximshidlovski23 2 года назад
@@SpencerPaoHere Thanks, I came up with a similar solution yesterday, now I'm working on implementing it.
@sssaturn 2 года назад
is there a reason you dont split the data set?
@SpencerPaoHere 2 года назад
I just wanted to highlight the recommendation aspect (not necessarily the training aspect)
Though, in an ML model, you definitely want to do the typical 60/20/20 split!
@sssaturn 2 года назад
@@SpencerPaoHere cool, thank you spencer!
@kain5244 2 года назад
thanks
@umershabir7045 6 месяцев назад ⁺¹
is your voice AI generated?
@SpencerPaoHere 6 месяцев назад
😂
@hmhm2903 3 года назад
dataset link pls
@SpencerPaoHere 3 года назад
You can try here: files.grouplens.org/datasets/movielens/ml-20m-README.html
@brahimsabiri3116 3 года назад
Could you share the code plz
@SpencerPaoHere 3 года назад
github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems
@mainguyenhoang2667 3 года назад
can you share the code sir?
@SpencerPaoHere 3 года назад
As requested, here is my code: github.com/SpencerPao/Data_Science/tree/main/Recommendation%20Systems
@mainguyenhoang2667 3 года назад
@@SpencerPaoHere thanks you
@phatle-248 Год назад
I can't hear that "deep" voice clearly
@eda1198-w6m 2 года назад

Следующие

Автовоспроизведение

How does Netflix recommend movies? Matrix Factorization