One thing I think is missing with your approach to looking at the team's history is that you are not telling the model which of the teams they played against were good and which ones were not (winning against the best team and winning against the worst team obviously give you different information about how good your team is). I have had some reasonable success trying to predict the other football using two models (trained together): the first one was an LSTM that would be fed the the outcomes of the matches between ALL teams for the previous year and would produce status vectors for all of them and a second model that, given the status vectors of the two teams I care about and some other information about the match, would predict the final result.
This sounds interesting! Have you tried treating the two things as different variables? e.g clustering the teams and using that as a variable of how good the are, and the sequence of the last games independently of the goodness of the teams? If so, did it yield good results?
Love it! What about doing some feature engineering on the las 5 games of each team instead of using the RNN for that? And then fitting a much simpler model? That could help balance the size of data set and complexity of the model
There's a lot that's wrong with this model. You would have benefitted by looking at the (published) literature of prediction models for American sports. Specifically, what's found to work best is either a decision tree or.a boosted decision tree. In any case, two points: 1) In selecting the last five games or the last ten games, you aren't following the optimization method to remove bias in your model. You have to implement a cross validation method. That is, you have to have a randomized, combinatorial approach that selects, for instance, the previous last game, the first game, the third game, etc. Bias in the model can only be removed if (almost) all random combinations are compared to each other. 2) In selecting your initial features for your logistic model, you didn't implement LASSO. That is, the first step in selecting your features is to test your bias that this is the correct number of features. A common method is to use LASSO to see how many are independent and therefore how many are predictive. (It also could be that you aren't using enough features, which would also be made evident using LASSO.)
hey first off, thank you for taking the time two write up your comment. So, I should have mentioned in passing other models I tried, but trying any tree based model (decision tree, random forest, gradient boosted decision tree) gave weaker performance on the validation set than the logistic regression. I definitely agree with your point about needing more robust lag window selection. I basically tried 5 and 10 as a gut check but we should iterate over all combinations (as well as all the other hyperparameters available to us) if we were trying to get the best performance possible. So I would agree with the feature selection point if our feature space was a lot bigger (in the dozens, hundreds, or more) since that might interfere with our model learning properly. In fact, more rich feature are probably the main limitation of this model. But with the 6 features we're using, I think there's an argument that the RNN would decide not to use those that are ineffective at solving the task. Of course, we need to look under the hood into the learned weights of the RNN to know for sure. Again, really appreciate your insight into how to make this model even better, cheers!
@@ritvikmath Yes, you're right RNN's do work as an implicit form of LASSO; but this is why it should have been implemented when you were developing your logistic model. That is, because you've chosen to map many features to one, if a team won or lost, you've essentially lost the predictive ability of RNN's by not using sequential data that can be predictive. In other words, because you don't know if multi-linearity affects your features, you won't learn this using a RNN.
Both teams in the Superbowl had to go undefeated in the playoffs to get to the game, so their recent win-lost ratio should be the same depending on whether they got the bye week off. Point spreads in playoff games might be more predictive.
Dear, can you make a series that specifically about implementing data science portfolio projects? I still have no idea that what kind of DS projects should I make to apply for DS job and how those projects really look like in details. 🙏
I think the next obvious improvement is taking into consideration injuries. If a team loses 3 key players over the last two games, their next game is not going to go as well. I'm sure injury reports are well documented but that is a feature that could simply be the number of starting player injured before game.
I'm not into sports (at all), but I do fancy the idea of using data science to make money betting on sports. Question 1: Is hometown advantage of such significance that it deserves to be featured so strongly in the model? Question 2: Why stop at only 5 or 10 games? Why not track the performance of the two teams since the last Super Bowl?
Unbelievable! Well done, you were spot on with your prediction of +3 for the Chiefs!
Talk about a great prediction!
You're one of the best data science channels on RUclips. Thanks for the video.
Cool applications of these models, I would be interested to see what would happen if you weighted each game by the ranking of the opposing team!
Thanks for creating. Interesting hearing about the thought process you take and your approach to the problem.
Wow, YOU won 🎉🥇
Glad to meet a Taylor fan data scientist!
Awesome video!
Would love a video on Hamiltonian Monte Carlo and how this links to Metropolis-Hastings and Markov Chains.
Keep up the great content :)
One thing I think is missing with your approach to looking at the team's history is that you are not telling the model which of the teams they played against were good and which ones were not (winning against the best team and winning against the worst team obviously give you different information about how good your team is). I have had some reasonable success trying to predict the other football using two models (trained together): the first one was an LSTM that would be fed the the outcomes of the matches between ALL teams for the previous year and would produce status vectors for all of them and a second model that, given the status vectors of the two teams I care about and some other information about the match, would predict the final result.
This sounds interesting! Have you tried treating the two things as different variables? e.g clustering the teams and using that as a variable of how good the are, and the sequence of the last games independently of the goodness of the teams? If so, did it yield good results?
Congratulations on your model being right! Amazing!
Holy cow! What a prediction!!!
Wow your model was right!
Love it! What about doing some feature engineering on the las 5 games of each team instead of using the RNN for that? And then fitting a much simpler model? That could help balance the size of data set and complexity of the model
well done! would you mind explaining how you derived the baseline probability of winning for each example in the validation set?
There's a lot that's wrong with this model. You would have benefitted by looking at the (published) literature of prediction models for American sports. Specifically, what's found to work best is either a decision tree or.a boosted decision tree. In any case, two points:
1) In selecting the last five games or the last ten games, you aren't following the optimization method to remove bias in your model. You have to implement a cross validation method. That is, you have to have a randomized, combinatorial approach that selects, for instance, the previous last game, the first game, the third game, etc. Bias in the model can only be removed if (almost) all random combinations are compared to each other.
2) In selecting your initial features for your logistic model, you didn't implement LASSO. That is, the first step in selecting your features is to test your bias that this is the correct number of features. A common method is to use LASSO to see how many are independent and therefore how many are predictive. (It also could be that you aren't using enough features, which would also be made evident using LASSO.)
hey first off, thank you for taking the time two write up your comment.
So, I should have mentioned in passing other models I tried, but trying any tree based model (decision tree, random forest, gradient boosted decision tree) gave weaker performance on the validation set than the logistic regression.
I definitely agree with your point about needing more robust lag window selection. I basically tried 5 and 10 as a gut check but we should iterate over all combinations (as well as all the other hyperparameters available to us) if we were trying to get the best performance possible.
So I would agree with the feature selection point if our feature space was a lot bigger (in the dozens, hundreds, or more) since that might interfere with our model learning properly. In fact, more rich feature are probably the main limitation of this model. But with the 6 features we're using, I think there's an argument that the RNN would decide not to use those that are ineffective at solving the task. Of course, we need to look under the hood into the learned weights of the RNN to know for sure.
Again, really appreciate your insight into how to make this model even better, cheers!
@@ritvikmath Yes, you're right RNN's do work as an implicit form of LASSO; but this is why it should have been implemented when you were developing your logistic model. That is, because you've chosen to map many features to one, if a team won or lost, you've essentially lost the predictive ability of RNN's by not using sequential data that can be predictive. In other words, because you don't know if multi-linearity affects your features, you won't learn this using a RNN.
Both teams in the Superbowl had to go undefeated in the playoffs to get to the game, so their recent win-lost ratio should be the same depending on whether they got the bye week off. Point spreads in playoff games might be more predictive.
The issue with predicting the unpredictable is that the odds or price are always the best indicator
Dear, can you make a series that specifically about implementing data science portfolio projects? I still have no idea that what kind of DS projects should I make to apply for DS job and how those projects really look like in details. 🙏
I think the next obvious improvement is taking into consideration injuries. If a team loses 3 key players over the last two games, their next game is not going to go as well. I'm sure injury reports are well documented but that is a feature that could simply be the number of starting player injured before game.
I'm not into sports (at all), but I do fancy the idea of using data science to make money betting on sports.
Question 1: Is hometown advantage of such significance that it deserves to be featured so strongly in the model?
Question 2: Why stop at only 5 or 10 games? Why not track the performance of the two teams since the last Super Bowl?
Now you made me interested in the outcome. So far I was only interested in the female performer in the break 😂
Interesting video. I wonder, do you use Bayesian inference or Bayesian neural networks in your work?
hey I didn't happen to use those here but will look into them for next time! cheers!
@@ritvikmathhe wants to know about your actual work job I think
Nice prediction but also lucky hahahaha! Nice job
I wonder when I would see video on gen ai
Can you share the code?
nice
Thanks 🙏
Promo-SM 💋
Wow I cannot believe that worked so well!