Thank you for taking the time to do this! Been wanting to learn it for a while but lacked the basic skills to start and run run by run. I'd be great if there was a way to just pick a team and start scraping their data from each game for a specific time period... Maybe there's already more work on this as well. Either way I appreciate it!
Hey man, excelent video!! I started a master in data science and i wanted to practice with something related with football. I will use this for my FPL team
Great Video! Congrats! You could get the entire json converted directly to dataframe by doing: import ast pd.read_json(json.dumps(ast.literal_eval(str(data_json['h']))))
Don't know if this has already been posted, but the nested for loops can be replaced with the following code: for shot_event in data_home: x.append(shot_event['X']) y.append(shot_event['Y']) xg.append(shot_event['xG']) team.append(shot_event['h_team']) And the same for the away team. Much cleaner imo this way - No nested loops and no multiple ifs.
This is really helpful especially for someone starting with football analysis and getting stuck at the initial step of finding the right data. Is there a way to get pass or any event data in general from understat?
Great video! Have you found a way to iterate over the competitions to retrieve all match urls for each competition/season? Or given the structure of Understat we have to manually collect all of them?
Amazing content. You should be very proud of what you are doing for the community, specially the people who are just beginning in the field of data visualisation for football. Can i check if the method you have used is easily transferable to the other sites and we can easily scrap data? Also in one of your other videos you had mentioned that you wouldn't recommend scraping the data for any of the analysis, what is the reason for that?
I appreciate that! But yes you can use this method to scrape other websites, you will just need to adjust the tags you are looking for. Some things may not be in JSON so you will have to adjust accordingly. As well, I meant to say that you shouldn't be scraping the data and then using it in a way that you are going to be like making a ton of money off of it. Like if you scraped understat and then just went and threw up your own version of understat without their consent for profit that would probably be a no.
@@McKayJohns Thank you for your response and 100% with you on "ethical" scraping. I really struggled to scrap data from whoscored using this method, do you know what could be wrong? Also, would that be possible to do an article using whoscored as the reference site?
As far as the transformation from json to pd.DataFrame is concerned that one also works : # Combine 'h' and 'a' dictionaries into a single list combined_data = data['h'] + data['a'] # Create a DataFrame from the combined data df = pd.DataFrame(combined_data) # Display the DataFrame df So, it does really create a full data frame from json, having that home/away parameter as a column. Then anyone could try his own cleaning wrangling or usage of understat data himself.
What is that x and y? if those are the x,y coordinates then why does it range from 0-1. Then it will be a square... please someone help me out with this..
Where can I download updated scraped data from the understat website? On github someone shared a package with csv files but last updated 3 years ago. I'm not familiar with Python and can't update the data myself.
I don't believe understat provides that data... they are more focused on just shot location, xG and other stats.. You would probably need to get data from places such as wyscout or whoscored to do that
Guys I get the following error json_data = json_data.encode('uft8').decode('unicode_escape') LookupError: unknown encoding: uft8. Do you know why I get this error? And how can I solve it
could you explain better the coordinate system that these dataframe has? i can't understand where is located the origin (x,y)=(0,0), because these coordinates are always positive (>0). Great video btw GJ 😀
Hi, thanks for the video. I scrapped the shots data from understat, but I am not sure how to convert the X and Y values into X-coordinate, Y-coordinate values to create a shot map. Can you please give an idea.
@@McKayJohns Thanks for the response. I did try, I am searching for almost a week, but I couldn’t find anything. Almost all the videos I’ve seen in RUclips or somewhere else starts with a ready made file where X and Y coordinates are there already. Or I guess I don’t know how to search exactly 😬
You just have to multiply the X and y values to the dimensions of the pitch that u have selected. So u have to create a new column maybe (NewX= X*120) like that
Hey, I was wondering: if I want to scrape multiple pages, what kind of timeout should I be using between each request? Thanks for the very helpful video
It depends on how well the webpage is at slowing down / blocking requests. Technically you can hit it as fast as you want, but for understat's sake maybe you slow it down a half second between each so they don't get overwhelmed
Great video! I'm trying to do this in Java, do you know how to do the encode & decode in Java? I'm talking about this line: encode('utf8').decode('unicode_escape') Thank you!
Hi, can you help to convert the thrid script in the page called "roostersData? I changed from 1 to 2 in scripts, but even changing variables it doesn't work, seems it's a bit different from the shotsData one...thanks!
I had this working a while back, but went to run another game, and I'm getting this error: NameError Traceback (most recent call last) in () 1 res = requests.get(url) ----> 2 soup = BeautifulSoup(res.content, 'lxml') 3 scripts = soup.find_all('script') NameError: name 'BeautifulSoup' is not defined Nothing else changed but the match id. Thank you for your tutorials
@@McKayJohns Hi McKay. I have had a look but I think for those visualisations you need the corresponding minutes with the data which isn't with the understat data. Is there anyway you can do it without the minutes?
Thank you for providing this tutorial! If I have a list with the match id's I want to scrape (instead of 1 by 1), what are the necessary modifications to the code? I guess that an additional for loop should be written, but don't know where.
you would need to loop through your list of match id's and every time you loop, you would use the next match id and then aggregate all of that data to a single dataframe. Put the for loop at the beginning of the code and it should work out
Hey man, fantastic videos, really great stuff! I just have one quick question, is this method easily transferrable to scraping a players data rather than data from a single game? I have tried it and gotten along nicely until it came to around the 16 minute mark in this video. Where you have inputted "data_away = data ['a']" and "data_home = data ['h']", I am struggling to figure out what to put as obviously there isn't any home/away data to separate. I hope I am making sense when I am explaining this, I'm probably not though! Anyways, great work man
Appreciate it! So to get an individual players data, you will need to switch the url to be something like this understat.com/player/2097 (that is for Messi) and you can find the shots for the player if you look through the json data. If you are just wanting to get an individual player's shots you won't need that part of the code. If you have any other questions reach out to me on twitter!
Awesome video bro...help me write a program to alert me when my variable of choice (team) scores or gets a yellow card or wins a corner kick etc. I need to be able to punch in the id of the team and id of variable I want to keep an eye on, hook it up to the internet and let it scrap while Iwait for the program to alert me if id (goal, corner, yellow card, penalty, odd) is True... U get the idea....
The web scraping demo here is fantastic, very clear and easy to apply to other aspects of the website. Top man!
Thanks!
Thank you for taking the time to do this! Been wanting to learn it for a while but lacked the basic skills to start and run run by run. I'd be great if there was a way to just pick a team and start scraping their data from each game for a specific time period... Maybe there's already more work on this as well. Either way I appreciate it!
Thanks for the shout out :)
You bet
Serhii I read your blog post. That was awesome man, thanks for putting that out there .
@@GuardianApe thanks 🙏😊
Hey man, excelent video!! I started a master in data science and i wanted to practice with something related with football. I will use this for my FPL team
Just coming across and had to click that subscribe button. You're so informative I wish you were my prof 😂 awesome work man!
Thank you! Welcome aboard!
Great Video! Congrats! You could get the entire json converted directly to dataframe by doing:
import ast
pd.read_json(json.dumps(ast.literal_eval(str(data_json['h']))))
yep! i didn't learn about ast until after this video but it's a great package. Thanks for pointing that out!
Superb content man! Btw I have good memories of Barcelona, my team (Internacional) defeated them in 2006 with Adriano Gabiru's goal.
Excellent video. Keep up the good work!
Thanks!
Thanks McKay, learned a lot from this!
My man! Unreal, helping me a ton rn!!
Glad to help man!
@@McKayJohns hey man, what does the x and y coordinates run to and from on understat?
100 x 100 is the scale
thanks man you saved few hours of my coding
Don't know if this has already been posted, but the nested for loops can be replaced with the following code:
for shot_event in data_home:
x.append(shot_event['X'])
y.append(shot_event['Y'])
xg.append(shot_event['xG'])
team.append(shot_event['h_team'])
And the same for the away team.
Much cleaner imo this way - No nested loops and no multiple ifs.
Great tutorial, cheers McKay. Instant new sub!
Welcome!
great video lesson
This had to be done , thanks for sharing your knowledge.
You bet
Great work man. Appreciate it.
Glad to help
This has been a great help. Thanks
Glad it helped!
This is an awesome tutorial! Thanks so much!
This is really helpful especially for someone starting with football analysis and getting stuck at the initial step of finding the right data. Is there a way to get pass or any event data in general from understat?
Unfortunately, understat only provides the shot locations.
Great video! Have you found a way to iterate over the competitions to retrieve all match urls for each competition/season? Or given the structure of Understat we have to manually collect all of them?
Thank you so much for this video
Genius! Really helpful!
Glad it was helpful!
Amazing content❤️❤️
Thank you 🙌
Amazing content. You should be very proud of what you are doing for the community, specially the people who are just beginning in the field of data visualisation for football. Can i check if the method you have used is easily transferable to the other sites and we can easily scrap data? Also in one of your other videos you had mentioned that you wouldn't recommend scraping the data for any of the analysis, what is the reason for that?
I appreciate that! But yes you can use this method to scrape other websites, you will just need to adjust the tags you are looking for. Some things may not be in JSON so you will have to adjust accordingly.
As well, I meant to say that you shouldn't be scraping the data and then using it in a way that you are going to be like making a ton of money off of it. Like if you scraped understat and then just went and threw up your own version of understat without their consent for profit that would probably be a no.
@@McKayJohns Thank you for your response and 100% with you on "ethical" scraping. I really struggled to scrap data from whoscored using this method, do you know what could be wrong? Also, would that be possible to do an article using whoscored as the reference site?
@@sehgaldeepika fully agree, that will be dope.
As far as the transformation from json to pd.DataFrame is concerned that one also works :
# Combine 'h' and 'a' dictionaries into a single list
combined_data = data['h'] + data['a']
# Create a DataFrame from the combined data
df = pd.DataFrame(combined_data)
# Display the DataFrame
df
So, it does really create a full data frame from json, having that home/away parameter as a column. Then anyone could try his own cleaning wrangling or usage of understat data himself.
Thank you so much broooooo 😍
Welcome 😊
What is that x and y? if those are the x,y coordinates then why does it range from 0-1. Then it will be a square...
please someone help me out with this..
please do a video of scrap data and save to csv file for pizza,radr and other charts.
🙏
Where can I download updated scraped data from the understat website? On github someone shared a package with csv files but last updated 3 years ago. I'm not familiar with Python and can't update the data myself.
Nice. Where I can learn football analytics?
And is possible to land job in football analytics?
Bro this was so helpful, but how can i segregate other data like shot start/end or defensive actions?
I don't believe understat provides that data... they are more focused on just shot location, xG and other stats.. You would probably need to get data from places such as wyscout or whoscored to do that
Thank you very much, man! It is helpful for my graduation work in university
Guys I get the following error json_data = json_data.encode('uft8').decode('unicode_escape')
LookupError: unknown encoding: uft8. Do you know why I get this error? And how can I solve it
By converting everything to strings surely that means we cant manipulate the numbers since there arnt any numbers just strings
could you explain better the coordinate system that these dataframe has? i can't understand where is located the origin (x,y)=(0,0), because these coordinates are always positive (>0). Great video btw GJ
😀
Great stuff man, which club do you support? Please don't say arsenal
Barcelona 😂
How would you plot this for the shot map
Sorry, I would like to ask, I am a beginner, what exactly the aim of scrapping the understat of football data?
they have data you can use to analyze things such as shots, xg, etc.
can you do this method on the page
b e t 3 6 5 ?
I couldn't with the instruction in this video
Delete the spaces between the words
Hi, thanks for the video. I scrapped the shots data from understat, but I am not sure how to convert the X and Y values into X-coordinate, Y-coordinate values to create a shot map. Can you please give an idea.
If you watch some other videos they explain how to do this!
@@McKayJohns Thanks for the response. I did try, I am searching for almost a week, but I couldn’t find anything. Almost all the videos I’ve seen in RUclips or somewhere else starts with a ready made file where X and Y coordinates are there already. Or I guess I don’t know how to search exactly 😬
You just have to multiply the X and y values to the dimensions of the pitch that u have selected. So u have to create a new column maybe (NewX= X*120) like that
Hey, I was wondering: if I want to scrape multiple pages, what kind of timeout should I be using between each request? Thanks for the very helpful video
It depends on how well the webpage is at slowing down / blocking requests. Technically you can hit it as fast as you want, but for understat's sake maybe you slow it down a half second between each so they don't get overwhelmed
How can i get data manually from a football match please ?
Great video! I'm trying to do this in Java, do you know how to do the encode & decode in Java? I'm talking about this line:
encode('utf8').decode('unicode_escape')
Thank you!
Do you know how I can scrape multiple matches/pages on that website?
Hi, can you help to convert the thrid script in the page called "roostersData? I changed from 1 to 2 in scripts, but even changing variables it doesn't work, seems it's a bit different from the shotsData one...thanks!
I had this working a while back, but went to run another game, and I'm getting this error:
NameError Traceback (most recent call last)
in ()
1 res = requests.get(url)
----> 2 soup = BeautifulSoup(res.content, 'lxml')
3 scripts = soup.find_all('script')
NameError: name 'BeautifulSoup' is not defined
Nothing else changed but the match id. Thank you for your tutorials
Can I ask what the x and y have for meaning in the match?
it means the coordinate of the player on the pitch
Did you make modifications to your scraper based on my feedback on Twitter?
I've looked at it yes!
But i dont think understat has any international or CL data right? Just the leagues ig
Yeah top 5 leagues
Am a Real Madrid fan and I subscribed!😁😁😁...thanks for sharing...I will be visiting again!
Awesome! Thank you! Even tho you are a madrid fan ;)
Hi mate. Is there a way to visualise the data at the end
Ya i have a lot of tutorials on my channel which show how to make shotmaps or xG charts for example
@@McKayJohns Hi McKay. I have had a look but I think for those visualisations you need the corresponding minutes with the data which isn't with the understat data. Is there anyway you can do it without the minutes?
Any recommendations on how to scrape Sofascore data?
I personally have never done it. Probably would just need to use requests and BeautifulSoup
Thank you for providing this tutorial! If I have a list with the match id's I want to scrape (instead of 1 by 1), what are the necessary modifications to the code? I guess that an additional for loop should be written, but don't know where.
you would need to loop through your list of match id's and every time you loop, you would use the next match id and then aggregate all of that data to a single dataframe.
Put the for loop at the beginning of the code and it should work out
Hey man, fantastic videos, really great stuff!
I just have one quick question, is this method easily transferrable to scraping a players data rather than data from a single game?
I have tried it and gotten along nicely until it came to around the 16 minute mark in this video. Where you have inputted "data_away = data ['a']" and "data_home = data ['h']", I am struggling to figure out what to put as obviously there isn't any home/away data to separate. I hope I am making sense when I am explaining this, I'm probably not though!
Anyways, great work man
Appreciate it!
So to get an individual players data, you will need to switch the url to be something like this understat.com/player/2097 (that is for Messi) and you can find the shots for the player if you look through the json data.
If you are just wanting to get an individual player's shots you won't need that part of the code.
If you have any other questions reach out to me on twitter!
THANKYOUUU
Awesome video bro...help me write a program to alert me when my variable of choice (team) scores or gets a yellow card or wins a corner kick etc. I need to be able to punch in the id of the team and id of variable I want to keep an eye on, hook it up to the internet and let it scrap while Iwait for the program to alert me if id (goal, corner, yellow card, penalty, odd) is True...
U get the idea....
Hello brother, thanks for the video. i want a scraping project done. Are you able to help please? we can talk privately.
Hi this is a great video, can please scrape lotto data
Github file still exits?
github.com/mckayjohns/youtube-videos hey sorry i'll update it but heres where the files are at now
Wowwww