Web Scraping NBA Games With Python [Full Walkthrough W/Code]

Dataquest

Просмотров 34 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 25 янв 2025

Комментарии • 89

@joaothomazlemos13 2 года назад ⁺⁴
Hello!
As I want to do a personal project for my portfolio ( as im trying to get my first data scientist job) with the nba theme that i became recently a big fan of, I wanted to do the project from zero, which means scrap. The thing is, scraping was the only thing that i had zero knowledge.
I found this video that is absolute pure gold. Im on windows, so i had to use sync mode, and changed a few things. Its working! I also tried to impersonate a little things and I commented the whole code. I'd love to get in touch with you, for some insights from now on, so the project is not a copy of yours, per say.
thank you for the video, these kind of knownledge is much needed! Cheers from Brazil.
@Dataquestio 2 года назад ⁺¹
Glad it helped you!
@shukkkursabzaliev1730 Год назад ⁺³
Hey! Thanks for amazing tutorial. I can't understand one thing.
All these features we are preparing for the ML model to train on, however if we want to predict future games these features wont be available. So what will be the inputs for the potential trained model?
@kennethcolombe5579 Год назад ⁺⁵
If you are coming from this with some knowledge about basketball, the "standings" mentioned are not actually standings, but the game schedule for that month. It was throwing me off a bit when continuously referenced...not sure it bothers anyone else but thought it worth mentioning.
@mkzzzzzzzzzz1 2 года назад ⁺³
28:50 How can you run await outside of a function? I don't really use jupyter. I tried something like z = [await scrape_season(x) for x in SEASONS], scrape_season(z) but neither worked. Any help appreciated
@Dataquestio 2 года назад ⁺⁶
You can use await inside Jupyter notebook since everything in Jupyter is already running inside an async event loop.
I would recommend stripping out async if you're writing a regular Python script outside of Jupyter. You'll use the Playwright sync api (instead of the async api). You'll have to replace the import of playwright with `from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout`.
Then you'll need to remove all the `async` and `await` keywords in the code, and write `with sync_playwright() as p:` inside the `get_html` function. This will remove the need for async entirely. But it won't work with Jupyter notebook, only with a regular python file.
@keithravid5235 2 года назад
@@Dataquestio I tried this and got:
"
Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.
"
As far as I can tell I'm not using an asyncio loop anymore since I made the changes you mentioned.
@Dataquestio 2 года назад
If you're running in Jupyter, you need to use async (like in the video). If you're writing a regular .py file and running from the command line, then you can use the sync api like I mentioned in the comment above.
You wouldn't get the error that you shared if you're running a regular python script (create a `x.py` file, run using `python x.py` from the command line).
Jupyter by default wraps code in an asyncio loop. So anything you run in Jupyter is already running async!
@mkzzzzzzzzzz1 2 года назад
@@keithravid5235 forwarding the message because he replied directly to me so you won't see it. check above/below.
@FlisB 2 года назад ⁺²
Nice tutorial. I am just curios what is the purpose of opening a browser with playwright. Why not just use the requests library to get the html?
@birasafabrice 2 года назад
a new sub is gained, thank you for this tutorial!
@mkoller 2 года назад ⁺²
Nice tutorial! I’m still waiting for the data to download.
Impressive if you actually did the entire project with Jupyter Notebooks.
I had the Windows Playwright issue everyone is talking about, so I used Pycharm. Ran out of memory so I had to run from the command line.
Curious. Why did you make an opp column? You had rows of the same data without it, no?
@kinetiksports 10 месяцев назад ⁺¹
I need help!! Can I email you with an error message I keep getting?
@OBBBB17 11 месяцев назад
I'm trying to scrape with playwright but PlaywrightTimeout isn't working and I keep getting invalid syntax.
Cell In[5], line 13
except PlaywrightTimeout:

SyntaxError: invalid syntax
@michealwillis7500 3 месяца назад
Hey i keep getting error (next(iter(done)).result())
Also in 3rd line of code yours highlights playwright.async_api in blue. Mine doesnt is that an issue too. Please help
@tomkmb4120 2 года назад ⁺¹
Just noticed your responses below, I'll have to try the code as a Pycharm file and see how I get on
@Slicneil1 Год назад
Thanks for the video i follow all your work. the issue i am having is continuous timeout error when trying to scrape the data and ideas to get around it?
@kirillprokhodtsev6249 11 месяцев назад
Hi! Trying to get 'line score' table, but without success. Table not found
Selenium method doesn't approach here, because it takes a lot of time + I scare my laptop will bloom. Did someone meet the same problem and solve it?
@garymichalske2274 Год назад ⁺²
Thanks for the detailed explanation. Since I'm on Windows, I couldn't use Jupyter to run the code so I've been trying your first option of using a Python IDE (I'm using PyCharm). I imported "from playwright.sync_api import sync_playwright" and eliminated the "async" and "await" keywords throughout the code. I was able get all the standings pages (after a few timeouts) and was getting excited with the success! But am having issues with the boxscore pages. The code starts with the April 2016 Standings file and is able to successfully save three of the boxscore files but will start timing out on the fourth one and eventually throw this error..."UnicodeEncodeError: 'charmap' codec can't encode character '\u010d' in position 38876: character maps to " When it does, the related .html file is blank. Firefox seems to work a little better than chrome as it doesn't timeout as often. Any idea of how to get this to work?
@nemanjatamindzija58 Год назад ⁺²
add encoding UTF-8 into the line
" with open(save_path, "w+", encoding="utf-8") as f:
f.write(html) "
that is at the end of the scrape_game function, hope it helps.
@edsonarthurzancheta3052 Год назад
@@nemanjatamindzija58 Thank you, i was having the same problem
@andresjacome3315 Год назад
Hi! Can you share the final code that you have please? For this project becuase I have the same problem of windows @garymichalske2274
@f1ip_br 2 года назад ⁺²
Trying on both Jupyter and PyCharm and getting the same error on the parse_data part.
When running it, it throws a ValueError with
----> 6 line_score = read_line_score(soup)
In the box_scores loop, tracing back to
1 def read_line_score(soup):
----> 2 line_score = pd.read_html(str(soup), attrs = {'id': 'line_score'})[0]
Ending message is "ValueError: No tables found"
Have checked and double checked the code, including running the version in github, but no way to get it to work.
Any ideas?
Thank you for an excellent tutorail.
@jerryli2276 8 месяцев назад
got the same problem going on. Have you figured it out, bro?
@f1ip_br 8 месяцев назад
@@jerryli2276 Not really. Started doing some different stuff to learn python, forgot about this project, never went back.
@ling6701 2 года назад
beautiful project, thank you.
@jerryli2276 8 месяцев назад
Hi, during the parsing part, when I run the code till if len(games) % 100 == 0:
print(f"{len(games)} / {len(box_scores)}"), it keeps telling me the error: html5lib not installed, even if I have installed it myself. Could you help me with it?
@nishchay89 Год назад
Hi!!
I have one query.
Why did we take the max of each stat? What is the purpose behind it?
@ScottRachelson777 Год назад
How is Playwright different from BeautifulSoup which also grabs HTML from website pages?
@herreramoralesjoseroman1504 2 года назад ⁺¹
Help!.. I couldn't instantiate the browser in the "get_html" function, I already changed p.firefox.launch() to p.chromium.launch()... is it necessary to execute any previous command to install the browsers for the library "playwright"..?
@Dataquestio 2 года назад ⁺²
I showed it in the video - you need to run `playwright install` in the command line, or `!playwright install` in jupyter notebook to install the browsers.
@Jollyjoky 8 месяцев назад
Hi I keep getting Notimported error when trying to do this project in Windows """Create subprocess transport."""
--> 524 raise NotImplementedError
Could someone help me? How to I correct it? I'm running on Windows and vscode
@meechmiliyan8965 2 года назад ⁺¹
Awesome stuff!! I am looking to parse box scores for player data. I would like to get player stats AVG and ideally get AVG for Opponent Defensive stats . Could you suggest next steps?
@FlisB 2 года назад
You want average stats for players? I want to do something similar. Well I want to get moving averages of players, so that I will predict their points scored in the next games.
@SharpeLocks 8 месяцев назад
@@FlisB did you ever figure out how to do this?
@nicksteele6578 Год назад
I can't get all the data to scrape
any suggestions??!!
@tomphillips5513 Год назад
Hello! I am doing a similar project but for the NFL. Firstly, is it okay to scrape data from the football reference website, their T&C's are rather unclear. Also, Unlike the basketball reference website where you have to iterate through the months to get all the games, you do not have to do this on the football reference website, therefore, I am wondering how I would have to amend that part of the code. Any help would be very much appreciated as this is for my Final Year Project (Dissertation) at university. Have a great day one and all.
@thebinarybin 8 месяцев назад
Never could get Chromium to work. I looked everywhere to find a solution for a very long time. So I ended up using Firefox as well. Does anyone have a solution to the chromium issue to direct to me. I really want to figure that out. Great job with the video! Very intuitive. Wish more content was on the regular.
@cap_smok3r 2 года назад ⁺¹
Hello ! Thanks for the great tutorial. I am an NBA fan and data nerd myself, and was wondering why you did not make use of the 'nba_api' to get the most up to date data of game ?
And if someone does use it, is there a way to build a ticker to predict the win probability of your favorite team(s) next game in an ongoing season ??
Thanks again for the great content !
@AS-rg9ly Год назад
The NBA api might only be for private use. I know the NFL api is.
@cap_smok3r Год назад
@@AS-rg9ly it is not. I have tried it myself.
@ChristianJustinDeGuzman 20 дней назад
getting a NotImplementedError, any help?
@billybarnes6961 18 дней назад
see above comment of mine if my last comment didnt go thru, its not showing it for some reason
@jordankasowski284 2 года назад
How do you get past the cookie wall. I can't download the proper HTML because of cookies
@jomagabarsa 10 месяцев назад
I tried to replicate it but it didn't work for some reason. When I call the get_html function I get a not implemented error but it doesn't say anything. Nice tutorial though
@unclexphil7874 Год назад
i’m having trouble installing playwright can anyone help?
@hybridinc1035 2 года назад
This whole thing is not working. Tried like a thousand times, kept getting the same error. I can send screenshot if possible
@peperecabarren4536 2 года назад
Hello boss, is it normal to run the scrape season a few times to gather all the data if some of them timeout? Thank you for your time.
@nicksteele6578 Год назад ⁺³
I had this issue, i upped the request,retries and timeout time and I have all the data now.
@peperecabarren4536 Год назад
@@nicksteele6578 thanks homie
@tenienteale Год назад
hello, thanks for the tutorial!
@billybarnes6961 18 дней назад
NOTE FOR ANYONE GETTING NOT IMPLEMENTED ERROR::::::
you need to download wsl and run the jupyter lab using that. enter directory where you got your lab, type wsl in cmd prompt, then do jupyter lab. you may have to do something to make the jupyter lab open automatically on wsl but if you don't care you can just copy paste the address it gives into chrome and itll open.
windows doesn't support async playright hence the error
@coconutnut21 Год назад ⁺²
code gives this result "NotImplementedError:
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings..."
@Laochuang1 Год назад ⁺²
I have the same problem, did you manage to solve it?
@ryanbiancavilla9921 Год назад ⁺¹
Also getting this error
@RICHARDKOVARLIETZBTW-ye7gu 3 месяца назад
need to run jupyter on wsl if youre using jupyter on windows. playwright async doesnt work with windows
@noamsolow2825 Месяц назад
me too
any chance you solved
@julesdrums6167 2 года назад
Trying to re-run just the get_data.ipynb in Jupyter Lab on my local machine. Have changed .p.chromium.launch() to p.firefox.launch() in get_html() and am still getting the "Timeout error on {url}" when I run
`for season in SEASONS:
await scrape_season(season)`
Any tips?
@julesdrums6167 2 года назад ⁺²
Update: couple of nifty tricks to get this part to work. Change `retries` in get_html to at least 5, and you will probably still run into the timeout issue which causes either BeautifulSoup or f.write(html) to error out, so what you need to do is keep running the code over and over again, and keep an eye on the standings directory. As it populates with each season's month's htmls, modify the seasons variables to exclude those years (e.g. change it from SEASONS = list(range(2016,2024)) to SEASONS = list(range(2017,2024)) and keep iterating up that lower bound as needed).
@dimz130588 Год назад
same here
@joeguerby 2 года назад
Hello, thanks for this amazing tuto. Anyone else had an error while installing playwright ? Me i got the "playwright is not recognized as an internal or external error message" both in command line or in Jupyther notebook. Can anybody help me please ?
@Dataquestio 2 года назад ⁺¹
You would need to run `pip install playwright` in the command line, or `%pip install playwright` in Jupyter. (remove the `, that's just to show which part is the command).
@joeguerby 2 года назад
@@Dataquestio I did it, but the '!playwright install' failled and i don't know why (You said that we must run this also in jupyther or command line . This is the error message i got : 'playwright' is not recognized as an internal or external command,
operable program or batch file.
Another request can you add the current season results in the CSV files availlable in the project files ? Please
@optimist4472 2 года назад ⁺⁴
the program shows NotImplementedError after executing "html = await get_html....." and I have done every step as you have shown
@Dataquestio 2 года назад ⁺¹
It looks like there is an issue with playwright and Jupyter on certain versions of Windows/Python (see issue at github.com/scrapy-plugins/scrapy-playwright/issues/7 ).
Your options:
* Put the code into a regular `.py` file and run it as a python script (not in Jupyter notebook) (easiest)
* Install windows subsystem for linux and run jupyter notebook using wsl
* Try to upgrade your version of Python/Jupyter and see if that works
@davidichoho5788 2 года назад
@@Dataquestio please i'm having the same issue and im using window
@marcgold424 2 года назад ⁺¹
using windows 10, VSC, i get: html = await get_html(url, "#content .filter")
SyntaxError: 'await' outside function. we cant make html a global variable or put it in the function huh? can we use something else besides playwright? 😞
@Dataquestio 2 года назад ⁺⁶
If you write your code in a regular python file (no Jupyter notebook), then you can use the Playwright sync api (instead of the async api). You'll have to replace the import with `from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout`.
Then you'll need to remove all the `async` and `await` keywords in the code, and write `with sync_playwright() as p:` inside the `get_html` function. This will remove the need for async entirely. But it won't work with Jupyter notebook, only with a regular python file.
@Migzee34 2 года назад
@@Dataquestio I wiill try this when I get home later, a few questions if you see this.
Are you using windows? and also would you recommend I just substitute with selenium or as you said run it as a python script.
Thanks for the content, love the channel
@AkachiIsGod 2 года назад
SyntaxError: 'await' outside function
@pirrisynho 2 года назад
same here, I scraped with requests and it works...
@zakyvids6566 2 года назад
Please make a python crashcourse
@sohanverma3255 2 года назад ⁺¹
cool
@jonathanschild4092 2 года назад
I keep getting the below after running the for season in SEASONS loop. I'm writing it in regular python script, vice Jupyter in case that's a factor.
playwright._impl._api_types.Error: NS_ERROR_UNKNOWN_HOST
@Laochuang1 Год назад
did you solve it?
@manjunathreddy5566 2 года назад ⁺⁴
ask exception was never retrieved
future:
Traceback (most recent call last):
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_connection.py", line 224, in run
await self._transport.connect()
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_transport.py", line 133, in connect
raise exc
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\site-packages\playwright\_impl\_transport.py", line 121, in connect
self._proc = await asyncio.create_subprocess_exec(
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\subprocess.py", line 218, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1652, in subprocess_exec
transport = await self._make_subprocess_transport(
File "C:\Users\manjunatha.reddy\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 493, in _make_subprocess_transport
raise NotImplementedError
NotImplementedError
kindly help me above error
@주희룡 2 года назад ⁺¹
I also got the same error. What's wrong?
Task exception was never retrieved
future:
Traceback (most recent call last):
File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_connection.py", line 224, in run
await self._transport.connect()
File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_transport.py", line 133, in connect
raise exc
File "C:\Users\JU HEE RYONG\anaconda3\lib\site-packages\playwright\_impl\_transport.py", line 121, in connect
self._proc = await asyncio.create_subprocess_exec(
File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\subprocess.py", line 236, in create_subprocess_exec
transport, protocol = await loop.subprocess_exec(
File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\base_events.py", line 1630, in subprocess_exec
transport = await self._make_subprocess_transport(
File "C:\Users\JU HEE RYONG\anaconda3\lib\asyncio\base_events.py", line 491, in _make_subprocess_transport
raise NotImplementedError
NotImplementedError
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
in
----> 1 html = await get_html(url, "#content .filter")
in get_html(url, selector, sleep, retries)
4 time.sleep(sleep * i)
5 try:
----> 6 async with async_playwright() as p:
7 browser = await p.firefox.launch()
8 page = await browser.new_page()
~\anaconda3\lib\site-packages\playwright\async_api\_context_manager.py in __aenter__(self)
44 if not playwright_future.done():
45 playwright_future.cancel()
---> 46 playwright = AsyncPlaywright(next(iter(done)).result())
47 playwright.stop = self.__aexit__ # type: ignore
48 return playwright
~\anaconda3\lib\site-packages\playwright\_impl\_connection.py in run(self)
222 self.playwright_future.set_result(await self._root_object.initialize())
223
--> 224 await self._transport.connect()
225 self._init_task = self._loop.create_task(init())
226 await self._transport.run()
~\anaconda3\lib\site-packages\playwright\_impl\_transport.py in connect(self)
131 except Exception as exc:
132 self.on_error_future.set_exception(exc)
--> 133 raise exc
134
135 self._output = self._proc.stdin
~\anaconda3\lib\site-packages\playwright\_impl\_transport.py in connect(self)
119 env.setdefault("PLAYWRIGHT_BROWSERS_PATH", "0")
120
--> 121 self._proc = await asyncio.create_subprocess_exec(
122 str(self._driver_executable),
123 "run-driver",
~\anaconda3\lib\asyncio\subprocess.py in create_subprocess_exec(program, stdin, stdout, stderr, loop, limit, *args, **kwds)
234 protocol_factory = lambda: SubprocessStreamProtocol(limit=limit,
235 loop=loop)
--> 236 transport, protocol = await loop.subprocess_exec(
237 protocol_factory,
238 program, *args,
~\anaconda3\lib\asyncio\base_events.py in subprocess_exec(self, protocol_factory, program, stdin, stdout, stderr, universal_newlines, shell, bufsize, encoding, errors, text, *args, **kwargs)
1628 debug_log = f'execute program {program!r}'
1629 self._log_subprocess(debug_log, stdin, stdout, stderr)
-> 1630 transport = await self._make_subprocess_transport(
1631 protocol, popen_args, False, stdin, stdout, stderr,
1632 bufsize, **kwargs)
~\anaconda3\lib\asyncio\base_events.py in _make_subprocess_transport(self, protocol, args, shell, stdin, stdout, stderr, bufsize, extra, **kwargs)
489 extra=None, **kwargs):
490 """Create subprocess transport."""
--> 491 raise NotImplementedError
492
493 def _write_to_self(self):
NotImplementedError:
@NotOnVerg 2 года назад ⁺¹
@@주희룡 Same problem here
@md.shafaatjamilrokon8587 2 года назад ⁺¹
did you solve it?
@ScottRachelson777 Год назад ⁺³
@@주희룡 Yep, I got the same error. This is why programming can be so frustrating.
@Digital-Light Год назад ⁺²
the same problem (

Следующие

Автовоспроизведение

Predict Baseball Stats using Machine Learning and Python