I found this tweet while I was researching an error I was getting while trying to scrape data from Basketball Reference:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">sports research just got a bit more difficult: basketball-reference has implemented recaptcha and is enforcing a 3 second crawl delay.</p>&mdash; alex cardazzi (@ACardazzi) <a href="https://twitter.com/ACardazzi/status/1608126560870998016?ref_src=twsrc%5Etfw">December 28, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

I figured this was eventually going to be the case, considering in the [Sports Reference terms of service](https://www.sports-reference.com/termsofuse.html) (sec. 5i) it says we shouldn't scrape from them without permission and myself and plenty of others have been skating by for years.

Per Basketball Reference's [robots.txt,](https://www.basketball-reference.com/robots.txt) you should throw in `Sys.sleep(3)` commands liberally while scraping in R. I will incorporate that into any code I publish on here in the future.

I'd started getting a 403 error message after calling 10-15 pages, but I just ran a script that pulls about 100 pages. While it was slower than I'm used to becuase of the delays, it never kicked me out, so I'm satisfied. As long as we can keep pulling data from there, that's all that matters to me.