Sports Reference has started to enforce a 3-second crawl limit
I found this tweet while I was researching an error I was getting while trying to scrape data from Basketball Reference:
I figured this was eventually going to be the case, considering in the Sports Reference terms of service (sec. 5i) it says we shouldn't scrape from them without permission and myself and plenty of others have been skating by for years.
Per Basketball Reference's robots.txt, you should throw in Sys.sleep(3)
commands liberally while scraping in R. I will incorporate that into any code I publish on here in the future.
I'd started getting a 403 error message after calling 10-15 pages, but I just ran a script that pulls about 100 pages. While it was slower than I'm used to becuase of the delays, it never kicked me out, so I'm satisfied. As long as we can keep pulling data from there, that's all that matters to me.