So, for those in the class, you might remember that I said that I bombarded the congress.gov website with requests trying to get the urls for all the xmls, and that their server had kicked me off. Well, I’m now redoing my project and getting all the data since the 103rd Congress. This time, I didn’t want to make their server mad, so I added in a pause to my download.
Python’s time library has a function called pause, and with pause, you can have your program hold for a time in seconds. In my new download script, I wrote it so that a request would only be made to their server once every 7.5 seconds. With 91043 requests to be made, that would take almost 8 days. I thought, okay, I can wait, I’ll just use my hosting service to do this, and in 7 days, I’ll just go get all those files.
Well, as my hosting was downloading everything, I came across this video about web scraping, and in this, it gives a lot of advice about web scraping, and it turns out, many large websites have a file in their root called robots.txt. Such as on the congress.gov website, http://congress.gov/robots.txt . Robots.txt files have information about what spiders, crawlers and other web scraping programs are allowed to do, literally stating what directories are allowed and disallowed. http://en.wikipedia.org/robots.txt has a pretty extensive one. Some websites have a Crawl-Delay operator which says how much time in seconds you should wait.
Anyways, after reading the congress.gov robots.txt, it turns out that I only have to wait 2 seconds in between making a request. So, I quickly changed my code to use time.pause at 2.15 seconds… because I’m a nice guy. Now I just have a little under 2 days to finish requesting all these webpages.