scheduling a cron job in python??

atreides

Graduate Student
Joined
7/4/08
Messages
420
Points
38
I was wondering if somewhere here has experience doing cron jobs in python. I have a script that basically scrapes data from a website and writes the data to a file.

The script should write about 60K records to file if everything goes smoothly, but this hasn't happened yet. After about 1000 records have been written, the script either hangs or gets a url error saying 'connection has been reset by peer'
EDIT: It seems my connection times out

I'm currently using time.sleep to make the script stop for a few seconds (how many seconds is determined by a random function) after writing every 100 records, but even with this, it eventually still stalls. I thick what I need to figure out is a way to exit the script completely after every writing say 100 records to file and reenter the scripts after a few seconds at the exact point of exit.

The problem is probably not from the page source because, I check whether the page html is well formed before I even attempt to scrape it.

Would appreciate any thoughts

I'm on a mac/linux env
 
do you have a single python script? you could put #!/usr/bin/python at the top and call the script from from cron. Also, you need to figure out why the script hangs. It might be easier to get the data from the URL using wget or curl and then scrape the result with yout python script locally.
 
do you have a single python script? you could put #!/usr/bin/python at the top and call the script from from cron. Also, you need to figure out why the script hangs. It might be easier to get the data from the URL using wget or curl and then scrape the result with yout python script locally.

Yes, it's a single python script.. I'm following links from the main page and going five levels deep to get the data. I'm currently using urllib2 and BeautifulSoup to get the links and eventually get to the data tags..five levels down. The url links are actually dynamic, so I don't know exactly what they are before I start scraping.. Will look into wget/curl thanks
 
Back
Top Bottom