When writing Python crawler, how guaranteed between the two request urllib.reque

LZ is in the write a bean crawler, landing when encountered the problem of verification code.

Dynamic this page per request generates a code image URL, I used two times urllib.request.urlopen (.Read) (), the first to obtain the verification code image recognition, then once the form is submitted back, but when the second request verification code has been changed.
How to solve this problem? How to use the open urlopen () to submit the form.?

The following code:
#The use of cookiejar cookie
cj = http.cookiejar.LWPCookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
login_path = 'http://www.douban.com/accounts/login'
urllib.request.install_opener(opener)
#Gets the verification code image
log_page=urllib.request.urlopen(login_path)
soup_log=BeautifulSoup(log_page)
im=soup_log.find(id='captcha_image')
im_url=im['src']
print(im_url)
im_data=urllib.request.urlopen(im_url).read()
f=open('image.jpeg','wb')
f.write(im_data)
f.close
#Script is paused, waiting for the manual identification verification code and write back to captcha.txt
input()
data = {"form_email":"xxxxxxx","form_password":"xxxxxxx"}
f=open('captcha.txt','r')
captcha=f.readline()
f.close
#Submit the form
data['captcha-solution']=captcha
post_data = urllib.parse.urlencode(data)
post_data = post_data.encode('utf-8')
html=urllib.request.urlopen(login_path,post_data)

Started by Beacher at November 20, 2016 - 12:11 PM

You use three urlopen
The first to obtain HTML (as is the verification code links, because he is dynamically generated)
Second according to the verification code links to get verification code
The third form is submitted

To see such a statement they HTML source code
<img id="captcha_image" src="http://www.douban.com/misc/captcha?id=BDkxjj2eexLJRWnPJUO4LbaL&amp;size=s" alt="captcha" class="captcha_image"/>
<div class="captcha_block">
    <span id="captcha_block"  class="pl">Please enter the above words</span>
    <input type="text" id="captcha_field" name="captcha-solution" tabindex="3"/>
    <input type="hidden" name="captcha-id" value="BDkxjj2eexLJRWnPJUO4LbaL"/>
</div>

Captcha-id should be and URL inside the ID correspondence

Don't say, I just tested, open the login page, deliberately wrong a few times out verification code, then open a page, brush ah brush, and produced a new verification code (captcha-id corresponds to a URL), but I opened the old page, open the verification code links in a separate page, find pictures is still available.

Then I entered the old code, direct login is successful.

Posted by Burton at December 04, 2016 - 1:04 PM

Thank you, yes ah, the old page in the browser can be used, but what is in the python script in a page corresponding? How to use the first open urlopen corresponding page submission form.?

Posted by Beacher at December 11, 2016 - 1:48 PM

Thank you very much. Done, post form with the ID on it!

Posted by Beacher at December 22, 2016 - 2:38 PM