![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
Newbie
Join Date: Jan 2006
Posts: 13
Rep Power: 0
![]() |
Newbie to python: trying to submit forms
I'm trying to automate some form submission on a mediawiki site. I've never done anything like this before, and I don't know anything about internet related programming except what I've learned trying to do this.
I am aware of the existance of the python wikipedia robot framework, but I can't figure out how to use it for the simple tasks (or at all). I'd prefer to understand what I'm doing anyway. From the wikipedia.py file in that framework, I found this code: def putPage(self, text, comment = None, watchArticle = False, minorEdit = True, newPage = False, token = None, gettoken = False, sysop = False):
"""
Upload 'text' as new contents for this Page by filling out the edit
page.
Don't use this directly, use put() instead.
"""
safetuple = () # safetuple keeps the old value, but only if we did not get a token yet could
# TODO: get rid of safetuple
if self.site().version() >= "1.4":
if gettoken or not token:
token = self.site().getToken(getagain = gettoken, sysop = sysop)
else:
safetuple = (text, comment, watchArticle, minorEdit, newPage, sysop)
# Check whether we are not too quickly after the previous putPage, and
# wait a bit until the interval is acceptable
put_throttle()
# Which web-site host are we submitting to?
host = self.site().hostname()
# Get the address of the page on that host.
address = self.site().put_address(self.urlname())
# If no comment is given for the change, use the default
if comment is None:
comment=action
# Use the proper encoding for the comment
comment = comment.encode(self.site().encoding())
# Encode the text into the right encoding for the wiki
text = text.encode(self.site().encoding())
predata = [
('wpSave', '1'),
('wpSummary', comment),
('wpTextbox1', text)]
# Except if the page is new, we need to supply the time of the
# previous version to the wiki to prevent edit collisions
if newPage:
predata.append(('wpEdittime', ''))
else:
predata.append(('wpEdittime', self._editTime))
predata.append(('wpStarttime', self._startTime))
# Pass the minorEdit and watchArticle arguments to the Wiki.
if minorEdit:
predata.append(('wpMinoredit', '1'))
if watchArticle:
predata.append(('wpWatchthis', '1'))
# Give the token, but only if one is supplied.
if token:
predata.append(('wpEditToken', token))
# Encode all of this into a HTTP request
data = urlencode(tuple(predata))
if newPage:
output('Creating page %s' % self.aslink())
else:
output('Changing page %s' % self.aslink())
# Submit the prepared information
conn = httplib.HTTPConnection(host)
conn.putrequest("POST", address)
conn.putheader('Content-Length', str(len(data)))
conn.putheader("Content-type", "application/x-www-form-urlencoded")
conn.putheader("User-agent", "PythonWikipediaBot/1.0")
if self.site().cookies():
conn.putheader('Cookie', self.site().cookies(sysop = sysop))
conn.endheaders()
conn.send(data)It appears to submit a page to wikipedia, but I don't understand how, and can't do a simple similar operation myself. I think that if I can understand how to use the POST method with python I can figure it out. The thread here seems to show how to do this, but I can't make it work. I googled and found this, which, again seems to tell me exactly what to do, but I can't make it work. Here is my attempt using the interpreter: Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> params = urllib.urlencode({'wpTextbox1': 'test1', 'wpCommment': 'This is the
first test', 'wpSave':1})
>>> f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Wikipedia:Sand
box&action=edit", params)
>>> print f.read()
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.or
g/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859
-1">
<TITLE>ERROR: The requested URL could not be retrieved</TITLE>
<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,san
s-serif}PRE{font-family:sans-serif}--></STYLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The requested URL could not be retrieved</H2>
<HR noshade size="1px">
<P>
While trying to retrieve the URL:
<A HREF="http://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&action=
edit">http://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&action=edi
t</A>
<P>
The following error was encountered:
<UL>
<LI>
<STRONG>
Access Denied.
</STRONG>
<P>
Access control configuration prevents your request from
being allowed at this time. Please contact your service provider if
you feel this is incorrect.
</UL>
<P>Your cache administrator is <A HREF="mailto:wikidown@bomis.com">wikidown@bomi
s.com</A>.
<BR clear="all">
<HR noshade size="1px">
<ADDRESS>
Generated Mon, 02 Jan 2006 16:12:47 GMT by mayflower.knams.wikimedia.org (squid/
2.5.STABLE12)I expected it to replace the SandBox content with the text "test1", with "This is the first test" in the summary box. So, it didn't work, but I don't know why. Should there be some reference to the name of the form ("editform")? I imagine if I do this I'll be able to log in, and append code to multiple pages without having to do it manually in Firefox. Say, add [[Category:users]] to each page in a list. Any and all pointers are welcome, even if you think using Python is the wrong way to go about doing this. Thanks. |
|
|
|
|
|
#2 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
For a self-proclaimed newbie to Python, you've gone about trying to solve the problem in a very intelligent and sensible way. For the most part, you appear to do everything perfectly correct.
The only thing I can spot that might be wrong with it is the URL you send the POST data to. I took a look at the source for the Sandbox edit page, and noticed this line: <form id="editform" name="editform" method="post" action="/w/index.php?title= Wikipedia:Sandbox&action=submit" enctype="multipart/form-data"> Try changing your urllib call to: f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Wikipedia:Sand
box&action=submit", params){{Please leave this line alone (sandbox heading)}}urllib.urlencode({
'wpTextbox1': "{{Please leave this line alone (sandbox heading)}}\ntest1",
'wpCommment': 'This is the first test', 'wpSave':1})![]() |
|
|
|
|
|
#3 | ||
|
Newbie
Join Date: Jan 2006
Posts: 13
Rep Power: 0
![]() |
Quote:
Quote:
I tested the script on another form here. And it worked: Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> params = urllib.urlencode({
... 'name' : 'My name is Sir Launcelot of Camelot.',
... 'quest' : 'To seek the Holy Grail.',
... 'color' : 'Blue',
... 'swallow' : 'african',
... 'text' : 'Oh, thank you. Thank you very much.',
... 'here' : 1})
>>> f = urllib.urlopen("http://cgi-lib.berkeley.edu/ex/perl5/simple-form.cgi", p
arams)
>>> f.read()
'<html>\n<head>\n<title>cgi-lib.pl demo form output</title>\n</head>\n<body>\n<h
1>cgi-lib.pl demo form output</h1>\n\nYou, My name is Sir Launcelot of Camelot.,
whose favorite color is Blue are on a\nquest which is To seek the Holy Grail.,
and are looking for the weight of an\nafrican swallow. And this is what you hav
e to say for\nyourself:<P> Oh, thank you. Thank you very much.<P>\n\n<HR>And her
e is a list of the variables you entered...<P>\n<dl compact>\n<dt><b>color</b>\n
<dd>:<i>Blue</i>:<br>\n<dt><b>here</b>\n <dd>:<i>1</i>:<br>\n<dt><b>name</b>\n
<dd>:<i>My name is Sir Launcelot of Camelot.</i>:<br>\n<dt><b>quest</b>\n <dd>:<
i>To seek the Holy Grail.</i>:<br>\n<dt><b>swallow</b>\n <dd>:<i>african</i>:<br
>\n<dt><b>text</b>\n <dd>:<i>Oh, thank you. Thank you very much.</i>:<br>\n</dl>
\n</body>\n</html>\n'
>>>So I knew I was doing everything right. I went back to try again, but this time simply opening the URL, and not trying to submit any form: >>> f = urllib.urlopen("http://en.wikipedia.org")
>>> f.read()
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.o
rg/TR/html4/loose.dtd">\n<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="te
xt/html; charset=iso-8859-1">\n<TITLE>ERROR: The requested URL could not be retr
ieved</TITLE>\n<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-fam
ily:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE>\n</HEAD><BODY>\n<H
1>ERROR</H1>\n<H2>The requested URL could not be retrieved</H2>\n<HR noshade siz
e="1px">\n<P>\nWhile trying to retrieve the URL:\n<A HREF="http://en.wikipedia.o
rg/">http://en.wikipedia.org/</A>\n<P>\nThe following error was encountered:\n<U
L>\n<LI>\n<STRONG>\nAccess Denied.\n</STRONG>\n<P>\nAccess control configuration
prevents your request from\nbeing allowed at this time. Please contact your se
rvice provider if\nyou feel this is incorrect.\n</UL>\n<P>Your cache administrat
or is <A HREF="mailto:wikidown@bomis.com">wikidown@bomis.com</A>. \n\n\n<BR clea
r="all">\n<HR noshade size="1px">\n<ADDRESS>\nGenerated Tue, 03 Jan 2006 00:43:4
0 GMT by hawthorn.knams.wikimedia.org (squid/2.5.STABLE12)\n</ADDRESS>\n</BODY><
/HTML>\n'
>>>...followed by: import urllib2
>>> f = urllib2.urlopen("http://en.wikipedia.org")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python22\lib\urllib2.py", line 138, in urlopen
return _opener.open(url, data)
File "C:\Python22\lib\urllib2.py", line 328, in open
'_open', req)
File "C:\Python22\lib\urllib2.py", line 307, in _call_chain
result = func(*args)
File "C:\Python22\lib\urllib2.py", line 824, in http_open
return self.do_open(httplib.HTTP, req)
File "C:\Python22\lib\urllib2.py", line 818, in do_open
return self.parent.error('http', req, fp, code, msg, hdrs)
File "C:\Python22\lib\urllib2.py", line 354, in error
return self._call_chain(*args)
File "C:\Python22\lib\urllib2.py", line 307, in _call_chain
result = func(*args)
File "C:\Python22\lib\urllib2.py", line 406, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> f.read()
''So I'm getting a 403, but I can't tell why. Next I tried the code on this page: >>> params = urllib.urlencode({
... 'wpTextbox1': 'This test is the best',
... 'wpSummary': 'Comment for Clarity', 'wpSave':1})
>>> headers = {"Content-type": "application/x-www-form-urlencoded",
... "Accept": "text/plain"}
>>> conn = httplib.HTTPConnection("en.wikipedia.org:80")
>>> conn.request("POST", "/w/index.php?title=Wikipedia:Sandbox&action=submit", p
arams, headers)
>>> response = conn.getresponse()
>>> print response.status, response.reason
200 OK
>>> data = response.read()
>>> conn.close()
>>> response
<httplib.HTTPResponse instance at 0x00B870B0>
>>> data
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.o
rg/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/ ...(etc. The entire and long source code of the page)>>> params = urllib.urlencode({
... 'wpTextbox1': 'This test is the best',
... 'wpSummary': 'Comment for Clarity'})I have no idea what to do next. I'll have another look at the code in the pyWikipedia files later. |
||
|
|
|
|
|
#4 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
I'll take a look at this tonight, if I have time. If it helps, here's the edit form with all non-form tags removed. It might show you something you're missing.
<form id="editform" name="editform" method="post" action="/w/index.php?title=Wikipedia:Sandbox&action=submit" enctype="multipart/form-data"> <input type='hidden' value="" name="wpSection" /> <input type='hidden' value="20060103120224" name="wpStarttime" /> <input type='hidden' value="20060103120149" name="wpEdittime" /> <input type='hidden' value="" name="wpScrolltop" id="wpScrolltop" /> <textarea tabindex='1' accesskey="," name="wpTextbox1" id="wpTextbox1" rows='25' cols='80' >[page text]</textarea> <input tabindex='2' type='text' value="" name='wpSummary' id='wpSummary' maxlength='200' size='60' /> <input tabindex='5' id='wpSave' type='submit' value="Save page" name="wpSave" accesskey="s" title="Save your changes [alt-s]"/> <input tabindex='6' id='wpPreview' type='submit' value="Show preview" name="wpPreview" accesskey="p" title="Preview your changes, please use this before saving! [alt-p]"/> <input tabindex='7' id='wpDiff' type='submit' value="Show changes" name="wpDiff" accesskey="v" title="Show which changes you made to the text. [alt-d]"/> </form> |
|
|
|
|
|
#5 | |
|
Newbie
Join Date: Jan 2006
Posts: 13
Rep Power: 0
![]() |
30 minutes isn't a long time to be able to edit posts for, but when I wrote this:
Quote:
Thanks for that form info. I tried putting wpStarttime and wpEdittime in the params as well, but still with no luck. >>> params = urllib.urlencode({
... 'wpTextbox1': 'the test',
... 'wpComment': 'the comment',
... 'wpSave': 1,
... 'wpStarttime': '20060103161918',
... 'wpEdittime': '20060103161523'}) |
|
|
|
|
|
|
#6 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
Maybe try changing wpSave from 1 to "Save page"?
|
|
|
|
|
|
#7 |
|
Professional Programmer
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4
![]() |
A good tip if you use Firefox is to use the live HTTP headers extension so you can see exactly what POST data and headers are being sent to each page, so you can copy the full string and manipulate the bits as need be.
|
|
|
|
|
|
#8 | |||
|
Newbie
Join Date: Jan 2006
Posts: 13
Rep Power: 0
![]() |
Quote:
Quote:
I might have to go back to square one on this. I have some links that I'll give a look to when i get the chance: http://effbot.org/librarybook/httplib.htm might be a new way to approach this in there, by putting different headers in etc. http://comments.gmane.org/gmane.scie...echnical/21150 Quote:
Incidentally, I tried subnitting the Login form with python, but with no luck. the follwing code in a HTML file and opened in FF allows me to login. However, a similar code to submit the edit page form (which I can't seem to recreate) didn't work. <form name="userlogin" method="post" action="http://en.wikipedia.org/w/index.php?title=Special:Userlogin&action=submitlogin&type=login"> <table> <tr> <td align='right'><label for='wpName1'>Username:</label></td> <td align='left'> <input type='text' class='loginText' name="wpName" id="wpName1" value="Steveire" size='20' /> </td> </tr> <tr> <td align='right'><label for='wpPassword1'>Password:</label></td> <td align='left'> <input type='password' class='loginPassword' name="wpPassword" id="wpPassword1" value="" size='20' /> </td> </tr> <tr> <td></td> <td align='left' style="white-space:nowrap"> <input type='submit' name="wpLoginattempt" id="wpLoginattempt" value="Log in" /> <input type='submit' name="wpMailmypassword" id="wpMailmypassword" value="E-mail new password" /> </td> </tr> </table> </form> Once again, I'm stumped. Have you ever tried to do this in python or any other language. I never imagined a few batch operations would proove so difficult if you try to involve the internet... :/ |
|||
|
|
|
|
|
#9 |
|
Professional Programmer
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4
![]() |
I regularly use Python to automate submission forms. I'm sorry to say i've never ran into major problems like this really. Just compare what you're sending, make sure the server isn't doing anything funky because of your user agent (can't imagine Wikipedia browser sniffing though, to be honest), and it normally Just Works.
Live headers doesn't show you what you can't figure out from the page source if you can be bothered to trace it through and have an understanding of what headers the browser sends to the server. This is much more time consuming than just submitting the page and looking at what headers were sent to get to the GET or POST data. |
|
|
|
|
|
#10 |
|
Programming Guru
![]() Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 5
![]() |
You could also create a quick Python program that echos all TCP information it receives to STDOUT. Point your form submission program to localhost, and get the output. Then open your browser to the Wikipedia edit page. Open your hosts file and alias en.wikipedia.org to 127.0.0.1, and then try pressing the save button. The browser's output should be caught by your TCP logger. You can then compare the HTTP request from the browser, to the HTTP request you're sending from python.
|
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|