Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jan 2nd, 2006, 10:39 AM   #1
Steveire
Newbie
 
Join Date: Jan 2006
Posts: 13
Rep Power: 0 Steveire is on a distinguished road
Newbie to python: trying to submit forms

I'm trying to automate some form submission on a mediawiki site. I've never done anything like this before, and I don't know anything about internet related programming except what I've learned trying to do this.

I am aware of the existance of the python wikipedia robot framework, but I can't figure out how to use it for the simple tasks (or at all). I'd prefer to understand what I'm doing anyway.

From the wikipedia.py file in that framework, I found this code:
def putPage(self, text, comment = None, watchArticle = False, minorEdit = True, newPage = False, token = None, gettoken = False, sysop = False):

        """

        Upload 'text' as new contents for this Page by filling out the edit

        page.

        

        Don't use this directly, use put() instead.

        """

        safetuple = () # safetuple keeps the old value, but only if we did not get a token yet could

        # TODO: get rid of safetuple

        if self.site().version() >= "1.4":

            if gettoken or not token:

                token = self.site().getToken(getagain = gettoken, sysop = sysop)

            else:

                safetuple = (text, comment, watchArticle, minorEdit, newPage, sysop)

        # Check whether we are not too quickly after the previous putPage, and

        # wait a bit until the interval is acceptable

        put_throttle()

        # Which web-site host are we submitting to?

        host = self.site().hostname()

        # Get the address of the page on that host.

        address = self.site().put_address(self.urlname())

        # If no comment is given for the change, use the default

        if comment is None:

            comment=action

        # Use the proper encoding for the comment

        comment = comment.encode(self.site().encoding())

        # Encode the text into the right encoding for the wiki
        text = text.encode(self.site().encoding())

        predata = [

            ('wpSave', '1'),

            ('wpSummary', comment),

            ('wpTextbox1', text)]

        # Except if the page is new, we need to supply the time of the

        # previous version to the wiki to prevent edit collisions

        if newPage:

            predata.append(('wpEdittime', ''))

        else:

            predata.append(('wpEdittime', self._editTime))

        predata.append(('wpStarttime', self._startTime))            

        # Pass the minorEdit and watchArticle arguments to the Wiki.

        if minorEdit:

            predata.append(('wpMinoredit', '1'))

        if watchArticle:

            predata.append(('wpWatchthis', '1'))

        # Give the token, but only if one is supplied.

        if token:

            predata.append(('wpEditToken', token))

        # Encode all of this into a HTTP request

        data = urlencode(tuple(predata))

        

        if newPage:

            output('Creating page %s' % self.aslink())

        else:

            output('Changing page %s' % self.aslink())

        # Submit the prepared information

        conn = httplib.HTTPConnection(host)

    

        conn.putrequest("POST", address)

        conn.putheader('Content-Length', str(len(data)))

        conn.putheader("Content-type", "application/x-www-form-urlencoded")

        conn.putheader("User-agent", "PythonWikipediaBot/1.0")

        if self.site().cookies():

            conn.putheader('Cookie', self.site().cookies(sysop = sysop))

        conn.endheaders()

        conn.send(data)

It appears to submit a page to wikipedia, but I don't understand how, and can't do a simple similar operation myself.

I think that if I can understand how to use the POST method with python I can figure it out. The thread here seems to show how to do this, but I can't make it work. I googled and found this, which, again seems to tell me exactly what to do, but I can't make it work. Here is my attempt using the interpreter:
Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> params = urllib.urlencode({'wpTextbox1': 'test1', 'wpCommment': 'This is the
 first test', 'wpSave':1})
>>> f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Wikipedia:Sand
box&action=edit", params)
>>> print f.read()
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.or
g/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859
-1">
<TITLE>ERROR: The requested URL could not be retrieved</TITLE>
<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,san
s-serif}PRE{font-family:sans-serif}--></STYLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The requested URL could not be retrieved</H2>
<HR noshade size="1px">
<P>
While trying to retrieve the URL:
<A HREF="http://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&amp;action=
edit">http://en.wikipedia.org/w/index.php?title=Wikipedia:Sandbox&amp;action=edi
t</A>
<P>
The following error was encountered:
<UL>
<LI>
<STRONG>
Access Denied.
</STRONG>
<P>
Access control configuration prevents your request from
being allowed at this time.  Please contact your service provider if
you feel this is incorrect.
</UL>
<P>Your cache administrator is <A HREF="mailto:wikidown@bomis.com">wikidown@bomi
s.com</A>.


<BR clear="all">
<HR noshade size="1px">
<ADDRESS>
Generated Mon, 02 Jan 2006 16:12:47 GMT by mayflower.knams.wikimedia.org (squid/
2.5.STABLE12)

I expected it to replace the SandBox content with the text "test1", with "This is the first test" in the summary box. So, it didn't work, but I don't know why. Should there be some reference to the name of the form ("editform")?

I imagine if I do this I'll be able to log in, and append code to multiple pages without having to do it manually in Firefox. Say, add [[Category:users]] to each page in a list.

Any and all pointers are welcome, even if you think using Python is the wrong way to go about doing this.

Thanks.
Steveire is offline   Reply With Quote
Old Jan 2nd, 2006, 2:18 PM   #2
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
For a self-proclaimed newbie to Python, you've gone about trying to solve the problem in a very intelligent and sensible way. For the most part, you appear to do everything perfectly correct.

The only thing I can spot that might be wrong with it is the URL you send the POST data to. I took a look at the source for the Sandbox edit page, and noticed this line:
<form id="editform" name="editform" method="post" action="/w/index.php?title=
Wikipedia:Sandbox&amp;action=submit" enctype="multipart/form-data">
The edit form appears to submit with "action=submit", whilst the URL you pass to urllib has "action=edit". At a guess, I'd say that the 'edit' action displays the edit page, whilst the 'submit' action is designed to handle data submitted from forms.

Try changing your urllib call to:
f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Wikipedia:Sand
box&action=submit", params)
And see if that makes any difference. Also, the default edit page for the sandbox contains the line:
{{Please leave this line alone (sandbox heading)}}
So it might be an idea to prepend this line to your wpTextbox1 value:
urllib.urlencode({
    'wpTextbox1': "{{Please leave this line alone (sandbox heading)}}\ntest1", 
    'wpCommment': 'This is the first test', 'wpSave':1})
Just so that you don't mess anything up for Wikipedia
Arevos is offline   Reply With Quote
Old Jan 2nd, 2006, 8:35 PM   #3
Steveire
Newbie
 
Join Date: Jan 2006
Posts: 13
Rep Power: 0 Steveire is on a distinguished road
Quote:
Originally Posted by Arevos
For a self-proclaimed newbie to Python, you've gone about trying to solve the problem in a very intelligent and sensible way.
Cheers. Having a problem to solve is good motivation for learning a bit about how the internet works.

Quote:
Originally Posted by Arevos
The edit form appears to submit with "action=submit", whilst the URL you pass to urllib has "action=edit".
Yep, I noticed that too, but forgot to change it when I made my post, because when I tried it before, I got an identical message. I did it again to make sure, it still says "Access Denied".

I tested the script on another form here. And it worked:
Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> params = urllib.urlencode({
...     'name' : 'My name is Sir Launcelot of Camelot.',
...     'quest' : 'To seek the Holy Grail.',
...     'color' : 'Blue',
...     'swallow' : 'african',
...     'text' : 'Oh, thank you. Thank you very much.',
...     'here' : 1})
>>> f = urllib.urlopen("http://cgi-lib.berkeley.edu/ex/perl5/simple-form.cgi", p
arams)
>>> f.read()
'<html>\n<head>\n<title>cgi-lib.pl demo form output</title>\n</head>\n<body>\n<h
1>cgi-lib.pl demo form output</h1>\n\nYou, My name is Sir Launcelot of Camelot.,
 whose favorite color is Blue are on a\nquest which is To seek the Holy Grail.,
and are looking for the weight of an\nafrican swallow.  And this is what you hav
e to say for\nyourself:<P> Oh, thank you. Thank you very much.<P>\n\n<HR>And her
e is a list of the variables you entered...<P>\n<dl compact>\n<dt><b>color</b>\n
 <dd>:<i>Blue</i>:<br>\n<dt><b>here</b>\n <dd>:<i>1</i>:<br>\n<dt><b>name</b>\n
<dd>:<i>My name is Sir Launcelot of Camelot.</i>:<br>\n<dt><b>quest</b>\n <dd>:<
i>To seek the Holy Grail.</i>:<br>\n<dt><b>swallow</b>\n <dd>:<i>african</i>:<br
>\n<dt><b>text</b>\n <dd>:<i>Oh, thank you. Thank you very much.</i>:<br>\n</dl>
\n</body>\n</html>\n'
>>>

So I knew I was doing everything right. I went back to try again, but this time simply opening the URL, and not trying to submit any form:
>>> f = urllib.urlopen("http://en.wikipedia.org")
>>> f.read()
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.o
rg/TR/html4/loose.dtd">\n<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="te
xt/html; charset=iso-8859-1">\n<TITLE>ERROR: The requested URL could not be retr
ieved</TITLE>\n<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-fam
ily:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE>\n</HEAD><BODY>\n<H
1>ERROR</H1>\n<H2>The requested URL could not be retrieved</H2>\n<HR noshade siz
e="1px">\n<P>\nWhile trying to retrieve the URL:\n<A HREF="http://en.wikipedia.o
rg/">http://en.wikipedia.org/</A>\n<P>\nThe following error was encountered:\n<U
L>\n<LI>\n<STRONG>\nAccess Denied.\n</STRONG>\n<P>\nAccess control configuration
 prevents your request from\nbeing allowed at this time.  Please contact your se
rvice provider if\nyou feel this is incorrect.\n</UL>\n<P>Your cache administrat
or is <A HREF="mailto:wikidown@bomis.com">wikidown@bomis.com</A>. \n\n\n<BR clea
r="all">\n<HR noshade size="1px">\n<ADDRESS>\nGenerated Tue, 03 Jan 2006 00:43:4
0 GMT by hawthorn.knams.wikimedia.org (squid/2.5.STABLE12)\n</ADDRESS>\n</BODY><
/HTML>\n'
>>>

...followed by:

import urllib2
>>> f = urllib2.urlopen("http://en.wikipedia.org")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python22\lib\urllib2.py", line 138, in urlopen
    return _opener.open(url, data)
  File "C:\Python22\lib\urllib2.py", line 328, in open
    '_open', req)
  File "C:\Python22\lib\urllib2.py", line 307, in _call_chain
    result = func(*args)
  File "C:\Python22\lib\urllib2.py", line 824, in http_open
    return self.do_open(httplib.HTTP, req)
  File "C:\Python22\lib\urllib2.py", line 818, in do_open
    return self.parent.error('http', req, fp, code, msg, hdrs)
  File "C:\Python22\lib\urllib2.py", line 354, in error
    return self._call_chain(*args)
  File "C:\Python22\lib\urllib2.py", line 307, in _call_chain
    result = func(*args)
  File "C:\Python22\lib\urllib2.py", line 406, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> f.read()
''

So I'm getting a 403, but I can't tell why.
Next I tried the code on this page:
>>> params = urllib.urlencode({
...     'wpTextbox1': 'This test is the best',
...     'wpSummary': 'Comment for Clarity', 'wpSave':1})
>>> headers = {"Content-type": "application/x-www-form-urlencoded",
...     "Accept": "text/plain"}
>>> conn = httplib.HTTPConnection("en.wikipedia.org:80")
>>> conn.request("POST", "/w/index.php?title=Wikipedia:Sandbox&action=submit", p
arams, headers)
>>> response = conn.getresponse()
>>> print response.status, response.reason
200 OK
>>> data = response.read()
>>> conn.close()
>>> response
<httplib.HTTPResponse instance at 0x00B870B0>
>>> data
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.o
rg/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/ ...(etc. The entire and long source code of the page)
the contents of 'data' is the source code of the page, and shows the result if I had pressed the preview button, not the save button. I repeated the above, but this time not specifiying anything for 'wpSave', ie,
>>> params = urllib.urlencode({
...     'wpTextbox1': 'This test is the best',
...     'wpSummary': 'Comment for Clarity'})
and the same preview page resulted. There is clearly something wrong with how I am attempting to submit this form. Everything else seems ok. I think when I wrote my own HTML form to submit the information, I got only the preview in that case also. I can't remember right now how to do that though.

I have no idea what to do next. I'll have another look at the code in the pyWikipedia files later.
Steveire is offline   Reply With Quote
Old Jan 3rd, 2006, 6:11 AM   #4
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
I'll take a look at this tonight, if I have time. If it helps, here's the edit form with all non-form tags removed. It might show you something you're missing.
<form id="editform" name="editform" method="post" action="/w/index.php?title=Wikipedia:Sandbox&amp;action=submit" enctype="multipart/form-data">
<input type='hidden' value="" name="wpSection" />
<input type='hidden' value="20060103120224" name="wpStarttime" />
<input type='hidden' value="20060103120149" name="wpEdittime" />
<input type='hidden' value="" name="wpScrolltop" id="wpScrolltop" />
<textarea tabindex='1' accesskey="," name="wpTextbox1" id="wpTextbox1" rows='25' cols='80' >[page text]</textarea>
<input tabindex='2' type='text' value="" name='wpSummary' id='wpSummary' maxlength='200' size='60' />
<input tabindex='5' id='wpSave' type='submit' value="Save page" name="wpSave" accesskey="s" title="Save your changes [alt-s]"/>
<input tabindex='6' id='wpPreview' type='submit'  value="Show preview" name="wpPreview" accesskey="p" title="Preview your changes, please use this before saving! [alt-p]"/>
<input tabindex='7' id='wpDiff' type='submit' value="Show changes" name="wpDiff" accesskey="v" title="Show which changes you made to the text. [alt-d]"/>
</form>
Arevos is offline   Reply With Quote
Old Jan 3rd, 2006, 10:44 AM   #5
Steveire
Newbie
 
Join Date: Jan 2006
Posts: 13
Rep Power: 0 Steveire is on a distinguished road
30 minutes isn't a long time to be able to edit posts for, but when I wrote this:
Quote:
Originally Posted by Steveire
So I'm getting a 403, but I can't tell why.
Next I tried the code on this page:
I made a link to the wrong location. The link should point to http://docs.python.org/lib/httplib-examples.html.

Thanks for that form info. I tried putting wpStarttime and wpEdittime in the params as well, but still with no luck.
>>> params = urllib.urlencode({
...     'wpTextbox1': 'the test',
...     'wpComment': 'the comment',
...     'wpSave': 1,
...     'wpStarttime': '20060103161918',
...     'wpEdittime': '20060103161523'})
Steveire is offline   Reply With Quote
Old Jan 3rd, 2006, 11:17 AM   #6
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
Maybe try changing wpSave from 1 to "Save page"?
Arevos is offline   Reply With Quote
Old Jan 3rd, 2006, 11:59 AM   #7
Cerulean
Professional Programmer
 
Cerulean's Avatar
 
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4 Cerulean is on a distinguished road
A good tip if you use Firefox is to use the live HTTP headers extension so you can see exactly what POST data and headers are being sent to each page, so you can copy the full string and manipulate the bits as need be.
Cerulean is offline   Reply With Quote
Old Jan 3rd, 2006, 8:30 PM   #8
Steveire
Newbie
 
Join Date: Jan 2006
Posts: 13
Rep Power: 0 Steveire is on a distinguished road
Quote:
Originally Posted by Arevos
Maybe try changing wpSave from 1 to "Save page"?
I think I tried that and some other variations. I can't remember.

Quote:
Originally Posted by Cerulean
A good tip if you use Firefox is to use the live HTTP headers extension so you can see exactly what POST data and headers are being sent to each page, so you can copy the full string and manipulate the bits as need be.
That's a nice extension, thanks, but I don't think it shows anything that isn't in the source page. The headers tab in the view page info box doesn't work, and there's reference to a non existant file that it wants me to find called nsHeadersInfo.js. Maybe it's something to look for another time. I'm new to this complex stuff (which should be easy I think: just a simple form submit), so I might be missing something.

I might have to go back to square one on this. I have some links that I'll give a look to when i get the chance:

http://effbot.org/librarybook/httplib.htm
might be a new way to approach this in there, by putting different headers in etc.

http://comments.gmane.org/gmane.scie...echnical/21150

Quote:
If you're referring to offsite form submissions automated with JavaScript, we already have protection in place to prevent this for registered users.
This seems irrelevant to me, as Javascript is not involved AFAIK. However, I'll not rule out the possiblity that I'm getting 403's because they don't want automated edits.

Incidentally, I tried subnitting the Login form with python, but with no luck. the follwing code in a HTML file and opened in FF allows me to login. However, a similar code to submit the edit page form (which I can't seem to recreate) didn't work.
<form name="userlogin" method="post" action="http://en.wikipedia.org/w/index.php?title=Special:Userlogin&amp;action=submitlogin&amp;type=login">
	
	<table>
		<tr>
			<td align='right'><label for='wpName1'>Username:</label></td>
			<td align='left'>
				<input type='text' class='loginText' name="wpName" id="wpName1"
					value="Steveire" size='20' />
			</td>
		</tr>

		<tr>
			<td align='right'><label for='wpPassword1'>Password:</label></td>
			<td align='left'>
				<input type='password' class='loginPassword' name="wpPassword" id="wpPassword1"
					value="" size='20' />
			</td>
		</tr>
		<tr>
			<td></td>

			<td align='left' style="white-space:nowrap">
				<input type='submit' name="wpLoginattempt" id="wpLoginattempt" value="Log in" />&nbsp;<input type='submit' name="wpMailmypassword" id="wpMailmypassword"
									value="E-mail new password" />

							</td>
		</tr>
	</table>
</form>

Once again, I'm stumped.

Have you ever tried to do this in python or any other language. I never imagined a few batch operations would proove so difficult if you try to involve the internet... :/
Steveire is offline   Reply With Quote
Old Jan 5th, 2006, 9:32 AM   #9
Cerulean
Professional Programmer
 
Cerulean's Avatar
 
Join Date: Apr 2005
Location: London, England
Posts: 459
Rep Power: 4 Cerulean is on a distinguished road
I regularly use Python to automate submission forms. I'm sorry to say i've never ran into major problems like this really. Just compare what you're sending, make sure the server isn't doing anything funky because of your user agent (can't imagine Wikipedia browser sniffing though, to be honest), and it normally Just Works.
Live headers doesn't show you what you can't figure out from the page source if you can be bothered to trace it through and have an understanding of what headers the browser sends to the server. This is much more time consuming than just submitting the page and looking at what headers were sent to get to the GET or POST data.
Cerulean is offline   Reply With Quote
Old Jan 5th, 2006, 9:44 AM   #10
Arevos
Programming Guru
 
Arevos's Avatar
 
Join Date: Aug 2005
Location: England
Posts: 1,499
Rep Power: 4 Arevos is on a distinguished road
You could also create a quick Python program that echos all TCP information it receives to STDOUT. Point your form submission program to localhost, and get the output. Then open your browser to the Wikipedia edit page. Open your hosts file and alias en.wikipedia.org to 127.0.0.1, and then try pressing the save button. The browser's output should be caught by your TCP logger. You can then compare the HTTP request from the browser, to the HTTP request you're sending from python.
Arevos is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 10:30 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC