Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Perl (http://www.programmingforums.org/forum21.html)
-   -   Problem with regular expression? (http://www.programmingforums.org/showthread.php?t=6586)

wingz198 Oct 24th, 2005 10:07 AM

Problem with regular expression?
 
I'm trying to write a program that processes a web log file. Here's an antry example:
:

3236 "GET /robert/./index.php?page=links HTTP/1.1" "30/Sep/2005:11:11:38 -0400" "Java/1.4.1_05" "-" - - 200 69.177.179.241

I made a regex to get the important parts and trying to print out the first reference for the 'bytes':
:

$entry =~ /\
                  (\d+)\s+                        #bytes
                  (".*")\s+              #method.url.hvers
                  (".*")\s+              #date & time
                  (".*")\s+              #useragent
                  (".*")\s+              #referer
                  .*\s+.*\s+
                  (\d+)\s+                #statuscode
                  (.*)                    #ipaddy
                  /x;
            $bytes = $1;
            print $bytes;


I get an error
:

Use of uninitialized value in print at ./prog11.pl line 54 (#1)
    (W uninitialized) An undefined value was used as if it were already
    defined.  It was interpreted as a "" or a 0, but maybe it was a mistake.
    To suppress this warning assign a defined value to your variables.

when I try to print it out. Is there something wrong with the expression? The way I see it, it should work

Polyphemus_ Oct 24th, 2005 10:18 AM

What line is line 54?

wingz198 Oct 24th, 2005 10:28 AM

Sorry, forgot to post that. It's the 'print $bytes'. It works until I put the print statement in there.

mackenga Jan 24th, 2006 5:49 PM

That regexp looks a little strange to me. To quote the sense of it here, but in the squished up way I'm more used to looking at:

:

(\d+)\s+(".*")\s+(".*")\s+(".*")\s+(".*")\s+.*\s+.*\s+(\d+)\s+(.*)

This regexp seems to have several faults. The first one is (".*") to match a quoted string - what this actually matches is zero or more of anything (including quotes) surrounded by quotes. You could use a nongreedy quantifier here (? after the *), but the more efficient method would be to change that . to a character class that, if you wanted to be generous, just excluded quotes:

:

^(\d+)\s+("[^"]+")\s+("[^"]+")\s+("[^"]+")\s+("[^"]+")\s+\S+\s+\S+\s+(\d{3})\s+(.*)$

I haven't actually tested the above expression. I've replaced (".*") with the version with the character class each time it occurs, and replaced the two .*'s later (that you weren't capturing) with \S (nonwhitespace) because otherwise the .* consumes the whitespace too. I changed the \d+ for the status code with \d{3} (the {3} is a quantifier meaning exactly three of the preceding atom, which is OK here since all HTTP status codes are 3 digits long). Other than that, I've added start and end anchors (^ and $) to the expression to make sure it matches the whole thing and fails to match on lines it can't do that on.

Like I say, I haven't actually tested this, but I hope it helps. This may not appear to have much to do with the error about $bytes - but if the first .* construct ends up eating all the text, it prevents the expression matching, and if the whole expression doesn't match, none of the submatch variables will be set.

Hope this helps!


All times are GMT -5. The time now is 2:08 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC