View Single Post
Old Jan 24th, 2006, 5:49 PM   #4
mackenga
Professional Programmer
 
Join Date: Mar 2005
Location: Glasgow, Scotland
Posts: 317
Rep Power: 4 mackenga is on a distinguished road
That regexp looks a little strange to me. To quote the sense of it here, but in the squished up way I'm more used to looking at:

(\d+)\s+(".*")\s+(".*")\s+(".*")\s+(".*")\s+.*\s+.*\s+(\d+)\s+(.*)

This regexp seems to have several faults. The first one is (".*") to match a quoted string - what this actually matches is zero or more of anything (including quotes) surrounded by quotes. You could use a nongreedy quantifier here (? after the *), but the more efficient method would be to change that . to a character class that, if you wanted to be generous, just excluded quotes:

^(\d+)\s+("[^"]+")\s+("[^"]+")\s+("[^"]+")\s+("[^"]+")\s+\S+\s+\S+\s+(\d{3})\s+(.*)$

I haven't actually tested the above expression. I've replaced (".*") with the version with the character class each time it occurs, and replaced the two .*'s later (that you weren't capturing) with \S (nonwhitespace) because otherwise the .* consumes the whitespace too. I changed the \d+ for the status code with \d{3} (the {3} is a quantifier meaning exactly three of the preceding atom, which is OK here since all HTTP status codes are 3 digits long). Other than that, I've added start and end anchors (^ and $) to the expression to make sure it matches the whole thing and fails to match on lines it can't do that on.

Like I say, I haven't actually tested this, but I hope it helps. This may not appear to have much to do with the error about $bytes - but if the first .* construct ends up eating all the text, it prevents the expression matching, and if the whole expression doesn't match, none of the submatch variables will be set.

Hope this helps!
mackenga is offline   Reply With Quote