View Single Post
Old Jul 3rd, 2005, 9:20 AM   #4
mackenga
Professional Programmer
 
Join Date: Mar 2005
Location: Glasgow, Scotland
Posts: 317
Rep Power: 4 mackenga is on a distinguished road
Rooting through HTML is something I've done a painfully large amount of. Pitfalls to look out for (I notice they've been jumped right into with the code examples above, but it's easy done) are HTML's case insensitivity (use the i option on regexps; some people use caps in their HTML tags, like <TD> rather than <td>) and the dodgy nature of much production HTML code. For example, look out for spaces and other cruft in the tags; e.g.

<   Td D colsp6o4y2gh >

OK, it's not often THIS bad, but most browsers would accept the above as:

<TD>

so any code crawlers should be equally generous. Check the HTML specification if in doubt about what to accept and where.
mackenga is offline   Reply With Quote