Rooting through HTML is something I've done a painfully large amount of. Pitfalls to look out for (I notice they've been jumped right into with the code examples above, but it's easy done) are HTML's case insensitivity (use the i option on regexps; some people use caps in their HTML tags, like <TD> rather than <td>) and the dodgy nature of much production HTML code. For example, look out for spaces and other cruft in the tags; e.g.
OK, it's not often THIS bad, but most browsers would accept the above as:
so any code crawlers should be equally generous. Check the HTML specification if in doubt about what to accept and where.