![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
Newbie
Join Date: Jun 2005
Posts: 1
Rep Power: 0
![]() |
Getting Info From an HTML File
So I am looking for a simple way to do this, and I just need to be pointed in the right direction. So, I am trying to basically get information that is inside of tables in an HTML file. What I want to do is to turn the data inside the <TD> lines in the tables into an array so everything between <TD> and </TD> gets added to an array. Does anyone know a simple way of doing this? The way that I have started go is getting very complex, any help is much appreciated.
-Jonny |
|
|
|
|
|
#2 |
|
Professional Programmer
|
read the file, and do something like
if (/<td>(.*)<\/td>/) print $1; Dizz |
|
|
|
|
|
#3 |
|
Professional Programmer
|
here's it a little more detailed
open FILE, "<file.html";
@tables;
while (<FILE>)
{
chomp;
if (/<td>(.*)<\/td>/){
push @tables, $1;
}
}
close FILE; |
|
|
|
|
|
#4 |
|
Professional Programmer
Join Date: Mar 2005
Location: Glasgow, Scotland
Posts: 335
Rep Power: 4
![]() |
Rooting through HTML is something I've done a painfully large amount of. Pitfalls to look out for (I notice they've been jumped right into with the code examples above, but it's easy done) are HTML's case insensitivity (use the i option on regexps; some people use caps in their HTML tags, like <TD> rather than <td>) and the dodgy nature of much production HTML code. For example, look out for spaces and other cruft in the tags; e.g.
< Td D colsp6o4y2gh > OK, it's not often THIS bad, but most browsers would accept the above as: <TD> so any code crawlers should be equally generous. Check the HTML specification if in doubt about what to accept and where. |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|