Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Jun 10th, 2008, 5:34 PM   #1
abhisheksainiabhishek
Newbie
 
Join Date: Jun 2008
Posts: 12
Rep Power: 0 abhisheksainiabhishek is on a distinguished road
Red face string search

I am writing a perl program which should do the following...

for ex. if I have a html file like..

<b>this is bold.</b>This is
bold too</b>

I have to write the program (without using any html parser function) that would print it like.....

<b>this is bold.This is bold too</b>

basically it would remove unnecarry tags.

I just have to use regular expressions for it.

My instructor advised me not to read the html file line by line as it would not take care of if a tags have beginning tags in on line 1 and the end tag is on the line after (as seen in the file above). I was suggested to put all the html file into one scalar variable.
Now I have made the program so it puts all the html file in one scalar variable. Now my question is how would I search for several instances of <b> and </b> tags in the scalar variable. Should I read it character by character? I am very consfused on this part. Please advise me. Thanks!
abhisheksainiabhishek is offline   Reply With Quote
Old Jun 10th, 2008, 7:43 PM   #2
abhisheksainiabhishek
Newbie
 
Join Date: Jun 2008
Posts: 12
Rep Power: 0 abhisheksainiabhishek is on a distinguished road
Wink Re: string search

Hi,

so far i have am able to remove the bold tags as.....

<b>abcd</b>efgh<b>ijkl</b>

to

<b>abcdefghijkl</b>

by using...
$allHtmlDocument =~ s/$endBoldTag(\s*)$startBoldTag//gi;

now the problem is...

if I have <b>abcd</b><i><b>efgh</i></b>

and I want to make it like

<b>abcd<i>efgh</i></b>


then I still need to remove the bold tags (as there are only tags between them) but I also need to keep the tags between them.how would i capture those tags. I am unable to figure out any way since I am not reading the whole document line by line.

Thanks!
abhisheksainiabhishek is offline   Reply With Quote
Old Jun 10th, 2008, 8:14 PM   #3
Sane
Programming Guru
 
Sane's Avatar
 
Join Date: Apr 2005
Location: Waterloo, Ontario
Posts: 1,888
Rep Power: 5 Sane will become famous soon enough
Send a message via MSN to Sane
Re: string search

For clarification, did you make a mistake in your first post?

<b>this is bold.</b>This is 
bold too</b>

Was that supposed to be:

<b>this is bold.</b><b>This is 
bold too</b>

Or are you saying you want to remove these two "types" of unecessary tags?
  1. The tags that malform the HTML
  2. The tags that close and open needlessly
Looking at your first post... you say you want the first type.
Looking at your second post... you say you want the second type.

So could you clarify?
Sane is offline   Reply With Quote
Old Jun 10th, 2008, 8:26 PM   #4
abhisheksainiabhishek
Newbie
 
Join Date: Jun 2008
Posts: 12
Rep Power: 0 abhisheksainiabhishek is on a distinguished road
Smile Re: string search

thanks for yr quick reply.

sorry about the mistake on the first post and for the lack of clariffication.

You are right....i need to get rid of the tags that 'close and open' needlessly

please advise me that how would I deal with them if I have other tags in between them (but no text).

will special variables $1... play any role.


I tried using special variables but what if I have other tags (more than one time) between the bold tags.

thanks!
abhisheksainiabhishek is offline   Reply With Quote
Old Jun 10th, 2008, 8:35 PM   #5
Sane
Programming Guru
 
Sane's Avatar
 
Join Date: Apr 2005
Location: Waterloo, Ontario
Posts: 1,888
Rep Power: 5 Sane will become famous soon enough
Send a message via MSN to Sane
Re: string search

I'd find each pair of "</b>[random junk]<b>", and then call some function that checks the "sanity" of the [random junk]. If the random junk is insane (meaning that there only exists other tags within), then delete the "</b>" and "<b>". If it is sane, then proceed to the next pair.

The way I would check the sanity is by seeing if there exists any non-space characters that lie outside a pair of <> html tags. You might be able to do that with regex. I'm not experienced enough in regex to say.
Sane is offline   Reply With Quote
Old Jun 10th, 2008, 8:41 PM   #6
abhisheksainiabhishek
Newbie
 
Join Date: Jun 2008
Posts: 12
Rep Power: 0 abhisheksainiabhishek is on a distinguished road
Re: string search

Quote:
Originally Posted by Sane View Post
I'd find each pair of "</b>[random junk]<b>", and then call some function that checks the "sanity" of ...........there exists any non-space characters that lie outside a pair of <> html tags. You might be able to do that with regex. I'm not experienced enough in regex to say.
thank you very much!
abhisheksainiabhishek is offline   Reply With Quote
Old Jun 10th, 2008, 10:13 PM   #7
Sane
Programming Guru
 
Sane's Avatar
 
Join Date: Apr 2005
Location: Waterloo, Ontario
Posts: 1,888
Rep Power: 5 Sane will become famous soon enough
Send a message via MSN to Sane
Re: string search

It might also be important to note... that even if the tags are "redundant" in a sense, removing them might make the HTML non-compliant with certain standards...

For example,

<div>
    <b>This is the first body of text.</b>
</div>
<div>
    <b>This is the second body of text.</b>
</div>

Will be processed to:

<div>
    <b>This is the first body of text.
</div>
<div>
    This is the second body of text.</b>
</div>

And even though that may work on all browsers (can anyone confirm this?), it is still non-compliant with HTML standards (something you should not do in a job).

Therefore, it's best to make sure that the [random junk] only consists of some set of predictable tags (probably <i>, <u>, <em>, etc...). If you can't predict what tags might be in the [random junk], then there's a bunch more work ahead of you.
Sane is offline   Reply With Quote
Old Jun 11th, 2008, 4:54 PM   #8
abhisheksainiabhishek
Newbie
 
Join Date: Jun 2008
Posts: 12
Rep Power: 0 abhisheksainiabhishek is on a distinguished road
Re: string search

thanks.

buts its ok for the project I needed to do.
Thanks a lot!
abhisheksainiabhishek is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An Attempt at a DBMS grimpirate PHP 8 Apr 17th, 2007 1:01 PM
Throwing an exception when using string constructor csrocker101 C# 3 Apr 8th, 2007 2:04 PM
Help with breaking apart a string csrocker101 C# 6 Apr 6th, 2007 7:50 AM
madlib search through string vector uniacid C++ 2 Mar 29th, 2007 4:59 AM
Function Parameters grimpirate PHP 10 Mar 14th, 2007 6:55 PM




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 4:12 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC