Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Feb 27th, 2006, 5:32 PM   #1
hoffmandirt
Hobbyist Programmer
 
hoffmandirt's Avatar
 
Join Date: Jul 2005
Location: PA
Posts: 125
Rep Power: 4 hoffmandirt is on a distinguished road
Send a message via AIM to hoffmandirt
Word Frequency Regular Expression

I have been working on a word frequency application that works as follows:

1. Retrieves line from text file.
2. Splits line on spaces.
3. Iterates through each word storing each word in a hash table assuming it is not already stored there. If it is, update the corresponding value by adding 1.
4. Repeat with next line.

My problem is that I don't have much experience with text processing or regular expressions and I am getting words such as "testing," I'm having trouble comming up with a regular expression that verifies if the current word is a word. I guess what I'm getting at is that I need a regular expression that allows punctuation, but not periods, commas, exclamtion points, and etc. Also any input on text processing and regular expressions is appreciated. Thanks.
hoffmandirt is offline   Reply With Quote
Old Feb 27th, 2006, 7:01 PM   #2
titaniumdecoy
Expert Programmer
 
titaniumdecoy's Avatar
 
Join Date: Nov 2005
Posts: 855
Rep Power: 3 titaniumdecoy is on a distinguished road
Send a message via AIM to titaniumdecoy
Why not simply remove all punctionation from each string?
titaniumdecoy is offline   Reply With Quote
Old Feb 27th, 2006, 7:15 PM   #3
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
It isn't a trivial process. "Its" and "It's" aren't the same thing. "It's" represents two words. Test and testing are, truly, variations of the same word. Generally, depending upon the purpose of the task, people back off their requirements an appropriate amount. Running the GNU license through your processor will highlight the sort of thing I mean. I would venture to say that the use of regular expressions is an overkill, regardless of goal, and will slow the process down considerably.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Old Feb 27th, 2006, 7:36 PM   #4
hoffmandirt
Hobbyist Programmer
 
hoffmandirt's Avatar
 
Join Date: Jul 2005
Location: PA
Posts: 125
Rep Power: 4 hoffmandirt is on a distinguished road
Send a message via AIM to hoffmandirt
I don't have a requirement that says I need to remove punctuation from the words, but I was looking for a way to verify the text was a valid word. You make a good point though about the RE. It doesn't really fit the application here, where I am just looking to do away with periods and such. Do you agree that it would be sufficient enough to replace periods with spaces and go from there?
hoffmandirt is offline   Reply With Quote
Old Feb 27th, 2006, 8:21 PM   #5
DaWei
Resident Grouch
 
DaWei's Avatar
 
Join Date: Jun 2005
Posts: 6,453
Rep Power: 10 DaWei is on a distinguished road
Personally, I'd strip just about everything but alphanumerics and single quotes. Shoot, I'd even strip the single quotes if they weren't internal to a 'word'. As I say, you're not going to be entirely pleased, whatever you do. Copyright notices, dates, all sorts of things garfle up the process.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code.
Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers
DaWei is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 5:36 PM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC