![]() |
|
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
Hobbyist Programmer
|
Word Frequency Regular Expression
I have been working on a word frequency application that works as follows:
1. Retrieves line from text file. 2. Splits line on spaces. 3. Iterates through each word storing each word in a hash table assuming it is not already stored there. If it is, update the corresponding value by adding 1. 4. Repeat with next line. My problem is that I don't have much experience with text processing or regular expressions and I am getting words such as "testing," I'm having trouble comming up with a regular expression that verifies if the current word is a word. I guess what I'm getting at is that I need a regular expression that allows punctuation, but not periods, commas, exclamtion points, and etc. Also any input on text processing and regular expressions is appreciated. Thanks. |
|
|
|
|
|
#2 |
|
Expert Programmer
|
Why not simply remove all punctionation from each string?
|
|
|
|
|
|
#3 |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
It isn't a trivial process. "Its" and "It's" aren't the same thing. "It's" represents two words. Test and testing are, truly, variations of the same word. Generally, depending upon the purpose of the task, people back off their requirements an appropriate amount. Running the GNU license through your processor will highlight the sort of thing I mean. I would venture to say that the use of regular expressions is an overkill, regardless of goal, and will slow the process down considerably.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
|
|
#4 |
|
Hobbyist Programmer
|
I don't have a requirement that says I need to remove punctuation from the words, but I was looking for a way to verify the text was a valid word. You make a good point though about the RE. It doesn't really fit the application here, where I am just looking to do away with periods and such. Do you agree that it would be sufficient enough to replace periods with spaces and go from there?
|
|
|
|
|
|
#5 |
|
Resident Grouch
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2005
Posts: 6,453
Rep Power: 10
![]() |
Personally, I'd strip just about everything but alphanumerics and single quotes. Shoot, I'd even strip the single quotes if they weren't internal to a 'word'. As I say, you're not going to be entirely pleased, whatever you do. Copyright notices, dates, all sorts of things garfle up the process.
__________________
Abstraction doesn't make it impossible to write bad code; it makes it possible to write superior code. Contributor's Corner: Grumpy on C++ Exceptions DaWei on Pointers |
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | |
|
|