|
Word Frequency Regular Expression
I have been working on a word frequency application that works as follows:
1. Retrieves line from text file.
2. Splits line on spaces.
3. Iterates through each word storing each word in a hash table assuming it is not already stored there. If it is, update the corresponding value by adding 1.
4. Repeat with next line.
My problem is that I don't have much experience with text processing or regular expressions and I am getting words such as "testing," I'm having trouble comming up with a regular expression that verifies if the current word is a word. I guess what I'm getting at is that I need a regular expression that allows punctuation, but not periods, commas, exclamtion points, and etc. Also any input on text processing and regular expressions is appreciated. Thanks.
|