Programming Forums

Programming Forums (http://www.programmingforums.org/forumindex.php)
-   Software Design and Algorithms (http://www.programmingforums.org/forum64.html)
-   -   Parsing Microsoft Word Documents (http://www.programmingforums.org/showthread.php?t=13615)

hoffmandirt Jul 24th, 2007 12:04 PM

Parsing Microsoft Word Documents
 
I am looking for a way to parse word documents to avoid inputing archived technology profiles. The document consists of a title, date, and a table of information that I need. Currently I am using the Java POI API to do this, however it has very basic support for MS Word. The only way I can parse the table is by depending on a special character that shows up at the end of each cell and each row. However, this special character shows up other times in the cell if hyperlinks are used, etc. So this isn't dependable. Is there a better API out there?

The other option I was thinking about is converting the document to clean HTML and scrape the output. I'm not aware of any components that do this.

What are your thoughts on this?


All times are GMT -5. The time now is 2:31 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC