Programming Forums
User Name Password Register
 

RSS Feed
FORUM INDEX | TODAY'S POSTS | UNANSWERED THREADS | ADVANCED SEARCH

Reply
 
Thread Tools Display Modes
Old Aug 27th, 2007, 2:19 AM   #1
rwm
Professional Programmer
 
Join Date: Jan 2007
Location: Cape Town
Posts: 291
Rep Power: 2 rwm is on a distinguished road
Parsing questions.

Hey!

I'm trying to write a parser, and I was hoping anyone could suggest/advise?

What i'd like to do is write a generic parser class, without having to derive from this to parse certain files etc

i.e.

...
Parser *parser = new Parser;
parser->Parse("anyformat.txt");
...
parser->Parse("anotherformat.dat");
...

instead of

...
Parser *parser = new AnyFormatParser;
parser->Parse("anyformat.txt");
...
delete parser;
parser = new AnotherFormatParser;
parser->Parse("anotherformat.dat");
...

I don't know what you think about this? Maybe a little irrational? I think it would be nicer to have a "generic" parser class, and just define a "grammar/syntax" for parsing.

So my idea was to create an abstract class, ParseTree, from which I derive certain other parse trees, such as AnyFormatParseTree, AnotherFormatParseTree etc

So the generic parser would use it like:

...
Parser *parser = new Parser;
AnyFormatParseTree afpt;
parser->SetParseTree(afpt);
parser->Parse("anyformat.txt");
...
AnotherFormatParseTree anfpt;
parser->SetParseTree(anfpt);
parser->Parse("anotherformat.dat");
...

The idea is that the user doesnt have to do anything to the ParseTree subclass (unless deriving their own version) in that the ParseTree subclass already contains a "pre-built" parse tree, so all that is required is to pass an instace of ParseTree* to Parser:etParseTree(...).

What do you think?

Incidentally i'd like to ask a couple of questions about parse trees and parsing in general.

What is the best method to write an extensible parser framework that can be reused as much as possible?

With a parse tree, what we do is construct a tree of characters, so if i had the keywords in a certain file: akeyword, dosomething, blah

The parse tree would contain branches that took paths to the following keywords, or to NULL right?

Would it be a good idea to implement a certain "callback" via function pointer at completion of parsing a keyword?

Or not?

Hope someone can help!

Thanks!

rwm is offline   Reply With Quote
Old Aug 27th, 2007, 7:14 AM   #2
lectricpharaoh
Caffeinated Neural Net
 
lectricpharaoh's Avatar
 
Join Date: Jun 2005
Location: Dry west coast of Canada
Posts: 925
Rep Power: 4 lectricpharaoh will become famous soon enough
I don't think your expectations are reasonable. You can't really parse without knowing certain things about the data you're parsing. Basically, about all you're going to be able to do in a generic sense is break the data stream into tokens using some defined delimiter(s) (for example, whitespace characters). Remember that there will be gotchas. For example, a C++ parser would need to treat whitespace differently depending on the situation; after a #define directive, there must be a newline, but in normal code, the whitespace character(s) don't matter (of course, if the preprocessor is a separate unit with its own parser, this can explain things). Regular spaces in quoted strings must be preserved. Another gotcha would be in multi-token constructs; for example, long and long int are semantically the same, but the second is two tokens.

Even given all that, a tree is not necessarily the ideal structure to represent the parsed data. You might want a stack-based implementation for processing mathematical expressions. It really depends on the needs.

All in all, expecting to create a 'one solution fits all' approach for this sort of thing just isn't feasible. Even though it's possible to do without subclassing, it would mean the user of your class would need to pass in all kinds of data to tell it how to accomplish the parsing. This seems a throwback to the C way of doing things, rather than an object-oriented approach.
__________________
A man's knowledge is like an expanding sphere, the surface corresponding to the boundary between the known and the unknown. As the sphere grows, so does its surface; the more a man learns, the more he realizes how much he does not know. Hence, the most ignorant man thinks he knows it all. - L. Sprague de Camp
lectricpharaoh is offline   Reply With Quote
Old Aug 27th, 2007, 7:29 AM   #3
rwm
Professional Programmer
 
Join Date: Jan 2007
Location: Cape Town
Posts: 291
Rep Power: 2 rwm is on a distinguished road
Hey!

Thanks for the reply!

Quote:
Originally Posted by lectricpharaoh View Post
All in all, expecting to create a 'one solution fits all' approach for this sort of thing just isn't feasible. Even though it's possible to do without subclassing, it would mean the user of your class would need to pass in all kinds of data to tell it how to accomplish the parsing. This seems a throwback to the C way of doing things, rather than an object-oriented approach.
Mmm, well im trying to think of it at a different point of view... Look at it this way, we only have a single Parser instance that takes on any subclass of ParseTree (or any other structure for that matter, if one were so inclined?)...

It seems to me that the standard method of parsing is to use a parse tree... Maybe im getting the whole concept wrong?

I know, its probably a good idea to subclass an abstract Parser interface, that would seem the standard way to do things, but what about using a different class that represents a data structure as a parameter to the Parser interface?

Or not?

Maybe im missing the concept entirely? I wrote a plugin toolkit for Maya a while ago, I wrote the whole thing, just so I could learn more about programming, and when it came to parsing files, what I did was write the following:

Tokenizer - read a file into a buffer and converted the file to a list of tokens (with the options to provide custom delimitiers, the standard delimiters were ' ','\t','\r' and '\n').

Then I wrote a parser for each kind of file, derived from an abstract Parser class, which took on a Tokenizer as a data member.

Then what i did was simply read each token, evaluating them one at a time, if a token was a keyword then I would enter a token evaluator method, which stepped through the token list till it hit the end of the keyword block, while building a data structure to store the parsed data...

It worked well, it was fast... But I felt it wasn't a very *elegant* solution...

Maybe you could help me by outlining the standard approaches to parsing, unless i'm doing it the standard way?

Thanks! Looking forward to a reply!



EDIT: Maybe I wasn't clear enough, by using ParseTree* I could be thinking along the lines of using a custom callback function at the end of parsing a certain keyword???

Actually it sounds stupid!
rwm is offline   Reply With Quote
Old Aug 27th, 2007, 7:39 AM   #4
rwm
Professional Programmer
 
Join Date: Jan 2007
Location: Cape Town
Posts: 291
Rep Power: 2 rwm is on a distinguished road
for example, whats the best way to parse this?

# this is a comment
begin BLOCK
	id 0
	name "myname"
	file "/usr/local/projects/file.dat"
end BLOCK

What I was doing is:

The tokenizer works like this:

Tokenizer tokenizer('#'); //'#' passed as a delimiter, i.e. skip comments

and it stores a list of tokens. It is used in the Parser* class.

FileParser parser;
parser.Parse("myfile.txt"); //uses Tokenizer to convert file to tokens.
...

Then something like this in Parser*:arse:

void Parser::Parse(const char *f)
{
	...
	Token tok = tokenizer.Begin();
	while(tok != Tokenizer.End())
	{
		...
		switch(tok.TYPE)
		{
			case BLOCK:
				EvalBlock(); //evaluate block...
				break;
			...
		}
		...
		tok = tokenizer.Next();
	}
}

and to evaluate the block:

void Parser::EvalBlock()
{
	...
	while(tok != "end" && tokenizer.PeekNext() != "BLOCK")
	{
		...
		//read each token like this
		if(tok == "id" && tokenizer.NumTokensOnLine() == 2)
		{
			//got id
			tok = tokenizer.NextToken();
			int id = atoi(tok.TOK_STR);
			...
		}
		tok = tokenizer.Next();
	}
}

Something like that...
rwm is offline   Reply With Quote
Old Aug 28th, 2007, 7:36 AM   #5
rwm
Professional Programmer
 
Join Date: Jan 2007
Location: Cape Town
Posts: 291
Rep Power: 2 rwm is on a distinguished road
so me guessing thats standard way to parse?
rwm is offline   Reply With Quote
Reply

Bookmarks

« Previous Thread in Forum | Next Thread in Forum »

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Template Questions King C++ 3 Feb 3rd, 2007 7:30 PM
hardware questions programmingnoob Coder's Corner Lounge 28 Aug 8th, 2006 8:04 PM
Novice questions. snafu Python 3 Oct 5th, 2005 5:03 PM
ucf cs student has some questions raspberryh Coder's Corner Lounge 33 Sep 12th, 2005 2:49 PM
Emergency: Confusing C Programming assignment questions silvia C 3 Jul 13th, 2005 3:39 AM




DaniWeb IT Discussion Community
All times are GMT -5. The time now is 7:21 AM.

Powered by vBulletin® Version 3.7.0, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Copyright ©2007 DaniWeb® LLC