chomsky - determine Chomsky type of a grammar

Disclaimer

The code is horrible.
Not all my code is that bad. At least not the code I usually publish.

Copyright

As always:

  Copyright (c) 2002 Steffen Mueller. All rights reserved. This program is
  free software; you can redistribute it and/or modify it under the same
  terms as Perl itself. Please see the Perl Artistic License.

Download

The script is available at:
Old version: steffen-mueller.net/chomsky/chomsky.txt
New version: steffen-mueller.net/chomsky/chomsky2.txt
The new version behaves differently for CH-0/CH-1 grammars. See Description.
On Windows, you need to install Perl (try www.activestate.com) and then rename the file to "chomsky.pl" before running it as "perl chomsky.pl".
On *nix, you have Perl. You know how to run Perl scripts, don't you? Just modify the shebang appropriately.
Please read the description.

Example grammars:
grammar 1
grammar 2
grammar 3
grammar 4

Description

With the nasty stuff out of the way, I can define the expected behaviour of the program. If its behaviour does not comply to these specs, it's a bug and provided that it is not already listed under Limitations and bugs, you may feel free to complain to the author.

All grammars comply to type CH-0. (read: Chomsky type 0)
All grammars that comply to CH-n also comply to CH-m with m < n.

To comply with CH-1, all productions of the grammar must be of the form:

  uAv -> urv
where u and v are single terminals or epsilon (epsilon meaning "nothing"), A is a single non terminal, and r is a string of terminals and non terminals but not epsilon.
This is not the actual definition of CH-1. I screwed up and now I am too lazy to fix it. In a real CH-1, u and v may be a string of terminals and non terminals or epsilon. Maybe, in some distant future, there will be another version of the program that fixes this.
Note: This issue *should* be resolved in the "new version".

To comply with CH-2, all productions of the grammar must be of the form:

  A -> r
where A is a single terminal and r is a string of terminals and non-terminals or epsilon.

To comply with CH-3, all productions of the grammar must be of any of the following forms:

  A -> Bx      (left-linear)
  A -> xB      (right-linear)
  A -> x       (terminating)
  A -> epsilon (epsilon production)
where A and B are single non terminals and x a single terminal.

Grammars should look similar to the following example, but whitespace should be mostly insignificant. (Except that in between tokens, there must be some whitespace to delimit the tokens.)



This is some random description where no
line starts with "N\s*=\s*".
Nonterminals are the tokens inside the curly parens
that are assigned to "N".
Terminals are in "E" (resembling large sigma)

N =
    {
	A
	S
	D
    }

# This is a comment.
# Comments are removed.
But you needn't comment here unless the comment starts with
# E\s*=\s*

E =
    {
	f
	g
    }

Now for the productions.
Productions start with "P:" or any substring of "Production"
followed by a colon.

P: A ->
		f D          # This is the first production
	|                    # OR
		S            # This is the second production...
	|
		f
;

# A semicolon ends a production.

P: D -> f D | B ;           # Much shorter!

P: S ->
		g S
	|
		g
;


Limitations and bugs

Lots.
One is described above. (see CH-1)
Another one is that if you have "A -> B" and "B -> b" (capitals being non terminals, lower case letters being terminals), the program will identify "A -> B" as not CH-3 compliant and make the whole grammar not CH-3 compliant. In fact, B can be substituted by "b" and the whole grammar is CH-3 compliant. So the algorithm used to determine whether a production is CH-3 compliant should check for compliance of all subrules (non terminals) in it. If "A -> B C" and "B -> epsilon", "C -> c X", the algorithm should detect CH-3 compliance. As far as I can tell anyway.
This leads me to the most important gotcha:

The implementation cannot be any better than my understanding of the matter.

It may be significantly worse, though.

(c) 2002 Steffen Mueller, mail / at / steffen-mueller / dot / net. All rights reserved.