[Date Prev][Date Next][Thread Prev][Thread Next]
[Search]
[Date Index]
[Thread Index]
Re: [FWP] Comparing Regular Expressions
On 7 Sep 99, at 17:59, D@i-works.com wrote:
> Is there a way to "compare" regular expressions? For instance,
> q/.*/ is less than q/a.*/ because whatever matches the latter
> also matches the former. Is there an easy way to do such a
> comparison?
Well, first, even though this is the FWP mailing list, if you limit your
inquiry to *REAL* regular expressions, as opposed to the hodgepodge of
matching rules that Perl implements, then I think the answer is "yes".
> That's perhaps intractable, but at least well defined. Now
> a more interesting question: if a string is known to match
> two REs, then is there a sense in which one of the matches
> is "better", i.e. more specialized? How do you go about
> defining it? ...
OK, it has been a long time since I've done coursework in languages and
grammars [probably before many of you were born!], and my textbooks are
long-buried, but:
First step is to make precise what we're talking about: we say that two
REs are "equivalent" [or "equal" if you want] if they recognize the same
"language" [that is, the set of all strings matched by the two REs is the
same]. We can define a partial ordering on REs by asserting that if the
language recognized by one RE is a proper subset of that for some other
RE, we can say that the first RE is "less than" or "smaller" than the
second. It is in this sense that we can make precise your inquiry about
'specialized': given a set of strings, we can say that if two REs match
the particular set *AND* one RE's language is a proper subset of the
others, then we could say that the first RE was more "specialized" than
the second.
It is important to understand that we're talking about *partial*
orderings. Given two REs, we can determine if RE1 is <= RE2, or RE1 is
>= RE2 [and RE1 = RE2 if and only if it is <= aND >=], *OR* it might be
that the two REs aren't comparabla at all.
For example, the RE aa* is strictly less than the RE a* [because the
latter matches the null string and the former does not, but they
otherwise match the same set]. Similarly the RE aa* is strictly less
than both. But the RE b* is not comparable with any of these. it
matches *NO* strings in common with aa* and aaa* [and so their langauges
are disjoint] and has only one string in common with a* [the null
string].
Given that, and the rather limited [but precise] notion in which we can
*sometimes* assert that one RE is "more specialized" than another, we can
go on to the question: can we actually *determine* these relationships.
I believe the answer is "yes", but I've forgotten too much of the theory
to know for sure. Let me outline how such a thing would go:
First step is that there is a "canonical form" for REs. No, I *don't*
remember the details of the canonical form [and , indeed, there are
several different schemas for such a form], but the canonical form has
the properties: a) it has a particular [restricted] syntax, and
b) for every RE there is a *unique* canonical-form RE
that recognizes the same language {i.e. "matches the
same set of strings"]
And so we can, without loss of generality, only deal with the question of
comparing REs in canonical or "normal" form. As it turns out, from what
I remember of this, the answer to the last question is "yes" -- given two
REs *in*normal*form* you _can_ determine whether they are <, =, > or
incommensurate.
But, alas, really doing all of that is neither fun nor Perl (although it
might be an interesting programming exercise [although not one I'd
consider 'fun'] to do the reduction/comparison: write a routine to
convert an arbitrary RE to canonical form, and then write a routine to
compare two REs in canonical form.)
/Bernie\
--
Bernie Cosell Fantasy Farm Fibers
mailto:bernie@fantasyfarm.com Pearisburg, VA
--> Too many people, too few sheep <--
==== Want to unsubscribe from Fun With Perl? Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
==== unsubscribe