[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] Comparing Regular Expressions



On 7 Sep 99, at 17:59, D@i-works.com wrote:

> Is there a way to "compare" regular expressions? For instance,
> q/.*/ is less than q/a.*/ because whatever matches the latter
> also matches the former. Is there an easy way to do such a
> comparison?

Well, first, even though this is the FWP mailing list, if you limit your 
inquiry to *REAL* regular expressions, as opposed to the hodgepodge of 
matching rules that Perl implements, then I think the answer is "yes".

> That's perhaps intractable, but at least well defined. Now
> a more interesting question: if a string is known to match
> two REs, then is there a sense in which one of the matches
> is "better", i.e. more specialized? How do you go about
> defining it?   ...

OK, it has been a long time since I've done coursework in languages and 
grammars [probably before many of you were born!], and my textbooks are 
long-buried, but:

First step is to make precise what we're talking about: we say that two 
REs are "equivalent" [or "equal" if you want] if they recognize the same 
"language" [that is, the set of all strings matched by the two REs is the 
same].  We can define a partial ordering on REs by asserting that if the 
language recognized by one RE is a proper subset of that for some other 
RE, we can say that the first RE is "less than" or "smaller" than the 
second.  It is in this sense that we can make precise your inquiry about 
'specialized': given a set of strings, we can say that if two REs match 
the particular set *AND* one RE's language is a proper subset of the 
others, then we could say that the first RE was more "specialized" than 
the second.

It is important to understand that we're talking about *partial* 
orderings.  Given two REs, we can determine if RE1 is <= RE2, or RE1 is 
>= RE2 [and RE1 = RE2 if and only if it is <= aND >=], *OR* it might be 
that the two REs aren't comparabla at all.

For example, the RE aa* is strictly less than the RE a* [because the 
latter matches the null string and the former does not, but they 
otherwise match the same set].  Similarly the RE aa* is strictly less 
than both.  But the RE b* is not comparable with any of these.  it 
matches *NO* strings in common with aa* and aaa* [and so their langauges 
are disjoint] and has only one string in common with a* [the null 
string].

Given that, and the rather limited [but precise] notion in which we can 
*sometimes* assert that one RE is "more specialized" than another, we can 
go on to the question: can we actually *determine* these relationships.  
I believe the answer is "yes", but I've forgotten too much of the theory 
to know for sure.  Let me outline how such a thing would go:

First step is that there is a "canonical form" for REs.  No, I *don't* 
remember the details of the canonical form [and , indeed, there are 
several different schemas for such a form], but the canonical form has 
the properties:  a) it has a particular [restricted] syntax, and
                 b) for every RE there is a *unique* canonical-form RE
                    that recognizes the same language {i.e. "matches the
                    same set of strings"]

And so we can, without loss of generality, only deal with the question of 
comparing REs in canonical or "normal" form.  As it turns out, from what 
I remember of this, the answer to the last question is "yes" -- given two 
REs *in*normal*form* you _can_ determine whether they are <, =, > or 
incommensurate.

But, alas, really doing all of that is neither fun nor Perl (although it 
might be an interesting programming exercise [although not one I'd 
consider 'fun'] to do the reduction/comparison: write a routine to 
convert an arbitrary RE to canonical form, and then write a routine to 
compare two REs in canonical form.)

  /Bernie\
-- 
Bernie Cosell                     Fantasy Farm Fibers
mailto:bernie@fantasyfarm.com     Pearisburg, VA
    -->  Too many people, too few sheep  <--          


==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe