On 7 Sep 99, at 17:59, D@i-works.com wrote: > Is there a way to "compare" regular expressions? For instance, > q/.*/ is less than q/a.*/ because whatever matches the latter > also matches the former. Is there an easy way to do such a > comparison? Well, first, even though this is the FWP mailing list, if you limit your inquiry to *REAL* regular expressions, as opposed to the hodgepodge of matching rules that Perl implements, then I think the answer is "yes". > That's perhaps intractable, but at least well defined. Now > a more interesting question: if a string is known to match > two REs, then is there a sense in which one of the matches > is "better", i.e. more specialized? How do you go about > defining it? ... OK, it has been a long time since I've done coursework in languages and grammars [probably before many of you were born!], and my textbooks are long-buried, but: First step is to make precise what we're talking about: we say that two REs are "equivalent" [or "equal" if you want] if they recognize the same "language" [that is, the set of all strings matched by the two REs is the same]. We can define a partial ordering on REs by asserting that if the language recognized by one RE is a proper subset of that for some other RE, we can say that the first RE is "less than" or "smaller" than the second. It is in this sense that we can make precise your inquiry about 'specialized': given a set of strings, we can say that if two REs match the particular set *AND* one RE's language is a proper subset of the others, then we could say that the first RE was more "specialized" than the second. It is important to understand that we're talking about *partial* orderings. Given two REs, we can determine if RE1 is <= RE2, or RE1 is >= RE2 [and RE1 = RE2 if and only if it is <= aND >=], *OR* it might be that the two REs aren't comparabla at all. For example, the RE aa* is strictly less than the RE a* [because the latter matches the null string and the former does not, but they otherwise match the same set]. Similarly the RE aa* is strictly less than both. But the RE b* is not comparable with any of these. it matches *NO* strings in common with aa* and aaa* [and so their langauges are disjoint] and has only one string in common with a* [the null string]. Given that, and the rather limited [but precise] notion in which we can *sometimes* assert that one RE is "more specialized" than another, we can go on to the question: can we actually *determine* these relationships. I believe the answer is "yes", but I've forgotten too much of the theory to know for sure. Let me outline how such a thing would go: First step is that there is a "canonical form" for REs. No, I *don't* remember the details of the canonical form [and , indeed, there are several different schemas for such a form], but the canonical form has the properties: a) it has a particular [restricted] syntax, and b) for every RE there is a *unique* canonical-form RE that recognizes the same language {i.e. "matches the same set of strings"] And so we can, without loss of generality, only deal with the question of comparing REs in canonical or "normal" form. As it turns out, from what I remember of this, the answer to the last question is "yes" -- given two REs *in*normal*form* you _can_ determine whether they are <, =, > or incommensurate. But, alas, really doing all of that is neither fun nor Perl (although it might be an interesting programming exercise [although not one I'd consider 'fun'] to do the reduction/comparison: write a routine to convert an arbitrary RE to canonical form, and then write a routine to compare two REs in canonical form.) /Bernie\ -- Bernie Cosell Fantasy Farm Fibers mailto:bernie@fantasyfarm.com Pearisburg, VA --> Too many people, too few sheep <-- ==== Want to unsubscribe from Fun With Perl? Well, if you insist... ==== Send email to <fwp-request@technofile.org> with message _body_ ==== unsubscribe