[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] longest substring with unique characters



I created an algorithm which I believe is fast, and kind of mimicks the
Boyer-Moore (or some other efficient string matching algorithm) mechanism
of jumping forward.

The program is beneath my signature file.

Here is an explanation of its operation:

Assume a string like "the aftermath".  The loop will start looking at the
first character, 't'.  It then scans the rest of the string for the next
occurrence of 't' -- if there is one, that is the upper bound for the
unique-character substring (hereafter referred to as UCS).  This upper
bound is at character 6.  The loop also caches this occurrence in the
@ahead array ($ahead[0] = 6).

Next, we move to the 'h', at offset 1.  We set $ahead[1] to 12, but we do
not adjust the bound of the substring, since 12 > 6.  Next the 'e' (and
$ahead[2] = 7), and the space ($ahead[3] is -1).

Once we've made our way to the "stop character", 't', we have our
substring, "the af".  This is pushed to the list of potential matches.

The "stop character" is the character that would have created the
duplication.  We then move to the character AFTER the occurrence of the
stop character in the substring (so we start at the 'h' in 'the').

We can make this jump because any string containing the stop character
will have to stop at this second occurrence of it.  If, for example, we
had "the new autumn", then the stop character for the first UCS is 'e'.
This means that we now skip to the ' ' after the word 'the' for our next
possible UCS.

The other short circuit is done when we have too little content left in
our string, we can quit searching.

The final @ahead array for the string "the aftermath" is:

  char:  t   h   e   _   a   f   t   e   r   m   a   t   h  
   pos:  0   1   2   3   4   5   6   7   8   9  10  11  12
 ahead:  6  12   7  -1  10  -1  11  -1  -1  -1  -1  -1  -1

Keep in mind that the full @ahead array needn't be made (and in this case,
it isn't -- the 't' and 'h' at the end of the string are never reached).

I hope this can be optimized more, though.

You can run the program through this email, via

  japhy% perl -x name_of_email

-- 
Jeff "japhy" Pinyan     japhy@pobox.com    http://www.pobox.com/~japhy/
CPAN - #1 Perl Resource  (my id:  PINYAN)       http://search.cpan.org/
PerlMonks - An Online Perl Community          http://www.perlmonks.com/
The Perl Archive - Articles, Forums, etc.   http://www.perlarchive.com/


#!/usr/bin/perl -w

use strict;

while (<>) {
  chomp;
  my @longest = pinyan_UCS($_);
  my $str = $_;

  print map substr($str, $_->[0], $_->[1] - $_->[0]) . "\n", @longest;
  print "\n";
}


sub pinyan_UCS {
  my $str = shift;
  my $len = length $str;
  my @c = split //, $str;
  my ($diff,$biggest) = (0,0);
  my (@ahead,@matches);

  for (my $i = 0; $i < $len; ) {
    my $c = $c[$i];
    my $match = [ $i, $len, $i+1 ];

    if ($len - $i >= $biggest) {
      for (my $k = $i; $k < $match->[1]; $k++) {
        $ahead[$k] ||=  index($str, $c[$k], $k+1);
        if ($ahead[$k] != -1 and $match->[1] > $ahead[$k]) {
          $match->[1] = $ahead[$k];
          $match->[2] = $k+1;
        }
      }

      $diff = $match->[1] - $match->[0];

      if ($diff > $biggest) { ($biggest,@matches) = ($diff,$match) }
      elsif ($diff == $biggest) { push @matches, $match; }
    }
    else { last }

    $i = $match->[2];
  }

  return @matches;
}


==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe