[Date Prev][Date Next][Thread Prev][Thread Next] [Search] [Date Index] [Thread Index]

Re: [FWP] Weighted sampling



On Tue, Mar 20, 2001 at 12:15:05AM +0100, abigail@foad.org wrote:
> On Mon, Mar 19, 2001 at 04:44:51PM -0500, yanick1@sympatico.ca wrote:
> >      sub sample {
> >          my ($count, $sample) = ( $weights[0] );
> >          for (my $i  = 1; $i < @weights; $i ++) {
> >              $count += $weights [$i];
> >              $sample = $i if rand $count < $weights [$i];
> >          }
> >          $sample;
> >      }
> > 
> > would save one useless call of rand.
> 
> Yeah, but if all weights are 0, you return the first element of the set,
> and I return undef. 

	True.

> Plus, why special case 0, if there's no need to?

	Because a random number is a terrible thing to waste? :)

> > $sample_sum  += $sample{ $_ } foreach keys %sample;
> 
> I guess you want values here, not keys.

	*blush*

> > @sample_index = sort { $sample{$b} <=> $sample{$a} } keys %sample;
> > 
> > sub draw
> > {
> > 	my $count = rand $sample_sum;
> > 	for( @sample_index )
> > 	{
> > 			$count -= $sample{ $_ };
> > 			return $_ if $count <= 0;
> > 	}
> > }
> > 
> > be much faster in average?
> 
> Not by a long shot! You are garanteed to take Omega (n log n) time,
> due to your sort, while a single loop takes O (n). 

	Aye. But the sort only has to be performed once (assuming, of course,
that the population does not change from one call of sample()
to the other). If you draw enough samples, this first overhead will
be overcome by the the samller constant K it provides for the
loop O(n)itude.

> Not that the
> sort is necessary in any way.

	Not necesary, but helpful if the population has a few observations
with very large weights, and many, many ones with small weights. 
The sort only makes sure we try our luck with the elements that has 
the higher probability to be choosen, thus exiting the loop faster
(your loop goes through all the population element each time, 
mine does it only in the worse case).

> Then there's the O (n) additional memory (keys %sample).

	Guilty as charged on that one. 

This being said, if I run some test, I get:

Benchmark: running abi, yanick_sorted, yanick_unsorted, 
			each for at least 10 CPU seconds...

        abi: 28.67/s (n=318)
		yanick_sorted: 5387.94/s (n=58513)
		yanick_unsorted: 70.09/s (n=710)

(code included below)

Granted, I have taken an ugly sample, and excluded the creation
of %sample from the test. The 'n' calls to rand for each sample 
seems to be quite costly, time-wise (not to mention that it's 
not as handy as to use a single random number when comes the
time to play with antithetic values and correlation analysis).


Joy,
Yanick

-- 
use Benchmark;

@weights = ( (1) x 999, 99999 );

%sample = map { $_ => $weights[$_]  } 0..$#weights;
$sample_sum  += $_ foreach values %sample;
@sample_index = sort { $sample{$b} <=> $sample{$a} } keys %sample;

$nbr_sample = 5000;

timethese( -10, {
               'abi' => \&sample ,
				'yanick_sorted' => \&draw },
				'yanick_unsorted' => \&draw_unsorted });

sub sample {
   my ($count, $sample);
   for (my $i  = 0; $i < @weights; $i ++) {
       $count += $weights [$i];
       $sample = $i if rand $count < $weights [$i];
   }
  $sample;
}
 
sub draw
{
 	my $count = rand $sample_sum;
	( $count -= $sample{ $_ } ) <= 0 and return $_ for @sample_index;
}

sub draw_unsorted
{ 
	my $count = rand $sample_sum;
	( $count -= $weights[ $_ ] ) <= 0 and return $_ for 0..$#weights
}
 
 

==== Want to unsubscribe from Fun With Perl?  Well, if you insist...
==== Send email to <fwp-request@technofile.org> with message _body_
====   unsubscribe