bloom-filters: compact, probabilistic membership testing

Alec Berryman 2009-05-07 22:45:02 -04:00
parent 5cba4f694c
commit 99f7babcc0
4 changed files with 269 additions and 0 deletions

View File

@ -0,0 +1 @@
Alec Berryman

View File

@ -0,0 +1,36 @@
USING: help.markup help.syntax kernel math ;
IN: bloom-filters
HELP: <bloom-filter>
{ $values { "error-rate" "The desired false positive rate. A " { $link float } " between 0 and 1." }
{ "number-objects" "The expected number of object in the set. An " { $link integer } "." }
{ "bloom-filter" bloom-filter } }
{ $description "Creates an empty Bloom filter." } ;
HELP: bloom-filter-insert
{ $values { "object" object }
{ "bloom-filter" bloom-filter } }
{ $description "Records the item as a member of the filter." }
{ $side-effects "bloom-filter" } ;
HELP: bloom-filter-member?
{ $values { "object" object }
{ "bloom-filter" bloom-filter }
{ "?" boolean } }
{ $description "Returns " { $link t } " if the object may be a member of Bloom filter, " { $link f } " otherwise. The false positive rate is configurable; there are no false negatives." } ;
HELP: bloom-filter
{ $class-description "This is the class for Bloom filters. These provide constant-time insertion and probabilistic membership-testing operations, but do not actually store any elements." } ;
ARTICLE: "bloom-filters" "Bloom filters"
"This is a library for Bloom filters, sets that provide a constant-time insertion operation and probabilistic membership tests, but do not actually store any elements."
$nl
"The accuracy of the membership test is configurable; a Bloom filter will never incorrectly report an item is not a member of the set, but may incorrectly report than an item is a member of the set."
$nl
"Bloom filters cannot be resized and do not support removal."
$nl
{ $subsection <bloom-filter> }
{ $subsection bloom-filter-insert }
{ $subsection bloom-filter-member? } ;
ABOUT: "bloom-filters"

View File

@ -0,0 +1,71 @@
USING: accessors bit-arrays bloom-filters bloom-filters.private kernel layouts
math random sequences tools.test ;
IN: bloom-filters.tests
! The sizing information was generated using the subroutine
! calculate_shortest_filter_length from
! http://www.perl.com/pub/a/2004/04/08/bloom_filters.html.
! Test bloom-filter creation
[ 47965 ] [ 7 0.01 5000 bits-to-satisfy-error-rate ] unit-test
[ 7 47965 ] [ 0.01 5000 size-bloom-filter ] unit-test
[ 7 ] [ 0.01 5000 <bloom-filter> n-hashes>> ] unit-test
[ 47965 ] [ 0.01 5000 <bloom-filter> bits>> length ] unit-test
[ 5000 ] [ 0.01 5000 <bloom-filter> maximum-n-objects>> ] unit-test
[ 0 ] [ 0.01 5000 <bloom-filter> current-n-objects>> ] unit-test
! Should return the fewest hashes to satisfy the bits requested, not the most.
[ 32 ] [ 4 0.05 5 bits-to-satisfy-error-rate ] unit-test
[ 32 ] [ 5 0.05 5 bits-to-satisfy-error-rate ] unit-test
[ 4 32 ] [ 0.05 5 size-bloom-filter ] unit-test
! This is a lot of bits. On linux-x86-32, max-array-capacity is 134217727,
! which is about 16MB (assuming I can do math), which is sort of pithy. I'm
! not sure how to handle this case. Returning a smaller-than-requested
! arrays is not the least surprising behavior, but is still surprising.
[ 383718189 ] [ 7 0.01 40000000 bits-to-satisfy-error-rate ] unit-test
! [ 7 383718189 ] [ 0.01 40000000 size-bloom-filter ] unit-test
! [ 383718189 ] [ 0.01 40000000 <bloom-filter> bits>> length ] unit-test
! Should not generate bignum hash codes. Enhanced double hashing may generate a
! lot of hash codes, and it's better to do this earlier than later.
[ t ] [ 10000 iota [ hashcodes-from-object [ fixnum? ] both? ] map [ t = ] all? ] unit-test
[ ?{ t f t f t f } ] [ { 0 2 4 } 6 <bit-array> [ set-indices ] keep ] unit-test
: empty-bloom-filter ( -- bloom-filter )
0.01 2000 <bloom-filter> ;
[ 1 ] [ empty-bloom-filter [ increment-n-objects ] keep current-n-objects>> ] unit-test
: basic-insert-test-setup ( -- bloom-filter )
1 empty-bloom-filter [ bloom-filter-insert ] keep ;
! Basic tests that insert does something
[ t ] [ basic-insert-test-setup bits>> [ t = ] any? ] unit-test
[ 1 ] [ basic-insert-test-setup current-n-objects>> ] unit-test
: non-empty-bloom-filter ( -- bloom-filter )
1000 iota
empty-bloom-filter
[ [ bloom-filter-insert ] curry each ] keep ;
: full-bloom-filter ( -- bloom-filter )
2000 iota
empty-bloom-filter
[ [ bloom-filter-insert ] curry each ] keep ;
! Should find what we put in there.
[ t ] [ 2000 iota
full-bloom-filter
[ bloom-filter-member? ] curry map
[ t = ] all? ] unit-test
! We shouldn't have more than 0.01 false-positive rate.
[ t ] [ 1000 iota [ drop most-positive-fixnum random 1000 + ] map
full-bloom-filter
[ bloom-filter-member? ] curry map
[ t = ] filter
! TODO: This should be 10, but the false positive rate is currently very
! high. It shouldn't be much more than this.
length 150 <= ] unit-test

View File

@ -0,0 +1,161 @@
! Copyright (C) 2009 Alec Berryman.
! See http://factorcode.org/license.txt for BSD license.
USING: accessors arrays assocs bit-arrays kernel layouts locals math
math.functions math.ranges multiline sequences ;
IN: bloom-filters
/*
TODO:
- How to singal an error when too many bits? It looks like a built-in for some
types of arrays, but bit-array just returns a zero-length array. What we do
now is completely broken: -1 hash codes? Really?
- The false positive rate is 10x what it should be, based on informal testing.
Better object hashes or a better method of generating extra hash codes would
help. Another way is to increase the number of bits used.
- Try something smarter than the bitwise complement for a second hash code.
- http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
makes a case for http://murmurhash.googlepages.com/ instead of enhanced
double-hashing.
- Be sure to adjust the test that asserts the number of false positives isn't
unreasonable.
- Should round bits up to next power of two, use wrap instead of mod.
- Should allow user to specify the hash codes, either as inputs to enhanced
double hashing or for direct use.
- Support for serialization.
- Wrappers for combining filters.
- Should we signal an error when inserting past the number of objects the filter
is sized for? The filter will continue to work, just not very well.
- The other TODOs sprinkled through the code.
*/
TUPLE: bloom-filter
{ n-hashes fixnum read-only }
{ bits bit-array read-only }
{ maximum-n-objects fixnum read-only }
{ current-n-objects fixnum } ;
<PRIVATE
! number-bits = -(n-objects * n-hashes) / ln(1 - error-rate ^ 1/n-hashes)
:: bits-to-satisfy-error-rate ( n-hashes error-rate n-objects -- size )
n-objects n-hashes * -1 *
1 error-rate 1 n-hashes / ^ - log
/
ceiling >integer ; ! should check that it's below max-array-capacity
! TODO: this should be a constant
!
! TODO: after very little experimentation, I never see this increase after about
! 20 or so. Maybe it should be smaller.
: n-hashes-range ( -- range )
100 [1,b] ;
! Ends up with a list of arrays - { n-bits position }
: find-bloom-filter-sizes ( error-rate number-objects -- seq )
[ bits-to-satisfy-error-rate ] 2curry
n-hashes-range swap
map
n-hashes-range zip ;
:: smallest-first ( seq1 seq2 -- seq )
seq1 first seq2 first <= [ seq1 ] [ seq2 ] if ;
! The consensus on the tradeoff between increasing the number of bits and
! increasing the number of hash functions seems to be "go for the smallest
! number of bits", probably because most implementations just generate one hash
! value and cheaply mangle it into the number of hashes they need. I have not
! seen any usage studies from the implementations that made this tradeoff to
! support it, and I haven't done my own, but we'll go with it anyway.
!
! TODO: check that error-rate is reasonable.
: size-bloom-filter ( error-rate number-objects -- number-hashes number-bits )
find-bloom-filter-sizes
max-array-capacity -1 2array
[ smallest-first ]
reduce
[ second ] [ first ] bi ;
PRIVATE>
: <bloom-filter> ( error-rate number-objects -- bloom-filter )
[ size-bloom-filter <bit-array> ] keep
0 ! initially empty
bloom-filter boa ;
<PRIVATE
! See "Bloom Filters in Probabilistic Verification" by Peter C. Dillinger and
! Panagiotis Manolios, section 5.2, "Enhanced Double Hashing":
! http://www.cc.gatech.edu/~manolios/research/bloom-filters-verification.html
!
! This is taken from the definition at the top of page 12:
!
! F(i) = (A(s) + (i * B(s)) + ((i^3 - i) / 6)) mod m
!
! Where i is the hash number, A and B are hash functions for object s, and m is
! the length of the array.
:: enhanced-double-hash ( index hash0 hash1 array-size -- hash )
hash0
index hash1 *
+
index 3 ^ index -
6 /
+
array-size mod ;
: enhanced-double-hashes ( n hash0 hash1 array-size -- seq )
[ enhanced-double-hash ] 3curry
[ [0,b) ] dip
map ;
! Stupid, should pick something good.
: hashcodes-from-hashcode ( n -- n n )
dup
! we could be running this through a lot of double hashing, make sure it's a
! fixnum here
most-positive-fixnum >fixnum bitxor ;
! TODO: This code calls abs because all the double-hashing stuff outputs array
! indices and those aren't good negative. Are we throwing away bits? -1000
! b. actually prints -1111101000, which confuses me.
: hashcodes-from-object ( obj -- n n )
hashcode abs hashcodes-from-hashcode ;
: set-indices ( indices bit-array -- )
[ [ drop t ] change-nth ] curry each ;
: increment-n-objects ( bloom-filter -- )
dup current-n-objects>> 1 + >>current-n-objects drop ;
! This would be better as an each-relevant-hash that didn't cons.
: relevant-indices ( value bloom-filter -- indices )
[ n-hashes>> ] [ bits>> length ] bi ! value n array-size
swapd [ hashcodes-from-object ] dip ! n value1 value2 array-size
enhanced-double-hashes ;
PRIVATE>
: bloom-filter-insert ( object bloom-filter -- )
[ relevant-indices ]
[ bits>> set-indices ]
[ increment-n-objects ]
tri ;
: bloom-filter-member? ( value bloom-filter -- ? )
[ relevant-indices ]
[ bits>> [ nth ] curry map [ t = ] all? ]
bi ;