diff --git a/extra/bloom-filters/authors.txt b/extra/bloom-filters/authors.txt new file mode 100644 index 0000000000..528e5dfe6b --- /dev/null +++ b/extra/bloom-filters/authors.txt @@ -0,0 +1 @@ +Alec Berryman diff --git a/extra/bloom-filters/bloom-filters-docs.factor b/extra/bloom-filters/bloom-filters-docs.factor new file mode 100644 index 0000000000..4af1a82af6 --- /dev/null +++ b/extra/bloom-filters/bloom-filters-docs.factor @@ -0,0 +1,36 @@ +USING: help.markup help.syntax kernel math ; +IN: bloom-filters + +HELP: +{ $values { "error-rate" "The desired false positive rate. A " { $link float } " between 0 and 1." } + { "number-objects" "The expected number of object in the set. An " { $link integer } "." } + { "bloom-filter" bloom-filter } } +{ $description "Creates an empty Bloom filter." } ; + +HELP: bloom-filter-insert +{ $values { "object" object } + { "bloom-filter" bloom-filter } } +{ $description "Records the item as a member of the filter." } +{ $side-effects "bloom-filter" } ; + +HELP: bloom-filter-member? +{ $values { "object" object } + { "bloom-filter" bloom-filter } + { "?" boolean } } +{ $description "Returns " { $link t } " if the object may be a member of Bloom filter, " { $link f } " otherwise. The false positive rate is configurable; there are no false negatives." } ; + +HELP: bloom-filter +{ $class-description "This is the class for Bloom filters. These provide constant-time insertion and probabilistic membership-testing operations, but do not actually store any elements." } ; + +ARTICLE: "bloom-filters" "Bloom filters" +"This is a library for Bloom filters, sets that provide a constant-time insertion operation and probabilistic membership tests, but do not actually store any elements." +$nl +"The accuracy of the membership test is configurable; a Bloom filter will never incorrectly report an item is not a member of the set, but may incorrectly report than an item is a member of the set." +$nl +"Bloom filters cannot be resized and do not support removal." +$nl +{ $subsection } +{ $subsection bloom-filter-insert } +{ $subsection bloom-filter-member? } ; + +ABOUT: "bloom-filters" diff --git a/extra/bloom-filters/bloom-filters-tests.factor b/extra/bloom-filters/bloom-filters-tests.factor new file mode 100644 index 0000000000..b7a5d7ebc2 --- /dev/null +++ b/extra/bloom-filters/bloom-filters-tests.factor @@ -0,0 +1,71 @@ +USING: accessors bit-arrays bloom-filters bloom-filters.private kernel layouts +math random sequences tools.test ; +IN: bloom-filters.tests + +! The sizing information was generated using the subroutine +! calculate_shortest_filter_length from +! http://www.perl.com/pub/a/2004/04/08/bloom_filters.html. + +! Test bloom-filter creation +[ 47965 ] [ 7 0.01 5000 bits-to-satisfy-error-rate ] unit-test +[ 7 47965 ] [ 0.01 5000 size-bloom-filter ] unit-test +[ 7 ] [ 0.01 5000 n-hashes>> ] unit-test +[ 47965 ] [ 0.01 5000 bits>> length ] unit-test +[ 5000 ] [ 0.01 5000 maximum-n-objects>> ] unit-test +[ 0 ] [ 0.01 5000 current-n-objects>> ] unit-test + +! Should return the fewest hashes to satisfy the bits requested, not the most. +[ 32 ] [ 4 0.05 5 bits-to-satisfy-error-rate ] unit-test +[ 32 ] [ 5 0.05 5 bits-to-satisfy-error-rate ] unit-test +[ 4 32 ] [ 0.05 5 size-bloom-filter ] unit-test + +! This is a lot of bits. On linux-x86-32, max-array-capacity is 134217727, +! which is about 16MB (assuming I can do math), which is sort of pithy. I'm +! not sure how to handle this case. Returning a smaller-than-requested +! arrays is not the least surprising behavior, but is still surprising. +[ 383718189 ] [ 7 0.01 40000000 bits-to-satisfy-error-rate ] unit-test +! [ 7 383718189 ] [ 0.01 40000000 size-bloom-filter ] unit-test +! [ 383718189 ] [ 0.01 40000000 bits>> length ] unit-test + +! Should not generate bignum hash codes. Enhanced double hashing may generate a +! lot of hash codes, and it's better to do this earlier than later. +[ t ] [ 10000 iota [ hashcodes-from-object [ fixnum? ] both? ] map [ t = ] all? ] unit-test + +[ ?{ t f t f t f } ] [ { 0 2 4 } 6 [ set-indices ] keep ] unit-test + +: empty-bloom-filter ( -- bloom-filter ) + 0.01 2000 ; + +[ 1 ] [ empty-bloom-filter [ increment-n-objects ] keep current-n-objects>> ] unit-test + +: basic-insert-test-setup ( -- bloom-filter ) + 1 empty-bloom-filter [ bloom-filter-insert ] keep ; + +! Basic tests that insert does something +[ t ] [ basic-insert-test-setup bits>> [ t = ] any? ] unit-test +[ 1 ] [ basic-insert-test-setup current-n-objects>> ] unit-test + +: non-empty-bloom-filter ( -- bloom-filter ) + 1000 iota + empty-bloom-filter + [ [ bloom-filter-insert ] curry each ] keep ; + +: full-bloom-filter ( -- bloom-filter ) + 2000 iota + empty-bloom-filter + [ [ bloom-filter-insert ] curry each ] keep ; + +! Should find what we put in there. +[ t ] [ 2000 iota + full-bloom-filter + [ bloom-filter-member? ] curry map + [ t = ] all? ] unit-test + +! We shouldn't have more than 0.01 false-positive rate. +[ t ] [ 1000 iota [ drop most-positive-fixnum random 1000 + ] map + full-bloom-filter + [ bloom-filter-member? ] curry map + [ t = ] filter + ! TODO: This should be 10, but the false positive rate is currently very + ! high. It shouldn't be much more than this. + length 150 <= ] unit-test diff --git a/extra/bloom-filters/bloom-filters.factor b/extra/bloom-filters/bloom-filters.factor new file mode 100644 index 0000000000..94d0dd070f --- /dev/null +++ b/extra/bloom-filters/bloom-filters.factor @@ -0,0 +1,161 @@ +! Copyright (C) 2009 Alec Berryman. +! See http://factorcode.org/license.txt for BSD license. +USING: accessors arrays assocs bit-arrays kernel layouts locals math +math.functions math.ranges multiline sequences ; +IN: bloom-filters + +/* + +TODO: + +- How to singal an error when too many bits? It looks like a built-in for some + types of arrays, but bit-array just returns a zero-length array. What we do + now is completely broken: -1 hash codes? Really? + +- The false positive rate is 10x what it should be, based on informal testing. + Better object hashes or a better method of generating extra hash codes would + help. Another way is to increase the number of bits used. + + - Try something smarter than the bitwise complement for a second hash code. + + - http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html + makes a case for http://murmurhash.googlepages.com/ instead of enhanced + double-hashing. + + - Be sure to adjust the test that asserts the number of false positives isn't + unreasonable. + +- Should round bits up to next power of two, use wrap instead of mod. + +- Should allow user to specify the hash codes, either as inputs to enhanced + double hashing or for direct use. + +- Support for serialization. + +- Wrappers for combining filters. + +- Should we signal an error when inserting past the number of objects the filter + is sized for? The filter will continue to work, just not very well. + +- The other TODOs sprinkled through the code. + +*/ + +TUPLE: bloom-filter +{ n-hashes fixnum read-only } +{ bits bit-array read-only } +{ maximum-n-objects fixnum read-only } +{ current-n-objects fixnum } ; + +integer ; ! should check that it's below max-array-capacity + +! TODO: this should be a constant +! +! TODO: after very little experimentation, I never see this increase after about +! 20 or so. Maybe it should be smaller. +: n-hashes-range ( -- range ) + 100 [1,b] ; + +! Ends up with a list of arrays - { n-bits position } +: find-bloom-filter-sizes ( error-rate number-objects -- seq ) + [ bits-to-satisfy-error-rate ] 2curry + n-hashes-range swap + map + n-hashes-range zip ; + +:: smallest-first ( seq1 seq2 -- seq ) + seq1 first seq2 first <= [ seq1 ] [ seq2 ] if ; + +! The consensus on the tradeoff between increasing the number of bits and +! increasing the number of hash functions seems to be "go for the smallest +! number of bits", probably because most implementations just generate one hash +! value and cheaply mangle it into the number of hashes they need. I have not +! seen any usage studies from the implementations that made this tradeoff to +! support it, and I haven't done my own, but we'll go with it anyway. +! +! TODO: check that error-rate is reasonable. +: size-bloom-filter ( error-rate number-objects -- number-hashes number-bits ) + find-bloom-filter-sizes + max-array-capacity -1 2array + [ smallest-first ] + reduce + [ second ] [ first ] bi ; + +PRIVATE> + +: ( error-rate number-objects -- bloom-filter ) + [ size-bloom-filter ] keep + 0 ! initially empty + bloom-filter boa ; + +fixnum bitxor ; + +! TODO: This code calls abs because all the double-hashing stuff outputs array +! indices and those aren't good negative. Are we throwing away bits? -1000 +! b. actually prints -1111101000, which confuses me. +: hashcodes-from-object ( obj -- n n ) + hashcode abs hashcodes-from-hashcode ; + +: set-indices ( indices bit-array -- ) + [ [ drop t ] change-nth ] curry each ; + +: increment-n-objects ( bloom-filter -- ) + dup current-n-objects>> 1 + >>current-n-objects drop ; + +! This would be better as an each-relevant-hash that didn't cons. +: relevant-indices ( value bloom-filter -- indices ) + [ n-hashes>> ] [ bits>> length ] bi ! value n array-size + swapd [ hashcodes-from-object ] dip ! n value1 value2 array-size + enhanced-double-hashes ; + +PRIVATE> + +: bloom-filter-insert ( object bloom-filter -- ) + [ relevant-indices ] + [ bits>> set-indices ] + [ increment-n-objects ] + tri ; + +: bloom-filter-member? ( value bloom-filter -- ? ) + [ relevant-indices ] + [ bits>> [ nth ] curry map [ t = ] all? ] + bi ;