statistics - Are overlapping sub-arrays of a byte array independent enough to use as hash function(s) for Bloom Filter? -
statistics - Are overlapping sub-arrays of a byte array independent enough to use as hash function(s) for Bloom Filter? -
i have next question in context of bloomfilter. bloomfilters need have k
independent hash functions. let's phone call these function h1, h2, ... hk
. independent in context means value have little correlation (hopefully zero) when applied same set. see algorithm description @ http://en.wikipedia.org/wiki/bloom_filter (but of course, know page within out :).
now, assume want define hash functions using n
bits (coming crypto function if must know, it's not relevant question), independent each other themselves. if want more context can read http://bitworking.org/news/380/bloom-filter-resources doing similar.
for example, assume want define each h
(pardon pseudo-code):
bytes = md5(value) h1 = bytes[0-3] integer h2 = bytes[4-7] integer h3 = bytes[8-11] integer ...
of course of study run out of hash functions quickly. 4 in md5 example.
one possibility allow hash functions overlap each other , not have requirement 4 bytes sequential. way has many hash functions permutations byte array allows. maintain simple, if defined hash functions in next way:
bytes = md5(value) h1 = bytes[0-3] integer h2 = bytes[1-4] integer h3 = bytes[2-5] integer ...
it easy see in md5 case have 12 hashing functions instead of four.
finally, the question. these hashing functions independent? thanks!
update: decided seek reply question practical point of view created little programme test hypothesis. see below.
as case clever questions, reply yes, , no.
yes, in sense there 16 bits not shared between h1 , h2. no, in senses of import (unless using 8 bits of hash function, presume not).
the issue here less dependence between 2 functions applied same item beingness inserted , more (in case, in opinion) functions beingness applied multiple items.
think of way. assume first illustration uses g1-g4, , sec uses h1-h4. 2 items md5sum (or other hashing function) overlaps in 5 consecutive bytes (unlikely, statistically do-able, if you're trying) stand chance of colliding if using h1 , h2, h2 , h3, or h3 , h4. meanwhile g1-g4 robust possibility.
now collisions bloom filters aren't big deal other applications of hash functions, should maintain in mind overlapping bytes detract utility of hash functions. i'm little surprised need more 4 indepdendent hash functions, honest.
also, if you're using lastly 8 bits of each number (256 bit bloom filter) or lastly 16 bits (2^16 bit bloom filter), or whatever, can 'overlap' bits aren't using reckless abandon , without risk.
disclaimer: know cryptography pretty , bloom filters because fricking awesome, practical knowledge of bloom filters limited; describe may work quite utilize case.
statistics correlation bloom-filter
Comments
Post a Comment