statistics - Are overlapping sub-arrays of a byte array independent enough to use as hash function(s) for Bloom Filter? -



statistics - Are overlapping sub-arrays of a byte array independent enough to use as hash function(s) for Bloom Filter? -

i have next question in context of bloomfilter. bloomfilters need have k independent hash functions. let's phone call these function h1, h2, ... hk. independent in context means value have little correlation (hopefully zero) when applied same set. see algorithm description @ http://en.wikipedia.org/wiki/bloom_filter (but of course, know page within out :).

now, assume want define hash functions using n bits (coming crypto function if must know, it's not relevant question), independent each other themselves. if want more context can read http://bitworking.org/news/380/bloom-filter-resources doing similar.

for example, assume want define each h (pardon pseudo-code):

bytes = md5(value) h1 = bytes[0-3] integer h2 = bytes[4-7] integer h3 = bytes[8-11] integer ...

of course of study run out of hash functions quickly. 4 in md5 example.

one possibility allow hash functions overlap each other , not have requirement 4 bytes sequential. way has many hash functions permutations byte array allows. maintain simple, if defined hash functions in next way:

bytes = md5(value) h1 = bytes[0-3] integer h2 = bytes[1-4] integer h3 = bytes[2-5] integer ...

it easy see in md5 case have 12 hashing functions instead of four.

finally, the question. these hashing functions independent? thanks!

update: decided seek reply question practical point of view created little programme test hypothesis. see below.

as case clever questions, reply yes, , no.

yes, in sense there 16 bits not shared between h1 , h2. no, in senses of import (unless using 8 bits of hash function, presume not).

the issue here less dependence between 2 functions applied same item beingness inserted , more (in case, in opinion) functions beingness applied multiple items.

think of way. assume first illustration uses g1-g4, , sec uses h1-h4. 2 items md5sum (or other hashing function) overlaps in 5 consecutive bytes (unlikely, statistically do-able, if you're trying) stand chance of colliding if using h1 , h2, h2 , h3, or h3 , h4. meanwhile g1-g4 robust possibility.

now collisions bloom filters aren't big deal other applications of hash functions, should maintain in mind overlapping bytes detract utility of hash functions. i'm little surprised need more 4 indepdendent hash functions, honest.

also, if you're using lastly 8 bits of each number (256 bit bloom filter) or lastly 16 bits (2^16 bit bloom filter), or whatever, can 'overlap' bits aren't using reckless abandon , without risk.

disclaimer: know cryptography pretty , bloom filters because fricking awesome, practical knowledge of bloom filters limited; describe may work quite utilize case.

statistics correlation bloom-filter

Comments

Popular posts from this blog

iphone - Dismissing a UIAlertView -

intellij idea - Update external libraries with intelij and java -

javascript - send data from a new window to previous window in php -