Bloom Filter

Probabilistic data structure

Probabilistic data structures are data structures that provide a reasonably approximate answer without giving a definitive, accurate result. They also offer a mechanism to approximate this estimation. The probabilistic data structure is used in big data and streaming applications, where approximate answers are sufficient to proceed with the workflow.

Some well-known probabilistic data structures include Bloom filter, Count Min Sketch, and HyperLogLog. Probabilistic data structures include these characteristics:

  • They are space-efficient. The data structures fit into memory, providing a lower memory footprint.

  • Probabilistic data structures are parallelizable.

  • The access patterns on the data structure are predictable and in constant time.

Probabilistic data structures can be used to:

  • Determine whether an element belongs to a set.

  • Determine the total distinct elements in a given array.

  • Count the approximate frequency of a particular element from a large data set.

Introduction

A Bloom filter is a space-efficient probabilistic data structure devised by Burton Howard Bloom in 1970. A Bloom filter is used to approximate the set membership problem statement and determine whether an element belongs to a set or not.

The data structure can produce a false positive, but it doesn't produce a false negative. Thus, the results of the Bloom filter are not deterministic under all conditions:

  • A false positive means that the assumption of a result is true but not valid. For example, if the Bloom filter determines the element is present and the corresponding data is missing in the actual database, it is a false positive.

  • A false negative match means that the assumption of a result is false but not valid. For example, if the Bloom filter determines the element is missing and the corresponding data is present in the actual database, it is a false negative.

Bloom filters can produce false positives. In the context of the Bloom filter, the data structure determines that the element belongs to a set, but the data may or may not be present in the underlying storage system. Thus, it is not completely accurate in predicting whether the element belongs to a set.

Bloom filters cannot produce false negatives. In the context of the Bloom filter, the data structure determines that the element doesn’t belong to a set, and is 100 percent accurate in the underlying storage system. This property of Bloom filter makes it suitable as a cheap, first-level filter to skip scanning the underlying data store when the data does not belong to the set.

The number of elements in the set and the number of hashing functions determine the probability of false positives.

Insertion

A Bloom filter has two significant components:

  • BitVector

  • Hashing module

The BitVector is the core data structure for the Bloom filter. The BitVector is an array that compactly stores bits with 0s and 1s. Bloom filter initializes the BitVector with all 0s.

Bloom filters initialize multiple hashing functions. Each hash function takes a particular input and generates a hash value mapping to N buckets. Subsequently, the Bloom filter maps the bucket to one of the bits in the vector. Finally, the algorithm maps the hash function's output to a bit in the BitVector, setting it to 1. Bloom filter repeats the algorithm for every initialized hash function, setting appropriate bits in the BitVector.

Get hands-on with 1200+ tech skills courses.