Sets and Performance Complexity

Explore the HashSet<T> and SortedSet<T> collections to understand high-performance data uniqueness and compare algorithmic complexity between linear and constant-time lookups.

While List<T> is ideal for maintaining a specific sequence of items, it allows duplicate values, and requires a sequential search to find an element. A set is a collection that guarantees uniqueness; it contains no duplicate elements.

The HashSet collection

In .NET, the primary implementation is HashSet<T>. It is specifically optimized for high-speed lookups and mathematical set operations. You would typically use a HashSet<T> in scenarios such as tracking unique website visitor IDs, managing distinct product tags, or filtering duplicates from a large dataset.

Key characteristics of sets

  • Guaranteed uniqueness: A set automatically prevents the insertion of duplicate items.

  • No indexing: Unlike a list, you cannot access a set element by a numeric index like mySet[0]. The items are not stored in a guaranteed sequence.

  • High performance: It uses hashing algorithms to achieve near-instantaneous search speeds regardless of the collection size.

Algorithmic complexity (Big O basics)

We use Big O notation to evaluate and compare algorithmic efficiency. This notation measures how the execution time of an algorithm grows as the size of the dataset increases.

In a standard List<T>, using the Contains() method triggers a sequential search. The collection must inspect the first item, then the second, and so on until it finds a match. If you have 1,000,000 items, the program might have to perform 1,000,000 checks. Because the maximum time taken grows linearly with the number of items (n), this is called O(n)O(n) complexity.

A HashSet<T> operates entirely differently. It uses a mathematical formula (a hash function) on the item itself to calculate its exact memory address. When you call Contains(), the set hashes the item and jumps directly to that memory location. Whether the set contains ten items or ten million items, the lookup requires a constant amount of execution time. This flat performance curve is called O(1)O(1), or constant time.