# Set data structures & Implementation

Sets are an important mathematical concept. A set can be defined as a collection of distinct elements. In mathematical literature, curly braces are used to denote a set. So, a set S containing the elements 1, 2 and 3 is denoted by S = {1, 2, 3}. A set supports the following basic operations,

1. Union - The union of two sets is a set that contains elements from both sets. For instance, given,
• S1 = {1, 2, 3} and
• S2 = {3, 4, 5}
• S1 ∪ 2 = {1, 2, 3, 4, 5}
2. Intersection - The intersection of two sets is a set that contains elements common to both sets. For instance, given,
• S1 = {1, 2, 3} and
• S2 = {3, 4, 5}
• S1 ∩ 2 = {3}
3. Difference - The difference of two sets is a set that contains elements from the first set with the elements common to both sets removed. Given,
• S1 = {1, 2, 3} and
• S2 = {3, 4, 5}
• S1 − 2 = {1, 2}

We also have the following additional operations,

1. IsMember - Tells you if an element is a member of the set. It is denoted by the ∈ symbol. So,
• 1 ∈ {1, 2, 3} but
• 4 ∉ {1, 2, 3}

## Implementing sets

Let’s look at the different ways we can implement the Set data structure and analyse each implementation.

### Bit-vectors

Bit vectors are an array of bits (x1, x2, … xn) such that xi = 1 if i is a member of the set. For example,

S1 = {1, 2, 3} will be represented as `11100` while
S2 = {3, 4, 5} will be represented as `00111`.

1. `IsMember()`, `Add()` and `Delete()` are `O(1)` operations.
2. `Union()`, `Intersection()` and `Difference()` are at most `O(N)` where `N` is the size of the universe.
3. For non-sparse sets, bit-vectors can represent the set quite tightly.
4. Bit operations are typically faster.

1. Takes too much memory for large universes.
2. Needs a mapping function for non-integer members.

A linked list implementation keeps a sorted list of elements for every set. This means,

S1 = {1, 2, 3} will be stored as `1 -> 2 -> 3` while
S2 = {3, 4, 5} will be represented as `3 -> 4 -> 5`.

#### Properties

1. `IsMember()` is at most `O(n)` operation where `n` is the size of the set.
2. `Union()`, `Intersection()`, `Difference()`, `Add()` and `Delete()` are take at most `O(mn)` where `m` is the size of the first set and `n` is the size of the second set.

• For very large universes, this outperforms a Bit-vector implementation.

• Typically the slowest implementation compared to the rest.

### Hashtables

A set can also be implemented as a hashmap. The time complexity of the various functions is dependent on the particular implementation of a hashtable.

#### Properties

1. `IsMember()` is typically a `O(1)` operation.
2. `Add()` and `Difference()` are typically `O(n)` where `n` is the size of the second set.
3. `Union()` and `Intersection()` are typically `O(mn)` where `m` is the size of the first set and `n` is the size of the second one.

### Ordered trees

Sets can also be implemented as ordered trees. Priority Queues are an example of an ordered tree. A Binary search tree implementation of a set would give you the following properties,

#### Properties

1. `IsMember()` is an `O(log n)` operation.
2. `Add()`, `Difference()`, `Union()` and `Intersection()` are `O(log n)`.

## Applications

1. Information retrieval - Inverted index.
2. AND & OR queries.