Set data structures & Implementation
Sets are an important mathematical concept. A set can be defined as a collection of distinct elements. In mathematical literature, curly braces are used to denote a set. So, a set S containing the elements 1, 2 and 3 is denoted by S = {1, 2, 3}.
A set supports the following basic operations,
- Union - The union of two sets is a set that contains elements from both sets. For instance, given,
- S1 = {1, 2, 3} and
- S2 = {3, 4, 5}
- S1 ∪ 2 = {1, 2, 3, 4, 5}
- Intersection - The intersection of two sets is a set that contains elements common to both sets. For instance, given,
- S1 = {1, 2, 3} and
- S2 = {3, 4, 5}
- S1 ∩ 2 = {3}
- Difference - The difference of two sets is a set that contains elements from the first set with the elements common to both sets removed. Given,
- S1 = {1, 2, 3} and
- S2 = {3, 4, 5}
- S1 − 2 = {1, 2}
We also have the following additional operations,
- IsMember - Tells you if an element is a member of the set. It is denoted by the ∈ symbol. So,
- 1 ∈ {1, 2, 3} but
- 4 ∉ {1, 2, 3}
Implementing sets
Let’s look at the different ways we can implement the Set data structure and analyse each implementation.
Bit-vectors
Bit vectors are an array of bits (x1, x2, … xn) such that xi = 1 if i is a member of the set. For example,
S1 = {1, 2, 3} will be represented as 11100
while
S2 = {3, 4, 5} will be represented as 00111
.
Advantages
IsMember()
,Add()
andDelete()
areO(1)
operations.Union()
,Intersection()
andDifference()
are at mostO(N)
whereN
is the size of the universe.- For non-sparse sets, bit-vectors can represent the set quite tightly.
- Bit operations are typically faster.
Disadvantages
- Takes too much memory for large universes.
- Needs a mapping function for non-integer members.
Linked-list
A linked list implementation keeps a sorted list of elements for every set. This means,
S1 = {1, 2, 3} will be stored as 1 -> 2 -> 3
while
S2 = {3, 4, 5} will be represented as 3 -> 4 -> 5
.
Properties
IsMember()
is at mostO(n)
operation wheren
is the size of the set.Union()
,Intersection()
,Difference()
,Add()
andDelete()
are take at mostO(mn)
wherem
is the size of the first set andn
is the size of the second set.
Advantages
- For very large universes, this outperforms a Bit-vector implementation.
Disadvantages
- Typically the slowest implementation compared to the rest.
Hashtables
A set can also be implemented as a hashmap. The time complexity of the various functions is dependent on the particular implementation of a hashtable.
Properties
IsMember()
is typically aO(1)
operation.Add()
andDifference()
are typicallyO(n)
wheren
is the size of the second set.Union()
andIntersection()
are typicallyO(mn)
wherem
is the size of the first set andn
is the size of the second one.
Ordered trees
Sets can also be implemented as ordered trees. Priority Queues are an example of an ordered tree. A Binary search tree implementation of a set would give you the following properties,
Properties
IsMember()
is anO(log n)
operation.Add()
,Difference()
,Union()
andIntersection()
areO(log n)
.
Applications
- Information retrieval - Inverted index.
- AND & OR queries.