Suffix Array

Suffix Array

By Jude Gao
Reference: CS240E Course Notes at University of Waterloo

In pattern matching problems, a suffix array is an array of all the sorted suffixes of the text $T$.

Introduction

Let $P=banananaban​$.

Write down all the suffixes of $T$.

$bananaban$

$ananaban$

$nanaban$

$anaban$

$naban$

$aban$

$ban$

$an$

$n$

Sort them lexicographically

$aban$

$an$

$anaban$

$ananaban$

$ban$

$bananaban$

$n$

$naban$

$nanaban​$

For each element $s$ of the sorted suffixes, indicate the index $i$ such that $T[i..|T|-1]=s​$.

$5\quad aban$

$7\quad an$

$3\quad anaban$

$1\quad ananaban$

$6\quad ban$

$0\quad bananaban$

$8\quad n$

$4\quad naban$

$2\quad nanaban​$

In fact, the array [5, 7, 3, 1, 6, 0, 8, 4, 2] is the suffix array of $T$.

What’s it good for?

Just like suffix trees, we want to be able to answer that whether my pattern is a prefix of any suffix of $T$, or equivalently, whether my pattern $P$ is in $T$.

Let $A^S$ denote the suffix array of $T$. Let $P$ be the pattern that is searched in $T$.

The key idea is to binary search $P$ in $A^S​$.

Given the suffix array $A^S=$

$5\quad aban​$

$7\quad an​$

$3\quad anaban​$

$1\quad ananaban​$

$6\quad ban​$

$0\quad bananaban​$

$8\quad n​$

$4\quad naban​$

$2\quad nanaban​$

and $P=naban$.

  1. Compare $naban​$ with the middle element $ban​$. $naban​$ is greater than $ban​$, so we check the lower half.

$0\quad bananaban$

$8\quad n$

$4\quad naban$

$2\quad nanaban$

  1. Compare $naban$ with the middle element $naban$. They are equal, so $P$ is found in $T$.

Efficiency

The first thing to bear in mind is that the algorithm has no luxury of listing out all the prefixes. What it will do is to index them.

Let $T$ be the text with size $n$. To build a suffix array, we need to sort $n$ suffixes. Unlike sorting integers, everything can be done efficiently.

An important step in traditional comparison-based sorting algorithms is comparison. Comparing numbers is cheap but comparing a string requires character-by-character comparisons, so every comparison takes $O(n)$ time. An optimal comparison-sorting algorithm requires $O(n\log n)$ comparisons. Hence, the suffix sorting takes $O(n^2 \log n)$ time.

There is a relatively smarter idea is to use MSD or LSD radix sort by viewing words as base-ASCII. We know that the formula for their runtime is $O(d(n+R))$ where $d$ is the number of characters and $R$ is the number of alphabets. Since the number of characters is $O(n)$, using MSD or LSD radix sort takes $O(n^2)$.

Can we do better?

Yes, now we will talk about a way of building suffix arrays in $O(n\log n)$ time, and we will examplify this method.

We will start with all the suffixes and their starting index $i$ such that $T[i..n-1]$ stores the suffix, shown below.

$0\quad bananaban$

$1\quad ananaban$

$2\quad nanaban$

$3\quad anaban$

$4\quad naban$

$5\quad aban$

$6\quad ban$

$7\quad an$

$8\quad n$

We will use count-sort to sort by the first character, and indicate the rank in the very beginning. Moreover, counting sort will provide a grouping based on the sorted part for free.

$0\quad 1\quad ananaban$

$1\quad 3\quad anaban$

$2\quad 5\quad aban$

$4\quad 7\quad an$

$4\quad 0\quad bananaban$

$5\quad 6\quad ban$

$6\quad 2\quad nanaban​$

$7\quad 4\quad naban​$

$8\quad 8\quad n​$

Until now, all the suffixes have been sorted by first character. It turns out that we can use the information of the first-character rank to sort by first two characters.

Ignore the parts that have been sorted. The parts that have not been are boxed. For each box, we write down their first-character ranks.

$0\quad 1\quad a\boxed{nanaban}_6$

$1\quad 3\quad a\boxed{naban}_7$

$2\quad 5\quad a\boxed{ban}_5$

$4\quad 7\quad a\boxed{n}_8$

$4\quad 0\quad b\boxed{ananaban}_0$

$5\quad 6\quad b\boxed{an}_4$

$6\quad 2\quad n\boxed{anaban}_1$

$7\quad 4\quad n\boxed{aban}_2$

$8\quad 8\quad n\boxed{\epsilon}_{-1}$

Then we sort each group with respect to the order indicated on the box.

$2\quad 5\quad a\boxed{ban}_5$

$0\quad 1\quad a\boxed{nanaban}_6$

$1\quad 3\quad a\boxed{naban}_7$

$4\quad 7\quad a\boxed{n}_8$

$4\quad 0\quad b\boxed{ananaban}_0$

$5\quad 6\quad b\boxed{an}_4$

$8\quad 8\quad n\boxed{\epsilon}_{-1}$

$6\quad 2\quad n\boxed{anaban}_1$

$7\quad 4\quad n\boxed{aban}_2$

Surprsingly, we have already sorted all suffixes by first two characters. We should update our previous one-character ranks to two-character ranks and box the new unsorted parts as well as indicate their two-character ranks. Remember also to split the grouping.

$0\quad 5\quad ab\boxed{an}_3$

$1\quad 1\quad an\boxed{anaban}_2$

$2\quad 3\quad an\boxed{aban}_0$

$3\quad 7\quad an\boxed{\epsilon}_{-1}$

$4\quad 0\quad ba\boxed{nanaban}_8$

$5\quad 6\quad ba\boxed{n}_7​$

$7\quad 8\quad n\epsilon\boxed{\epsilon}_{-1}$

$8\quad 2\quad na\boxed{naban}_9$

$9\quad 4\quad na\boxed{ban}_5$

We will sort them in box order in the same groups specified in the first iteration.

$0\quad 5\quad ab\boxed{an}_3​$

$3\quad 7\quad an\boxed{\epsilon}_{-1}$

$2\quad 3\quad an\boxed{aban}_0$

$1\quad 1\quad an\boxed{anaban}_2$

$5\quad 6\quad ba\boxed{n}_7$

$4\quad 0\quad ba\boxed{nanaban}_8$

$7\quad 8\quad n\epsilon\boxed{\epsilon}_{-1}$

$9\quad 4\quad na\boxed{ban}_5$

$8\quad 2\quad na\boxed{naban}_9$

In fact, they all have already been in order. However, the algorithm would continue all the boxes have $-1$ on them.

In each iteration, there are $O(n)$ rank quries. Due to the special property of suffixes, each iteration can double the number of sorted parts, hence $O(\log n)$ iterations. Therefore, the algorithm achieves $O(n\log n)$ time to sort $n$ suffixes, thus building a suffix array.