Suffix Array
Authors: Benjamin Qi, Kevin Sheng
Quickly sorting suffixes of a string and its applications
Prerequisites
Resources | ||||
---|---|---|---|---|
CF | Videos & Practice Problems | |||
cp-algo | ||||
CPC | ||||
CP2 |
The suffix array of a string is the sorted array of all possible suffixes of the string. Each suffix is usually represented by its starting index.
For example, if our string were abcbcba
, the suffix array would be as follows:
Suffix | String |
---|---|
6 | a |
0 | abcbcba |
5 | ba |
3 | bcba |
1 | bcbcba |
4 | cba |
2 | cbcba |
Construction
Focus Problem – try your best to solve this problem before continuing!
The general philosophy behind the construction of a suffix array is to sort incrementally. We first start by comparing the first character of each suffix, and then double the amount we compare until we're comparing the entire length of the string. If this sounds really abstract, don't worry! All the implementation details will be hammered out below.
Initial Transformation
For convenience, let's start by appending a "null character" of sorts to the end of the string.
This acts as a tiebreaker and will ensure we never hit the end of a suffix while comparing two suffixes.
$
or the space character are possible options, as the ASCII code of either are less than a
.
To avoid out of bounds issues, we'll also append the string to itself. Notice that since any comparisons would have stopped at the null character, this has no effect on the order of the suffix array.
After implementing these two transformations, the strings we compare would look like this (still using abcbcba
as an example):
Suffix | String |
---|---|
7 | $abcbcba |
6 | a$abcbcba |
0 | abcbcba$abcbcba |
5 | ba$abcbcba |
3 | bcba$abcbcba |
1 | bcbcba$abcbcba |
4 | cba$abcbcba |
2 | cbcba$abcbcba |
The last suffix that starts with $
will always come first; it doesn't particularly matter.
Sorting
As stated earlier, let's first just consider the first character in each substring and sort it:
Suffix | String |
---|---|
7 | $abcbcba |
0 | abcbcba$abcbcba |
6 | a$abcbcba |
1 | bcbcba$abcbcba |
3 | bcba$abcbcba |
5 | ba$abcbcba |
2 | cbcba$abcbcba |
4 | cba$abcbcba |
From here we start doubling the length of string we compare. To get to the point where we efficiently compare suffixes, let's first do some analysis.
Say we just sorted all the suffixes by the first characters, and we're now comparing the first four characters of suffixes and :
Suffix | String |
---|---|
2 | cbcb |
4 | cba$ |
If the first characters of suffix were already less than or greater than those of suffix , nothing changes. However, in this case they aren't; we're going to have to compare the right half.
How might we do this? Well, the cool thing is that since we just sorted all suffixes by the first , we actually have information about the right halves already! The right half of the first four characters of suffix is just the first two characters of suffix . Similarly, the right half of suffix is the first two characters of suffix .
To compare these half-suffixes, after each iteration of sorting we can assign each half-baked suffix a number representing its position in the array. If two half-baked suffixes are equal, then they would get the same number. These numbers are usually called equivalence classes.
Applying this augmentation to our array sorted by the first character, we would now have:
Eq. Class | Suffix | String |
---|---|---|
0 | 7 | $abcbcba |
1 | 0 | abcbcba$abcbcba |
1 | 6 | a$abcbcba |
2 | 1 | bcbcba$abcbcba |
2 | 3 | bcba$abcbcba |
2 | 5 | ba$abcbcba |
3 | 2 | cbcba$abcbcba |
3 | 4 | cba$abcbcba |
Optimization
Since we're doing sorts times, our current algorithm as it stands is .
Although this is adequate, it can sometimes be a bit too slow especially when creating a suffix array is often just the first step in many problems. Fortunately, we can fix this! Since everything we're comparing is in the range , we can use radix sort to remove a log factor from our complexity.
Implementation
Time Complexity:
C++
#include <algorithm>#include <iostream>#include <string>#include <vector>using std::cout;using std::endl;using std::pair;using std::vector;
A more compact implementation is also below.
Resources | ||||
---|---|---|---|---|
Benq | O(N log N) |
Java
Warning!
The below implementation will TLE on some test cases due to Java's constant factor.
import java.io.*;import java.util.*;import java.util.function.Function;public class SuffixArray {static class Suffix implements Comparable<Suffix> {public int start;public int leftEq;public int rightEq;
It's recommended that you also test your implementation against a brute force solution for many small strings.
Here's a small script that outputs a test case and its answer:
C++
#include <algorithm>#include <iostream>#include <vector>using std::cout;using std::endl;using std::vector;int main() {int n = 50; // adjust n as you please
Java
import java.io.*;import java.util.*;public class Generator {public static void main(String[] args) throws IOException {Random rng = new Random();int n = 50; // adjust n as you pleaseStringBuilder sb = new StringBuilder();for (int i = 0; i < n; i++) { sb.append((char)('a' + rng.nextInt(26))); }
Python
from random import choicefrom string import ascii_lowercasen = 50 # adjust n as you pleasestr_ = [choice(ascii_lowercase) for _ in range(n)]print("".join(str_))suffs = list(range(n))suffs.sort(key=lambda s: str_[s:])print(" ".join(str(i) for i in suffs))
LCP Array
The LCP array is a common auxiliary array based on the suffix array that stores the longest common prefix (LCP) between adjacent elements of the suffix array.
Taking our example string from above, such an array would look like this:
LCP | Suffix | String |
---|---|---|
NA | 7 | $abcbcba |
0 | 0 | abcbcba$abcbcba |
1 | 6 | a$abcbcba |
0 | 1 | bcbcba$abcbcba |
3 | 3 | bcba$abcbcba |
1 | 5 | ba$abcbcba |
0 | 2 | cbcba$abcbcba |
2 | 4 | cba$abcbcba |
Each row contains the LCP between the row's string and that of the previous row.
Construction
An algorithm often used to construct this LCP array starts by calculating the LCP for the suffix starting at and its previous suffix. It then calculates the LCP for the suffix starting at and its previous suffix, and so on and so forth. The reason for this calculation order is so that we can apply a clever optimization that is best described through our example.
Say we just finished calculating the LCP for the row of suffix . Our next step, then, would be to calculate the LCP for suffix .
One might think that we would have to start comparing the prefixes all over again for this suffix, but we actually don't have to! Remember that we just calculated the LCP of suffixes and to be characters, which means that the suffixes and , and all the suffixes between them in the array (despite there not being any in our example) must have at least characters in common.
What this means is that we can just start comparing from the third character to compute the LCP for the row of suffix instead of starting all over.
In general, we keep a variable which keeps the number of common characters we know the current suffix has with its predecessor in the suffix array. This variable tells us how many characters we can skip, as we know they're equal. After each computation of an LCP, we set this variable equal to its value minus one, except when the LCP is already .
You'll see an implementation of this algorithm in the next section.
Application
Now that we've constructed this LCP array, let's see an application of it.
Focus Problem – try your best to solve this problem before continuing!
A crucial observation to be made for this problem is that every substring can be represented as a prefix of some suffix. We know that suffix has (recall that is the length of the string) possible prefixes.
Notice that the number of prefixes of suffix that are duplicates is just the longest common prefix between and its predecessor. With this, we can iterate through the LCP array and calculate our answer in one fell swoop.
Implementation
Time Complexity:
C++
#include <algorithm>#include <iostream>#include <string>#include <vector>using std::cout;using std::endl;using std::pair;using std::string;using std::vector;
Java
Warning!
This solution, like the previous one, will TLE for the same reason stated above.
import java.io.*;import java.util.*;import java.util.function.Function;public class SuffixArray {Code Snippet: Suffix Class (Click to expand)public static void main(String[] args) throws IOException {BufferedReader read = new BufferedReader(new InputStreamReader(System.in));String str = read.readLine() + '$';
Burrows-Wheeler Transform
Resources | ||||
---|---|---|---|---|
GFG | could be simpler? |
Focus Problem – try your best to solve this problem before continuing!
View Internal SolutionRun Enumerate
Resources | ||||
---|---|---|---|---|
cp-algo | could be simpler? |
Focus Problem – try your best to solve this problem before continuing!
This section is not complete.
you can do this easily with a suffix array
Problems
Status | Source | Problem Name | Difficulty | Tags | |
---|---|---|---|---|---|
Platinum | Hard | Show TagsSuffix Array | |||
CF | Hard | Show TagsSuffix Array |
Module Progress:
Join the USACO Forum!
Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!