String Sorts

1.  Java char data type: A 16-bit unsigned integer.

    --  Supports original 16-bit Unicode.

    --  Supports 21-bit Unicode 3.0 (awkwardly).

 

2.  String data type in Java: Sequence of characters (immutable).

    --  Length: Number of characters.

    --  Indexing: Get the ith character.

    --  Substring extraction: Get a contiguous subsequence of characters.

    --  String concatenation: Append one character to end of another string.


 

    --  Implementation:

 

public final class String implements Comparable<String>
{
    private char[] value; // characters
    private int offset; // index of first char in array
    private int length; // length of string
    private int hash; // cache of hashCode()
    public int length()
    { return length; }
    public char charAt(int i)
    { return value[i + offset]; }

    private String(int offset, int length, char[] value)
    {
        this.offset = offset;
        this.length = length;
        this.value = value;
    }
    public String substring(int from, int to)
    { return new String(offset + from, to - from, value); }

...
}

     --  Memory: 40 + 2N bytes for a virgin String of length N.

 

     --  Performance:


 

3.  The StringBuilder data type

    --  Sequence of characters (mutable).

    --  Underlying implementation: Resizing char[] array and length.

    --  StringBuffer data type is similar, but thread safe (and slower).

 

4.  How to efficiently reverse a string?

 

public static String reverse(String s)
{
    StringBuilder rev = new StringBuilder();
    for (int i = s.length() - 1; i >= 0; i--)
        rev.append(s.charAt(i));
    return rev.toString();
}

 

 

5.  How to efficiently form array of suffixes?


 

public static String[] suffixes(String s)
{
    int N = s.length();
    String[] suffixes = new String[N];
    for (int i = 0; i < N; i++)
        suffixes[i] = s.substring(i, N);
    return suffixes;
}

 

 

6.  How long to compute length of longest common prefix?

 

public static int lcp(String s, String t)
{
    int N = Math.min(s.length(), t.length());
    for (int i = 0; i < N; i++)
        if (s.charAt(i) != t.charAt(i))
            return i;
    return N;
}

 

 

7.  Key-indexed counting

    --  Assumption: Keys are integers between 0 and R - 1.

    --  Applications:

        -  Sort string by first letter.

        -  Sort class roster by section.

        -  Sort phone numbers by area code.

        -  Subroutine in a sorting algorithm.

    --  Algorithm

        -  Count frequencies of each letter using key as index.

        -  Compute frequency cumulates which specify destinations.

        -  Access cumulates using key as index to move items.

        -  Copy back into original array.

    --  Implementation:

 

int N = a.length;
int[] count = new int[R+1];
for (int i = 0; i < N; i++)
    count[a[i]+1]++;
for (int r = 0; r < R; r++)
    count[r+1] += count[r];
for (int i = 0; i < N; i++)
    aux[count[a[i]]++] = a[i];
for (int i = 0; i < N; i++)
    a[i] = aux[i];

 

     --  Stable

 


8.  LSD string (radix) sort

    --  Consider characters from right to left.

    --  Stably sort using dth character as the key (using key-indexed counting).

    --  Pf. [by induction on i] After pass i, strings are sorted by last i characters.

        --  If two strings differ on sort key, key-indexed sort puts them in proper relative order.

        --  If two strings agree on sort key, stability keeps them in proper relative order.

    --  Implementation:

 

public class LSD
{
    public static void sort(String[] a, int W)
    {
        int R = 256; // extended ASCII Alphabets
        int N = a.length;
        String[] aux = new String[N];
        for (int d = W-1; d >= 0; d--)
        {
            int[] count = new int[R+1];
            for (int i = 0; i < N; i++)
                count[a[i].charAt(d) + 1]++;
            for (int r = 0; r < R; r++)
                count[r+1] += count[r];
            for (int i = 0; i < N; i++)
                aux[count[a[i].charAt(d)]++] = a[i];
            for (int i = 0; i < N; i++)
                a[i] = aux[i];
        }
    }
}

 

 

 

9.  MSD string (radix) sort.

    --  Partition array into R pieces according to first character (use key-indexed counting).

    --  Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).

    --  Variable-length strings : Treat strings as if they had an extra char at end (smaller than any char).

    --  Implementation:

 

public class MSD {
    public static void sort(String[] a)
    {
        aux = new String[a.length];
        sort(a, aux, 0, a.length - 1, 0);
    }

    private static void sort(String[] a, String[] aux, int lo, int hi, int d)
    {
        int R = 256 // extended ASCII Alphabets
        if (hi <= lo) return;
        int[] count = new int[R+2]; // R+1 values, including the special small value for padding
        for (int i = lo; i <= hi; i++)
            count[charAt(a[i], d) + 2]++; // alphabet j is counted at count[j + 2]
        for (int r = 0; r < R+1; r++)
            count[r+1] += count[r];
        for (int i = lo; i <= hi; i++)
            aux[count[charAt(a[i], d) + 1]++] = a[i]; // after accumulating number of alphabets less than j is count[j + 2 - 1]
        for (int i = lo; i <= hi; i++)
            a[i] = aux[i - lo];
        for (int r = 0; r < R; r++)
            sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1); // after moving data from a[] to aux[], count[r] records the number of alphabits less than r
    }

    private static int charAt(String s, int d)
    {
        if (d < s.length()) return s.charAt(d);
        else return -1;
    }
}

 

 

     --  Performance issues

        --  Much too slow for small subarrays.

            -  Each function call needs its own count[] array.

            -  ASCII (256 counts): 100x slower than copy pass for N = 2.

            -  Unicode (65,536 counts): 32,000x slower for N = 2.

        --  Huge number of small subarrays because of recursion.

    --  Solution: Cut off to insertion sort for small subarrays.

        --  Insertion sort, but start at dth character.

        --  Implement less() so that it compares starting at dth character.

public static void sort(String[] a, int lo, int hi, int d)
{
    for (int i = lo; i <= hi; i++)
        for (int j = i; j > lo && less(a[j], a[j-1], d); j--)
            exch(a, j, j-1);
}

private static boolean less(String v, String w, int d)
{ return v.substring(d).compareTo(w.substring(d)) < 0; }

     --  performance

        --  MSD examines just enough characters to sort the keys.

        --  Number of characters examined depends on keys.


 

10.  MSD string sort vs. quicksort for strings

    --  Disadvantages of MSD string sort.

        -  Extra space for aux[]. (caused by counting sort)

        -  Extra space for count[]. (caused by counting sort)

        -  Inner loop has a lot of instructions. (caused by counting sort)

        -  Accesses memory "randomly" (cache inefficient).

    --  Disadvantage of quicksort.

        -  Linearithmic number of string compares (not linear).

        -  Has to rescan many characters in keys with long prefix matches. ( for comparison)

 

11.  3-way string quicksort

    --  Do 3-way partitioning on the dth character.

    --  Less overhead than R-way partitioning in MSD string sort.

    --  Does not re-examine characters equal to the partitioning char

       (but does re-examine characters not equal to the partitioning char).

    --  Implementation

private static void sort(String[] a)
{ sort(a, 0, a.length - 1, 0); }

private static void sort(String[] a, int lo, int hi, int d)
{
    if (hi <= lo) return;
    int lt = lo, gt = hi;
    int v = charAt(a[lo], d);
    int i = lo + 1;
    while (i <= gt)
    {
        int t = charAt(a[i], d);
        if (t < v) exch(a, lt++, i++);
        else if (t > v) exch(a, i, gt--);
        else i++;
    }
    // sort 3 subarrays recursively
    sort(a, lo, lt-1, d);
    if (v >= 0) sort(a, lt, gt, d+1);
        sort(a, gt+1, hi, d);
    sort(a, gt+1, hi, d);
}



 

     --  Performance

        --  Standard quicksort uses ~ 2 N ln N string compares on average. Costly for keys with long common prefixes (and this is a common case!)

        --  3-way string (radix) quicksort uses ~ 2 N ln N character compares on average for random strings. Avoids re-comparing long common prefixes.



 

12.  Keyword-in-context search:

    --  Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).

    --  Solution : suffix sort the text and binary search for query; scan until mismatch.




 

13.  Longest repeated substring

    --  Given a string of N characters, find the longest repeated substring.

    --  Implementation :

public String lrs(String s)
{
    int N = s.length();
    String[] suffixes = new String[N];
    for (int i = 0; i < N; i++)
        suffixes[i] = s.substring(i, N);
    Arrays.sort(suffixes);
    String lrs = "";
    for (int i = 0; i < N-1; i++)
    {
        int len = lcp(suffixes[i], suffixes[i+1]);
        if (len > lrs.length())
            lrs = suffixes[i].substring(0, len);
    }
    return lrs;
}

 

     --  Bad Input : longest repeated substring very long.

        --  LRS needs at least 1 + 2 + 3 + ... + D character compares, where D = length of longest match.

 

14.  Manber-Myers MSD algorithm

    --  Suffix sorting in linearithmic time.

    --  Algorithm

        -  Phase 0: sort on first character using key-indexed counting sort.

        -  Phase i: given array of suffixes sorted on first 2^(i-1) characters, create array of suffixes sorted on first 2^i characters.



 

    --  Worst-case running time: N lg N.

        --  Finishes after lg N phases.

        --  Can perform a phase in linear time.

 

Posted by Adeline at September 17, 2014 - 7:55 AM