Wednesday, November 12, 2014

STACK: Similarity String Comparison in Java

Calculating String Similarity:

The common way of calculating the similarity between two strings in a 0%-100% fashion, as used in many libraries, is measuring how much (in %) you'd have to change the biggest string to turn it into the other:
public static double similarity(String s1, String s2) {
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1; s1 = s2; s2 = swap;
    }
    int bigLen = s1.length();
    if (bigLen == 0) { return 1.0; /* both strings are zero length */ }
    return (bigLen - computeEditDistance(s1, s2)) / (double) bigLen;
}
// full copy-paste working code is below



As you can see, the similarity() method above uses a computeEditDistance() method to calculate the edit distance step. There are several implementations to this step, each may suit a specific scenario better. The most common is the Levenshtein distance algorithm and we'll use it in our example below (for very large strings, other algorithms are likely to perform better).

Calculating edit distance with Levenshtein algorithm in Java:

LevenshteinDistance class is a working example of an implementation of the Levenshtein distance metric algorithm in Java (the similarity method used above is included):
public class LevenshteinDistance {

    public static double similarity(String s1, String s2) {
        if (s1.length() < s2.length()) { // s1 should always be bigger
            String swap = s1; s1 = s2; s2 = swap;
        }
        int bigLen = s1.length();
        if (bigLen == 0) { return 1.0; /* both strings are zero length */ }
        return (bigLen - computeEditDistance(s1, s2)) / (double) bigLen;
    }

    public static int computeEditDistance(String s1, String s2) {
        s1 = s1.toLowerCase();
        s2 = s2.toLowerCase();

        int[] costs = new int[s2.length() + 1];
        for (int i = 0; i <= s1.length(); i++) {
            int lastValue = i;
            for (int j = 0; j <= s2.length(); j++) {
                if (i == 0)
                    costs[j] = j;
                else {
                    if (j > 0) {
                        int newValue = costs[j - 1];
                        if (s1.charAt(i - 1) != s2.charAt(j - 1))
                            newValue = Math.min(Math.min(newValue, lastValue),
                                    costs[j]) + 1;
                        costs[j - 1] = lastValue;
                        lastValue = newValue;
                    }
                }
            }
            if (i > 0)
                costs[s2.length()] = lastValue;
        }
        return costs[s2.length()];
    }

    public static void printDistance(String s1, String s2) {
        System.out.println(s1 + "-->" + s2 + ": " +
                    computeEditDistance(s1, s2) + " ("+similarity(s1, s2)+")");
    }

    public static void main(String[] args) {
        printDistance("", "");
        printDistance("1234567890", "1");
        printDistance("1234567890", "12");
        printDistance("1234567890", "123");
        printDistance("1234567890", "1234");
        printDistance("1234567890", "12345");
        printDistance("1234567890", "123456");
        printDistance("1234567890", "1234567");
        printDistance("1234567890", "12345678");
        printDistance("1234567890", "123456789");
        printDistance("1234567890", "1234567890");
        printDistance("1234567890", "1234567980");

        printDistance("47/2010", "472010");
        printDistance("47/2010", "472011");

        printDistance("47/2010", "AB.CDEF");
        printDistance("47/2010", "4B.CDEFG");
        printDistance("47/2010", "AB.CDEFG");

        printDistance("The quick fox jumped", "The fox jumped");
        printDistance("The quick fox jumped", "The fox");
        printDistance("The quick fox jumped",
                            "The quick fox jumped off the balcany");
        printDistance("kitten", "sitting");
        printDistance("rosettacode", "raisethysword");
        printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
        for (int i = 1; i < args.length; i += 2)
            printDistance(args[i - 1], args[i]);
    }
}
Output:
-->: 0 (1.0)
1234567890-->1: 9 (0.1)
1234567890-->12: 8 (0.2)
1234567890-->123: 7 (0.3)
1234567890-->1234: 6 (0.4)
1234567890-->12345: 5 (0.5)
1234567890-->123456: 4 (0.6)
1234567890-->1234567: 3 (0.7)
1234567890-->12345678: 2 (0.8)
1234567890-->123456789: 1 (0.9)
1234567890-->1234567890: 0 (1.0)
1234567890-->1234567980: 2 (0.8)
47/2010-->472010: 1 (0.8571428571428571)
47/2010-->472011: 2 (0.7142857142857143)
47/2010-->AB.CDEF: 7 (0.0)
47/2010-->4B.CDEFG: 7 (0.125)
47/2010-->AB.CDEFG: 8 (0.0)
The quick fox jumped-->The fox jumped: 6 (0.7)
The quick fox jumped-->The fox: 13 (0.35)
The quick fox jumped-->The quick fox jumped off the balcany: 16 (0.5555555555556)
kitten-->sitting: 3 (0.5714285714285714)
rosettacode-->raisethysword: 8 (0.38461538461538464)
edocattesor-->drowsyhtesiar: 8 (0.38461538461538464)



Some algorithms:

Cosine similarity
Jaccard similarity
Dice's coefficient
Matching similarity
Overlap similarity
...


Source: http://stackoverflow.com/questions/955110/similarity-string-comparison-in-java