Dataless Classification with Descartes

Table of Contents

1 What is Descartes?

Descartes is a Java library that implements Explicit Semantic Analysis and can be used to perform Dataless Classification.

2 Setup

Here are the steps to get started:

  1. Get Wikipedia

    Download the latest version of Wikipedia from http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This is a very large file (>6GB) and will take time.

  2. Index Wikipedia

    Before the library can be used for classification, the downloaded Wikipedia documents need to be indexed for efficient access. The instructions here assume that you will be working with a Unix-based machine. For Windows, the instructions need to be changed, as specified in the final note in this section. On the command line, run the following to perform indexing:

    $ bash ./descartes index data/enwiki-latest-pages-articles.xml.bz2 data/wiki-index ./config.properties
    

    Note that this assume that the Wikipedia dump has been downloaded to data/enwiki-latest-pages-articles.xml.bz2. This command (which can take several hours) creates an index of Wikipedia at data/wiki-index using the configuration specified in the file config.properties.

  3. Test ESA on the command line

    After indexing, test to see if you can generate Explicit Semantic Analysis of text. To do this, run

    $ bash ./descartes esa data/wiki-index 10
    

    This will launch an interactive shell, where you can give text and see its ESA representation. Here is some sample output:

    $ bash ./descartes esa data/wiki-index 10
    Input text (_ to quit): Andre Agassi won the career grand slam.
    Andre Agassi won the career grand slam
    
      1. Grand Slam (tennis) [13.667234420776367]
      2. Andre Agassi [13.004325866699219]
      3. Rafael Nadal [9.345149040222168]
      4. Serena Williams [9.130873680114746]
      5. Grand Slam (golf) [8.913475036621094]
      6. Federer–Nadal rivalry [8.709552764892578]
      7. 2009 French Open [8.332965850830078]
      8. Roger Federer [8.134634017944336]
      9. 2008 Wimbledon Championships [8.078024864196777]
      10. 2008 French Open [7.422852993011475]
    
    Input text (_ to quit): _
    This experiment took 26.089 secs
    
  4. Check whether dataless classification works

    Finally, check that the dataless classification code works with this Wikipedia index. Run

    $ bash ./descartes dataless data/wiki-index data/20NGTest
    

    This performs a dataless classification experiment. Using the data in data/20NGTest, it measures the accuracy of sci.electronics vs. sci.crypt classification and prints the accuracy. Since this code does not use any cache for the features, it can take a fair bit of time. At the end, it should report an accuracy of about 90 percent.

  5. Other platforms

    If you are running this on Windows, then to perform setup, instead of bash descartes, you have to call

    java edu.illinois.cs.cogcomp.descartes.DescartesMain

    with the same arguments. Note that for this to work, all the jars in the bin directory should be in the Java classpath.

3 Setting up an XML-RPC server

Descartes comes with an inbuilt XML-RPC server that provides methods to get the ESA of some text and the dataless similarity between two documents. To setup an XML-RPC server, first create a Wikipedia Index as described above.

Use the command startServer to start the server. For example, if the Wikipedia index is located at data/wiki-index and we want the server to listen to port 9131, use the following command

$ ./descartes startServer data/wiki-index 8142

To connect to the server, create a client connection to http://server-address:8142. The server provides two functions:

  1. DescartesServer.esa: This takes two parameters – a string, representing the text, and an integer, representing the number of concepts required and returns the ESA representation of the text as a list of strings.
  2. DescartesServer.similarity: This function takes three paramters – two strings, representing two documents, and an integer, representing the number of concepts and returns a double that represents the Dataless similarity of the two documents.

4 Usage

Here are some code samples of how to use Descartes after indexing is complete.

4.1 Generating Explicit Semantic Analysis

The following example shows how to generate the Explicit Semantic Analysis representation for a given file. To compile this, all the jars in the bin directory need to be in the classpath.

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;

import edu.illinois.cs.cogcomp.descartes.retrieval.IResult;
import edu.illinois.cs.cogcomp.descartes.retrieval.ISearcher;
import edu.illinois.cs.cogcomp.descartes.retrieval.SearcherFactory;

/**
 * Generates Explicit Semantic Analysis. Uses the Wikipedia index from index-dir
 * and generates the specified number of concepts for the input file.
 * 
 * Usage: java ESA index-dir num-concepts file-input file
 * 
 * This assumes that descartes-0.1.jar and the other dependencies are in the
 * classpath.
 * 
 */
public class ESA {

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: java ESA index-dir num-concepts file");
            System.exit(-1);
        }

        String indexDir = args[0];
        int numConcepts = Integer.parseInt(args[1]);
        String file = args[2];

        // Create a new searcher to search the index
        ISearcher searcher = SearcherFactory.getStandardSearcher(indexDir);

        // Grab the input text
        BufferedReader reader = new BufferedReader(new FileReader(file));
        StringBuffer sb = new StringBuffer();
        String line = null;
        while ((line = reader.readLine()) != null) {
            sb.append(line + " ");
        }
        String text = sb.toString();

        // Clean up the input to make sure that nothing unexpected happens.
        text = text.replaceAll("[^a-zA-Z0-9 ]", "");

        // Get the concepts.
        ArrayList<IResult> results = searcher.search(text, numConcepts);

        // Print them
        for (IResult result : results) {
            // IResult has getters for id, title, document and score.
            System.out.println(result.getTitle());
        }
    }
}

4.2 Dataless Classification

For Dataless classification, we measure the similarity between the text and class labels in the concept space and predict the label that is most similar to the text. Here is example code that shows how to perform dataless classification:

public String dataless() {

    // The two classes that we want to classify into
    String prototype1 = "science electronics";
    String prototype2 = "science crypt";
    List<String> classPrototypes = Arrays.asList(prototype1, prototype2);

    // Create a searcher. This is similar to how it is done in the ESA
    // example.
    ISearcher searcher = SearcherFactory
            .getStandardSearcher(indexDirectory);

    // Specify the number of concepts to be used. This is a parameter that
    // can be tuned.
    int numConcepts = 1000;

    // Create a classifier
    DatalessClassifier classifier = new DatalessClassifier(searcher,
            numConcepts, classPrototypes);

    // The input text that needs to be classified
    String inputText = "Decryption is the reverse, in other words, "
            + "moving from the unintelligible ciphertext "
            + "back to plaintext.";

    // Perform the classification. This should return one of the prototypes.
    String label = classifier.getLabel(inputText);

    return label;
}

Notes

  1. Creating a searcher can be a potentially expensive operation. So it would probably be a good idea to keep one searcher for the dataset.
  2. The ISearcher is thread safe. So it can be used in a multi-threaded context.
  3. Once a searcher is created and the class prototypes are fixed, it is a good idea to create one instance of the classifier (again, for efficiency).
  4. Like the ISearcher, the DatalessClassifier is also thread safe.

4.3 Calling an XML-RPC client

Suppose we have an XML-RPC server running at http://server-address:8142, as described above. Here is some sample Java code that describes how to access the services. This code uses the Apache XML-RPC client.

import java.net.MalformedURLException;
import java.net.URL;

import org.apache.xmlrpc.XmlRpcException;
import org.apache.xmlrpc.client.XmlRpcClient;
import org.apache.xmlrpc.client.XmlRpcClientConfigImpl;

public class DescartesClient {
        public static void main(String[] args) throws MalformedURLException,
                        XmlRpcException {
                XmlRpcClientConfigImpl config = new XmlRpcClientConfigImpl();
                config.setServerURL(new URL("http://greedy.cs.uiuc.edu:8412"));
                XmlRpcClient client = new XmlRpcClient();
                client.setConfig(config);

                String s1 = "Bill Clinton";
                String s2 = "Barack Obama ";

                // Getting the dataless similarity between the two
                // strings using 1000 concepts.

                Object[] params = new Object[] { s1, s2, new Integer(1000) };
                Double result = (Double) client.execute("DescartesServer.similarity",
                                params);
                System.out.println(result);


                // Printing the ESA representation of the two strings
                params = new Object[] { s1, Integer.valueOf(10) };

                Object[] esa = (Object[]) client.execute("DescartesServer.esa", params);

                for (int i = 0; i < esa.length; i++) {
                        System.out.println((String)esa[i]);
                }

                System.out.println();
                params = new Object[] { s2, Integer.valueOf(10) };

                esa = (Object[]) client.execute("DescartesServer.esa", params);

                for (int i = 0; i < esa.length; i++) {
                        System.out.println((String)esa[i]);
                }

        }
}


5 Citing this work

TO BE ANNOUNCED

6 Contact

For further information about this package, contact Vivek Srikumar (vsrikum2 at illinois dot edu).

7 References

  • E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), 2007.
  • M. Chang, L. Ratinov, D. Roth and V. Srikumar, Importance of Semantic Represenation: Dataless Classification, Proceedings of the National Conference on Artificial Intelligence (AAAI), 2008.

Author: Vivek Srikumar

Date: 2011-06-28 11:05:21 CDT

HTML generated by org-mode 7.4 in emacs 23