Dataless Classification with Descartes
Table of Contents
1 What is Descartes
?
Descartes
is a Java library that implements Explicit Semantic
Analysis and can be used to perform Dataless Classification.
2 Setup
Here are the steps to get started:
-
Get Wikipedia
Download the latest version of Wikipedia from http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This is a very large file (>6GB) and will take time.
-
Index Wikipedia
Before the library can be used for classification, the downloaded Wikipedia documents need to be indexed for efficient access. The instructions here assume that you will be working with a Unix-based machine. For Windows, the instructions need to be changed, as specified in the final note in this section. On the command line, run the following to perform indexing:
$ bash ./descartes index data/enwiki-latest-pages-articles.xml.bz2 data/wiki-index ./config.properties
Note that this assume that the Wikipedia dump has been downloaded to
data/enwiki-latest-pages-articles.xml.bz2
. This command (which can take several hours) creates an index of Wikipedia atdata/wiki-index
using the configuration specified in the fileconfig.properties
. -
Test ESA on the command line
After indexing, test to see if you can generate Explicit Semantic Analysis of text. To do this, run
$ bash ./descartes esa data/wiki-index 10
This will launch an interactive shell, where you can give text and see its ESA representation. Here is some sample output:
$ bash ./descartes esa data/wiki-index 10 Input text (_ to quit): Andre Agassi won the career grand slam. Andre Agassi won the career grand slam 1. Grand Slam (tennis) [13.667234420776367] 2. Andre Agassi [13.004325866699219] 3. Rafael Nadal [9.345149040222168] 4. Serena Williams [9.130873680114746] 5. Grand Slam (golf) [8.913475036621094] 6. Federer–Nadal rivalry [8.709552764892578] 7. 2009 French Open [8.332965850830078] 8. Roger Federer [8.134634017944336] 9. 2008 Wimbledon Championships [8.078024864196777] 10. 2008 French Open [7.422852993011475] Input text (_ to quit): _ This experiment took 26.089 secs
-
Check whether dataless classification works
Finally, check that the dataless classification code works with this Wikipedia index. Run
$ bash ./descartes dataless data/wiki-index data/20NGTest
This performs a dataless classification experiment. Using the data in
data/20NGTest
, it measures the accuracy of sci.electronics vs. sci.crypt classification and prints the accuracy. Since this code does not use any cache for the features, it can take a fair bit of time. At the end, it should report an accuracy of about 90 percent. -
Other platforms
If you are running this on Windows, then to perform setup, instead of
bash descartes
, you have to calljava edu.illinois.cs.cogcomp.descartes.DescartesMain
with the same arguments. Note that for this to work, all the jars in the
bin
directory should be in the Java classpath.
3 Setting up an XML-RPC server
Descartes comes with an inbuilt XML-RPC server that provides methods to get the ESA of some text and the dataless similarity between two documents. To setup an XML-RPC server, first create a Wikipedia Index as described above.
Use the command startServer
to start the server. For example, if
the Wikipedia index is located at data/wiki-index and we want the
server to listen to port 9131, use the following command
$ ./descartes startServer data/wiki-index 8142
To connect to the server, create a client connection to
http://server-address:8142
. The server provides two functions:
- DescartesServer.esa: This takes two parameters – a string, representing the text, and an integer, representing the number of concepts required and returns the ESA representation of the text as a list of strings.
- DescartesServer.similarity: This function takes three paramters – two strings, representing two documents, and an integer, representing the number of concepts and returns a double that represents the Dataless similarity of the two documents.
4 Usage
Here are some code samples of how to use Descartes
after indexing
is complete.
4.1 Generating Explicit Semantic Analysis
The following example shows how to generate the Explicit Semantic
Analysis representation for a given file. To compile this, all the
jars in the bin
directory need to be in the classpath.
import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList; import edu.illinois.cs.cogcomp.descartes.retrieval.IResult; import edu.illinois.cs.cogcomp.descartes.retrieval.ISearcher; import edu.illinois.cs.cogcomp.descartes.retrieval.SearcherFactory; /** * Generates Explicit Semantic Analysis. Uses the Wikipedia index from index-dir * and generates the specified number of concepts for the input file. * * Usage: java ESA index-dir num-concepts file-input file * * This assumes that descartes-0.1.jar and the other dependencies are in the * classpath. * */ public class ESA { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: java ESA index-dir num-concepts file"); System.exit(-1); } String indexDir = args[0]; int numConcepts = Integer.parseInt(args[1]); String file = args[2]; // Create a new searcher to search the index ISearcher searcher = SearcherFactory.getStandardSearcher(indexDir); // Grab the input text BufferedReader reader = new BufferedReader(new FileReader(file)); StringBuffer sb = new StringBuffer(); String line = null; while ((line = reader.readLine()) != null) { sb.append(line + " "); } String text = sb.toString(); // Clean up the input to make sure that nothing unexpected happens. text = text.replaceAll("[^a-zA-Z0-9 ]", ""); // Get the concepts. ArrayList<IResult> results = searcher.search(text, numConcepts); // Print them for (IResult result : results) { // IResult has getters for id, title, document and score. System.out.println(result.getTitle()); } } }
4.2 Dataless Classification
For Dataless classification, we measure the similarity between the text and class labels in the concept space and predict the label that is most similar to the text. Here is example code that shows how to perform dataless classification:
public String dataless() { // The two classes that we want to classify into String prototype1 = "science electronics"; String prototype2 = "science crypt"; List<String> classPrototypes = Arrays.asList(prototype1, prototype2); // Create a searcher. This is similar to how it is done in the ESA // example. ISearcher searcher = SearcherFactory .getStandardSearcher(indexDirectory); // Specify the number of concepts to be used. This is a parameter that // can be tuned. int numConcepts = 1000; // Create a classifier DatalessClassifier classifier = new DatalessClassifier(searcher, numConcepts, classPrototypes); // The input text that needs to be classified String inputText = "Decryption is the reverse, in other words, " + "moving from the unintelligible ciphertext " + "back to plaintext."; // Perform the classification. This should return one of the prototypes. String label = classifier.getLabel(inputText); return label; }
Notes
- Creating a searcher can be a potentially expensive operation. So it would probably be a good idea to keep one searcher for the dataset.
- The ISearcher is thread safe. So it can be used in a multi-threaded context.
- Once a searcher is created and the class prototypes are fixed, it is a good idea to create one instance of the classifier (again, for efficiency).
- Like the ISearcher, the DatalessClassifier is also thread safe.
4.3 Calling an XML-RPC client
Suppose we have an XML-RPC server running at
http://server-address:8142
, as described above. Here is some
sample Java code that describes how to access the services. This
code uses the Apache XML-RPC client.
import java.net.MalformedURLException; import java.net.URL; import org.apache.xmlrpc.XmlRpcException; import org.apache.xmlrpc.client.XmlRpcClient; import org.apache.xmlrpc.client.XmlRpcClientConfigImpl; public class DescartesClient { public static void main(String[] args) throws MalformedURLException, XmlRpcException { XmlRpcClientConfigImpl config = new XmlRpcClientConfigImpl(); config.setServerURL(new URL("http://greedy.cs.uiuc.edu:8412")); XmlRpcClient client = new XmlRpcClient(); client.setConfig(config); String s1 = "Bill Clinton"; String s2 = "Barack Obama "; // Getting the dataless similarity between the two // strings using 1000 concepts. Object[] params = new Object[] { s1, s2, new Integer(1000) }; Double result = (Double) client.execute("DescartesServer.similarity", params); System.out.println(result); // Printing the ESA representation of the two strings params = new Object[] { s1, Integer.valueOf(10) }; Object[] esa = (Object[]) client.execute("DescartesServer.esa", params); for (int i = 0; i < esa.length; i++) { System.out.println((String)esa[i]); } System.out.println(); params = new Object[] { s2, Integer.valueOf(10) }; esa = (Object[]) client.execute("DescartesServer.esa", params); for (int i = 0; i < esa.length; i++) { System.out.println((String)esa[i]); } } }
5 Citing this work
TO BE ANNOUNCED
6 Contact
For further information about this package, contact Vivek Srikumar (vsrikum2 at illinois dot edu).
7 References
- E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), 2007.
- M. Chang, L. Ratinov, D. Roth and V. Srikumar, Importance of Semantic Represenation: Dataless Classification, Proceedings of the National Conference on Artificial Intelligence (AAAI), 2008.
Date: 2011-06-28 11:05:21 CDT
HTML generated by org-mode 7.4 in emacs 23