Share on TwitterDigg This

In my first english blog entry, I could not find a more appropriate topic. I’m still looking for a way to provide both Portuguese and English content on this blog.

Meanwhile, english readers can check my blog entries using the google translate. Google’s Language API is the reason of this post.

Lucene, is probably one of my favorites frameworks of all times, and I love everything related to it: Hadoop, Nutch, Solr and Hibernate Search.

I use Lucene whenever I can :) And one of the things we did within it was a Federated Search for JBoss Portal. We indexed all kind of documents uploaded to the CMS portal using interceptors. One of the problems we faced, was automatic language detection. Because Lucene needs an analyzer to proper index the document, we needed a specific analyzer for each language. Well, at the time we miserably failed on that. It was a restriction we did not gave much attention since we were only indexing portuguese documents.

This week started with this restriction on my mind. At first I thought that I could find an open source api for this. Only found a few desktop apps, all closed source.

What if I use some kind of classifier, for instance a Naive-Bayes classifier, to classify my documents? I could download a few hundred of documents from wikipedia, all from different languages, train it, and then use it. Wow! That seemed cool, but would require some effort (and I’m feeling lazy this week).

So I was checking GWT extensions (because GWT is the coolest thing ever happened to the presentation layer), and I found the translation API , which BTW have an method to detect the language. Now my problems are really solved. The API relies on REST and JSON which makes it really simple to use. I started to use it by extracting random pieces of text from the documents and asking google to classify it. I’ve used this approach to avoid hitting some quotes or an abstract in a paper, which could led to a wrong idiom detection. Once we have the correct language we can instantiate the appropriate Analyzer.

The code bellow uses JSONSimple to parse the JSON response from google.

try {
	String s = URLEncoder.encode("Há tantos burros mandando em homens de inteligência, que, às vezes, fico pensando que a burrice é uma Ciência", "UTF-8");
	URL url = new URL("http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q="+s);
	 BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        StringBuilder  buffer = new StringBuilder();
        while ((str = in.readLine()) != null) {
            buffer.append(str);
        }
        in.close();
        JSONObject obj = (JSONObject) ((JSONObject)JSONValue.parse(buffer.toString())).get("responseData");
        System.out.println(obj.get("language"));
        System.out.println(obj.get("confidence"));
 
 
} catch (UnsupportedEncodingException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
} catch (MalformedURLException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
} catch (IOException e) {
	// TODO Auto-generated catch block
		e.printStackTrace();
	}
}

The API not only provides the correct language, but also a confidence value.

Happy coding, and hope you enjoy this API as much as I did :)

Leave a Reply

You must be logged in to post a comment.