For instance, if you’re using Power Track to find all Tweets matching “Coca Cola,” now you can identify which of those are written in which language. We’re starting with support for eight languages: English, Spanish, German, French, Italian, Dutch, Portuguese, and Swedish, as available according to our confidence level for each language. You can expect more language support from Gnip in the coming weeks.
Starting from the open sourced JTCL, we’re using n-gram frequencies to categorize a Tweet into a given language. We’re thoroughly impressed with the accuracy levels thus far.
We’re excited about the use cases this enables across the industry. We know many of our friends are rapidly adopting Twitter (hello Japan!) and we’re glad to start providing better support for these global conversations.
Language enrichment is the first step toward a powerful language filtering capability for Twitter and Gnip’s 30+ other sources. If you’d like to try Twitter firehose filtering and language enrichment or to request support for your particular language, send us a note and say… ciao.