Today we’re excited to announce language enrichment for the Decahose, Power Track, and other commercial Twitter feeds. In the past few years, many of our customers have asked us how they can identify which Tweets come from which language. Starting today, you can use Gnip’s enrichments to easily identify the language of your Tweets.
For instance, if you’re using Power Track to find all Tweets matching “Coca Cola,” now you can identify which of those are written in which language. We’re starting with support for eight languages: English, Spanish, German, French, Italian, Dutch, Portuguese, and Swedish, as available according to our confidence level for each language. You can expect more language support from Gnip in the coming weeks.
Starting from the open sourced JTCL, we’re using n-gram frequencies to categorize a Tweet into a given language. We’re thoroughly impressed with the accuracy levels thus far.
We’re excited about the use cases this enables across the industry. We know many of our friends are rapidly adopting Twitter (hello Japan!) and we’re glad to start providing better support for these global conversations.
Language enrichment is the first step toward a powerful language filtering capability for Twitter and Gnip’s 30+ other sources. If you’d like to try Twitter firehose filtering and language enrichment or to request support for your particular language, send us a note and say… ciao.