We’ve spent the past few weeks at Gnip working on an infrastructure that allows us to expand shortened URLs as they come through the Twitter Firehose. Here’s an overview of our architecture:
Note that I’ve removed redundant machines used for failover to simplify things a bit. Each rectangular node represents a Java process running on a separate EC2 instance.
Connector: A lightweight client that consumes the full Twitter Firehose via Streaming HTTP. On the way in each Tweet is wrapped in an internal (JSON) messaging format that allows us to insert arbitrary metadata for each Tweet. The messages are fanned out to n number of Deliverators and Enricherators.
Enricherator: Ingests the full Twitter Firehose from Connector via Streaming HTTP. Inspects each Tweet and plucks out any URLs that exist in the ‘entities’ payload. Compares these URLs to a list of known URL shorteners (t.co, bit.ly, etc..) and if there are any matches, ships those URLs over to Bosserator via Google Protocol Buffers (GPB) RPC to be expanded.
Also, as Enricherator receives each Tweet, the Tweet is inserted into a queue. The queue contains Tweets that are waiting to be decorated with an expanded URL. If a response comes back from Bosserator for a given Tweet, Enricherator adds the expanded URL metadata for that Tweet. After a predetermined number of seconds the Tweet removed from the queue and shipped off to Deliverator regardless of whether or not it has been decorated.
Bosserator: Holds onto and actively manages a queue of expand URL tasks from Enricherator. Exposes this task queue via GPB RPC to Fetchors. Listens for decorated responses from Fetchors, ships them over to Enricherator as they are received.
Fetchor: When Fetchor has available resources, it will ask Bosserator for n number of tasks via GPB RPC. For each task, Fetchor will issue a HTTP HEAD request against the shortened URL. Fetchor’s HTTP client will follow up to 10 redirects before returning a response. When a response is received, the expanded URL inserted into the relevant task and is shipped back over to Bosserator, who then immediately ships it to Enricherator.
Deliverator: Ingests the full firehose (in realtime) from Connector and an enriched firehose (delayed) from Enricherator via Streaming HTTP. Deliverator then determines what kind of stream that the client is asking for (User Mention Stream, Link Stream, etc..), what format they want it delivered in (original or JSON Activity Streams), and inserts any enrichments that have been enabled for that stream. Then the stream is made available to your Gnip Data Collector via Streaming HTTP.
If you would like to beta test the expand URLs feature, please contact us at email@example.com. Be sure to sign up for our newsletter at http://gnip.com/newsletter to hear about new features that we have coming in the pipeline. Feel free to leave comments and suggestions about how we can make URL expanding more useful to the problem you’re trying to solve.