My favorite requirement from customers is the “I want all the data, from all the sources, for all of history, and for all of future” one. You’re never going to get it, from anyone, so reset your expectations. A few constructs fall out of this request.
Two Types of ‘feeds’
These are aggregate sources of data for a given publisher. They may, or may not, be a complete representation of that publisher’s data set. Everyone wants firehoses, but truth be told, there are very few of them in the wild, and those that do exist are of less “valuable” data. Consider access to firehoses to be at statistically relevant access levels, rather than truly “complete” sets of data.
These encompass the majority of data sources, and they require that you know what you’re looking for. Be it a keyword, a tag, a user name, a user id, or a geo-location.
In either case you need to know what it is you’re after. Blind, unfettered, access to a given publisher’s feed is a rarity and actually isn’t all that interesting in the end; you just think it is because someone else had the product idea first (e.g. the publisher you want all the data from… e.g. Twitter).
Storing and indexing lots of data is conceptually simple, yet hard to implement at scale; just ask any of the big-three search engines. You can stuff as much data as possible into a database, and “search” it offline, in order to meet most historical data access requirements, but weaving that into a variably accessed consumer application isn’t always easy. While storage costs are generally nil for today’s highly compressible data, the operational management costs of your locally stored data aren’t.
Processing data in a manner other than which it originated causes an impedance miss-match. Stream-to-offline processing implies that you’ll have gaps in data due to queuing problems. Offline-to-stream suggests the same. Offline-to-offline and stream-to-stream are generally easy to get your head and code around, but be wary of overloading stream processing with too much work as it then starts to feel like stream-to-offline. Once you enter that world, you need to solve parallel processing problems; in real-time.
Regardless of access pattern, you can only introspect and access the data you initially seeded your sources with. If your seed was wrong, for example you used the wrong set of users or keywords, processing the data doesn’t matter. Full circle to garbage in, garbage out.
If you find yourself asking for the introductory requirement with your team, and/or a vendor, I suggest you actually don’t have the focus on your product or idea that you’ll ultimately need in order to be successful. Batten down the hatches, and get crisp about precisely what it is you want to build, and precisely what data you need to do so. If you can do that, you will have a shot at success.