There are good open source options for each step here - is the solution you are ...

frakkingcylons · on Dec 9, 2016

I'm new to this sort of thing - can you elaborate on some of the open-source options for those steps?

danielvinson · on Dec 9, 2016

I'm not very familiar with the open source options since after many years of coding this by hand, I work with what I know. I am a developer that works with data, not a Data Scientist, so I don't really know the lingo and whatever hipstery terms people are using these days. I will answer to the best of my ability, though (mostly for my sake, who knows if this will be useful):

Cleaning:

Open Refine seems to be the best product in this category. I haven't used anything but my own tools to do this before, so I can't really offer any advice.

Warehousing:

My understanding is that this is just a fancy way to talk about a database with a schema designed for analytics. There are many open source databases which do this very well, the one I use being Cassandra (and/or KairosDB), though it is also likely the one that is hardest to use. For a beginner, you might want to refer to this SO answer: http://stackoverflow.com/questions/8816429/is-there-a-powerf...

Data processing/collection:

This is something that is incredibly dependent on the data sources, so I likely can't tell you anything that will help. Most of my data sources I've worked with have been internally sourced log files, messages from ZMQ, or CSV data - you might be working with something far different though, since there are lots of public data sets and such which are common. Ideally, this would be integrated into the tools that you are using to clean the data, but I don't know if that exists.

Handling input from many different sources at different rates is not a very hard problem to solve if your system is build correctly - you could for example run a daemon for each data source which will populate the database when there is new data available, then send a message off to the processing engine, which will integrate the data into whatever reports you are running.

Specifically for a use case of a hedge fund, the reports could be triggered by a message which is sent when the new data is available, and processing could be done in parallel in Lambda or similar dependent on need to get a nearly instant return, enabling nearly real-time reporting.