The article mentions having a long history of data, but I want to stress a corollary of this: if you think you might eventually want to sell data, either put your data in an append-only data format now so that history is preserved, or at least take regular snapshots.
Otherwise, any time you update the data in place, you make it impossible to reconstruct what the data "looked like at the time" for a backtest.
Yep. You typically need to incubate your data for several quarters to demonstrate a real correlation. And you have to balance how you prove this against the fact that some firms will try to reproduce the data collection internally once they learn how you get it.
What do you mean by this? I'm imagining, say, rainfall per day, with a column for date, and a column for rainfall. What's useful about getting the data as it was vs. just having a total dataset with accurate dates.
For something as simple as a sensor reading, you're right, the most natural way to store it is basically an append-only time-series database so it doesn't require much special care.
But say the Rainfall table had a foreign key to a Sensor table that had the coordinates of the sensor. If the physical sensor were moved, the most straightforward data schema would call for the coordinates for the Sensor table to be updated, but then the historic data would have the wrong coordinates if you did a join.
The ideal solution is to design a bitemporal schema so that rows are only ever inserted, not updated, but failing that, regular automatic database snapshots are a good start.
I run a business in this space. Realistically your chances of having data that is useful to a HF (of any kind) is pretty low so I wouldn't bank on it as a revenue source unless you have a strong reason to believe (1) your data is predictive of something an investor cares about and (2) isn't already covered by other data.
There are also ways to reliably find and curate this data for trading firms and asset managers. But that usually requires a somewhat uncommon blend of skills; like statistics, web scraping, reverse engineering, or some domain expertise that gives you an edge in (legally) finding and using nonpublic data. Scrappiness also helps a lot.
Out of curiosity what resolution and time scale is useful. Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?
Also are event streams and large connect event graphs like Forge.AI what sells actually useful?
Curious as my phd research potentially has applications for information extraction and event linking. But not entirely sure if those applications are actually valuable
> Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?
No, like he said the most valuable data has a signal that is independent from other data sets, which generally means something proprietary, and almost always without a long history. Having a cleaned archive of standard data with a long history is valuable but already pretty well served. It would be hard to compete with CRSP.
If you shoot me an email I can tell you very quickly if it's viable, and direct you to specific people at firms who would buy it. I probably don't need a sample if you give an honest description of it.
Because scraping data sucks, occasionally has compliance concerns, and is a different core competency from trading. They would rather offload all of the bullshit involved in maintaining a robust scraping operation than pay their research team to do it.
Time spent on maintaining a scraping operation is time taken away from optimizing your ETL process and producing actionable research for your trading team. You know how people pay to have their pipes unclogged even when they know how it's done? Same idea.
If all it is some data scraped off a few web sites that they could get an intern to do in a week or three, then it's unlikely to be valuable enough for them to pay you a substantial sum of money.
The most valuable data is data that is difficult to gather. Think things like proprietary (i.e. unpublished) industry data. The canonical "sexy" alternative data set sold to hedge funds is counts of cars in retail parking lots from satellite photos.
> If all it is some data scraped off a few web sites that they could get an intern to do in a week or three, then it's unlikely to be valuable enough for them to pay you a substantial sum of money.
If the data is compelling and clearly correlates to earnings KPIs, I can tell you from experience that "some data scraped off a few websites" can be salable to the tune of $50,000/quarter. Hedge funds will frequently choose to pay that instead of setting up their own scraping operation, because scraping sucks.
Not all hedge funds, of course. Some do actively try to reverse engineer your dataset. But you probably don't want to work with those anyway.
> The canonical "sexy" alternative data set sold to hedge funds is counts of cars in retail parking lots from satellite photos.
I've personally developed forecasts from (ostensibly) public, scraped sources which beat drone and satellite footage of manufacturing facilities. That one Bloomberg article is not representative; satellite footage sounds sexy but it's not what most alternative data looks like.
I disseminate realtime transaction information from blockchain mempools (BTC,ETH) and flag any that create large state updates, is this useful information for any hedgefunds in the cryptospace?
Trying to get my credit union to allow users to opt in to sell bulk annon checking account transactions to a hedge fund. Having that much point of sale data would be huge.
While I believe there are softwares that can collect such data which can definitely be useful to hedge funds but at the same time we should note that the data available to general public is most probably something these hedge funds already have since they have spent years developing systems to gather and analyze data to stay ahead of the curve.
[1] Quantale (https://quantale.io/) is a data collection and analytics platform my company is developing.
Does anyone else find it weird when a business like this uses the .org domain? Not to distract from the main discussion, just curious what people think on here.
It's not a business. It's a stale content marketing effort ran by one of the incumbents in the space who leveraged it as a means to promote themselves in front of their competitors.
Indeed, you are correct (at least about being ran by the incumbents):
> AlternativeData.org is supported and maintained by YipitData and sources its content from hundreds of contributing investors, data providers, and industry professionals.
It would make sense for this to be aimed at data consumers rather than data producers, considering how the advice is, honestly, unrealistic.
1. Social Media Data - Twitter mentions etc
2. Website scraping (trolling linkedin, alexa, github) for evidence of activity
3. Credit Card Data - aggregate and POS data
4. Satellite Data - How many cars in the Walmart parking lot.
Those are a couple of the more shiny alternative data set in terms of interest. A lot of quants also pay big bucks for more micro or hft data.
Honestly if you have a predictive data set, most likely it is more profitable to use it to invest yourself than it is to sell it. Its like a successful startup raising VC funding, the good ones dont need the investors, the bad ones do.
Source is I used to work in this space. Clearly there are operational difficulties to investing and serious domain knowledge but if you have data with alpha, its worth insane amounts of money (seven figures a month).
I disagree. Sell-side research using so-called alternative data is a different skillset from trading and buy-side research. Some datasets comprise sufficient alpha in of themselves to trade on, some datasets have some edge (i.e. are not yet priced in) but require more sophisticated analysis. Good data is necessary but insufficient for developing viable trading strategies.
Source: Also used to work in this space. Still do, but not on the sell-side anymore. I wouldn't give trading capital to someone if their entire pitch was just possession of exclusive, useful data.
I used to work in buy side research. Most of the good datasets have already been stripped of alpha, but are bought and sold anyways because funds have to know what the data is saying to keep up with the other funds. A true new dataset with separate alpha is a goldmine. There are ones that add to the edge, and require quants, but in my experience all the valuable alternative data can be normalized and then back tested to the stock price. If you cant do that with it, its not that valuable outside a few hedge funds.
Surely a hedge fund has three advantages compared to investing yourself:
* Higher amounts of money to invest immediately, so they can better capitalise on the data quickly.
* Balance investment risk over many different assets - they can take a higher degree of risk/reward as any single failure in investment is unlikely to topple them
* Combine multiple data sets with other trading strategies to maximise returns.
I'd ask: "If this data is truly useful in a financial market context, why are you selling it instead investing with it yourself? What data are you collecting that you're not* selling?"
Because sometimes the data is useful when combined with other data you might not have yourself. Example: you have data on number of Netflix subs. That’s great. It would be even more useful if you combine that with Netflix churn data.
You can think of company analysis as a jigsaw puzzle. You may have a few pieces. That by itself isn’t useful unless u have the other pieces.
This works in theory, but in practice when you start making estimates of engagement numbers that are twice removed from the underlying stock price, you get into trouble. The deeper you get away from a direct relationship to stock price, the harder it is to predict something in practice.
What you're saying is correct but also puts a non-sophisticated seller in a poor negotiating position. They do not know what existing data the buyer uses or how their data can be complementary to anything.
I mean, I guess that's something that you can do but it's likely to get you laughed out of the room and greatly encourage them to reverse engineer whatever you were trying to sell.
If the data made a few dollars per thousand dollars traded, it would take you a long time to make meaningful income from it, as opposed to a market maker.
This is a good heuristic for the sale of trading strategies. If someone has a viable trading strategy, they should be raising capital or joining a prop shop instead of selling it.
But it's not maximally rational to trade on data instead of selling it if you don't have any experience trading.
And even if you're somehow able to figure out a trading strategy on this with no expertise, there are extremely few data sources that are actually proprietary so alpha decay is a very real thing.
Otherwise, any time you update the data in place, you make it impossible to reconstruct what the data "looked like at the time" for a backtest.