Selling Data to Hedge Funds (2017)

paulgb · on Nov 2, 2020

The article mentions having a long history of data, but I want to stress a corollary of this: if you think you might eventually want to sell data, either put your data in an append-only data format now so that history is preserved, or at least take regular snapshots.

Otherwise, any time you update the data in place, you make it impossible to reconstruct what the data "looked like at the time" for a backtest.

conformist · on Nov 2, 2020

Absolutely. Ideally, you'd want both, "append only" and a vintage of the corrected historical data at every point in time. And good metadata.

fractionalhare · on Nov 2, 2020

Yep. You typically need to incubate your data for several quarters to demonstrate a real correlation. And you have to balance how you prove this against the fact that some firms will try to reproduce the data collection internally once they learn how you get it.

robswc · on Nov 2, 2020

Definitely agree. I'm not on the level of these hedge funds but I do love using data for trading.

jkinudsjknds · on Nov 2, 2020

What do you mean by this? I'm imagining, say, rainfall per day, with a column for date, and a column for rainfall. What's useful about getting the data as it was vs. just having a total dataset with accurate dates.

paulgb · on Nov 3, 2020

For something as simple as a sensor reading, you're right, the most natural way to store it is basically an append-only time-series database so it doesn't require much special care.

But say the Rainfall table had a foreign key to a Sensor table that had the coordinates of the sensor. If the physical sensor were moved, the most straightforward data schema would call for the coordinates for the Sensor table to be updated, but then the historic data would have the wrong coordinates if you did a join.

The ideal solution is to design a bitemporal schema so that rows are only ever inserted, not updated, but failing that, regular automatic database snapshots are a good start.

threedots · on Nov 2, 2020

I run a business in this space. Realistically your chances of having data that is useful to a HF (of any kind) is pretty low so I wouldn't bank on it as a revenue source unless you have a strong reason to believe (1) your data is predictive of something an investor cares about and (2) isn't already covered by other data.

fractionalhare · on Nov 2, 2020

There are also ways to reliably find and curate this data for trading firms and asset managers. But that usually requires a somewhat uncommon blend of skills; like statistics, web scraping, reverse engineering, or some domain expertise that gives you an edge in (legally) finding and using nonpublic data. Scrappiness also helps a lot.

dhairya · on Nov 2, 2020

Out of curiosity what resolution and time scale is useful. Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?

Also are event streams and large connect event graphs like Forge.AI what sells actually useful?

Curious as my phd research potentially has applications for information extraction and event linking. But not entirely sure if those applications are actually valuable

yellowstuff · on Nov 3, 2020

> Is it fair to assume the most hedge funds are relatively good at tracking recent information and the value is in older archives that's hard to collect?

No, like he said the most valuable data has a signal that is independent from other data sets, which generally means something proprietary, and almost always without a long history. Having a cleaned archive of standard data with a long history is valuable but already pretty well served. It would be hard to compete with CRSP.

ed25519FUUU · on Nov 2, 2020

I’m curious the types of data that are useful, what goes into them, and what the curators of that data might make for them.

r_singh · on Nov 2, 2020

Could I send you a sample?

I’m curious about certain data I’m scraping for a project.

fractionalhare · on Nov 2, 2020

If you shoot me an email I can tell you very quickly if it's viable, and direct you to specific people at firms who would buy it. I probably don't need a sample if you give an honest description of it.

neolog · on Nov 2, 2020

Is it better to provide the raw data, or to instead provide some interesting statistics from aggregating or running a model on the data?

fractionalhare · on Nov 2, 2020

Usually hedge funds prefer the raw data so their own research teams can do modeling and analysis.

neolog · on Nov 2, 2020

Once they see the raw data they'll have a good idea how to get it. Why don't they just set up their own scraper at that point, instead of paying me?

fractionalhare · on Nov 3, 2020

Because scraping data sucks, occasionally has compliance concerns, and is a different core competency from trading. They would rather offload all of the bullshit involved in maintaining a robust scraping operation than pay their research team to do it.

Time spent on maintaining a scraping operation is time taken away from optimizing your ETL process and producing actionable research for your trading team. You know how people pay to have their pipes unclogged even when they know how it's done? Same idea.

rmah · on Nov 3, 2020

If all it is some data scraped off a few web sites that they could get an intern to do in a week or three, then it's unlikely to be valuable enough for them to pay you a substantial sum of money.

The most valuable data is data that is difficult to gather. Think things like proprietary (i.e. unpublished) industry data. The canonical "sexy" alternative data set sold to hedge funds is counts of cars in retail parking lots from satellite photos.

fractionalhare · on Nov 3, 2020

> If all it is some data scraped off a few web sites that they could get an intern to do in a week or three, then it's unlikely to be valuable enough for them to pay you a substantial sum of money.

If the data is compelling and clearly correlates to earnings KPIs, I can tell you from experience that "some data scraped off a few websites" can be salable to the tune of $50,000/quarter. Hedge funds will frequently choose to pay that instead of setting up their own scraping operation, because scraping sucks.

Not all hedge funds, of course. Some do actively try to reverse engineer your dataset. But you probably don't want to work with those anyway.

> The canonical "sexy" alternative data set sold to hedge funds is counts of cars in retail parking lots from satellite photos.

I've personally developed forecasts from (ostensibly) public, scraped sources which beat drone and satellite footage of manufacturing facilities. That one Bloomberg article is not representative; satellite footage sounds sexy but it's not what most alternative data looks like.

fakedang · on Nov 3, 2020

Tbh I've yet to find successful hedge funds that try to reverse engineer stuff. It's just not worth the time for most of these companies.

cryptonautics · on Nov 3, 2020

I disseminate realtime transaction information from blockchain mempools (BTC,ETH) and flag any that create large state updates, is this useful information for any hedgefunds in the cryptospace?

crb002 · on Nov 2, 2020

Trying to get my credit union to allow users to opt in to sell bulk annon checking account transactions to a hedge fund. Having that much point of sale data would be huge.

ponker · on Nov 2, 2020

uh no, please don't do that. Individual decisions that seem harmless can be incredibly destructive in the aggregate, and this is one of them.

tmoaad · on Nov 3, 2020

For those that need an example, a great reminder is the target data mining story at https://www.forbes.com/sites/kashmirhill/2012/02/16/how-targ...

cbozeman · on Nov 3, 2020

I'm not saying you're wrong, but I am curious what negative trends you believe will emerge from someone actioning this data.

computerphage · on Nov 3, 2020

Would you mind elaborating on the sorts of harms you're thinking of?

alxmng · on Nov 3, 2020

Numerai hedge fund buys data anonymously, and pays you in cryptocurrency based on its performance: https://medium.com/numerai/building-the-last-hedge-fund-intr...

octoberfranklin · on Nov 3, 2020

Um, how is this not an attempt to attract insider information? And trade on it?

scottydelta · on Nov 2, 2020

I recently published a data dashboard(https://public.quantale.io/dashboards/50778f4a-02bb-4d31-b6e...) using Quantale[1] to show the correlation between Tweet activity of a stock vs the price during the recent stock split of Tesla and Apple.

While I believe there are softwares that can collect such data which can definitely be useful to hedge funds but at the same time we should note that the data available to general public is most probably something these hedge funds already have since they have spent years developing systems to gather and analyze data to stay ahead of the curve.

[1] Quantale (https://quantale.io/) is a data collection and analytics platform my company is developing.

Drblessing · on Nov 3, 2020

Hey this is really cool! I am a current data scientist masters student and would love to learn more about your platform! Do you have a way to contact?

scottydelta · on Nov 3, 2020

thanks, I have my email on my profile.

andrenotgiant · on Nov 2, 2020

Probably should add a "[2017]" to the title of this post.

dtwest · on Nov 2, 2020

Does anyone else find it weird when a business like this uses the .org domain? Not to distract from the main discussion, just curious what people think on here.

altdatathrow · on Nov 2, 2020

It's not a business. It's a stale content marketing effort ran by one of the incumbents in the space who leveraged it as a means to promote themselves in front of their competitors.

maest · on Nov 2, 2020

Indeed, you are correct (at least about being ran by the incumbents):

> AlternativeData.org is supported and maintained by YipitData and sources its content from hundreds of contributing investors, data providers, and industry professionals.

It would make sense for this to be aimed at data consumers rather than data producers, considering how the advice is, honestly, unrealistic.

miss_jackson · on Nov 3, 2020

Here's how a few hedge funds use data in production 'Dexamethasone Announcement Could Have Made Hedge Funds A Fortune' https://www.alpha-week.com/dexamethasone-announcement-could-...

soniman · on Nov 3, 2020

Can somebody give an idea of what are the most common alternative data sets that are the most widely used?

thedudeabides5 · on Nov 3, 2020

1. Social Media Data - Twitter mentions etc 2. Website scraping (trolling linkedin, alexa, github) for evidence of activity 3. Credit Card Data - aggregate and POS data 4. Satellite Data - How many cars in the Walmart parking lot.

Those are a couple of the more shiny alternative data set in terms of interest. A lot of quants also pay big bucks for more micro or hft data.

soniman · on Nov 3, 2020

Thanks. I've wondered whether the Linkedin "people also viewed" could be used to predict M&A transactions.

adrr · on Nov 3, 2020

Credit card data would be the keys to the castle if terms of predicting revenue for a company.

altdatathrow · on Nov 3, 2020

Transaction data is widely available and considered table stakes today.

hchz · on Nov 3, 2020

Even apps like Tinder and Grindr are selling valuable user information to HF.

logicslave · on Nov 2, 2020

Honestly if you have a predictive data set, most likely it is more profitable to use it to invest yourself than it is to sell it. Its like a successful startup raising VC funding, the good ones dont need the investors, the bad ones do.

Source is I used to work in this space. Clearly there are operational difficulties to investing and serious domain knowledge but if you have data with alpha, its worth insane amounts of money (seven figures a month).

fractionalhare · on Nov 2, 2020

I disagree. Sell-side research using so-called alternative data is a different skillset from trading and buy-side research. Some datasets comprise sufficient alpha in of themselves to trade on, some datasets have some edge (i.e. are not yet priced in) but require more sophisticated analysis. Good data is necessary but insufficient for developing viable trading strategies.

Source: Also used to work in this space. Still do, but not on the sell-side anymore. I wouldn't give trading capital to someone if their entire pitch was just possession of exclusive, useful data.

logicslave · on Nov 2, 2020

I used to work in buy side research. Most of the good datasets have already been stripped of alpha, but are bought and sold anyways because funds have to know what the data is saying to keep up with the other funds. A true new dataset with separate alpha is a goldmine. There are ones that add to the edge, and require quants, but in my experience all the valuable alternative data can be normalized and then back tested to the stock price. If you cant do that with it, its not that valuable outside a few hedge funds.

Closi · on Nov 2, 2020

Surely a hedge fund has three advantages compared to investing yourself:

* Higher amounts of money to invest immediately, so they can better capitalise on the data quickly.

* Balance investment risk over many different assets - they can take a higher degree of risk/reward as any single failure in investment is unlikely to topple them

* Combine multiple data sets with other trading strategies to maximise returns.

elevenoh · on Nov 2, 2020

True.

I'd ask: "If this data is truly useful in a financial market context, why are you selling it instead investing with it yourself? What data are you collecting that you're not* selling?"

AznHisoka · on Nov 2, 2020

Because sometimes the data is useful when combined with other data you might not have yourself. Example: you have data on number of Netflix subs. That’s great. It would be even more useful if you combine that with Netflix churn data.

You can think of company analysis as a jigsaw puzzle. You may have a few pieces. That by itself isn’t useful unless u have the other pieces.

logicslave · on Nov 2, 2020

This works in theory, but in practice when you start making estimates of engagement numbers that are twice removed from the underlying stock price, you get into trouble. The deeper you get away from a direct relationship to stock price, the harder it is to predict something in practice.

whatok · on Nov 2, 2020

What you're saying is correct but also puts a non-sophisticated seller in a poor negotiating position. They do not know what existing data the buyer uses or how their data can be complementary to anything.

ethbr0 · on Nov 2, 2020

That's why start-high and walk away is a fair opening gambit in negotiations of this sort.

You can always come back around, but if they pursue you, that tells you something about their appetite.

whatok · on Nov 2, 2020

I mean, I guess that's something that you can do but it's likely to get you laughed out of the room and greatly encourage them to reverse engineer whatever you were trying to sell.

ethbr0 · on Nov 2, 2020

So how would you play that scenario?

ed25519FUUU · on Nov 2, 2020

If the data made a few dollars per thousand dollars traded, it would take you a long time to make meaningful income from it, as opposed to a market maker.

fractionalhare · on Nov 2, 2020

This is a good heuristic for the sale of trading strategies. If someone has a viable trading strategy, they should be raising capital or joining a prop shop instead of selling it.

But it's not maximally rational to trade on data instead of selling it if you don't have any experience trading.

whatok · on Nov 2, 2020

And even if you're somehow able to figure out a trading strategy on this with no expertise, there are extremely few data sources that are actually proprietary so alpha decay is a very real thing.

DeafSquid · on Nov 3, 2020

This was posted 3 years ago