I work in one of the data platform teams at a social media company. Between our 3 HDFS clusters, we're storing more than an exabyte of data. At our scale, we have to tune our workloads carefully to make sure that problems of scale are not noticeable to internal customers (data scientists, analysts, etc.).
We basically have an entire org of highly paid engineers focused on making sure people can use that data efficiently. So we have a team of people working on storage, on Spark, on Presto/Trino, on data ingestion, and so on.
So my understanding is that we're investing in engineers to improve data science productivity, so that they can do analysis without having to understand the internals of all our systems, so that executives can make informed decisions backed by data to continue printing money. Or something like that...
This type of infrastructure is worth tooling out and productizing. I know a few places are doing it and it's hard to not have Software Engineers behind the scenes supporting clients.
Maybe you make it a SaaS so companies only need to hire daa scientists if you can optimize the ETL process.
I work in one of the data platform teams at a social media company. Between our 3 HDFS clusters, we're storing more than an exabyte of data. At our scale, we have to tune our workloads carefully to make sure that problems of scale are not noticeable to internal customers (data scientists, analysts, etc.).
We basically have an entire org of highly paid engineers focused on making sure people can use that data efficiently. So we have a team of people working on storage, on Spark, on Presto/Trino, on data ingestion, and so on.
So my understanding is that we're investing in engineers to improve data science productivity, so that they can do analysis without having to understand the internals of all our systems, so that executives can make informed decisions backed by data to continue printing money. Or something like that...