I think it is because people are really bad at logic. Seriously. Most of the arguments on here are coming from a very bad set of assumptions. The major assumption here is, like the OP said, some Sci-Fi sensor data. Even as I tried to explain this people are still thinking way too exact data ("Shih Tzu" vs "small dog"). People are correct to counter these claims, but when they do, they usually go overboard. They say the data is useless or "they already have it, so it is effectively useless." Verification is a powerful tool.
But I think a large amount really comes down to the fact that people don't think from the statistics or mathematical viewpoint (despite claims). I've read probably more on stats and prob than the vast majority of people here yet I would never remotely claim I'm good at stats. It is fucking hard, yet so many think it is so easy (be wary of anyone that suggests it is easy). So here's some things I notice:
- People think we need crazy specific details to find trends, but you can extract signal from noise with a lot of data (especially diversified data) and very careful insights. The counter to this isn't to just dismiss but to ask if simpler data can be useful[0]. We feel the need to take a drastically opposing argument rather than a nuanced argument.
- Thinking that there are singular causal factors (see the arguments "it is about robots, not data" when it is clearly both). We talk about PCA all the time but a lot of our arguments create frameworks where we're only discussing what is the dominant factor and pretending that others don't matter[1].
- Being nowhere near familiar with emergence and how it plays a role in data and our lives (I suspect lack of emergence knowledge is why so many believe in "deep state" or conspiracies).
- Not framing things from a probabilistic viewpoint. I see this in security arguments (usually among non-experts but tech literate: see the Signal community forums or here). Most security is probabilistic in nature and about putting bounds, not guarantees. The classic example is remotely wiping a phone. If you wipe, there's a probability that an adversary hasn't gotten the data first. If it isn't wiped, said adversary has all the time in the world. Everything, and I mean literally everything, is a probability. What we call truth just has tight bounds.
- Zero sum fallacy. Many people think the vast majority of games are zero sum, when very few really are. We see this a lot in economics (value can in fact be added to the system. It is only zero sum in a infinitesimal point in time). A tide raises or sinks ships equally, it does not raise some and sink others to the point where we have a balanced system.
- Over simplification and thinking higher order approximations aren't necessary for "good enough."[1] People think that most probability distributions in the wild are Normal(ish) when they aren't (most have heavy tails). This is all caused by creating a "spherical cow in a vacuum" framework. An oversimplification of a problem isn't necessarily a good approximation and can often lead you in the wrong direction! Specifically the major challenges we face today, we need higher order terms to even get a reasonable approximation. For those math inclined, think of the Taylor series approximation of e^x (1 + x + x^2/2! +...+ x^n/n!). y = 1 isn't a good approximation except in a very narrow region (around x=0). y = x + 1 is even a worse approximation (depending on your region of interest)! Even a 4th order approximation is only useful on the bounds [-1, 1] (8th order will get us to [-2,2], maybe [-3,3]), but diverges quickly beyond that. If we're concerned about x<0 then the first order approximation is better than the first 5 (y=1 is closer to e^-100 than the 5th order x=-100). But if we're concerned with x>0 then the higher orders are even better.
The real Trojan horse is how we've structured a belief that these things don't matter. That we think simple answers are good ("you don't know something unless you can explain it to a child" is bullshit). I suspect that this is evolutionary as this framework has allowed us to solve most of the issues related to survival. The problem is that our modern society is much more complicated than that and we have effectively solved these survival hardships. The problems we face are now so complicated that our simple frameworks are no match for solving them. The above are things that most people struggle with yet a single one can quickly ruin our frameworks. I'd argue that we see all of these points showing up in the arguments of this post (this rant is still rooted in the topic, just meta. I'm writing this so we can have better arguments).
[0] To help, let me give a very clear example of something almost trivial to determine but highly useful. Suppose your roomba constantly bumps into things near a door and those things move every single day. It is very likely that those are shoes. We now know to advertise a shoe rack so the person can organize their shoes at their door. Yes, there are more complicated examples where we can get more intimate details of ones life and sell more products, but the simplicity here is that there is noise that we can extract signal out of and in a way that purchasing behavior or online behavior wouldn't capture.
OK, I re-read this, it does read sort of confrontational. I actually probably agree with like 90% of what you are saying. Probably not really worth it to quibble on made up examples like this. I just want to emphasize that robotics in uncontrolled environments is very very hard.
What I'm getting at is, why do the hard robotics work to get a new stream of data when there are many streams that are already computer legible, probably give very similar insights *, and are more compact. The data is not useless, certainly not, but it's actually quite hard to obtain, even with the sensor kit this thing has.
If you were to focus your efforts on starving Amazon's insatiable consumer data appetite, the robot stuff they're talking about here should be very low on your priority list.
Your broader point definitely stands.
(*not precisely the same insights; I certainly see where you are going there!)
-----------------------------
I'm being overly specific to illustrate a point.
Try doing the shoe rack thing, I mean really try it. Is it really that easy? These shoes don't have barcodes the way things at the warehouse do, and the robots already screw that stuff up in the warehouse all the time. How much money will you make improving shoe rack recommendations by 2.5%? How could you come up with that number, which might justify such a project to Amazon executives, who are deciding what you work on? Why spend so much effort on the shoe thing when your robot is already running into the shoes!?
The big thing that I probably should have made clear is that robotics is really hard. Every single separate physical thing your robot runs into is difficult, and they're all different. It would take a lot of $$$ to nail the problem of determining that the thing you ran into is a pair of shoes, or that this area is the front door, or that you even ran into anything at all. Seriously!
Now, imagine putting the same number of statistics people, such as yourself, on a team in Amazon Healthcare. Or even just on a team that gets just 200 more people to sign up for Amazon Healthcare. I would be willing to bet that munging the data that comes out of that is simultaneously far easier to work with (no pesky physical reality!) and an order of magnitude more lucrative than any of these robot projects.
Amazon has five hundred open positions for data scientists. I assure you, none of them will be working on something like this. Not for a long time.
> I just want to emphasize that robotics in uncontrolled environments is very very hard.
I'm not sure what convinced you I was making this argument. My example was illustrative as a counter to *your example* of determining a specific dog breed. I'm trying to show an easier problem. But "easier" is comparative and doesn't mean the problem is "easy." I'm guessing this is the confusion? Let's be real, figuring out the dog breed by determining dog hair from other debris is substantially harder and I'm pretty sure you'd need a DNA scan. It would be near impossible (probably entirely) to differentiate hair of any type with current sensors. I don't actually think the shoe example is extremely hard considering these robots have lidar on them. Lidar can in fact tell you that something is roughly shoe shaped, so I'm not sure what your gripe is. No need for DNA scanning. No need for cameras or microphones. The problem is likely solvable given *existing instruments on commercially sold products.* This, again, doesn't mean the problem is "easy" though. Just that it can be solved.
> How much money will you make improving shoe rack recommendations by 2.5%?
Okay, but I'm assuming there's more than exactly one singular problem that the mapping can be solve. So I'm not sure what your point is.
Let's try a trivial example. One that *is easy* and will demonstrate that we can sell things beyond shoe racks with said data. Knowing the (approximate) square footage of your house is useful. For example, if you live in a 500sqft place, you can tell a lot about income because we know this is likely a small apartment. You probably can't afford a lot of stuff and Amazon probably shouldn't advertise luxury goods to you, especially large ones.
> Now, imagine putting the same number of statistics people, such as yourself, on a team in Amazon Healthcare.
I'm not sure why you're pigeon holling me and making this a zero sum game. I'm actively arguing an "and" position, not an "or" position. Yeah, health care is lucrative. But Amazon has an insane amount of wealth, more than they actually know what to do with. They are perfectly capable of being able to hire more than 200 positions for their robotics department, which is what they advertise on their robotics site. They are perfectly capable of increasing this number without taking away any funding from their health care side (not a zero sum game). I'd also assume that there's a significant amount of domain overlap considering that Roombas. Mapping warehouses is a pretty similar problem to mapping houses, so I'm not sure what the problem is.
But I think a large amount really comes down to the fact that people don't think from the statistics or mathematical viewpoint (despite claims). I've read probably more on stats and prob than the vast majority of people here yet I would never remotely claim I'm good at stats. It is fucking hard, yet so many think it is so easy (be wary of anyone that suggests it is easy). So here's some things I notice:
- People think we need crazy specific details to find trends, but you can extract signal from noise with a lot of data (especially diversified data) and very careful insights. The counter to this isn't to just dismiss but to ask if simpler data can be useful[0]. We feel the need to take a drastically opposing argument rather than a nuanced argument.
- Thinking that there are singular causal factors (see the arguments "it is about robots, not data" when it is clearly both). We talk about PCA all the time but a lot of our arguments create frameworks where we're only discussing what is the dominant factor and pretending that others don't matter[1].
- Being nowhere near familiar with emergence and how it plays a role in data and our lives (I suspect lack of emergence knowledge is why so many believe in "deep state" or conspiracies).
- Not framing things from a probabilistic viewpoint. I see this in security arguments (usually among non-experts but tech literate: see the Signal community forums or here). Most security is probabilistic in nature and about putting bounds, not guarantees. The classic example is remotely wiping a phone. If you wipe, there's a probability that an adversary hasn't gotten the data first. If it isn't wiped, said adversary has all the time in the world. Everything, and I mean literally everything, is a probability. What we call truth just has tight bounds.
- Zero sum fallacy. Many people think the vast majority of games are zero sum, when very few really are. We see this a lot in economics (value can in fact be added to the system. It is only zero sum in a infinitesimal point in time). A tide raises or sinks ships equally, it does not raise some and sink others to the point where we have a balanced system.
- Over simplification and thinking higher order approximations aren't necessary for "good enough."[1] People think that most probability distributions in the wild are Normal(ish) when they aren't (most have heavy tails). This is all caused by creating a "spherical cow in a vacuum" framework. An oversimplification of a problem isn't necessarily a good approximation and can often lead you in the wrong direction! Specifically the major challenges we face today, we need higher order terms to even get a reasonable approximation. For those math inclined, think of the Taylor series approximation of e^x (1 + x + x^2/2! +...+ x^n/n!). y = 1 isn't a good approximation except in a very narrow region (around x=0). y = x + 1 is even a worse approximation (depending on your region of interest)! Even a 4th order approximation is only useful on the bounds [-1, 1] (8th order will get us to [-2,2], maybe [-3,3]), but diverges quickly beyond that. If we're concerned about x<0 then the first order approximation is better than the first 5 (y=1 is closer to e^-100 than the 5th order x=-100). But if we're concerned with x>0 then the higher orders are even better.
The real Trojan horse is how we've structured a belief that these things don't matter. That we think simple answers are good ("you don't know something unless you can explain it to a child" is bullshit). I suspect that this is evolutionary as this framework has allowed us to solve most of the issues related to survival. The problem is that our modern society is much more complicated than that and we have effectively solved these survival hardships. The problems we face are now so complicated that our simple frameworks are no match for solving them. The above are things that most people struggle with yet a single one can quickly ruin our frameworks. I'd argue that we see all of these points showing up in the arguments of this post (this rant is still rooted in the topic, just meta. I'm writing this so we can have better arguments).
[0] To help, let me give a very clear example of something almost trivial to determine but highly useful. Suppose your roomba constantly bumps into things near a door and those things move every single day. It is very likely that those are shoes. We now know to advertise a shoe rack so the person can organize their shoes at their door. Yes, there are more complicated examples where we can get more intimate details of ones life and sell more products, but the simplicity here is that there is noise that we can extract signal out of and in a way that purchasing behavior or online behavior wouldn't capture.