Hacker News new | past | comments | ask | show | jobs | submit login

> What I found was "How to throw away data that doesn't support your desired conclusions," for the most part.

What exactly are you referring to here? This seems like a wildly misguided characterization of statistics, which I am sure cannot be based in expertise or practical applied experience.

> We're building "AI" right now but think about the inputs those see: The very first step is to throw away the statistically too common "stop words"

This is a fundamental misunderstanding of what a "stopword" is and how it's used.

Words like "the" are hard to utilize within with a bag-of-words model specifically. Removing them is not something people do/did because they are clueless monkeys. The goal is to improve the signal-to-noise ratio.

For example, traditionally spam filtering uses a very crude variety of bag-of-words model called "Naive Bayes", in which we assume (wrongly of course) that word choice is completely random, and that the only difference between spam and not spam is that random distribution of words. Are you really going to argue that the word "the" is critical to that process? If you can build a better NB spam filter by including stop words, by all means go ahead and do it. But both linguistics and decades of success in the field are against you.

On the other hand, words with grammatical function like "the" are absolutely important and relevant to the overall structure and meaning of a document. Therefore, training pipelines for modern deep-learning-based LLMs like GPT don't remove stop words (as far as I know at least), because the whole idea of a stopword doesn't make sense in a model like that.

I want to be respectful here, but it sounds like you took a cursory look through three vast literatures, without the perspective of having actually used any of this stuff in real life, and drew some invalid conclusions.




> I want to be respectful here,

Thanks!

Many people in these fields agree my conclusions are invalid. I say the same about theirs.


You're entitled to your own opinion of course, but your conclusions appear to be based on beginner-level misunderstandings. That doesn't seem like a constructive or productive way to conduct oneself through life.


you should see my rants about why normalizing weights is a bad idea and how a limited context window is effectively random interpolation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: