The real cost is that society at large is no longer contributing to the StackOve...

jefftk · 2024-12-19T01:04:26 1734570266

> proprietary databases (which granted SO also was)*
StackOverflow's database was public and shared at the Internet Archive until recently: https://archive.org/details/stackexchange

They've now moved it onto their own infra, ostensibly so people have to agree not to use it for LLM training: https://meta.stackexchange.com/questions/401324/announcing-a...

jimjimwii · 2024-12-19T08:15:30 1734596130

Wait, I thought they backed off from this nonsense.

If they really did close off db downloads then I'll never answer a question there ever again. I bet many others wont either. Maybe that's also part of why SE will fail.

The only deal I personally am willing to tolerate is me spending time providing quality questions and answers in my domain in return for being able to download all questions and answers under the cc-sa license and being able to freely download them for offline viewing without an account (via kiwix). No other arrangement is acceptable.

I'll look further into this and update this comment with my findings.

Edit: Yep. No more answers from me. Talk about throwing the baby out with the bath water.

Maybe small community forums are really the only way to share knowledge these days :(

RestartKernel · 2024-12-19T03:52:58 1734580378

The situation at large is even worse if you consider how Discord moved a significant portion of discussion away from the indexable Web.

I thought of mailing lists as archaic, but maybe we just regressed from there.

spacechild1 · 2024-12-19T09:32:10 1734600730

IMO public forums are the sweet spot.

daemin · 2024-12-18T22:50:33 1734562233

The StackOverflow database was open source for a while, but I haven't checked for a data dump since the new management came in.

TuringNYC · 2024-12-18T23:30:04 1734564604

True, may not be open source, but it is still viewable.

Best for Society: Open Source <-- Where SO was

Medium for Society: Proprietary, but openly browsable <-- Where SO is

Worst for Society: Proprietary, not browsable <-- LLM-based code assist tools

justsid · 2024-12-19T04:46:35 1734583595

This is just like the majority of communities moving to Discord over Forums. The barrier to entry might be a lot lower than getting a VPS and hosting phpBB or whatever, but the discoverability and searchabilty has gone towards 0. Everything is just moving into opaque black boxes.

disqard · 2024-12-19T07:26:37 1734593197

Your comment sparked a thought -- maybe this is the "Dark Forest" of the Web:

Either your site's content stays hidden behind discord, or an LLM's bot/minion scrapes all your content and makes visiting your site superfluous, thereby effectively killing your site.

networked · 2024-12-29T16:31:44 1735489904

This is an interesting thought because there is probably a real dynamic like this with some kinds of content, but as an HN comment it is also somewhat self-refuting.

8n4vidtmkvmk · 2024-12-19T06:20:41 1734589241

I never understood the move to Discord. Maybe I should host a PhpBB and bring some sanity back.

weebull · 2024-12-19T14:38:38 1734619118

> phpBB

Whoa there Nelly!

If we're going to resurrect things, can we do it whilst leaving PHP in the past?

8n4vidtmkvmk · 2024-12-21T00:00:56 1734739256

PHP hasn't failed me yet. Been using if for about 20 years. Not heavily but it gets the job done and it's still improving. It's really quite sane if you start comparing it to the alternatives. It continues to 'just work'

spiderfarmer · 2024-12-19T09:24:40 1734600280

Less competition for me

Wowfunhappy · 2024-12-19T04:23:22 1734582202

...where would you classify Llama in here? That's not really "open source" despite what Facebook calls it, but I wouldn't call it proprietary, anyone can download and use the whole thing locally.

pabs3 · 2024-12-19T04:36:04 1734582964

"public weights"?

_heimdall · 2024-12-19T04:47:09 1734583629

Can those weights be interpreted by anyone viewing them? If not, it seems like publicly available, obfuscated code at best.

fragmede · 2024-12-19T06:22:34 1734589354

"model available" would be my preferred term.

Is Photoshop.exe "interpretable" by anybody with a copy (of windows)? How about a binary that's been heavily decompiled, like a Mario game?

_heimdall · 2024-12-19T18:32:16 1734633136

Photoshop doesn't claim to be open source like llama does though, I'm not sure of the connection you're making.

Don't get me wrong, llama is at least more open than OpenAI and that may be meaningful.

fragmede · 2024-12-19T22:07:16 1734646036

The license aside, the question is what can be done with a carefully arranged blob of binary? Without additional software (Windows) I can't really do anything with Photoshop.exe. Similarly, Llama.gguf is either useful, with Ollama.app, or not, standing alone. So (looking past the difference in license), would you consider Photoshop.exe similar in that it's a binary blob that's useless by itself, or is it a useful collection of bytes, and why is/is not an ML model available on hugging faces the same?

_heimdall · 2024-12-20T01:18:55 1734657535

The license used isn't important in my opinion, when talking about open source the question is whether the source code is available to be modified and reviewed/interpreted.

Photoshop, or any compiled binary, isn't meant to be open source and the code isn't meant to be reviewable. Llama is called open source, though the most important part isn't publicly available for review. If llama didn't claim to be open source I don't think it would matter that the model itself and all the weights aren't available.

If your argument is just that most software is shipped as compiled and/or obfuscated code, sure that's how it is usually done. That isn't considered open source though, and the line with LLMs seems to be very gray - it can be "open source" if the source code for the training logic is available even though the actual code/model being run can't be reviewed or interpreted.

pabs3 · 2024-12-20T01:22:00 1734657720

The source data for the training needs to be public and freely licensed too, otherwise its IMO not an open source model.

_heimdall · 2024-12-22T19:51:17 1734897077

Is that really necessary if the resulting model was actually available and comprehensible?

Personally I can't say I care as much about what the training set is, I want to know what's actually in the model and used at runtime/interpretation.

pabs3 · 2024-12-23T02:59:44 1734922784

Yes, you can't know what kind of poisoning was done in the initial training data set, and you can't review the data, you can't review any human inputs, and you can't retrain from scratch. All those are things the model author can do, downstream folks/companies/governments should be able to do them too. Otherwise it isn't open source.

Wowfunhappy · 2024-12-20T01:28:56 1734658136

I think this discussion is silly in the context of a modern LLM. Nobody really understands how an LLM works, and you absolutely do not actually want to retrain Llama from scratch.

When I said "it's not really open source", I was referring to the fact that there are restrictions on who can use Llama.

_heimdall · 2024-12-22T19:52:59 1734897179

Well that's a much deeper rabbit hole - we shouldn't be using such massive systems or throwing so many resources at them when no one even knows how they work.

fragmede · 2024-12-20T15:21:08 1734708068

Just because you don't, don't assume others don't as well. I particular, I liked https://www.anthropic.com/news/golden-gate-claude and https://neuralblog.github.io/llama3-neurons/neuron_viewer.ht... and https://github.com/labmlai/inspectus

Terretta · 2024-12-18T22:52:02 1734562322

I think you'd find society-at-small was contributing, with perhaps 10x larger yet still quite small number posting but watering down useful contributions, 100x that lurking, and 1000x that just drive-by copy-pasting from SO to their IDE.

AlienRobot · 2024-12-19T01:32:18 1734571938

"Useful contributions" is subjective. Not everyone is born a senior developer. Juniors, and even children who aren't even juniors yet ask questions on these channels.

Source: I bothered a lot of people on the Internet about C++ when I was child.

throwaway277432 · 2024-12-19T09:05:45 1734599145

I contributed well-researched answers already when I was ostensibly a very junior dev, and even before that during my CS studies. Stuff you can lookup or simply try out is doable, of course questions where you need experience aren't a good fit.

When I was in high school I read the docs, and learned C++ from books and MSDN. Granted my access to the internet was rather limited back then, but it also never crossed my mind to bother people for things I could easily lookup myself.

Growing up in a RTFM, "search the forum first before asking" environment is seen as toxic today, but it really helps keeping certain behavior in check thats a drag on society as a whole.

One of the best mentors/bosses I ever had never answered coding questions directly, but always in the form of a question so I could look it up and learn for myself.

I try to do the same with my junior devs today, unless there's time constraints or they're under stress, I try to let them figure out the final answer themselves.

8n4vidtmkvmk · 2024-12-21T00:08:17 1734739697

I find there's a time and a place for RTFM, and it's usually not at the start of the project when you know the least. When you're just starting, you just want to get something working. Being spoon-fed some answers to get past a few hurdles is rather nice. But then there comes a point where you have to stop and be like "Okay, what the heck is this actually doing? How does it work?".

I just hit that point with some TensorFlow stuff because I started hitting the limits of what ChatGPT could answer successfully, and I think that's fine. But maybe good that I couldn't get everything out of it or it may have delayed my learning further yet. Which I guess reinforces your point.

8n4vidtmkvmk · 2024-12-19T06:17:52 1734589072

Are you saying these child questions are useful contributions?

AlienRobot · 2024-12-20T11:00:47 1734692447

To the child that asked it? Absolutely. :-)

Information is only useful if it's accessible. If they are asking questions it's because the information they want is in practice not accessible to them.

8n4vidtmkvmk · 2024-12-21T00:03:09 1734739389

I'm not disagreeing, just clarifying. I myself have asked lots of dumb/easy questions but prior to ChatGPT it was hard to get simple straight-forward answers.

rrrrrrrrrrrryan · 2024-12-19T04:07:45 1734581265

There's a rule of tens online: for every comment, there are ten interactions (like an upvote/down ote), and for every interaction, there are 10 views.

So, every comment on a post roughly equals 100 views.

nativeit · 2024-12-19T04:24:23 1734582263

I learned this as a real estate salesperson, some 20-years ago: Take 10 calls to get a meeting. Take 10 meetings to get a contract. Sign 10 contracts to complete a sale. So a thousand calls and a hundred contracts for 10 sales. I didn’t last long, the only other people there who were successfully selling land were also heavy cocaine users, that’s a mark of how grind-y the job was.

TuringNYC · 2024-12-18T23:26:13 1734564373

Sure, but at least the knowledge the publicly viewable, even if the contributions werent equal. Upvoting, browsing, asking is also a contribution.

data-ottawa · 2024-12-20T19:08:17 1734721697

I think this only removes the low quality questions like “how do I make X in React”.

There will always be new libraries and software updates, and those will always have corners and edge cases that will foster questions. The LLMs won’t have the answers out of the box until they have something to train from, so there’s still room for StackOverflow.

ncr100 · 2024-12-19T04:50:35 1734583835

So the real solution is tracking all of your data.

There should be a product, something that you install to capture your own data, and then pseudo-anonymize it and sell it back to databurgers (I meant to write data brokers but I kept this misspelling for lolz).

Is there?

Assistant LLMs like Limitless.AI merged with their older desktop scraper app could be repackaged to do that...

New industry? Or old?

rankun203 · 2024-12-19T03:17:22 1734578242

I think there will always be open platforms for us to exchange ideas where Copilot can't help. It's just now simple problems can be solved with Copilot directly, we can finally focus on implementing ideas and optimizing them.

8n4vidtmkvmk · 2024-12-19T06:14:48 1734588888

I don't know. Sometimes there's many ways to solve a simple problem but some are better than others. On stack overflow you get that variety and that discussion. Copilot just gives one. It might be suboptimal or it might have subtle bugs.

janaagaard · 2024-12-21T14:54:27 1734792867

Yeah, I agree that StackOverflow it but a shadow of what it used to be, and that this is a shame. But I don’t think that AI is to blame for this - all the interesting discussions had already migrated to GitHub issues. AI is just the nail in the coffin for SO.

Buttons840 · 2024-12-19T01:40:31 1734572431

I wonder if a purge and a fresh start for StackOverflow would renew interest.

I used to like reading StackExchange sites as a social media site--lots of interesting questions and clever answers. Today, votes have slowed down and the best answers are from 2017, and only niche questions can avoid being closed.

jart · 2024-12-19T03:44:15 1734579855

Stack Overflow is now a proprietary database too. Given the choice between that, and a proprietary robot offering 10x as much clarity and quality, I'd choose the robot. Not all LLMs are proprietary in the same way as Claude. Many LLMs have their weights publicly available, like Gemma. But I can understand if you feel like a floating point numbers file is de facto proprietary tool. But if you're smart, you would look at this instead as an opportunity to invent the tools that will make this knowledge accessible. I've been working with Mozilla to build a "fantasy mode" feature for Firefox, which works similar to incognito mode, where you have a local LLM generate a synthetic version of the world wide web on the fly. This gives you the ability to explore the knowledge contained in LLM weights using an intuitive familiar browser-based interface. So far it's about as fast as 56k dialup was in the 1990s but as microprocessors become faster, I believe we'll be able to generate artificial realities of useful information we can't live without which are superior to Stack Overflow today.

nativeit · 2024-12-19T04:19:08 1734581948

Forgive me if I have missed something, but how is a synthetic version of the web (which sounds interesting and impressive in its own right) in any way comparable to a vast, indexed repository of curated and organized technical knowledge shared by experts with nuanced experiences and insights?

> but as microprocessors become faster, I believe we'll be able to generate artificial realities of useful information we can't live without which are superior to Stack Overflow today.

Was this written by an LLM bot? It seems…off.

WD-42 · 2024-12-21T01:31:41 1734744701

This sounds great! There isn't enough slop in the web, so this sounds like a good way to experience browsing without any non-ai generated nonsense getting in the way!