Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The semantic web is now widely adopted (csvbase.com)
455 points by todsacerdoti on Aug 21, 2024 | hide | past | favorite | 259 comments


The semantic web standards are sorely lacking (for decades now) a killer application. Not in a theoretical universe of decentralized philosopher-computer-scientists but in the dumbed down, swipe-the-next-30sec-video, adtech oligopolized digital landscape of walled gardens. Providing better search metadata is hardly that killer app. Not in 2024.

The lack of adoption has, imho, two components.

1. bad luck: the Web got worse, a lot worse. There hasn't been a Wikipedia-like event for many decades. This was not pre-ordained. Bad stuff happens to societies when they don't pay attention. In a parallel universe where the good Web won, the semantic path would have been much more traveled and developed.

2. incompleteness of vision: if you dig to their nuclear core, semantic apps offer things like SPARQL queries and reasoners. Great, these functionalities are both unique and have definite utility but there is a reason (pun) that the excellent Protege project [1] is not the new spreadsheet. The calculus of cognitive cost versus tangible benefit to the average user is not favorable. One thing that is missing are abstractions that will help bridge that divide.

Still, if we aspire to a better Web, the semantic web direction (if not current state) is our friend. The original visionaries of the semantic web where not out of their mind, they just did not account for the complex socio-economics of digital technology adoption.

[1] https://protege.stanford.edu/


The semantic web has been, in my opinion, a category error. Semantics means meaning and computers/automated systems don't really do meaning very well and certainly don't do intention very well.

Mapping the incredible success of The Web onto automated systems hasn't worked because the defining and unique characteristic of The Web is REST and, in particular, the uniform interface of REST. This uniform interface is wasted on non-intentional beings like software (that I'm aware of):

https://intercoolerjs.org/2016/05/08/hatoeas-is-for-humans.h...

Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

It just hasn't worked out the way that people expected, and that's OK.


> Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

If you say "AI" in 2024, you are probably talking about an LLM. An LLM is a program that pretends to solve semantics by actually entirely avoiding semantics. You feed an LLM a semantically meaningful input, and it will generate a statistically meaningful output that just so happens to look like a semantically meaningful transformation. Just to really sell this facade, we go around calling this program a "transformer" and a "language model", even though it truthfully does nothing of the sort.

The entire goal of the semantic web was to dodge the exact same problem: ambiguous semantics. By asking everyone to rewrite their content as an ontology, you compel the writer to transform the semantics of their content into explicit unambiguous logic.

That's where the category error comes in: the writer can't do it. Interesting content can't just be trivially rewritten as a simple universally-compatible ontology that is actually rooted in meaningfully unambiguous axioms. That's precisely the hard problem we were trying to dodge in the first place!

So the writer does the next best thing: they write an ontology that isn't rooted. There are no really useful axioms at the root of this tree, but it's a tree, and that's good enough. Right?

What use is an ontology when it isn't rooted in useful axioms? Instead of dodging the problem of ambiguous semantics, the "semantic web" moves that problem right in front of the user. That's probably useful for something, just not what the user is expecting it to be useful for.

---

I have this big abstract idea I've been working on that might actually solve the problem of ambiguous semantics. The trouble is, I've been having a really hard time tying the idea itself down to reality. It's a deceptively challenging problem space.


I would happily bite on that; I mostly deal with archives, libraries, museums and how they deal with people and communities. Because of that there is a ton of nuance when it comes to identities (there is a lot of gradation in meaning between "African American" and "Black" or gay and homosexual for example). Things that seem simple are often very complicated and I've spent a good deal of my PhD work working on that (the Homosaurus is the most popular example of this work); and lately I've been working on how to represent identities in a linked and changing way that can still be used by cultural heritage institutions (i.e. simple enough to be linked to SKOS and Wikidata). I feel pretty close on how to represent some aspects of this semantically, thanks to help from a brilliant ontologist friend.


I don't think I'll live up to "brilliant ontologist", but here goes:

First of all, what's the problem? Computing human-written text.

What's the problem domain? Story. In other words: intentionally written text. By that, I mean text that was written to express some arbitrary meaning. This is smaller than the set of all possible written text, because no one intentionally writes anything that is exclusively nonsensical.

So what's my solution? I call it the Story Empathizer.

---

Every time someone writes text, they encode meaning into it. This even happens on accident: try to write something completely random, and there will always be a reason guiding your result. I call this the original Backstory. This original Backstory contains all of the information that is not written down. It's gone forever, lost to history. What if we could dig it up?

Backstory is a powerful tool. To see why, let's consider one of the most frustratingly powerful features of Story: ambiguity. In order to express a unique idea in Story, you don't need an equivalently unique expression! You can write a Story that literally already means some other specific thing, yet somehow your unique meaning still fits! Doesn't that break some mathematical law of compression? We do this all day every day, so there must be something that makes it possible. That thing is Backstory. We are full of them. In a sense, we are even made of them.

We can never get the original Backstory back, but we can do the next best thing: make a new one. How? By reading Story. When we successfully read a Story, we transform it into a new Backstory. That goes somewhere in the brain. We call it knowledge. We call it memory. We call it worldview. I call this process Empathy.

Empathy is a two way street. We can use it to read, and we can use it to write. When two people communicate, they each create their own contextual Backstory. The goal is to make the two Backstories match.

---

So how do we do it with a computer? This is the tricky part. First, we need some fundamental Backstories to read with, and a program that uses Backstory to read. Then we should be able to put them to work, and recursively build something useful.

I envision a diverse library of Backstories. Once we have that, the hardest part will be choosing which Backstory to use, and why. Backstories provide utility, but they come with assumptions. Enough meta-reading, and we should be able to organize this library well enough. The simple ability to choose what assumptions we are computing with will be incredibly useful.

---

So that's all I've got so far. Every time I try to write a real program, my surroundings take over. Software engineering is fraught with assumptions. It's very difficult to set aside the canonical ways that software is made, and those are precisely what I'm trying to reinvent. I'm getting tripped up by the very problem I intend to solve, and the irony is not lost on me.

Any help or insight would be greatly appreciated. I know this idea is pretty out there, but if it works, it will solve NLP, and factor out all software incompatibility.


> The semantic web has been, in my opinion, a category error.

Hard agree.

> Maybe this all changes when AI takes over, but AI seems to do fine without us defining ontologies, etc.

I think about it as:

- Hypermedia controls were been deemphasized, leading to a ton of workarounds to REST

- REST is a perfectly suitable interface for AI Agents, especially to audit for governance

- AI is well suited to the task of mapping the web as it exists today to REST

- AI is well suited to mapping this layout ontologically

The semantic web is less interesting than what is traversable and actionable via REST, which may expose some higher level, reusable structures.

The first thing I can think of is `User` as a PKI type structure that allows us to build things that are more actionable for agents while still allowing humans to grok what they're authorized to.


I take the other side of this trade, and have since c. 1980. I say that semantics is a delusion our brains creates. Doesn't really exist. Or conversely is not the magical thing we think it is.


How are you oblivious of the performative contradiction that is that statement?

Please tell me you're not an eliminativist. There is nothing respectable about eliminativism. Self-refuting, and Procrustean in its methodology, denying observation it cannot explain or reconcile. Eliminativism is what you get when a materialist refuses or is unable to revise his worldview despite the crushing weight of contradiction and incoherence. It is obstinate ideology.


TIL:

https://en.wikipedia.org/wiki/Eliminative_materialism

> Eliminative materialism (also called eliminativism) is a materialist position in the philosophy of mind. It is the idea that the majority of mental states in folk psychology do not exist. Some supporters of eliminativism argue that no coherent neural basis will be found for many everyday psychological concepts such as belief or desire, since they are poorly defined. The argument is that psychological concepts of behavior and experience should be judged by how well they reduce to the biological level. Other versions entail the nonexistence of conscious mental states such as pain and visual perceptions.


> Eliminativism is what you get when a materialist refuses or is unable to revise his worldview despite the crushing weight of contradiction and incoherence.

Funny, because eliminativism to me is the inevitable conclusion that follows from the requirement of logical consistency + the crushing weight of objective evidence when pitted against my personal perceptions.


man


At TU delft, I was supposed to do my PhD in semantic web especially in the shipping logistics. It was funded by port of Rotterdam 10 years ago. Idea was to theorize and build various concepts around discrete data sharing, data discovery, classification, building ontology, query optimizations, automation and similar usecases. I decided not to pursue phd a month into it.

I believe in semantic web. The biggest problem is that, due to lack of tooling and ease of use, it take alot of effort and time to see value in building something like that across various parties etc. You dont see the value right away.


Funny you bring up logistics and (data) ontologies. I'm a PM at a logistics software company and I'd say the lack of proper ontologies and standardized data exchange formats is the biggest effort driver for integrating 3rd party carrier/delivery services such as DHL, Fedex etc.

It starts with the lack of a common terminology. For tool A a "booking" might be a reservation e.g. of a dock at a warehouse. For tool B the same word means a movement of goods between two accounts.

In terms of data integration things have gotten A LOT worse since EDIFACT is de facto deprecated. Every carrier in the parcel business is cooking their own API, but with insufficient means. I've come across things like Polish endpoint names/error messages or country organisations of big Parcel couriers using different APIs.

IMHO the EU has to step in here because integration costs skyrocket. They forced cellphone manufacturers to use USB-Cs for charging, why can't they force carriers to use a common API?


The EU is doing its part in some domains. There is e.g., the eProcurement ontology [1] that aims to harmonize public procurement data flows. But I suppose it helped alot that (by EU law) everybody is obliged to submit to a central repository.

[1] https://docs.ted.europa.eu/epo-home/index.html


Good choice. The semantic web really brought me to the brink.

The community has its head in the sands about... just about everything.

Document databases and SQL are popular because all of the affordances around "records". That is, instead of deleting, inserting, and updating facts you get primitives that let you update records in a transaction even if you don't explicitly use transactions.

It's very possible to define rules that will cut out a small piece of a graph that defines an individual "record" pertaining to some "subject" in the world even when blank nodes are in use. I've done it. You would go 3-4 years into your PhD and probably not find it in the literature, not get told about it by your prof, or your other grad students. (boy I went through the phase where I discovered most semantic web academics couldn't write hard SPARQL queries or do anything interesting with OWL)

Meanwhile people who take a bootcamp can be productive with SQL in just a few days because SQL was developed long ago to give the run-of-the-mill developer superpowers. (imagine how lost people were trying to develop airline reservation systems in the 1960s!)


There's another element, trusting the data.

Often that may require some web scale data, like Pagerank but also any other authority/trust metric where you can say "this data is probably quality data".

A rather basic example, published/last modified dates. It's well known in SEO circles at least in the recent past that changing them is useful to rank in Google, because Google prefers fresh content. Unless you're Google or have a less than trivial way of measuring page changes, the data may be less than trustworthy.


Not even Google seems to be making use of that capability, if they even have it in the first place. I'm regularly annoyed by results claiming to be from this year, only to find that it's a years-old article with fake metadata.


They are quite good at near content duplicate detection so I imagine it's within their capabilities. Whether they care about recency, maybe not as long as the user metrics say the page is useful. Maybe a fallacy about content recency.

You don't see many geocities style sites nowadays, even though there's many older sites with quality (and original) content. Maybe mobile friendliness plays into that though.


Yeah, dates in Google results have become all but useless. It's just another meaningless knob for SEOtards to abuse.


Say what you want, but Macromedia Dreamweaver came pretty close to being "that killer app". Microsoft attempted the same with Frontpage, but abandoned it pretty quickly as they always do.

I think that Web Browsers need to change what they are. They need to be able to understand content, correlate it, and distribute it. If a Browser sees itself not as a consuming app, but as a _contributing_ and _seeding_ app, it could influence the semantic web pretty quickly, and make it much more awesome.

Beaker Browser came pretty close to that idea (but it was abandoned, too).

Humans won't give a damn about hand-written semantic code, so you need to make the tools better that produce that code.


> There hasn't been a Wikipedia-like event for many decades.

Off the top of head...

OpenStreetMap was in 2004. Mastodon and the associated spec-thingy was around 2016. One/two decades is not the same as many decades.

Oh, and what about asm.js? Sure, archive.org is many decades old. But suddenly I'm using it to play every retro game under the sun on my browser. And we can try out a lot of FOSS software in the browser without installing things. Didn't someone post a blog to explain X11 where the examples were running a javascript implementation of the X window system?

Seems to me the entire web-o-sphere leveled up over the past decade. I mean, it's so good in fact that I can run an LLM clientside in the browser. (Granted, it's probably trained in part on your public musing that the web is worse.)

And all this while still rendering Berkshire Hathaway website correctly for many decades. How many times would the Gnome devs have broken it by now? How many upgrades would Apple have forced an "iweb" upgrade in that time?

Edit: typo


The web browser (or an app with a vague likeness to a browser) would indeed be in the epicenter of a "semantic" leap if that happens.

The technical capability of the browser to be an OS within an OS is more than proven by now, but not sure I am impressed with the utility thus far.

At the same time even basic features in the "right direction", empowering the users information processing ability (bookmarks, rss, etc) have stagnated or regressed.


Over on lobste.rs, someone cited another article retracing the history of the Semantic Web: https://twobithistory.org/2018/05/27/semantic-web.html

An interesting read in itself, and also points to Cory Doctorow giving seven reasons why the Semantic Web will never work: https://people.well.com/user/doctorow/metacrap.htm. They are all good reasons and are unfortunately still valid (although one of his observations towards the end of the text has turned out to be comically wrong, I'll let you read what it is)

Your comment and the two above links point to the same conclusion: again and again, Worse is Better (https://en.wikipedia.org/wiki/Worse_is_better)


> An interesting read in itself...

Indeed a good read, thanks for the link!

> [Cory Doctorow's] seven insurmountable obstacles

I think his context is the narrower "Web of individuals" where many of his seven challenges are real (and ongoing).

The elephant in the digital room is the "Web of organizations", whether that is companies, the public sector, civil society etc. If you revisit his objections in that light they are less true or even relevant. E.g.,

> People lie

Yes. But public companies are increasingly reporting online their audited financials via standards like iXBRL and prescribed taxonomies. Increasingly they need to report environmental impact etc. I mentioned in another comment common EU public procurement ontologies. Think also the millions of education and medical institutions and their online content. In institutional context lies do happen, but at a slightly deeper level :-)

> People are lazy

This only raises the stakes. As somebody mentioned already, the cost of navigating random API's is high. The reason we still talk about the semantic web despite decades of no-show is precisely the persistent need to overcome this friction.

> People are stupid

We are who we are individually, but again this ignores the collective intelligence of groups. Besides the hordes of helpless individuals and a handful of "big techs"(=the random entities that figured out digital technology ahead of others) there is a vast universe of interests. They are not stupid but there is a learning curve. For the vast part of society the so-called digital transformation is only at its beginning.


You have a very charitable view of this whole thing and I want to believe like you. Perhaps there is a virtuous cycle to be built where infrastructure that relies on people being more honest helps change the culture to actually be more honest which makes the infrastructure better. You don't wait for people to be nice before you create the gpl, the gpl changes mindsets towards opening up which fosters a better culture for creating more.

It's also very important to think in macro systems and societies, as you point out, rather than at the individual level


One major problem RDF has is that people hate anything with namespaces. It's a "freedom is slavery" kind of thing. People will accept it grudgingly if Google says it will help their search rankings or if you absolutely have to deal with them to code Java but 80% of people will automatically avoid anything if it has namespaces. (See namespaces in XML)

Another problem is that it's always ignored the basic requirements of most applications like:

1. Getting the list of authors in a publication as refernces to authority records in the right order (Dublin Core makes the 1970 MARC standard look like something from the Starship Enterprise)

2. Updating a data record reliably and transactionally

3. Efficiently unioning graphs for inference so you can combine a domain database with a few database records relevant to a problem + a schema easily

4. Inference involving arithemtic (Godel warned you about first-order logic plus arithmetic but for boring fields like finance, business, logistics that is the lingua franca, OWL comes across as too heavyweight but completely deficient at the same time and nobody wants to talk about it)

things like that. Try to build an application and you have to invent a lot of that stuff. You have the tools to do it and it's not that hard if you understand the math inside and out but if you don't oh boy.

If RDF got a few more features it would catch up with where JSON-based tools like

https://www.couchbase.com/products/n1ql/

were 10 years ago.


Every time I read a post like this I'm inclined to post Doctorow's Metacrap piece in response. You got there ahead of me. His reasoning is still valid and continues to make sense to me. Where do you think he's "comically wrong"?


Link counting being reliable for search. After going through people's not-so-noble qualities and how they make the semantic web impossible, he declares counting links as an exception. It was to a comical degree not an exception.


Yes. There is that. Ignobility wins out again.


The implicit metrics of quality and pedigree he believed were superior to human judgement have since been gamified into obsolescence by bots.


I think that the jury is still out on that one. Human judgement is too often colored by human incentives. I still think there's an opportunity for mechanical assessments of quality and pedigree to excel, and exceed what humans can do; at least, at scale. But, it'll always be an arms race and I'm not convinced that bots are in it except in the sense of lying through metadata, which brings us back to the assessment of quality and pedigree - right/wrong, good/bad, relevant/garbage.


item 2.6 kneecapped item 3


Thanks for sharing that Doctorow post, I had not seen that before. While the specific examples are of course dated (hello altavista and Napster), it still rings mostly true.


A killer app is still not enough.

People can’t get HTML right for basic accessibility, so something like the semantic web would be super science that people will out of their way to intentionally ignore any profit upon so long as they can raise their laziness and class-action lawsuit liability.


>> People can’t get HTML right for basic accessibility.

Not only has this gotten much worse; even when you put in the stop gaps for developers such as linters or other plugins, they willfully ignore them and will actually implement code they know is determinantal to accessibility.


I see RDF as a basis to build on. If I think RDF is pretty good but needs a way to keep track of provenance or temporality or something I can probably build something augmented that does that.

If it really works for my company and it is a competitive advantage I would keep quiet about it and I know of more than one company that's done exactly that. The standards process is so exhausting and you have to fight with so many systems programmers who never wrote an application that it's just suicide to go down that road.

BTW, RSS is an RDF application that nobody knows about

https://web.resource.org/rss/1.0/spec

you can totally parse RSS feeds with a RDF-XML parser and do SPARQL and other things with them.


99% of the time you'll get an RSS 2.0 feed which is an XML format. Of course you can convert, but RSS 1.0 seems, like you said, forgotten from the world.


> There hasn't been a Wikipedia-like event for many decades

I'll give you two examples: Internet Archive. Let's Encrypt.


Hardly a good reference, Internet Archive is older than Wikipedia.


Wikipedia itself is only a little over two decades old. I don't think anyone would parse "many decades" as "two decades".

There's also OpenStreetMap, exactly two decades old and thus four years younger than Wikipedia.


> Wikipedia itself is only a little over two decades old

The world wide web (but not the internet) is only 3 decades old!


Let's Encrypt is very good but it's not exactly a web app, semantic-web or otherwise.


Not true: Wikidata, Open Alex, Europeana, ... and many smaller projects making use of all that data, such as my project Conzept (https://conze.pt)


Killer applications solve real problems. What is the biggest real problem on the web today? The noise flood. Can semantic web standards help with that? Maybe! Something about trust, integrity, and lineage, perhaps.


Semantic Web doesn't help with the most basic thing: how do you get information ? If I want to know when was the Matrix shot, where do I go ? Today we have for-profit centralized point to get all information, because it's the only way this can be sustainable. Semantic Web might make it more feasible, by instead having lots of small interconnected agents that trust each other, much like... a Web of Trust. Except we know where the last experiment went (nowhere).


Search and ontologies weren't the only goals. Microformats enabled standardized data markup that lots of applications could consume and understand.

RSS and Atom were semantic web formats. They had a ton of applications built to publish and consume them, and people found the formats incredibly useful.

The idea was that if you ran into ingestible semantic content, your browser, a plugin, or another application could use that data in a specialized way. It worked because it was a standardized and portable data layer as opposed to a soup of meaningless HTML tags.

There were ideas for a distributed P2P social network built on the semantic web, standardized ways to write articles and blog posts, and much more.

If that had caught on, we might have saved ourselves a lot of trouble continually reinventing the wheel. And perhaps we would be in a world without walled gardens.



I think the problem with any sort of ontology type approach is the problem isn't solved when you have defined the one ontology to rule them all after many years of wrangling between experts.

As what you have done is spend many years generating a shared understanding of what that ontology means between the experts. Once that's done you have the much harder task for pushing that shared understanding to the rest of the world.

ie the problem isn't defining a tag for a cat - it's having a global share vision of what a cat is.

I mean we can't even agree on what is a man or a women.


You point out a real problem but it does not feel like an unsurmountable and terminal one. By that argument we would never have a human language unless everybody spoke the same language. Turns out once you have well developed languages (and you do, because they are useful even when not universal) you can translate between them. Not perfectly, but generally good enough.

Developing such linking tools between ontologies would be worthwhile if there are multiple ontologies covering the same domain, provided they are actually used (i.e., there are large datasets for each). Alas, instead of a bottom-up, organic approach people try to solve this with top-down, formal (upper-level) ontologies [1] and Leibnizian dreams of an underlying universality [2], which only adds to the cognitive load.

[1] https://en.wikipedia.org/wiki/Formal_ontology

[2] https://en.wikipedia.org/wiki/Characteristica_universalis


> You point out a real problem but it does not feel like an unsurmountable and terminal one

In our spoken language the agents doing the parsing are human AI's (actual intelligences) able to deal with most of the finer nuances in semantics, and still making numerous errors in many contexts that lead to misunderstanding, i.e. parse errors.

There was this hand-waving promise in semantic web movement of "if only we make everything machine-readable, then .." magic would happen. Undoubtedly unlocking numerous killer apps, if only we had these (increasingly complex) linked data standards and related tools to define and parse 'universal meaning'.

An overreach, imho. Semantic web was always overpromising yet underdelivering. There may be new use cases in combinations of SM with ML/LLM but I don't think they'll be a vNext of the web anytime soon.


I am not sure I understand the fixation on a "killer app" in the context of web standards. We are talking about things like, say, XML, or SVG or HTTP/2. They can have their rationale and their value simply by serving to enable organic growth of a web ecosystem. I think I agree most with your last sentence and should define success more in those terms, aspiring to a better web.


The idea (or hope) is that apps based on semantic standards would kick off a virtuous cycle where publishers of information keep investing in both generating metadata and evolving the standards themselves. As many have mentioned in the thread, thats not a trivial step.

People sort of try. A concrete example are the Activitypub/Fediverse standards which dared to use json-ld. To my knowledge so far the social media experience of mastodon and friends is not qualitatively different from the old web stuff.


I think you misunderstand me, it's not that I don't understand the idea in and of itself, I don't understand the logic of believing that such a thing is integral to the definition of success of protocols.


I think pointing just to Wikipedia ignores the growing use adoption and massive impact of Wikidata. perhaps I'm biased because of my field but everything I see indicates the growing and not shrinking power of it, I would categorize it as different than Wikipedia's effects though.


Why do we need web standards for the semantic web anymore when we have LLMs?

Just make LLMs more ubiquitous and train them on the Web. Rather than crawling or something. The LLMs are a lot more resilient.


i think you're confused. the killer app is everyone following the same format, and such, capitalists can extract all that information and sell LLMs that no one wants in place of more deterministic search and data products.


I laughed at this bit:

"Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it."


Well, the immediate initial test failed for me: I thought, "why not apply this on one of my own sites, where I have a sort of journal of poetry I've written?"...and there's no category for "Poem", and the request to add Poem as a type [1] is at least 9 years old, links to an even older issue in an unreadable issue tracker without any resolution (and seemingly without much effort to resolve it), and then dies off without having accomplished anything.

[1] https://github.com/schemaorg/suggestions-questions-brainstor...


Having worked in this field for a bit, this uncovers an even more fundamental flaw: The idea that we can have a single static ontology.


Domain driven design is well aware that is not feasible to have a single schema for everything, they use bounded contexts. Is there something similar for the semantic web?


In the Semantic Web, things like ontologies and namespaces play a role similar to bounded contexts in DDD. There’s no exact equivalent, but these tools help different schemas coexist and work together


Isn't that the point of RDF / Owl etc.?


The point of RDF is to raise the problem to the greatest common divisor level of complexity - a hypermedia graph of arbitrary predicates about arbitrary objects.

OWL is a modeling language to describe ontologies, e.g. some constraints people have agreed to follow about how to structure the information they publish in graphs. It can also be considered an advanced schema language.

The idea of a bounded context in DDD is that it is not a good use of time (or indeed may not be feasible at all) to get a single ontology for an entire domain, so different subdomains may be unified by some concepts but have differing or overlapping concepts that they use internally. Two contexts know they are talking about a product called "New Shimmer", even if understands it as a floor wax and the other uses it as a dessert topping.

The two pillars of the semantic web are public data and machine understanding, which IMHO pushes strongly toward the (often unachievable) goal of a single kitchen-sink schema.


Mostly, the problems of a semantic web are covered in the history of Cyc[1].

When I started to use LLMs I thought that was the missing link to convert content to semantic representations, even taking into account the errors/hallucinations within them.

[1] https://en.wikipedia.org/wiki/Cyc


There is also the problem that structure doesn't guarantee meaning.


What kind of work do you do?


various. Notable I, some years ago, had a project that considered automatic consolidation of ontologies based on meta-ontologies and heuristics.

The idea being that everyone have their own ontology for the data they release and the system would make a consolidated ontology that could be used to automatic integration of data from different datasources.

regardless, that project did not get traction, so now it sits.


Thanks! Is it possible for you to share more details about the domain and problem?


That's only schema.org! Linked data is so much bigger than that.

Many ontologies have a "poem" type (for example dbpedia (https://dbpedia.org/ontology/Poem) has one), as well as other publishing or book-oriented ontologies.


Every time I've read up on semantic web it's been treated as more or less synonymous with schema.org. Are these other ontologies used by anything?


My mental model of that question is: how would anyone know if an ontology was used by something? One cannot have a search facet in any engine that I'm aware of to search by namespace qualified nouns, and markup is only as good as the application which is able to understand it


They are _used_ very extensively, but not in web pages. Many industry domains rely heavily on linked data and ontologies.

It's true that they are less commonly embedded as semantic data in web pages. There's a real bootstrapping problem there: no reason to embed data if no tools will read it, no reason to build tools if there's no data to read.


The argument about LLMs is wrong, not because of reasons stated but because semantic meaning shouldn't solely be defined by the publisher.

The real question is whether the average publisher is better than an LLM at accurately classifying their content. My guess is, when it comes to categorization and summarization, an LLM is going to handily win. An easy test is: are publishers experts on topics they talk about? The truth of the internet is no, they're not usually.

The entire world of SEO hacks, blogspam, etc exists because publishers were the only source of truth that the search engine used to determine meaning and quality, which has created all the sorts of misaligned incentives that we've lived with for the past 25 years. At best there are some things publishers can provide as guidance for an LLM, social card, etc, but it can't be the only truth of the content.

Perhaps we will only really reach the promise of 'the semantic web' when we've adequately overcome the principal-agent problem of who gets to define the meaning of things on the web. My sense is that requires classifiers that are controlled by users.


Yet LLMS fail to make these simple but sometimes meaningful differentiation. See for example this case in which a court reporter is described as being all the things he reported about by Copilot: a child molester, a psychatric escapee, a widow cheat. Presumably because his name was in a lot of articles about said things and LLMS simply associate his name with the crimes without making the connection that he could in fact be simply the messenger and not the criminal. If LLMS had the semantic understanding that the name on top/bottom of a news article is the author, it would not have made that mistake.

https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...


Absolutely! Today's LLMs can sometimes(/often?) enormously suck and should not be relied upon for critical information. There's a long way to go to make them better, and I'm happy that a lot of people are working on that. Finding meaning in a sea of information is a highly imperfect enterprise regardless of the tech we use.

My point though was that the core problem we should be trying to solve is overcoming the fundamental misalignment of incentives between publisher and reader, not whether we can put a better schema together that we hope people adopt intelligently & non-adversarially, because we know that won't happen in practice. I liked what the author wrote but they also didn't really consider this perspective and as such I think they haven't hit upon a fundamental understanding of the problem.


Humans do something very similar, fwiw. It's called spontaneous trait association: https://www.sciencedirect.com/science/article/abs/pii/S00221...


> fwiw

What do you think this sort of observation is worth?


Really depends on what sort of person you are I guess.

Some people appreciate being shown fascinating aspects of human nature. Some people don't, and I wonder why they're on a forum dedicated to curiosity and discussion. And then, some people get weirdly aggressive if they're shown something that doesn't quite fit in their worldview. This topic in particular seems to draw those out, and it's fascinating to me.

Myself, I thought it was great to learn about spontaneous trait association, because it explains so much weird human behavior. The fact that LLMs do something so similar is, at the very least, an interesting parallel.


>My guess is, when it comes to categorization and summarization, an LLM is going to handily win. An easy test is: are publishers experts on topics they talk about? The truth of the internet is no, they're not usually.

LLMs are not experts either. Furthermore, from what I gather, LLMs are trained on:

>The entire world of SEO hacks, blogspam, etc


This is an excellent rebuttal. I think it is an issue that can be overcome but I appreciate the irony of what you point out :)


> because semantic meaning shouldn't solely be defined by the publisher

LLMs are not that great at understanding semantics though


If even the semantic web people are declaring victory based on a post title and a picture for better integration with Facebook, then it's clear that Semantic Web as it was envisioned is fully 100% dead and buried.

The concept of OWL and the other standards was to annotate the content of pages, that's where the real values lie. Each paragraph the author wrote should have had some metadata about its topic. At the very least, the article metadata was supposed to have included information about the categories of information included in the article.

Having a bit of info on the author, title (redundant, as HTML already has a tag for that), picture, and publication date is almost completely irrelevant for the kinds of things Web 3.0 was supposed to be.


The blog post does not address why the Semantic Web failed:

1. Trust: How should one know that any data available marked up according to Sematic Web principles can be trusted? This is an even more pressing question when the data is free. Sir Berners-Lee (AKA "TimBL") designed the Semantic Web in a way that makes "trust" a component, when in truth it is an emergent relation between a well-designed system and its users (my own definition).

2. Lack of Incentives: There is no way to get paid for uploading content that is financially very valuable. I know many financial companies that would like to offer their data in a "Semantic Web" form, but they cannot, because they would not get compensated, and their existence depends on selling that data; some even use Semantic Web standards for internal-only sharing.

3. A lot of SW stuff is either boilerplate or re-discovered formal logic from the 1970s. I read lots of papers that propose some "ontology" but no application that needs it.


> title (redundant, as HTML already has a tag for that)

Note that `title` isn't one of the properties that BlogPosting supports. It supports `headline`, which may well be different from the `<title/>`. It's probably analogous to the page's `<h1/>`, but more reliable.


I had pretty much the same reacon while reading the article. "BlogPosting" isn't particularily informative. The rest of the metadata looked like it could/should be put in <meta> tags, done.

A very bad example if the intention was to demonstrate how cool and useful semweb is :-)


The schema.org data is much more rich than meta tags, though. Using the latter, an author is just a string of text containing who-knows-what. The former lets you specify a name, email address, and url. And that's just for the Person type—you can specify an Organization too.


That's still just tangential Metadata. The point of a semantic web would be to annotate the semantic content of text. The vision was always that you can run a query like, say, "physics:particles: proton-mass", over the entire web, and it would retrieve parts of web pages that talk about the proton mass.


Which was already possible with RDF. It is hard to not see JSON-LD as anything other than "RDF but in JSON because we don't like XML".


Yeah, this is hiking the original Semantic Web goal post over the horizon, across the ocean, up a mountain, and cutting it down to a little stump downhill in front of the kicker compared to the original claims. "It's going to change the world! Everything will be contained in RDF files that anyone can trivially embed and anyone can run queries against the Knowledge Graph to determine anything they want!"

"We've achieved victory! After over 25 years, if you want to know who wrote a blog post, you can get it from a few sites this way!"

I'd call it damning with faint success, except it really isn't even success. Relative to the promises of "Semantic Web" it's simply a failure. And it's not like Semantic Web was overpromised a bit, but there were good ideas there and the reality is perhaps more prosaic but also useful. No, it's just useless. It failed, and LLMs will be the complete death of it.

The "Semantic Web" is not the idea that the web contains "semantics" and someday we'll have access to them. That the web has information on it is not the solution statement, it's the problem statement. The semantic web is the idea that all this information on the web will be organized, by the owners of the information, voluntarily, and correctly, into a big cross-site Knowledge Graph that can be queried by anybody. To the point that visiting Wikipedia behind the scenes would not be a big chunk of formatted text, but a download of "facts" embedded in tuples in RDF and the screen you read as a human a rendered result of that, where Wikipedia doesn't just use self-hosted data but could grab "the Knowledge Graph" and directly embed other RDF information from the US government or companies or universities. Compare this dream to reality and you can see it doesn't even resemble reality.

Nobody was sitting around twenty years ago going "oh, wow, if we really work at this for 20 years some people might annotate their web blogs with their author and people might be able to write bespoke code to query it, sometimes, if we achieve this it will have all been worth it". The idea is precisely that such an act would be so mundane as to not be something you would think of calling out, just as I don't wax poetic about the <b> tag in HTML being something that changes the world every day. That it would not be something "possible" but that it would be something your browser is automatically doing behind the scenes, along with the other vast amount of RDF-driven stuff it is constantly doing for you all the time. The very fact that someone thinks something so trivial is worth calling out is proof that the idea has utterly failed.


Beautifully said.

I'll also add that I wouldn't even call what he's showing "semantic web", even in this limited form. I would bet that most of the people who add that metadata to their pages view it instead as "implenting the nice sharing link API". The fact that Facebook, Twitter and others decided to converge on JSON-LD with a schema.org schema as the API is mostly an accident of history, rather than someone mining the Knowledge Graph for useful info.


I'm a bit surprised that the author doesn't mention key concepts such as linked data, RDF, federation and web querying. Or even the five stars of linked open data. [1] Sure, JSON-LD is part of it, but it's just a serialization format.

The really neat part is when you start considering universal ontologies and linking to resources published on other domains. This is where your data becomes interoperable and reusable. Even better, through linking you can contextualize and enrich your data. Since linked data is all about creating graphs, creating a link in your data, or publishing data under a specific domain are acts that involves concepts like trust, authority, authenticity and so on. All those murky social concepts that define what we consider more or less objective truths.

LLM's won't replace the semantic web, nor vice versa. They are complementary to each other. Linked data technologies allow humans to cooperate and evolve domain models with a salience and flexibility which wasn't previously possible behind the walls and moats of discrete digital servers or physical buildings. LLM's work because they are based on large sets of ground truths, but those sets are always limited which makes inferring new knowledge and asserting its truthiness independent from human intervention next to impossible. LLM's may help us to expand linked data graphs, and linked data graphs fashioned by humans may help improve LLM's.

Creating a juxtaposition between both? Well, that's basically comparing apples against pears. They are two different things.

[1] https://5stardata.info/en/


Metadata in PDFs is also typically based on semantic web standards.

https://www.meridiandiscovery.com/articles/pdf-forensic-anal...

Instead of using JSON-LD it uses RDF written as XML. Still uses the same concept of common vocabularies, but instead of schema.org it uses a collection of various vocabularies including Dublin Core.


The author gives two reasons why AI won't replace the need for metadata:

1: LLMs "routinely get stuff wrong"

2: "pricy GPU time"

1: I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project. And they get pretty hard stuff right 99% of the time already. This will only improve.

2: Loading the frontpage of Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code. In the past, this would have been seen as an impossible task to just show some links to articles. In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.


Let's hope you never write articles about court cases then: https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...

The alleged low error rate of 1% can ruin your day/life/company, if it hits the wrong person, regards the wrong problem, etc. And that risk is not adequately addressed by hand-waving and pointing people to low error rates. In fact, if anything such claims would make me less confident in your product.

1% error is still a lot if they are the wrong kind of error in the wrong kind of situation. Especially if in that 1% of cases the system is not just slightly wrong, but catastrophically mind-bogglingly wrong.


This is the thing with errors and automation. A 1 % error rate in a human process is basically fine. A 1 % error rate in an automated process is hundreds of thousands of errors per day.

(See also why automated face recognition in public surveillance cameras might be a bad idea.)


Another part is that artificial systems can screw up in fundamentally different ways and modes compared to a human baseline, even if the raw count of errors is lower.

A human might fail to recognize another person in a photo, but at least they won't insist the person is definitely a cartoon character, or blindly follow "I am John Doe" written on someone's cheek in pen.


Exactly. If your system monitors a place like a halfway decent railway station half a million people per day is a number you could expect. Even with an amazingly low error rate of 1% that would result in 5000 wrong signals a day. If we make the assumption that the people are uniformly spread out througout a 24 hour cycle that means a false alarm every 20 seconds.

In reality most of the people are there during the day (false alarm every 10 seconds) and the error percentages are nowhere near 1%.

If you do the math to figure out the staff needed to react to those false alarms in any meaningful way you have to come to the conclusion that just putting people there instead of cameras would be a safer way to reach the goal.


Human error rates are also not a constant.

If you're about to publish a career-ending allegation, you're going to spend some extra time fact-checking it.


Can you point to where that claim was made? I can't find it. The parent post assumes 1% for the sake of argument to underline that the impact of the 1% error depends on the number to which the 1% are applied — automation reduces the effort and increases the number.

Hypothetical example: Cops shoot the wrong person in x% of cases. If we equipped all surveillance cameras with guns that also shoot the wrong person in x% of cases the world would be a nightmare pandemonium, simply because there is more cameras and they are running 24/7.

Mind that the precise value of x and whether is constant or not does not impact the argument at all.


I'm the one making the claim, and it does not attempt to refute the parent post, it agrees with/reinforces it.

I'm also making the point that a human with an error rate of x% is not directly comparable to a machine with x% error rate, just via a different line of reasoning.


Isn't this just saying "humans are slow" in a different way?


Yes. Slow is good in these Blackstone's ratio type situations when the cost of false positives are very high.


Is product search a high risk activity? LLMs could be the right tool for building a product search database while also being libelously terrible for news reporting.


> Reddit takes hundreds of http requests, parses megabytes of text, image and JavaScript code [...] to show some links to articles

Yes, and I hate it. I closed Reddit many times because the wait time wasn't worth it.



That definitely seems to be getting less reliable these days. A number of times I've found it refusing to work, or redirecting me to the primary UI arbitrarily, a few months ago there was a time when you couldn't login via that UI (though logging in on main and going back worked for me).

These instances seem to be temporary bugs, but they show that it isn't getting any love (why would it? they only maintain it at all under sufferance) so at some point it'll no doubt be cut off as a cost cutting exercise during a time when ad revenue is low.


Gets buggier for every year.


> I make a lot of tests on how well LLMs get categorization and data extraction right or wrong for my Product Chart (https://www.productchart.com) project.

In fact, what you're doing there is building a local semantic database by automatically mining metadata using LLM. The searching part is entirely based on the metadata you gathered, so the GP's point 1 is still perfectly valid.

> In the near future, nobody will see passing a text through an LLM as a noteworthy amount of compute anymore.

Even with all that technological power, LLMs won't replace most simple-searching-over-index, as they are bad at adapting to ever changing datasets. They only can make it easier.


Oh nice, Product Chart looks like a great fit for what LLMs can actually do. I'm generally pretty skeptical about LLMs getting used, but looking at the smart phone tool: this is the sort of product search missing from online stores.

Critically, if the LLM gets something wrong, a user can notice and flag it, then someone can manually fix it. That's 100x less work than manually curating the product info (assuming 1% error rate).


Only slightly tongue in cheek, but if your measure of success is Reddit, perhaps a better example may serve your argument?


The argument for "LLMs get it right 99% of the time" is also very generalized and doesn't take into account smaller websites


It’s baffling how defeatist and ignorant engineering culture has become when someone else’s non-deterministic, proprietary and non-debuggable code, running on someone else’s machine, that uses an enormous amount of currently VC-subsidized resources, is touted as a general solution to a data annotation problem.

Back in my day people used to bash on JavaScript. Today one can only dream of a world where JS is the worst of our engineering problems.


LLMs have no soul, so I like content and curation from real people


The main problem is that the incentive for well-intentioned people to add detailed and accurate metadata is much lower than the incentive for SEO dudes to abuse the system if the metadata is used for anything of consequence. There's a reason why search engines that trusted website metadata went extinct.

That's the whole benefit of using LLMs for categorization: they work for you, not for the SEO guy... well, prompt injection tricks aside.


There is value-add if you can prove whatever content you are producing is from an authentic human, because I dislike LLM produced garbage


The point is that metadata lies. Intentionally, instead of just being coincidentally wrong. For example everybody who wants to spew LLM produced garbage in your face will go out of their way to attach metadata claiming the opposite. The value proposition of LLM categorization would be that the LLM looks at the same content as the eventual human (if, in fact, it does - which is a related but different problem)


Huh, it's not often you hear a religious argument in a technical discussion. Interesting viewpoint!


I don't see it as anything religious. I see the comment about something having an intrinsic, instinctive quality, which we can categorise as having "soul".


> intrinsic, instinctive quality,

What are a few examples of things with an 'intrinsic, instinctive quality'?


How about a Peanuts comic script by Charles M. Schulz. Oozes humanity in both the writing, and the shakey, loose penmanship.


That's even more interesting! The only non-religious meaning of soul I've ever heard is a music genre, but then English is my second language. I tried googling it and found this meaning I wasn't aware of:

emotional or intellectual energy or intensity, especially as revealed in a work of art or an artistic performance. "their interpretation lacked soul"

Is this the definition used? I'm not sure how a JSON document is supposed to convey emotional or intellectual energy, especially since it's basically a collection of tags. Maybe I also lack soul?

Or is there yet another definition I didn't find?


It's early 20th century (and later) black American dialect to say things "have soul" or "don't have soul." In the West, Black Americans are associated with a mystical connection to the Earth, deeper understandings, and suffering.

So LLMs are not gritty and down and dirty, and don't get down. They're not the real stuff.


Mystical connection? Now you're back to religion.

If you wanna be down you gotta keep it real, and mysticism is categorically not that.


All the web metadata I consume is organic and responsively farmed.


GPU compute price is dropping fast and will continue to do so.


The cost of GPU time isn't just the cost that you see (buying them initially, paying for service if they are not yours, paying for electricity if they are) but the cost to the environment. Data centre power draws are increasing significantly and the recent explosion in LLM model creation is part of that.

Yes, things are getting better per unit (GPUs get more efficient, better yet AI-optimised chipsets are an order more efficient than using GPUs, etc.) but are they getting better per unit of compute faster than the number of compute units being used is increasing ATM?


But is it dropping faster than the needs of the next model that needs to be trained?


Short answer is yes.

Also, GPU pricing is hardly relevant. From now on we will see dedicated co-processors on the GPU to handle these things.

They will keep on keeping up with the demand until we meet actual physical limits.


How does Product Chart use LLMs?


We research all product data manually and then have AI cross-check the data and see how well it can replicate what the human has researched and whether it can find errors.

Actually, building the AI agent for data research takes up most of my time these days.


Have you seen https://superagent.sh/ ? It's an interesting one and not terrible in the test cases I tried. (Requires pretty specific descriptions for the fields though)


For my part, I stopped reading at the free bashing of blockchain•.

Reminded me of the angst and negativity of these original "Web3" people, already bashing everything that was not in their mood back then.

• The crypto ecosystem is shady, I know, but the tech is great


As someone who stopped getting involved in blockchain "tech" 12 years ago because of the prevalence of scams and bad actors and lack of interesting tech beyond the merkle tree, what's great about it?

FWIW I am genuinely asking. I don't know anything about the current tech. There's something about "zero knowledge proofs" but I don't understand how much of that is used in practice for real blockchain things vs just being research.

As far as I know, the throughput of blockchain transactions at scale is miserably slow and expensive and their usual solution is some kind of side channel that skips the full validation.

Distributed computation on the blockchain isn't really used for anything other than converting between currencies and minting new ones mostly AFAIK as well.

What is the great tech that we got from the blockchain revolution?


Scams and bad actors haven't changed sadly.

But zk-based really decentralized consensus now does 400 tps and it's extraordinary when you think about it and all the safety and security properties it brings.

And that's with proof-of-stake of course with decentralized sequencers for L2.

But I get that people here prefer centralized databases, managed by admins and censorship-empowering platforms. Your bank stack looks like it's designed for fraud too. Manual operations and months-long audits with errors, but that is by design. Thanks everyone for all the downvotes.


> But I get that people here prefer

For many of us it isn't that we think the status quo is the RightWay™ - we just aren't convinced that crypto as it currently is presents a better answer. It fixes some problems, but adds a number of its own that many of us don't think are currently worth the compromise for our needs.

As you said yourself:

> The crypto ecosystem is shady, I know, but the tech is great

That but is not enough for me to want to take part. Yes the tech is useful, heck I use it for other things (blockchains existed as auditing mechanisms long before crypto-currencies), but I'm not going to encourage others to take part in an ecosystem that is as shady as crypto is.

> Thanks everyone for all the downvotes.

I don't think you are getting downvoted for supporting crypto, more likely because you basically said “you know that article you are all discussing?, well I think you'll want to know that I didn't bother to read it”, then without a hint of irony made assertions of “angst and negativity”.

And if I might make a mental health suggestion: caring about online downvotes is seldom part of a path to happiness :)


The main problem with blockchain is identical to the one with LLMs. When snake oil salesmen try to apply the same solution to every problem, you stop wasting your time with those salesmen.

Both can be useful now and then, but the legit uses are lost in the noise.

And for blockchain... it was launched with the promise of decentralized currency. But we've had decentralized currency before in the physical world. Until the past few hundred years. Then we abandoned it in favor of centralized currency for some reason. I don't know, reliability perhaps?


> And for blockchain... it was launched with the promise of decentralized currency.

Cryptocurrencies were launched with that promise.

They are but one use¹ of block-chains / merkle-trees, which existed long before them².

----

[1] https://en.wikipedia.org/wiki/Merkle_tree#Uses

[2] 1982 for blockchains/trees as part of a distributed protocol as people generally mean when they use the words now³, hash chains/trees themselves go back at least as far as 1979 when Ralph Merkle patented the idea

[3] https://en.wikipedia.org/wiki/Blockchain#History


But if you put it that way neural networks were defined in the 70s too :)


Very much so. Is there a problem with that? To what time period would attribute their creation?

In fact it is only the 70s if you mean networks that learn via backprop & similar methods. Some theoretical work on artificial neurons was done in the 40s.


The point is whatever you said in defense of blockchain/crypto applies or does not apply to neural networks/LLMs in equal measure.

I for one fail to see the difference between these two kinds of snake oil.

> Some theoretical work on artificial neurons was done in the 40s.

"The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts. The first hardware implementation was Mark I Perceptron machine built in 1957"


> whatever you said in defense of blockchain/crypto

You seem to be labouring under the impression that blockchain and cryptocurrencies are one in the same. The point you seem to be missing is that I'm saying they are not the same. Blockchains (usually actually trees like merkle trees) are a thing that has existed long before cryptocurrencies which are one application of technique.

> I for one fail to see the difference between these two kinds of snake oil.

The gaggle of quackish sales people with miracle cures based on LLMs is pretty much the same sort of quackish sales people that touted miracle cures based on crypto currencies, yes. But LLMs are one use of neural networks and crypto/proof-of-work is one use of blockchains.

This all started with me correcting “And for blockchain... it was launched with the promise of decentralized currency.” — which is the incorrect equivalency of blockchain/cryptocurrency writ large.

> > Some theoretical work on artificial neurons was done in the 40s.

> "The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts.

Exactly. You've just repeated my sentence with a little more detail.

40s: theory

50s: attempts at practical implementation

early 70s: backprop methods (as we currently mean the term wrt neural networks, backpropagation as a more general concept existed before that) first published, starting that decades' big excitement over the potential for neural networks.


Gold is and has been a decentralized currency for a very long time. It’s mostly just very inconvenient to transport.

> Then we abandoned it in favor of centralized currency for some reason. I don't know, reliability perhaps?

The global economy practically requires a centralized currency, because the value of your currency vs other countries becomes extremely important for trading in a global economy (importers want high value currency, exporters want low).

It’s also a requirement to do financial meddling like what the US has been doing with interest rates to curb inflation. None of that is possible on the blockchain without a central authority.


> Gold is and has been a decentralized currency for a very long time. It’s mostly just very inconvenient to transport.

Even precious metal coins became endorsed by one authority or another (the cities/banks/little kingdoms stamping the coins). Because you as a normal person don't have the resources to validate every single piece of gold/silver you are paid with.

There has also been a short period when every 3rd bank had its own paper currency. That seems to be gone too. Perhaps because as a normal person maintaining a list of banks you could trust was too much.


I don’t think having a centralized validation authority makes a currency centralized. Centralized currency usually implies the means to control that currency. While the monarch may control the supply of gold minted into coins, they don’t control the supply of gold itself.

It would have been an inconvenient currency for small transactions, but it’s still a currency.

The bank currencies were weird. Iirc, some of that was wrapped up in the Civil War and the Confederate currency being “official” but also basically worthless towards the end of the war. I think the Great Depression killed them, when banks became insolvent and their currencies became worthless.


> The main problem with blockchain is identical to the one with LLMs. When snake oil salesmen try to apply the same solution to every problem, you stop wasting your time with those salesmen.

+100

Rule in blockchain: Whenever there is money beyond paying for services/infra like AWS, there is a problem.


> I don't think you are getting downvoted for supporting crypto

Still, every of my post that is more or less supportive of crypto gets downvoted. And I am the first to tell the ecosystem is one of the worst in tech so that's always mild support.

But yes, you're right it's probably sem web people overreacting to _my_ rant :)


> If Web 3.0 is already here, where is it, then? Mostly, it's hidden in the markup.

I feel like this is so obvious to point out that I must be missing something, but the whole article goes to heroic lengths to avoid... HTML. Is it because HTML is difficult and scary? Why invent a custom JSON format and a custom JSON-to-HTML compiler toolchain than just write HTML?

The semantics aren't hidden in the markup. The semantics are the markup.


I think that’s what we’re doing today, and it’s a phenomenal mess.

The typical HTML page these days is horrifically bloated, and whilst it’s machine parsable, it’s often complicated to actually understand what’s what. It’s random nested divs and unified everything. All the way down.

But I do wonder if adding context to existing HTML might be better than a whole other JSON blob that’ll get out of sync fast.


I'm just not convinced that swapping out "<ol></ol>" for "[]" actually addresses any of the problems.


I must have missed your point, isn't the answer obviously that HTML is very, very limited and intended as a way to markup text? Semantic data is a way to go further and make machine-readable what actually is inside that text: recipes, places, people, posts, animals, etc, etc and all their various attributes and how they relate to each other.

Basically, what you are saying is already rdf/xml, except that devs don't like xml so json-ld came along as a man-machine-friendlier way to do rdf/xml.

There are also various microdata formats that allow you to annotate html in a way the machines can parse it as rdf. But that can be limited in some cases if you want to convery more metadata.


Why should anybody do that though? It doesn't benefit individual users, it benefits web scrapers mostly. Search bots are pretty sophisticated at parsing HTML so it isn't an issue there.


Web 1.0 = read

Web 2.0 = read/write

Web 3.0 = read/write/own


You could make the case that we already are in Web 3.0, or that we have regressed into Web 1.0 territory.

Back in actual Web 2.0, the internet was not dominated by large platforms, but more spread out by ppl hosting their own websites. Interaction was everywhere and the spirit resolved around "p2p exchange" (not technologically speaking).

Now, most traffic goes over large companies which own your data, tell you what to see and severely limit genuine exchange. Unless you count out the willingness of "content monkeys", that is.

What has changed? The internet has settled for a lowest-common denominator and moved away from a space of tech savy people (primarily via the arrival of smartphones). The WWW used to be the unowned land in the wild west, but has now been colonized by an empire from another world.


> Web3 = read/write/own

This is "reminagined" and more commonly known as just "Web3". Entirely different from the older conceptual "Web 3.0" in context of semantic web.

So more like:

Web 3.0 = read/write/describe

https://en.wikipedia.org/wiki/Semantic_Web

The resolution to this conondrum is of course to do both simultaneously and refer to it as "Web 5". Ask my friend Jack.


Are there any tools that employ LLMs to fill out the Semantic Web data? I can see that being a high-impact use case: people don’t generally like manually filling out all the fields in a schema (it is indeed “a bother”), but an LLM could fill it out for you – and then you could tweak for correctness / editorializing. Voila, bother reduced!

This would also address the two reasons why the author thinks AI is not suited to this task:

1. human stays in the loop by (ideally) checking the JSON-LD before publishing; so fewer hallucination errors

2. LLM compute is limited to one time per published content and it’s done by the publisher. The bots can continue to be low-GPU crawlers just as they are now, since they can traverse the neat and tidy JSON-LD.

——————

The author makes a good case for The Semantic Web and I’ll be keeping it in mind for the next time I publish something, and in general this will add some nice color to how I think about the web.


Bringing an LLM into the picture is just silly. There's zero need.

The author (and much of HN?) seems to be unaware that it's not just thousands of websites using JSON-LD, it's millions.

For example: install WordPress, install an SEO plugin like Yoast, and boom you're done. Basic JSON-LD will be generated expressing semantic information about all your blog posts, videos etc. It only takes a few lines of code to extend what shows up by default, and other CMSes support this took.

SEOs know all about this topic because Google looks for JSON-LD in your document and it makes a significant difference to how your site is presented in search results as well as all those other fancy UI modules that show up on Google.

Anyone who wants to understand how this is working massively, at scale, across millions of websites today, implemented consciously by thousands of businesses, should start here:

https://developers.google.com/search/docs/appearance/structu...

https://search.google.com/test/rich-results

Is this the "Semantic Web" that was dreamed of in yesteryear? Well it hasn't gone as far and as fast as the academics hoped, but does anything?

The rudimentary semantic expression is already out there on the Web, deployed at scale today. Someone creative with market pull could easily expand on this e.g. maybe someday a competitor to Google or another Big Tech expands the set of semantic information a bit if it's relevant to their business scenarios.

It's all happening, it's just happening in the way that commercial markets make things happen.


I guess where do you go from basic info that can be machine generated, to rich information that's worth consuming for things other than link previews and specific Google Search integrations?


It depends, are we pontificating as technologists, or addressing real market needs? Given that the framework is already there and widely adopted, I think the moment there is a viable commercial scenario where a company sees a profit opportunity, additional JSON-LD schemas will take off. I don't think Great Thinker technologists are likely to alter the behavior of the market at scale by themselves


Semantic Web is now revived into its new marketing incarnation, called Knowledge Graphs. There's actually a lot of work on building KGs with LLMs, specially in the RAG space e.g., Microsoft's GraphRag and llama_index's KnowledgeGraphIndex


I think the future holds a synthesis of LLM functions with semantic entities and logic from knowledge graphs (this is called "neuro-symbolic AI"), so each topic/object can have a clear context, upon which you can start prompting the AI for the preferred action/intention.

Already implemented in part on my Conzept Encyclopedia project (using OpenAI): https://conze.pt/explore/%22Neuro-symbolic%20AI%22?l=en&ds=r...

Something like this is much easier done using the semantic web (3D interactive occurence map for an organism): https://conze.pt/explore/Trogon?l=en&ds=reference&t=link&bat...

On Conzept one or more bookmarks you create, can be used in various LLM functions. One of the next steps is to integrate a local WebGPU-based frontend LLM, and see what 'free' prompting can unlock.

JSON-LD is also created dynamically for each topic, based on Wikidata data, to set the page metadata.


> Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it.

Google has been pushing JSON-LD to webmasters for better SEO for at least 5 years, if not more: https://developers.google.com/search/docs/appearance/structu...

There really isn't a need to do it as most of the relevant page metadata is already captured as part of the Open Graph protocol[0] that Twitter and Facebook popularized 10+ years ago as webmasters were attempting to set up rich link previews for URLs posted to those networks. Markup like this:

<meta property="og:type" content="video.movie" />

is common on most sites now, so what benefit is there for doing additional work to generate JSON-LD with the same data?

[0]https://ogp.me/


> Before JSON-LD there was a nest of other, more XMLy, standards emitted by the various web steering groups. These actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.

If archival systems and library's are using XML, wouldn't it be preferable to follow their lead and whatever standards they are using? Since they are the ones who are going to use this stuff most, most likely.

If nothing else, you can add a processing instruction to the document they use to convert it to HTML.


The format really isn’t much of an issue. From an information point of view, the content of the different formats are identical, and translation among them is straightforward.

Promoting JSON-LD potentially makes it more palatable to the modern web creators, perhaps increasing adoption. The bots have already adapted.


You're aware of straightforward translations to and from E-ARK SIP and CSIP? Between what formats?

As far as I can tell archivists don't care about "modern web creators", and they likely shouldn't, since archiving is a long term project. I know I don't, and I'm only building software for digital archiving.


If by that the author means JSON-LD has replaced MarcXML, BibTex records, and other bibliographic information systems, then that's very much not the case.


They recognise that in the quoted paragraph. The JSON-LD thing was only about the open web:

> [MarcXML, BibTex etc] actually have very, very deep support in many places (for example in library and archival systems) but on the open web they are not a goer.


> If nothing else, you can add a processing instruction to the document they use to convert it to HTML.

Like XSLT?


Well worth, for whom? as a blogger, these things are 99% for the companies making profit by scraping my content, maybe 1% of the users will need them. Or am I wrong?


This has been my hang up as well. Providing metadata seems extremely useful and powerful, but coming into web development in the mid 10s rather than mid 00s made it more clear that the metadata would largely just help a handful of massive corporations.

I will still include JSON-LD when it make financial sense for a site. In practice that usually just means business metadata for search results and product data for any ecommerce pages.


I started this thread on the w3c list almost 20 years ago - https://lists.w3.org/Archives/Public/semantic-web/2005Dec/00...

Unfortunately, it is unlikely we will ever get something like a Semantic web. It seemed like a good idea in the beginning of 2000s but now there is honestly no need for it as it is quite cheap and easy to attach meaning to text due to the progress in LLMs and NLP.


Exactly. Afaik, there are certain corners of the Web that benefit from some kind of markup. I think real estate is one, where you can generate searches of the MLS on sites like Redfin or Zillow (or any realtor's site, really) such that you can set parameters: between 1000 and 1500 square feet (or meters in Europe), with a garage and no basement. That's very helpful (although I don't know whether that searching is done over indexed web pages, or on the MLS itself). But most of the Web, afaict, have nothing like that---and don't need it, because NLP can distinguish different senses of 'bank' (financial vs. river), etc.


Ehm... The semantic web as an idea was/is a totally different thing: the idea is the old libraries of Babel/Bibliotheca Universalis by Conrad Gessner (~1545) [1] or the ability to "narrow"|"select"|"find" just "the small bit of information I want". Observing that a book it's excellent to develop and share a specific topic, it have some indexes to help directly find specific information but that's not enough, a library of books can't be traversed quick enough to find a very specific bit of information like when John Smith was born and where.

The semantic web original idea was the interconnection of every bit of information in a format a machine can travel for a human, so the human can find any specific bit ever written with little to no effort without having to humanly scan pages of moderately related stuff.

We never achieve such goal. Some have tried to be more on the machine side, like WikiData, some have pushed to the extreme the library science SGML idea of universal classification not ended to JSON but all are failures because they are not universal nor easy to "select and assemble specific bit of information" on human queries.

LLMs are a, failed, tentative of achieve such result from another way, their hallucinations and slow formation of a model prove their substantial failure, they SEEMS to succeed for a distracted eye perceiving just the wow effect, but they practically fails.

Aside the issue with ALL test done on the metadata side of the spectrum so far is simple: in theory we can all be good citizens and carefully label anything, even classify following Dublin Core at al any single page, in practice very few do so, all the rest do not care, or ignoring the classification at all or badly implemented it, and as a result is like an archive with some missing documents, you'll always have holes in information breaking the credibility/practical usefulness of the tool.

Essentially that's why we keep using search engines every day, with classic keyword based matches and some extras around. Words are the common denominator for textual information and the larger slice of our information is textual.

[1] https://en.wikipedia.org/wiki/Bibliotheca_universalis


The problem I find with semantic search is first I have to read and understand somebody elses definitions before I can search within the confines of the ontology.

The problem I have with ML guided search is the ML takes web average view of what I mean, which sometimes I need to understand and then try and work around if that's wrong. It can become impossible to find stuff off the beaten track.

The nice thing about keyword and exact text searching with fast iteration is it's my mental model that is driving the results. However if it's an area I don't know much about there is a chicken and egg problem of knowing which words to use.


Personally I think the limitation of keyword search it's not in the model per se but in the human langue: we have synonymous witch are relatively easy to handle but we also have gazillion of different way to express the very same concept that simply can't be squeezed in some "nearby keyword list".

Personally I notes news, importing articles in org-mode, so I have a "trail" of the news I think are relevant in a timeline, sometimes I remember I've noted something but I can't find it immediately in my own notes with local full-text search on a very little base compared to the entire web, simply because a title does express something with very different words than another and at the moment of a search I do not think about such possible expression.

For casual searches we do not notice, but for some specific searches emerge very clear as a big limitation, however so far LLMs does not solve it, they are even LESS able to extract relevant information, and "semantic" classifications does not seems to be effective either, a thing even easier to spot if you use Zotero and tags and really try to use tags to look for something, in the end you'll resort on mere keyword search for anything.

That's why IMVHO it's an unsolved so far problem.


For me the search problem isn't so much about making sure I get back all potentially relevant hits ( more than I could ever read ) , it's how I get the specific ones I want...

So effective search more about excluding than including.

Exact phrases or particular keywords are great tools here.

Note there is also a difference between finding an answer to a particular question and finding web pages around a particular topic. Perhaps LLM's are more useful for the former - where there is a need to both map the question to an embedding, and summarize the answer - but for the latter I'm not interested in a summary/quick answer, I'm interested in the source material.

Sometimes you can combine the two - LLM's for a quick route into the common jargon, which can then be used as keywords.


> there is also a difference between finding an answer to a particular question and finding web pages around a particular topic.

That's exactly the Gessner's problem and unimplementable so far solution: we have information packed in various form, when we want specific bits it's hard do narrow enough available information packages (books, articles, post etc).

For common stuff in the vast see of the web we tend to end up very frequently on exactly or nearly exactly what we want, but for less common stuff there is a big problem. Just knowing the historical temperature of a small place where many have recorded temperatures, but still no one have create a table and a graph or daily minima and maxima for let's say the 10 years period we want it's hard to find an answer. For some kind of data we have wikidata, for some others we have public datasets from various sources, but still no easy way to locate and narrow them.

LLM are in theory an answer, unfortunately not in practice first for freshness problem (you have to train a model, ingesting new information is still training, far longer than a search engine crawler) and training itself witch is like the opposite of semantic web: instead of having the author prepare the information for machines we have third party who do it en masse, unfortunately they tend to be even less precise than semantic web mean authors PLUS models hallucinations and essentially no ability to tell where their answer came from.

That's why we still miss semantic search...


I don't see how one can have any hope in a Semantic Web ever succeeding when we haven't even managed to get HTML tags for extremely common Internet things: pricetags, comments, units, avatars, usernames, advertisement and so on. Even things like pagination are generally just a bunch of links, not any kind of semantic thing holding multiple documents together (<link rel> exists, but I haven't seen browsers doing anything with it). Take your average website and look at all the <div>s and <span>s and there is a whole lot more low hanging fruit one could turn semantic, but there seems little interest in even trying to.


I don't think we necessarily need new tags: they narrow down the list of possible into an immutable set and require changing the structure of your already existing content. What exists instead are microformats (http://microformats.org/wiki/microformats2), a bunch of classes you sprinkle in your current HTML to "augment" it.


I include microformats on blog sites, but at scale the challenge with microformats is that most existing tooling doesn't consider class names at all for semantics.

Browsers, for example, completely ignore classes when building the accessibility tree for a web page. Only the HTML structure and a handful of CSS properties have an impact on accessibility.

Class names were always meant as an ease of use feature for styling, overloading them with semantic meaning could break a number of sites built over the last few decades.


There is also RDFa and even more obscure Microdata to augment HTML elements. Google’s schema.org vocabulary originally used these before switching to JSON-LD.

The trick, as always, is to get people to use it.


Everyone is optimizing for their own local use-case. Even open-source. Standards get adopted sometimes, but only if they solve a specific problem.

There is an additional cost to making or using ontologies, making them available and publishing open data on the semantic web. The cost is quite high, the returns aren't immediate, obvious or guaranteed at all.

The vision of the semantic web is still valid. The incentives to get there are just not in place.


There's a project [0] that parses Commoncrawl data for various schemas, it contains some interesting datasets.

[0] http://webdatacommons.org/


That’s a really useful link, thanks for sharing. We’re building a scrapping service and only parsing rely on native html tags and open graph metadata, based on this link we should definitely take a step forward to parse JSON-LD as well.


I've playing with RSS feeds recently, suddently it occured to me, XML can be transformed into anything with XSL, for static hosting personal blogs, I can save articles into the feeds directly, then serve frontend single-page application with some static XSLT+js. This is content-presentation separation at best.

Is JSON-LD just reinventation of this?


Back in the optimistic 2000s there was the brief idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff from HTML, e.g. microformats, HTML meta, FOAF, etc, and then transforming it into RDF or other things:

https://www.w3.org/TR/grddl/


But why? Isn't most of the information you can extract from those tags stuff that's pretty obvious, like title and author (the examples the linked page uses)? How do you extract really useful information using that methodology, supporting searches that answer queries like "110 volt socket accepting grounding plugs"? Of course search engines can (and do) get such info, but afaik it doesn't require or use XSLT beyond extracting the plain text.


That is exactly the thought behind SGML/XML and its derivatives. XSL is kind of clumsy but very powerful and the most direct way to transform documents.

JSON-LD to me looks more like trying to glue different documents together, its not about the transformation itself.


> This is content-presentation separation at best.

The idea is the best, but arguably the implementation is lacking.

> Is JSON-LD just reinventation of this?

Yup. It's "RDF/XML but we don't like XML"


Is that really what Discord, Whatsapp & co are using to display the embed widgets they have or is it just <meta> tags like I would expect...?


There are several methods they may use:

- OpenGraph (by Facebook, probably used by Whatsapp) – https://ogp.me/

- Schema.org markup (the main point of this blog) – https://schema.org/

- oEmbed (used to embed media in another page, e.g. YouTube videos on a WordPress blog) – https://oembed.com/


"Googlers, if you're reading this, JSON-LD could have the same level of public awareness as RSS if only you could release, and then shut down, some kind of app or service in this area. Please, for the good of the web: consider it." - lol


The question is: does this bring any of the purported benefits of the Semantic Web? Does it suddenly allow "agents" to understand the meaning of your web pages, or are we just complying with a set of pre-defined schemas that predefined software (or more specifically, Google, in practice) understands and knows how to render. In other words, was all the SemWeb rigmarole actually necessary, or could the same results have been achieved using any of the mentioned simpler alternatives (microdata, OpenGraph tags, or even just JSON schemas)?


So much jumping to defend llms as the future. I'd like to point that llms hallucinate, could be injected, and often lack context which well structured metadata can provide. At least, I don't want for an llm to hollucinate the author's picture and bio based on hints in the article, thank you very much.

I don't think that one is necessarily better than the other, but imagining that llms are a silver bullet when another trending story in the front pages is about prompt injection used against the slack ai bots sounds a bit over optimistic.


Sure but do hallucinations matter then much just for categorisation? Hardly the end of the world if they make up a published date occasionally.

And prompt injection is irrelevant because the alternative we're considering is letting publishers directly choose the metadata.


Prompt injection is highly relevant because you end up achieving the same as the publisher choosing the metadata, but on a much higher price for the user. Price which needs to be paid by each user separately instead of using one already generated.

LLMs are much better when the user adapts the categories to their needs or crunches the text to pull only the info relevant to them. Communicating those categories and the cutoff criteria would be an issue in some contexts, but still better if communication is not the goal. Domain knowledge is also important, because nitch topics are not represented in the llm datasets and their abilities fail in such scenarios.

As I said above, one is not necessarily better than the other and it depends on the use cases.


> Prompt injection is highly relevant because you end up achieving the same as the publisher choosing the metadata, but on a much higher price for the user.

How does price affect the relevance of prompt injection? That doesn't make sense.

> nitch

Niche. Pronounced neesh.


My question is: how price does not matter? If you are given the choice to pay either a dollar or a million dollars for the same good from an untrustworthy merchant, why would you pay the million? And the difference between parsing a json and sending a few megabytes of a webpage to chatgpt is the same if not bigger. For a dishonest seo engineer it does not matter if they will post boasting metadata or a prompt convincing chatgpt in the same. The difference is for the user.

I don't mind the delusions of most people, but the idea that llms will deal with spam if you throw a million times more electricity against it is what makes the planet burning.


Price matters, but you said prompt injection is relevant because of price. Maybe a typo...


As much as I like the ideas behind the semantic web, JSON-LD feels like the least friendly of all semantic markup options (compared to something like, say, microformats)


I think the main issue with microformats is most CMSes don't really have a good way of adding them. You need a very capable rich editor to add semantic data inline or edit the output HTML by hand. Simple markup like WikiText and Markdown don't support microformat annotation.

JSON-LD in a page's header is much easier for a CMS to present to the page author for editing. It can be a form in the editing UI. Wordpress et al have SEO plugins that make editing the JSON-LD data pretty straightforward.


That's a good point. I adopted microformats in a static site generator, with a handful of custom shortcodes. It would be much harder to adopt in a WYSIWYG context


Microformats feel like they're ugly retrofitted kludges, where it would have been way more elegant if in among all the crazy helter-skelter competing development of HTML, someone thought to invent a <person> tag, maybe a <organisation> tag. That would have solved a few problems that <blink> certainly didn't.


They certainly are retrofitted, but the existing semantic tags are largely abandoned for div soups that are beaten into shape and submission by lavish amounts of JS and a few sprinkles of CSS (and the latter often as CSS-in-JS). For microformats there is at least a little ecosystem already, and the vendor-driven committees don't need to be involved.


I mean, is anything actually stopping one from adding something like those tags today? Web components use custom tags all the time


Nothing at all. I believe you don't even need to use web components. You can just throw in <my-person>Joe Bloggs</my-person> and that's valid HTML-whatever.

But it's not a standard that is recognised, and so is no kind of metadata format.


Neither were microformats, up until a handful of aggregators started parsing them. We're still kind of stuck in the infancy of semantic markup it seems.


I found about The Semantic Web about 5 years ago, and since then I've been pulling the thread and I realized it's a lot more than just "machine readable metadata". In particular, it's also about data ownership and control.

You can see what I mean learning about the Solid Protocol, I gave a talk about it a couple of years ago: https://m.youtube.com/watch?v=kPzhykRVDuI


No ... because the incentives to lie in metadata are too high


worth the bother. "preview" on the capitalocenic web without any mention of the Link Relation Types does not a semantic web adoption make. no mention of the economic analysis and impact of monopoly, no intersectional analysis with #a11y.

if the "preview" link relation type is worth mentioning it's worth substantiating the claims about adoption. when did the big players adopt? why? what of the rest of the types and their relation to would-be "a.i." claims?

how would we write html differently and what capabilities would we expose more readily to driving by links, like carousels only if written with a11y in mind? how would our world-wild web look different if we wrote html like we know it? than only give big players a pass when we view source?


I dont like the use of a Json "script" inside an HTML page. I understand the flexibility it grants but markup tags is what HTL is based on and the design would be more consistent by using HTML tags as we have had them for decades to also handle this extra meta data.


JSON-LD isn't the only way one can embed these metadata (though I think most tooling prefers it now).

For example, Microdata[0] is one in-line way to do it, and RDFa[1] is another.

[0] https://en.wikipedia.org/wiki/Microdata_(HTML)

[1] https://en.wikipedia.org/wiki/RDFa


I wish there was a better alternative to JSON-LD. I want to avoid duplication by reusing the data that's already in the page by marking them up with appropriate tags and properties. Stuff like RDF exists but is extremely complex and verbose.


Originally you could use the schema.org vocabulary with RDFa or Microdata which embed the structured data right at the element. But than can be brittle: Markup structures change, get copy-and-pasted and editing attributes is not really great in CMS. I may not like it aesthetically but embedded JSON-LD makes some sense.

See also this comment above: https://news.ycombinator.com/item?id=41309555


Embedding data as JSON as program text inside a <script> tag inside a tagged data format just seems like such a terrible hack. Among other things, it stutters: it repeats information already in the document. The microdata approach seems much less insane. I don’t know if it is recognised nearly as often.

TFA mentions it at the end: ‘There is also “microdata.” It’s very simple but I think quite hard to parse out.’ I disagree: it’s no harder to parse than HTML, and one already must parse HTML in order to correctly extract JSON-LD from a script tag (yes, one can incorrectly parse HTML, and it will work most of the time).


Pardon my naivetée, but what exactly is JSON-LD doing that the HTML meta tags don't do already? My blog doesn't implement JSON-LD but if you link to my blog on popular social media sites, you still get a fancy link.


JSON-LD / RDFa and such can use the full type hierarchy of schema.org (and other languages) and can build a tree or even a graph of data. Meta elements are limited to property/value pairs.


Interesting. How is this implemented by the server? Just periodically refresh a cached value for each page, or are you updating the tree in real-time for all pages, and having them traverse it on each request?

Either way it sounds awfully expensive for data that probably isn't used by the client most of the time. Do you have to explicitly ask for it? Is there some ad-hoc way to tell the server "hey I don't need the JSON-LD data?"


Did json-ld get a lot of traction for link previews? I haven't really encountered it much.

I actually implemented a simple link preview system a while ago. It uses opengraph and twitter cards meta data that is commonly added to web pages for SEO. That works pretty well.

Ironically, I did use chat gpt for helping me implement this stuff. It did a pretty good job too. It suggested some libraries I could use and then added some logic to extract titles, descriptions, icons, images, etc. with some fallbacks between various fields people use for those things. It did not suggest me to add logic for json-ld.


That statement is both kind of true and, well, revisionist. Originally there was a strong focus on logics, clean comprehensive modeling of the world through large complicated ontologies, and the adoption of super impractical representation languages, etc. It wasn't until rebellious sub-communities went rogue and pushed for pragmatic simplifications that things got any widespread impact at all. So here's to the crazy ones, I guess.


I was doing bachelor thesis 10 years ago on some semantic file conversions, we had a lot of projects at school. And looks like there is not much progress for end user…


Companies use open-graph because it gives them something in return (nice integration in other products when linking to your site). That’s nice and all but outside of this niche use case there is no incentives for a semantic web from the point of view of publishers. You just make it simpler to crawl your website (something you cannot really monetize) instead of offering a strict API you can monetize to access structured data.


Semantic web suffers from organisational capture. If there's a big org they get to define the standard at the expense over everyone else use cases.


Monetization is the elephant in the room in my opinion.

IMDB could easily be a service entirely dedicated to hosting movie metadata as RDF or JSON-LD. They need to fund it though, and the go to seems to be advertising and API access. Advertising means needing human readable UI, not metadata, and if they put data behind an API its a tough sell to use a standardized and potentially limiting format.


'it makes social sharing look a bit nicer' being the only benefit that can scraped from the barrel as a benefit undermines the entire premise.

It's not widely adopted, it's used as an attempted growth hack in a few locations that may or may not be of use (with value being relative to how US centric your and your audiences Internet use is)


Semantic Web technology (RDF, RDFS, OWL, SHACL) is widely used in the European electricity industry to exchange grid models: https://www.entsoe.eu/data/cim/cim-for-grid-models-exchange/


I have experience using this back when I worked for a startup that did distribution grid optimization. The specs are unfortunately useless in practice because while the terminology is standardized the actual use of each object and how to relate them is not.

Thus, every tool makes CIM documents slightly differently and there are no guarantees that a document created in one tool will be usable in another


That's why ENTSO-E has just completed a software vendor interoperability workshop. :-) And import/export/validation worked just fine for all participants.


> The Semantic Web is the old Web 3.0. Before "Web 3.0" meant crypto-whatnot, it meant "machine-readable websites".

Using contemporary AI models aren't all websites machine-readable? - or potentially even more readable than semantic web unless an ai model actually does the semantic classification while reading it?


It is hilarious to see namespaces trying to creep into json.

I do wonder how any of this is better than using the meta tags of the html, though? Especially for such use cases as the preview. Seems the only thing that isn't really there for the preview is the image? (Well, title would come from the title tag, but still...)


I think that if you want your page to be well discoverable, to be well asvertised, positioned in search engines and social media you have to support standards. Like open graph protocol, or json ld.

Be nice to bots. This is advertisment after all.

Support standards even if Google does not. Other bots might not be as sofisticated.

For me, yes, it is worth the bother


There is a lot of value on Enterprise Knowledge Graphs, applying the semantic web standards into the "self-contained" world of enterprise data, there are many large enterprises doing it, and there is an interesting video from UBS on how they consider it a competitive advantage


> Semantic Web information on websites is a bit of a "living document". You tend publish something, then have a look to see what people have parsed (or failed to parse) it and then you try to improve it a bit.

Hm.


I suspect that AI training data standards will make this much more prevalent.

Just today, I am working on an experimental training/consuming app pair. The training part will leverage JSON data from a backend I designed.


This has been invented a number of times. Facebook's version is called Open Graph.

https://ogp.me/


Back then Facebook said their Open Graph Protocol was only an application of RDFa – and syntax wise it seemed so.


If this counts as the "semantic web", then <meta name="description"... should to.

In which case we have all been on it since the mid 90s.


It's real RDF. You can process this with RDF tools. Certainly do SPARQL queries. Probably add a schema and have valid OWL DL and do OWL inference if the data is squeaky clean. Certainly use SPIN or Jena rules.

It leans too hard on text and doesn't have enough concepts defined as resources but what do you expect, Python didn't have a good package manager for decades because 2 + 2 = 3.9 with good vibes beats 2 + 2 = 4 with honest work and rigor for too many people.

The big trouble I have with RDF tooling is inadequate handling of ordered lists. Funny enough 90% of the time or so when you have a list you don't care about the order of the items and frequently people use a list for things that should have set semantics. On the other hand, you have to get the names of the authors of a paper in the right order or they'll get mad. There's a reasonable way to turn native JSON lists into RDF lists

https://www.w3.org/TR/json-ld11/#lists

although unfortunately this uses the slow LISP lists with O(N) item access and not the fast RDF Collections that have O(1) access. (What do you expect from M.I.T.?)

The trouble is that SPARQL doesn't support the list operations that are widespread in document-based query languages like

https://www.couchbase.com/products/n1ql/

https://docs.arangodb.com/3.11/aql/

or even Postgresql. There is a SPARQL 1.2 which has some nice additions like

https://www.w3.org/TR/sparql12-query/#func-triple

but the community badly needs a SPARQL 2 that catches up to today's query languages but the semantic web community has been so burned by pathological standards processes that anyone who can think rigorously or code their way out of a paper bag won't go near it.

A substantial advantage of RDF is that properties live in namespaces so if you want to add a new property you can do it and never stomp on anybody else's property. Tools that don't know about those properties can just ignore them, but SPARQL, RDFS and all that ought to "just work" though OWL takes some luck. That's got a downside too which is that adding namespaces to a system seems to reduce adoption by 80% in many cases because too many people think it's useless and too hard to understand.


My point is that even if technically its rdf, if all anyone does is use a few specific properties from a closed pre-agreed schema, we might as well just be using meta tags.


But there's the question of who is responsible for it and who sets the standards. These days the consortium behind HTML 5 is fairly quick and responsive compared to the W3C's HTML activity in the day (e.g. fight with a standards process for a few months as opposed to "talk to the hand") but schema.org can evolve without any of that.

If there's anything that sucks today it is that people feel they have to add all kinds of markup for different vendors (such as Facebook's Open Graph) I remember the Semweb folks who didn't think it was a problem that my pages had about 20k of visible markup and 150k of repeated semantic markup. It's like the folks who don't mind that an article with 5k worth of text has 50M worth of Javascript, ads, trackers and other junk.

On the other hand I have no trouble turning

   <meta name="description" content="A brief description of your webpage content.">
into

   @prefix meta: <http://example.com/my/name/space> .
   <http://example.com/some/web/page> meta:description "A brief description of your webpage content." .
where meta: is some namespace I made up if I want to access it with RDF tools without making you do anything


In all honesty, llms are probably going to make all this entirely redundant.

As such semantic web was not a natural follower to what we had before, and not web 3.0.


The article addresses this point with the following:

> It would of course be possible to sic Chatty-Jeeps on the raw markup and have it extract all of this stuff automatically. But there are some good reasons why not. > > The first is that large language models (LLMs) routinely get stuff wrong. If you want bots to get it right, provide the metadata to ensure that they do. > > The second is that requiring an LLM to read the web is throughly disproportionate and exclusionary. Everyone parsing the web would need to be paying for pricy GPU time to parse out the meaning of the web. It would feel bizarre if "technological progress" meant that fat GPUs were required for computers to read web pages.


The first point is moot, because human annotation would also have some amount of error, either through mistakes (interns being paid nothing to add it) or maliciously (SEO). Plus, human annotation would be multi-lingual, which leads to a host of other problems that LLMs don't have to the same extent.

The second point is silly, because there is no reason for everyone to train their own LLMs on the raw web. You'd have a few companies or projects that handle the LLM training, and everyone else uses those LLMs.

I'm not a big fan of LLMs, and not even a big believer in their future, but I still think they have a much better chance of being useful for these types of tasks than the semantic web. Semantic web is a dead idea, people should really allow it to rest.


While both of these points a valid today they are likely going to be invalidated going forward - assume that what you can conceive is technically possible will become technically possible.

In 5 years resource price is likely negligible and accuracy is high enough that you just trust it.


It's HN, most people don't read the article and jump into whatever conclusion they have at the moment despite not being an expert in the field.


As I already pointed out, none of the arguments the author brings up are really relevant. Resources and accuracy will not be a concern in 5 years.

What makes you think that I am not an expert btw?

It indeed seems like you appear to believ that what's written on the internet is true. So if someone writes that LLMs are not a contester to semantic web - then it might be true.

Could it be, that I merely challenge that author of the blog article and don't take his predictions for granted?


He had it summarized by chatgpt


Have you read the article? It addresses this point towards the end.


And it fails to address why SemWeb failed in its heyday: that there's no business case for releasing open data of any kind "on the web" (unless you're wikidata or otherwise financed via public money) the only consequence being that 1. you get less clicks 2. you make it easier for your competitors (including Google) to aggregate your data. And that hasn't changed with LLMs, quite the opposite.

To think a turd such as JSON-LD can save the "SemWeb" (which doesn't really exist), and even add CSV as yet another RDF format to appease "JSON scientists" lol seems beyond absurd. Also, Facebook's Open Graph annotations in HTML meta-links are/were probably the most widespread (trivial) implementation of SemWeb. SemWeb isn't terrible but is entirely driven by TBL's long-standing enthusiasm for edge-labelled graph-like databases (predating even his WWW efforts eg [1]), plus academia's need for topics to produce papers on. It's a good thing to let it go in the last decade and re-focus on other/classic logic apps such as Prolog and SAT solvers.

[1]: https://en.wikipedia.org/wiki/ENQUIRE


yes


The article talks about JSON-LD, but there is also shema.org and open graph.

What which one should you use, and why?

Should you use several? How does that impact the site?


JSON-LD uses schema.org schema


But very helpfully Google supports...mostly schema.org except when they don't when they feel like it.


My fear around JSON-LD is too much of our content will end up on a SERP, and we'll attract less traffic.


Looks like a perfect use case for LLM: generate that JSON-LD metadata from HTML via LLM, either by the website owner or by the crawler. If crawlers, website owners doesn’t need to do anything to enter Semantic Web and crawlers specify their own metadata format they want to extract. This promises an appealing future of Web 3.0, not by crypto, defined not by metadata but by LLMs.


Not totally sure if it is needed, nice to have? RSS feeds are great but seen less and less.


Arguing against standard vocabularies (part of the Semantic Web) is like arguing against standard libraries. "Cool story bro."

But it is true, if you can't make sense of your data, then the Semantic Web probably isn't for you. (It's the least of your problems.)


> The first is that large language models (LLMs) routinely get stuff wrong. If you want bots to get it right, provide the metadata to ensure that they do.

Yet another reason NOT to use the semantic web. I don't want to help any LLMs.


Here I was, thinking the machines would make our lives easier. Now we have to make our websites Reader-Mode friendly, ARIA[1]-labelled, rendered server-side and now semantic web on top, just so that bots and non-visitors can crawl around?

[1] This is also something the screen assist software should do, not the publisher.


ARIA is something that really shouldn't have been necessary, but today it is absolutely crucial that content publishers make sure is right. Because the screen assist software can't do it.

Why? Because a significant percentage of people working on web development think a webpage is composed as many <spans> and <divs> as you like, styled with CSS and the content is injected into it with JavaScript.

These people don't know what an <img> tag is, let alone alt-text, or semantic heading hierarchy. And yet, those are exactly the things that Screen Reader software understands.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: