This actually happened already and it's part of why llms are so smart, I haven't tested this but I venture a guess that without wikipedia and wikidata and wikipedia clones and stolen articles, LLMs would be quite a lot dumber. You can only get so far with reddit articles and embedded knowledge of basic info on higher order articles.
My guess is when fine tuning and modifying weights, the lowest hanging fruit is to overweigh wikipedia sources and reduce the weight of sources like reddit.
Only a relatively small part of Wikipedia has semantic markup though? Like if the article says "_Bob_ was born in _France_ in 1950" where the underlines are Wikpedia links, you'll get some semantic info from the use of links (Bob is a person, France is a country), but you'd be missing the "born" relationship and "1950" date as these are still only raw text.
Same with the rest of articles with much more complex relationships that would probably be daunting even for experts to markup in an objective and unambiguous way.
I can see how the semantic web might work for products and services like ordering food and booking flights, but not for more complex information like the above, or how semantic markup is going to get added books, research articles, news stories etc. that are always coming out.
> The semantic information is first present not in markup but in natural language.
Accurate natural language processing is a very hard problem though and is best processed by AI/LLMs today, but this goes against what the article was going for when it's saying we shouldn't need AI if the semantic web had been done properly?
Complex NLP is the opposite to what the semantic web was advocating? Imagine asking the computer to buy a certain product and it orders the wrong thing because the natural language parsed was ambiguous.
> Additionally infoboxes also hold relationships, you might find when a person was born in an infobox, or where they studied.
That's not a lot of semantic information compared to the contents of a Wikipedia article that's several pages long. Imagine a version of Wikipedia that only included the infoboxes and links within them.
This actually happened already and it's part of why llms are so smart, I haven't tested this but I venture a guess that without wikipedia and wikidata and wikipedia clones and stolen articles, LLMs would be quite a lot dumber. You can only get so far with reddit articles and embedded knowledge of basic info on higher order articles.
My guess is when fine tuning and modifying weights, the lowest hanging fruit is to overweigh wikipedia sources and reduce the weight of sources like reddit.