From Books to Knowledge Graphs

ivan888 · on March 14, 2023

Since we are now at the stage where it's difficult to distinguish well reasoned human discourse from that generated by AI, I think that deeply linked citation systems in the form of knowledge graphs will grow in importance. Otherwise it will be too easy for AI generated discourse to begin referencing itself, introducing biases and clinging to blatant falsehoods without an easy way to discover their origin and fix the problems

mecsred · on March 14, 2023

My god, that sounds horrible. I wish we could go back to pure human discourse in the sciences. That never becomes purely self referential, introduces biases, or clings to blatant falsehoods.

Only half joking. I do agree that AI content is going to change the landscape. However the things it does are all things it "learned" to do because... It was already present in the training data. The big change is just the scale those things can be done at.

bordercases · on March 14, 2023

Citation systems can and do discover circular reference. Not all of it! But it's a start.

gryn · on March 14, 2023

most of these knowledge graph papers seems to be about linking shallow content, likes citations etc,

with all these strides in AI, GPT, etc, does anyone know of papers about extracting into graph or other form of knowledge representations deeper semantics ?

for example instead extracting what the paper is about, how it relates to other papers, nodes and edges linking to field specific terms/concepts mentioned, translation of paper specific terminology to commonly agreed terminology, why it's citing said references (is it an example of use of a technique ? a proof for a claim ? something that's being refuted, etc ..), where in the text it's being used. and other kinds of "deeper meaning".

I've glanced at some papers that seemed to do something remotely close to this but it was restricted to doing NER for only medical/biology terms.

stonerri · on March 14, 2023

The second layer is hard. I tried something in this space in mid-2018. Full text extraction and sentence segmentation tech was adequate, but extracting the discourse tree and building the graph was a bit of a struggle (trying to repurpose a collection of academic/open tools to get something useful). Never published or released the code.

If interested, a few rabbit holes to explore (no affiliations):

https://scite.ai -> best option for citation mapping, but same issues you described above

https://www.semanticscholar.org and AI2 -> the best group working on tooling in this space

https://www.weave.bio -> early startup trying to build this out

The hardest challenge in my view is solving the intermediate representation issue. You have to establish a DSL/nomenclature that provides the range required to represent a complete scholastic discourse while also being computable.

21eleven · on March 14, 2023

> The hardest challenge in my view is solving the intermediate representation issue.

Right, you'd basically be writing an interpreter for English

nerpderp82 · on March 15, 2023

Have GPT translate down into a Controlled Natural Language. I tried having it translate to OWL, but it sucked.

https://en.wikipedia.org/wiki/Controlled_natural_language

uticus · on March 15, 2023

“OWL”?

xyzzy3000 · on March 15, 2023

I assumed OWL='Web Ontology Language', which seems to make sense given that the topic is semantics - but OWL isn't a controlled vocabulary of a natural language, so I may be wrong.

nerpderp82 · on March 15, 2023

You are correct, I just reached for the nearest machine readable knowledge format.

summarity · on March 14, 2023

I'm building this, for (mostly) non-scientific non-fiction works (books, articles, news, etc.). Launching soon, with about 7,500 books indexed.

Generally, what I found useful to build a graph between "topics" or entities was to use a HyDE[1] prompt to generate possible distinct definitions and then build a nearest-neighbor network from that. This successfully identifies related concepts in the truly abstract sense rather than literal entities.

[1] https://summarity.com/hyde

abhinavkulkarni · on March 15, 2023

Hey @minxomat,

HyDE sounds like an interesting approach. All dense retrieval approaches suffer from the problem you outlined in the blogpost. Have you looked at keyword-based or late-interaction models for retrieval such as ColBERTv2[1]? I find that late-interaction methods seem to offer best trade-off between semantic intelligence (precision) and retriving relevant documents (recall).

[1] https://github.com/stanford-futuredata/ColBERT

bwb · on March 14, 2023

I'd love to chat! I am doing a rough version of this at Shepherd.com and laying the ground work for more :) (ben@re-moveshepherd.com)

I haven't uses Wikidata info yet, but hoping to expand to that in 3 or 4 months.

summarity · on March 14, 2023

Sure. Sent a ping.

jeromechoo · on March 14, 2023

Not a paper, but NLP extraction into knowledge graph representations do exist and are in use today. For example, here's a general purpose NLP model that links organization and people entities (among others) to each other based on factual relationships described in the text — https://demo.nl.diffbot.com

This semantic extraction can be extrapolated to most any trainable context. A useful one I've worked with involved mapping supplier-partner relationships. A well built supply chain graph can identify every layer of risk in a single supply chain and provide the provenance to back it up.

IMO, the biggest blocker to more mainstream use of Knowledge Graphs (even in the commercial world) is an actually intuitive interface for knowledge exploration. The real market innovation behind GPT isn't its 175 billion parameters, its the prompt interface that makes ChatGPT so universally accessible.

dberenstein1957 · on March 14, 2023

Lovely, for me coreference resolution also is a huge issue. I created this package (https://github.com/Pandora-Intelligence/crosslingual-corefer...) which is currently the only viable solution to do coref resolution in Dutch and some other low-resource language.

Also, if you want to scale, LLMs are going to prove to expensive, so eventually data need to be logged somewhere to create a fine-tuned model that can do sub-tasks and ideally do them better. What do you think?

jeromechoo · on March 15, 2023

Nice! Yes, coreference resolution is surprisingly absent even in enterprise NLP.

Depends on what you mean by "better". With more accuracy?

moonchrome · on March 14, 2023

I think this is the next step in AGI - feed and reference general knowledge graphs from/to LLM. Then link to expert systems such as SAT/facial recognition/physics model/medical model.

I think LLMs will be capable of orchestrating between systems and maintaining a conscious narrative but knowledge graphs will solve the update and references problem.

zozbot234 · on March 14, 2023

You can mark reasons for paper citations using CiTO, the citation types ontology: https://sparontologies.github.io/cito/current/cito.html Not sure if there's anything standard along the same lines for more general annotation of e.g. Web-derived resources.

beckman466 · on March 14, 2023

not a paper but a project: http://libscie.org

pr337h4m · on March 14, 2023

https://github.com/varunshenoy/GraphGPT

"GraphGPT converts unstructured natural language into a knowledge graph. Pass in the synopsis of your favorite movie, a passage from a confusing Wikipedia page, or transcript from a video to generate a graph visualization of entities and their relationships.

Successive queries can update the existing state of the graph or create an entirely new structure. For example, updating the current state could involve injecting new information through nodes and edges or changing the color of certain nodes."

breck · on March 14, 2023

Shameless plug for our new open source, public domain knowledge graph software: TrueBase (https://truebase.pub/).

If anyone is interested in doing research properly and building a revolutionary public domain knowledge graph for your domain, I'm literally betting my house that this is the way to go.

TrueBase powers PLDB.com (140K lines of data), and our newer one CancerDB.com (13K LOD). A new one for physics and math is coming out this week!

slater · on March 14, 2023

btw your user profile looks messed up (firefox, macOS)

breck · on March 14, 2023

It's an easter egg. Link:

https://try.scroll.pub/#scroll%0A%20comment%20This%20is%20wr...

kfrzcode · on March 14, 2023

I've been wondering about this problem from a non-expert position for awhile. How can we capture, say the entire hard-information - or semantic latent space; of a given body of work. Pick, say, the Simpsons universe. Could we have an AI consume all of the screenplays, scripts, fandom wikis, etc, and create a pre-trained model of every considerable aspect of the fictional universe? That pre-trained model would just be a distillation of the informational content and built to interact in both plain-language, ChatGPT style interface but also more technically and output JSON for queries. Say I want to list every episode of the show that includes a theme of love always ending in tragedy, except of course for Marge and Homer. Or I want to define a histogram of which illustrators worked when which voice actors did a specific set of characters.

Obviously a more useful example would be domain-specific knowledge in industry or medicine, but a generalized approach to ontological encoding from a given dataset would probably require a lot of interesting techniques and math that's way beyond my head.

But that's probably something easily done with current technology, what I'm interested in is learning how to talk about/learn about this concept. Distilling "knowledge," or "concepts" into "parameters", like, defining a DNA-like code for a given corpus of data... sorry I'm rambling but hopefully someone can relate

reidjs · on March 14, 2023

Semi-related: One of the first things I tried with 'GPT' was training it on every Simpson's screenplay and then write its own.

Example:

https://github.com/reidjs/simpsons-gpt/blob/master/output/ou...

The results were... not great. Maybe on the new GPT iteration it will be more successful though.

doing_analytics · on March 15, 2023

Yes, Knowledge Graphs are one of the power house technologies coming soon.

I am in a team developing the next generation of knowledge graph analytics, currently faster than almost all tools out there and its open-source.

The main barrier to entry, and issue, is that not many people know about them. Not many know how to use it, they dont understand the algorithms and thus are unaware of the benefits.

fdgsdfogijq · on March 14, 2023

Personally feel that knowledge graphs will disappear. Currently work on a large knowledge graph at FAANG. Feels like working on an antiquated abstraction layer that needed to exist due to lack of ML sophistication. With LLMs, I think knowledge graphs are obsolete. The LLM builds the knowledge graph itself inside the network layers. Anytime you have an ML system where humans are designing how the information is organized (feature engineering, linking, graph building), it will work poorly. The algorithm needs full control over the information at the lowest level (characters, words). Only then can something that truly works be built.

mattlutze · on March 14, 2023

LLMs have a very clear issue of making false predictions, versus knowledge graphs that represent the information mapped within them.

We should be careful to not treat the current hot thing as a hammer.

> Anytime you have an ML system where humans are designing how the information is organized (feature engineering, linking, graph building), it will work poorly.

This is a very broad assertion that I think is not accurate.

> The algorithm needs full control over the information at the lowest level (characters, words). Only then can something that truly works be built.

But data preparation and management is fundamentally important which means there's a human shaping the information at the lowest levels. There's no single ring to rule them all. This is also just fundamentally erasing the utility of any supervised model or hybrid system which is a bit silly.

vimax · on March 14, 2023

I feel like you’re thinking about this wrong. Knowledge graphs are incredibly useful in combining language and structure, but often not worth the expense to build. Future language models will build knowledge graphs for us with little expense, and for most applications that are not directly at the human interface level, a vectorized knowledge graph will be more useful than a blob of text.

0xbadcafebee · on March 14, 2023

Curated knowledge graphs are the only way to accurately map knowledge. Essential for things like medical KGs. You don't want to kill patients because the magical numbers used to hone the algorithm weren't magic enough.

melagonster · on March 15, 2023

so in anyway expert had read the articles following reference link.