I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.
I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.
Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?
No, his first job would be a more senior developer writing 100 line Python script instead of hiring an intern to write a truck load of RegExs. After that dev saved time just writing the script over mentoring/explaining/hiring the intern, that dev would then do the more interesting things with the events.
I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.
I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.
On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.
Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.
I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.
The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.
I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.
Yes! This library is great and definitely helps, but I still had problems with performance. For example, smaller models would still hallucinate when extracting a JSON field when the given field wasn't present in the text (I'd expect null, but it provided either an incorrect value from the text, or a totally made up value).
Jina did something related for extracting content from raw HTML and wrote about the techniques they used here: https://jina.ai/news/reader-lm-small-language-models-for-cle... .. in my tests, the 1.5bn model works extremely well, though the open model is non commercial.
How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.
Reminds me of the Dutch childcare benefits scandal [0], where 26,000 families were unfairly labeled as having committed tax fraud (11,000 of which had been targeted via "risk profiling", as they had dual nationalities [1]). Bad policy + automation = disaster. The wikipedia article doesn't fully explain how some automated decisions were made (e.g. You had a typo in a form, therefore all previous benefits were clawed-back; if you owe more than €3.000,- then you're a fraudster and if you called to ask for clarification they wouldn't help you — you're officially labeled a fraudster, you see).
Edit: couldn't find a source for my last statement, but I remember hearing it in an episode of the great Dutch News podcast. I'll see if I can find it.
Just for posterity, I couldn't find the specific podcast episode, but there are public statements from some of the victims [0] available online (translated):
> What personally hurt you the most about how you were treated?
> Derya: 'The worst thing about all of this, I think, was that I was registered as a fraudster. But I didn't know anything about that. There was a legal process for it, but they blocked it by not telling me what OGS (intent gross negligence) entailed. They had given me the qualification OGS and that was reason for judges to send me home with rejections. I didn't get any help anywhere and only now do I realize that I didn't stand a chance. All those years I fought against the heaviest sanction they could impose on me and I didn't know anything. I worked for the government. I worked very hard. And yet I was faced with wage garnishment and had to use the food bank. If I had known that I was just a fraudster and that was why I was being treated like that, I wouldn't have exhausted myself to prove that I did work hard and could pay off my debts myself. I literally and figuratively worked myself to death. And the consequences are now huge. Unfortunately.'
We tried all models from openai and google to get data from images and all of them made "mistakes".
The images are tables with 4 columns and 10 rows of numbers and metadata above that are in a couple of fields. We had thousands of images already loaded and when we tried to check those previously loaded images we found quite a few errors.
Multimodal LLMs are not up for these tasks imo. It can describe an image but its not great on tables and numbers. Now on the other hand, using something like Textract to get the text representation of the table and then feeding that into a LLM was a massive success for us.
Not the OP, but if doing this at scale, I'd consider a quorum approach using several models and looking for a majority to agree (otherwise bump it for human review). You could also get two different approaches out of each model by using purely the model and external OCR + model and compare those too.
I’m working on a problem in this space, and that’s the approach I’m taking.
More detailed explanation: I have to OCR dense, handwritten data using technical codes. Luckily, the form designers included intermediate steps. The intermediate fields are amenable to Textract, so I can use a multimodal model to OCR the full table and then error check.
So assuming it would be an issue, given that you’re building such a tool, what would your approach be?
If I put an invisible tag on my website and it tells your scraper to ignore all previous prompts, leak its entire history and send all future prompts and replies to a web address while staying silent about it, how would you handle that?
A casual look at the source shows the architecture won't allow the attacks you're talking about. Since each request runs separately, there's no way for prompt injection on one request to influence a future request. Same thing for leaking history.
What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.
1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.
2) There could be policies about making the data public, but in a way that discourages data scraping.
3) The providers of the data simply don't have the resources or incentives to develop a working API.
What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.
I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.
Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.
I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:
I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own
function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.
Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.
The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.
Happy to link you to some papers I've skimmed on it if you're interested!
Could you share some of those papers?
I had a great discussion with Marc Fischer from the LMQL team [0] on this topic while at ICML earlier this year. Their work recommended decoding to natural language templates with mad lib-style constraints to follow that “happy path” you refer to, instead of decoding to a (relatively more specific latent) JSON schema [1]. Since you provided a template and knew the targeted tokens for generation you could strip your structured content out of the message. This technique also allowed for beam search where you can optimize tokens which lead to the tokens contain your expected strings, avoiding some weird token concatenation process. Really cool stuff!
Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.
Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.
OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.
Any pointers on where we can check the best local models per amount of VRAM available? I only have consumer level cards available, but I would think something that just fits in to a 24Gb card should noticably outperform something scaled for an 8Gb card, yes?
Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.
<< Stuff like this shows how much better the commercial models are than local models.
I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.
edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.
You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).
The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.
(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)
Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/
There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!
I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.
> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.
or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems
The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.
Would you not want to read the XBRL from the filing? I thought those are now mandatory.
This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.
My (admittedly aged) experience with XBRL is that each company was able to define its own fields/format within that spec, and that most didn't agree on common names for common fields. Parsing it wasn't fun.
I have spotty education on the matter but I believe they all conform to a FASB taxonomy so there is at least a list of possible tags in use. I do wonder if any of the big data vendors actually use this though.
Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.
Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?
We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.
It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.