Extracting financial disclosure and police reports with OpenAI Structured Output

synthc · 2024-10-15T10:57:23 1728989843

My first job (around 2010) was to extract events from financial news and police reports.

We built this huge system with tons of regexes, custom parsers, word lists, ontologies etc. It was a huge effort to get somewhat acceptable accuracy.

It is humbling to see that these days a 100 line Python script can do the same thing but better: AI has basically taken over my first job.

dataguy_ · 2024-10-15T12:03:20 1728993800

I can see this being true to a lot of old jobs, like my brother's first job that basically was to transcribe audio tapes. whisper can do it in no time, that's crazy.

danofsteel32 · 2024-10-15T15:34:20 1729006460

I’ve had a similar experience extracting transactions from my PDF bank statements [1]. GPT-4o and GPT-4o-mini perform as well the janky regex parser I wrote a few years ago. The fact that they can zero shot the problem makes me think there’s a lot of bank statements in the training data.

[1] https://dandavis.dev/pnc-virtual-wallet-statement-parser.htm...

morkalork · 2024-10-15T15:25:56 1729005956

Well, your first job today would be writing that 100 line Python script then doing something 100x more interesting with the events than writing truck loads of regexs?

HWR_14 · 2024-10-15T17:38:02 1729013882

No, his first job would be a more senior developer writing 100 line Python script instead of hiring an intern to write a truck load of RegExs. After that dev saved time just writing the script over mentoring/explaining/hiring the intern, that dev would then do the more interesting things with the events.

That is, his first job is now gone.

rcarmo · 2024-10-15T07:38:28 1728977908

I’ve had pretty dismal results doing the same with spreadsheets—even with the data nicely tagged (and numbers directly adjacent to the labels) GPT-4o would completely make up figures to satisfy the JSON schema passed to it. YMMV.

TrainedMonkey · 2024-10-15T09:53:55 1728986035

I wonder if adversarial model which looks at user input & LLM output and predicts whether output is accurate + maybe output what is not accurate. This worked pretty well for image generation.

alach11 · 2024-10-15T12:47:45 1728996465

This is a common workflow for these sorts of problems. I've done similar a few times. The downside is the additional cost.

infecto · 2024-10-15T12:17:27 1728994647

On the flip side I have had a lot of success parsing spreadsheets and other tables into their markdown or similar representation and pulling data out of that quite accurately.

druskacik · 2024-10-15T11:27:52 1728991672

Data extraction is definitely one of the most useful functions of LLM, however, in my experience a large model is necessary for a reliable extraction - I tested smaller, open-weights models and the performance was not sufficient.

I wonder, did anyone try to fine-tune a model specifically for general formatted data extraction? My naive thinking is that this should be pretty doable - after all, it's basically just restructuring the content using mostly the same tokens as input.

The reason why this would be useful (in my case) is because while large LLMs are perfectly capable of extraction, I often need to run it on millions of texts, which would be too costly. That's the reason I usually end up creating a custom small model, which is faster and cheaper. But a general small extraction-focused LLM would solve this.

I thought about fine-tuning Llama3-1B or Qwen models on larger models outputs, but my focus is currently elsewhere.

msp26 · 2024-10-15T12:23:00 1728994980

Have you looked into structured generation with a library like outlines? https://github.com/dottxt-ai/outlines

druskacik · 2024-10-15T12:36:24 1728995784

Yes! This library is great and definitely helps, but I still had problems with performance. For example, smaller models would still hallucinate when extracting a JSON field when the given field wasn't present in the text (I'd expect null, but it provided either an incorrect value from the text, or a totally made up value).

petercooper · 2024-10-15T16:56:37 1729011397

Jina did something related for extracting content from raw HTML and wrote about the techniques they used here: https://jina.ai/news/reader-lm-small-language-models-for-cle... .. in my tests, the 1.5bn model works extremely well, though the open model is non commercial.

_ugfj · 2024-10-15T14:41:59 1729003319

How do you know the output has anything to do with the input? Hint: you don't. You are building a castle on quicksand. As always, the only thing LLMs are usable for:

https://hachyderm.io/@inthehands/112006855076082650

> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.

> Alas, that does not remotely resemble how people are pitching this technology.

TrackerFF · 2024-10-14T17:45:04 1728927904

We used GPT 4o for more or less the same stuff. Got a boatload of scanned bills we had to digitize, and GPT really nailed the task. Made a schema, and just fed the model all the bills.

Worked better than any OCR we tried.

thenaturalist · 2024-10-15T08:07:09 1728979629

How are you going to find (not even talking about correcting) hallucinated errors?

If money is involved and the LLM produces hallucination errors, how do you handle monetary impacts of such errors?

How does that approach scale financially?

ClearAndPresent · 2024-10-15T09:36:14 1728984974

Indeed. I anticipate the next Post Office Scandal(1,2) attributed to LLMs.

1 https://en.wikipedia.org/wiki/British_Post_Office_scandal 2 https://www.postofficescandal.uk/

rolandog · 2024-10-15T13:41:10 1728999670

That's awful.

Reminds me of the Dutch childcare benefits scandal [0], where 26,000 families were unfairly labeled as having committed tax fraud (11,000 of which had been targeted via "risk profiling", as they had dual nationalities [1]). Bad policy + automation = disaster. The wikipedia article doesn't fully explain how some automated decisions were made (e.g. You had a typo in a form, therefore all previous benefits were clawed-back; if you owe more than €3.000,- then you're a fraudster and if you called to ask for clarification they wouldn't help you — you're officially labeled a fraudster, you see).

Edit: couldn't find a source for my last statement, but I remember hearing it in an episode of the great Dutch News podcast. I'll see if I can find it.

[0]: https://en.wikipedia.org/wiki/Dutch_childcare_benefits_scand...

[1]: https://www.dutchnews.nl/2021/02/full-scale-parliamentary-in...

rolandog · 2024-10-17T18:20:04 1729189204

Just for posterity, I couldn't find the specific podcast episode, but there are public statements from some of the victims [0] available online (translated):

> What personally hurt you the most about how you were treated? > Derya: 'The worst thing about all of this, I think, was that I was registered as a fraudster. But I didn't know anything about that. There was a legal process for it, but they blocked it by not telling me what OGS (intent gross negligence) entailed. They had given me the qualification OGS and that was reason for judges to send me home with rejections. I didn't get any help anywhere and only now do I realize that I didn't stand a chance. All those years I fought against the heaviest sanction they could impose on me and I didn't know anything. I worked for the government. I worked very hard. And yet I was faced with wage garnishment and had to use the food bank. If I had known that I was just a fraudster and that was why I was being treated like that, I wouldn't have exhausted myself to prove that I did work hard and could pay off my debts myself. I literally and figuratively worked myself to death. And the consequences are now huge. Unfortunately.'

[0]: https://www.bnnvara.nl/artikelen/hoe-gaat-het-nu-met-de-slac...

is_true · 2024-10-15T11:58:37 1728993517

We tried all models from openai and google to get data from images and all of them made "mistakes".

The images are tables with 4 columns and 10 rows of numbers and metadata above that are in a couple of fields. We had thousands of images already loaded and when we tried to check those previously loaded images we found quite a few errors.

infecto · 2024-10-15T12:38:21 1728995901

Multimodal LLMs are not up for these tasks imo. It can describe an image but its not great on tables and numbers. Now on the other hand, using something like Textract to get the text representation of the table and then feeding that into a LLM was a massive success for us.

is_true · 2024-10-15T13:10:49 1728997849

LLMs don't offer much value for our use case, almost all values are just numbers

infecto · 2024-10-15T14:05:35 1729001135

Then you should be using something like Textract or other tooling in that space. Multimodal LLMs are no replacement.

is_true · 2024-10-15T14:17:04 1729001824

We use opencv + tesseract and easyocr

thenaturalist · 2024-10-15T12:12:39 1728994359

Curious, did that make you "fall back" to more conservative OCR?

Or what else did you do to correct them?

is_true · 2024-10-15T13:09:28 1728997768

We already had an OCR solution. We were exploring models in case the information source changes

petercooper · 2024-10-15T16:58:48 1729011528

Not the OP, but if doing this at scale, I'd consider a quorum approach using several models and looking for a majority to agree (otherwise bump it for human review). You could also get two different approaches out of each model by using purely the model and external OCR + model and compare those too.

coredog64 · 2024-10-15T17:28:04 1729013284

I’m working on a problem in this space, and that’s the approach I’m taking.

More detailed explanation: I have to OCR dense, handwritten data using technical codes. Luckily, the form designers included intermediate steps. The intermediate fields are amenable to Textract, so I can use a multimodal model to OCR the full table and then error check.

gloosx · 2024-10-15T05:59:40 1728971980

Did you finally balance out lol? If you didn't, would you approach finding a mistake by going through each bill manually?

marcell · 2024-10-14T17:54:18 1728928458

I’m making a free open source library for this, check it at http://github.com/fetchfox/fetchfox

MIT license. It’s just one line of code to get started: ‘fox.run(“get data from example.com”)’

thenaturalist · 2024-10-15T08:08:49 1728979729

How do you plan to address prompt injection/ poisoned data for a method that simply vacuums unchecked inputs into an LLM?

marcell · 2024-10-15T08:21:15 1728980475

It hasn’t been an issue yet, but I’m sure it will come up at some point. If you see a problem please file an issue.

thenaturalist · 2024-10-15T08:52:19 1728982339

So assuming it would be an issue, given that you’re building such a tool, what would your approach be?

If I put an invisible tag on my website and it tells your scraper to ignore all previous prompts, leak its entire history and send all future prompts and replies to a web address while staying silent about it, how would you handle that?

alach11 · 2024-10-15T12:51:08 1728996668

A casual look at the source shows the architecture won't allow the attacks you're talking about. Since each request runs separately, there's no way for prompt injection on one request to influence a future request. Same thing for leaking history.

4ad · 2024-10-14T22:34:53 1728945293

What a sad state for humanity that we have to resort to this sort of OCR/scrapping instead of the original data being released in a machine readable format in the first place.

TrackerFF · 2024-10-15T07:30:06 1728977406

To be fair, there are some considerations here:

1) There's plenty of old data out there. Newspaper scans from the days before computers, or digitalization of the newspaper process. Or the original files simply got lost, so manually scanned pages is all you have.

2) There could be policies about making the data public, but in a way that discourages data scraping.

3) The providers of the data simply don't have the resources or incentives to develop a working API.

And many more.

blitzar · 2024-10-15T09:10:55 1728983455

What is even sadder is that this data (especially the more recent data) is entered first in machine readable formats then sliced and diced and spat out in a non-machine readable format.

jxramos · 2024-10-15T16:21:47 1729009307

I'd like to see financial transactions and purchases abide by some json format standard, metadata and a list of items with full product name, quantity purchased, total unit volume/amount of product, price, and unit price.

DrillShopper · 2024-10-15T11:32:00 1728991920

Yeah, wow, humanity is so stupid for not distributing the machine readable format for the local newspaper in 1920. Gosh we're just so dumb

tpswa · 2024-10-14T17:24:04 1728926644

Cool work! Correct me if I'm wrong, but I believe to use the new OpenAI structured output that's more reliable, the response_format should be "json_schema" instead of "json_object". It's been a lot more robust for me.

danso · 2024-10-14T18:59:30 1728932370

I may be reading the documentation wrong [0], but I think if you specify `json_schema`, you actually have to provide a schema. I get this error when I do `response_format={"type": "json_schema"}`:

     openai.BadRequestError: Error code: 400 - {'error': {'message': "Missing required parameter: 'response_format.json_schema'.", 'type': 'invalid_request_error', 'param': 'response_format.json_schema', 'code': 'missing_required_parameter'}}

I hadn't used OpenAI for data extraction before the announcement of Structured Outputs, so not sure if `type: json_object` did something different before. But supplying only it as the response format seems to be the (low effort) way to have the API infer the structure on its own

[0] https://platform.openai.com/docs/guides/structured-outputs/s...

ec109685 · 2024-10-14T18:34:17 1728930857

I’ve been using jsonschema since forever with function calling. Does structured output just formalize things?

chaos_emergent · 2024-10-14T21:13:42 1728940422

function calling provides a "hint" in the form of a JSON schema for an LLM to follow. the models are trained to follow provided schemas. If you have really complicated or deeply nested models, they can become less stable at generating schema-conformant JSON.

Structured outputs apply a context-free grammar to the prediction generation so that, for each token generation, only tokens that generate a perfectly conformant JSON schema are considered.

The benefit of doing this is predictability, but there's a trade-off in prediction stability; apparently structured output can constrain the model to generate in a way that takes it off the "happy path" of how it assumes text should be generated.

Happy to link you to some papers I've skimmed on it if you're interested!

pmg0 · 2024-10-15T01:46:48 1728956808

Could you share some of those papers? I had a great discussion with Marc Fischer from the LMQL team [0] on this topic while at ICML earlier this year. Their work recommended decoding to natural language templates with mad lib-style constraints to follow that “happy path” you refer to, instead of decoding to a (relatively more specific latent) JSON schema [1]. Since you provided a template and knew the targeted tokens for generation you could strip your structured content out of the message. This technique also allowed for beam search where you can optimize tokens which lead to the tokens contain your expected strings, avoiding some weird token concatenation process. Really cool stuff!

[0] https://lmql.ai/ [1] https://arxiv.org/abs/2311.04954

throwup238 · 2024-10-14T18:55:25 1728932125

Structured output uses "constrained decoding" under the hood. They convert the JSON schema to a context free grammar so that when the model samples tokens, invalid tokens are masked to have a probability of zero. It's much less likely to go off the rails.

philipwhiuk · 2024-10-15T11:05:20 1728990320

I'm deeply worried by the impact of hallucinations in this sort of tool.

beoberha · 2024-10-14T18:03:43 1728929023

Stuff like this shows how much better the commercial models are than local models. I’ve been playing around with fairly simple structured information extraction from news articles and fail to get any kind of consistent behavior from llama3.1:8b. Claude and chatGPT do exactly what I want without fail.

0tfoaij · 2024-10-14T19:17:10 1728933430

OpenAI stopped releasing information about their models after gpt-3, which was 175b, but the leaks and rumours that gpt-4 is an 8x220 billion parameter model are most certainly correct. 4o is likely a distilled 220b model. Other commercial offerings are going to be in the same ballpark. Comparing these to llama 3 8b is like comparing a bicycle or a car to a train or cruise ship when you need to transport a few dozen passengers at best. There are local models in the 70-240b range that are more than capable of competing with commercial offerings if you're willing to look at anything that isn't bleeding edge state of the art.

Baeocystin · 2024-10-15T01:06:14 1728954374

Any pointers on where we can check the best local models per amount of VRAM available? I only have consumer level cards available, but I would think something that just fits in to a 24Gb card should noticably outperform something scaled for an 8Gb card, yes?

fnord77 · 2024-10-15T03:15:40 1728962140

lm studio tells you what models fit in your available RAM, with or without quantization

minimaxir · 2024-10-14T18:09:41 1728929381

The Berkeley Function-Calling Leaderboard tracks function calling/structured data performance from multiple models: https://gorilla.cs.berkeley.edu/leaderboard.html

Llama isn't on there but a few finetunes of it (Hermes) are OSS.

lolinder · 2024-10-14T23:40:52 1728949252

Llama 3 70B is on there, ranked 20.

int_19h · 2024-10-15T06:37:10 1728974230

Your problem isn't that you're using a local model. It's that you're using an 8b model. The stuff you're comparing it to is two orders of magnitude larger.

gdiamos · 2024-10-15T06:06:55 1728972415

I usually come to a different conclusion using the JSON output on Lamini, e.g. even with Llama 3.2 3B

https://lamini-ai.github.io/inference/json_output

Most of these models can read. If the relevant facts are in the prompt, they can almost always extract them correctly.

Of course bigger models do better on more complex tasks and reasoning unless you use finetuning or memory tuning.

dcreater · 2024-10-15T08:59:10 1728982750

You should probably disclose you're the founder of lamini.

Do you have any publicly available validation data demonstrating 100% json compliance?

gdiamos · 2024-10-15T10:20:13 1728987613

I am a founder. It’s not meant to be a secret.

Obviously I’m biased, but I also spend every day using tools like this.

Regarding json compliance, we have a formal grammar and a test suite. If you find a bug please report it. I’d appreciate having more test coverage.

A4ET8a8uTh0 · 2024-10-14T19:05:33 1728932733

<< Stuff like this shows how much better the commercial models are than local models.

I did not reach the same conclusion so I would be curious if you could provide rationale/basis for your assessment in the link. I am playing with humble llama3 8b here and results for federal register type stuff ( without going into details ) was good for what I was expecting to be.. not great.

edit: Since you mentioned llama explicitly, could you talk a little about the data/source you are using for your resutls. You got me curious and I want to dig a little deeper.

kgeist · 2024-10-14T20:53:18 1728939198

In my tests, Llama 3.1 8b was way worse than Llama 2 13b or Solar 13b.

tpm · 2024-10-15T06:51:23 1728975083

In my experience the Qwen2-VL models are great at this.

thatcat · 2024-10-14T18:11:02 1728929462

I mean, those aren't comparable models. I wonder how the 405b version compares.

Tiberium · 2024-10-14T19:02:17 1728932537

You raise a valid point, but 4o is way smaller than 405B. And 4o mini that's described in the article is highly likely <30B (if we're talking dense models).

maleldil · 2024-10-14T20:28:34 1728937714

Is the size of OpenAI's models public, or is this guesswork?

qwe----3 · 2024-10-14T22:50:02 1728946202

If your company has a lot of ex openai employees then you know ;)

And the public numbers are mostly right, the latest values are likely smaller now- they have been working on down sizing everything

1oooqooq · 2024-10-15T13:14:50 1728998090

if you're "parsing" structured or even semi structured data with a LLM.... sigh.

an true scotch engineer know tagged data goes into the other end. but I guess that doesn't align with openai target audience and business goals.

i guess that would be fine to clean the new training data... but then you risk extrapolating hallucinations

danso · 2024-10-15T15:13:14 1729005194

The financial disclosures example was meant to be a toy example; with the way U.S. House members file their disclosure reports now, everything should be in a relatively predictable PDF with underlying text [0], but that wasn't always the case [1]. I think this API would've been pretty helpful to orgs like OpenSecrets who in the past had to do record and enter this data manually.

(I wouldn't trust the API alone, but combine it with human readers/validators, i.e., let OpenAI do the data entry part, and have humans do the proofreading)

[0] https://disclosures-clerk.house.gov/public_disc/financial-pd...

[1] https://disclosures-clerk.house.gov/public_disc/financial-pd...

pooingcode · 2024-10-15T22:20:12 1729030812

I am a huge fan of using Stuctured Output to extract data.

Huge benefit that you can lock down model performance with as you fine-tune your prompt or extend out use cases.

I wrote about it here on my blog where i replaced a project’s prompt with Structured Output using Pydantic models https://amberwilliams.io/blogs/474b0361-cbc1-4fa5-b047-c042f...

Zaheer · 2024-10-14T16:40:08 1728924008

Made a small project to help extract structure from documents (pdf,jpg,etc -> JSON or CSV): https://datasqueeze.ai/

There's 10 free pages to extract if anyone wants to give it a try. I've found that just sending a pdf to models doesn't extract it properly especially with longer documents. Have tried to incorporate all best practices into this tool. It's a pet project for now. Lmk if you find it helpful!

matchagaucho · 2024-10-14T17:09:23 1728925763

Similarly I've found old-school OCR is needed for more reliability.

MarkMarine · 2024-10-14T21:02:40 1728939760

I've been using this to OCR some photos I took of books and it's remarkable at it. My first pass was just a loop where I'd OCR, feed the text to the model and ask it to normalize into a schema but I found out just sending the image to the model and asking it to OCR and turn it into the shape of data I wanted was so much more accurate.

bagels · 2024-10-14T17:39:58 1728927598

Combining google's ocr with llm gives OCR superpowers. Tell the llm the text is from an ocr and ask it to correct it.

saturn8601 · 2024-10-14T20:45:39 1728938739

That sounds like it could be very dangerous when the LLM gets it wrong...

bagels · 2024-10-15T05:25:17 1728969917

Depends what you're using it for. If you're relying on OCR, you've already got to accept some amount of error.

hackernewds · 2024-10-14T18:09:51 1728929391

Is this simply the OCR bits to feed to openai structured output?

artisandip7 · 2024-10-14T17:01:52 1728925312

tried it works great, ty!

minimaxir · 2024-10-14T17:47:45 1728928065

> Note that this example simply passes a PNG screenshot of the PDF to OpenAI's API — results may be different/more efficient if you send it the actual PDF.

OpenAI's API only accepts images: https://platform.openai.com/docs/guides/vision

To my knowledge, all the LLM services that take in PDF input do their own text extraction of the PDF before feeding it to an LLM.

tyre · 2024-10-14T18:16:59 1728929819

or convert PDF to image and send that. We’ve done it for things that textract completely mangled, but sonnet has no problem. Especially tables built out of text characters from very old systems

ec109685 · 2024-10-14T18:30:57 1728930657

I don’t think it does OCR. It’s able to use the structure of the PDF to guide the parsing.

mmsc · 2024-10-15T12:10:29 1728994229

Adding to the list of "now try it with"....

The SEC's EDGAR database (which is for SEC filings) is another nightmare ready to end. Extracting individual sections from a filing is, afaik, impossible pragmatically.

I tried making two parsers: https://github.com/MegaManSec/SEC-Feed-Parser and https://github.com/MegaManSec/SEC-sec-incident-notifier but they're just hacks.

Then just link it up to your automated investment platform and you're ready to go!

infecto · 2024-10-15T12:24:24 1728995064

Would you not want to read the XBRL from the filing? I thought those are now mandatory.

This is one of those interesting areas where its hard to innovate because the data is already available from most/all data vendors and its cheap and accurate enough that nobody is going to reinvent those processes but also too expensive for an individual to purchase.

derivagral · 2024-10-15T13:00:52 1728997252

My (admittedly aged) experience with XBRL is that each company was able to define its own fields/format within that spec, and that most didn't agree on common names for common fields. Parsing it wasn't fun.

infecto · 2024-10-15T14:08:05 1729001285

I have spotty education on the matter but I believe they all conform to a FASB taxonomy so there is at least a list of possible tags in use. I do wonder if any of the big data vendors actually use this though.

jsemrau · 2024-10-15T16:17:26 1729009046

The SEC has a well defined API with EDGAR.

https://jdsemrau.substack.com/p/mem0-building-a-sec-10k-anal...

jxramos · 2024-10-15T16:19:37 1729009177

Does anyone follow Vik's work? eg https://x.com/VikParuchuri/status/1846153661791011158

kiakiaa · 2024-10-16T10:36:57 1729075017

Fine-tuning smaller models specifically for data extraction could indeed save costs for large-scale tasks; I've found tools like FetchFox helpful for efficiently extracting data from websites using AI.

myflash13 · 2024-10-15T15:42:37 1729006957

Is there an automated way to check results and reduce hallucinations? Would it help to do a second pass with another LLM as a sanity check to see if numbers match?

thibaut_barrere · 2024-10-15T18:03:47 1729015427

This is what I am implementing a the moment (together with sampling for errors).

Imanari · 2024-10-15T09:11:03 1728983463

Does anybody have experience with Azure Document Intelligence? How does it compare to OAIs extraction capabilities?

frays · 2024-10-15T20:14:14 1729023254

Very cool and real application of LLMs. Although hallucinations is still something to be very wary of.

andrewg4445 · 2024-10-15T11:34:12 1728992052

This helped me a lot, thx.