Wow, that is a lot of money ($4400 on Amazon) to throw at this problem. I am cur...

bevekspldnw · on March 27, 2024

Large scale document classification tasks in very ambiguous contexts. A lot of my work goes into using big models to generate training data for smaller models.

I have multiple millions of documents so GPT is cost prohibitive, and too slow. My tools of choice tend to be a first pass with Mistral to check task performance and if lacking using Mixtral.

Often I find with a good prompt Mistral will work as well as Mixtral and is about 10x faster.

I’m on my “home” network, but it’s a “home office” for my startup.

Datagenerator · on March 28, 2024

Interesting I have the same task, can you share your tools? My goal is to detect if documents contain GDPR sensitive parts or are copies of official documents like ID's and driving licenses etc - would be great to reuse your work!

bevekspldnw · on March 28, 2024

Working in the same sector, we’ll license it out soon.