Launch HN: Datasaur (YC W20) – data labeling interface for NLP

zachguo · on March 7, 2020

It resembles an open source annotation tool that has existed for years. https://brat.nlplab.org/

It doesn't include a ML assistant though.

We have built a semi-automated annotation tool for our internal use too. ML models help classify documents and extract named entities by making suggestions. Sometimes I'm thinking of spinning it off as a standalone product but not sure how big the market would be.

flyx · on March 7, 2020

We're fans of brat and are hoping to take it to the next level. We devoted many hours to ensuring a highly performant web app on the latest modern tech stack. Additionally we're expanding beyond a labeling interface to include semi-automated annotation, as well as team management capabilities.

agusgun · on March 7, 2020

I think the semi-automated annotation tool is quite the same with prodigy

flyx · on March 7, 2020

Both can be described as semi-automated annotation, but we use different approaches! Datasaur allows you to plug in any pre-existing model to pre-label or validate your labels. One such model we integrate with is spaCy, so we're certainly fans of the Prodigy/Explosion team.

mrgordon · on March 7, 2020

Yes it is more similar to Figure Eight Text Annotation which allows you to use pre-existing models including Spacy or you can bring your own model

https://www.figure-eight.com/platform/machine-learning-assis...

aliml85 · on March 17, 2020

Looks great, Ivan. Congrats! If I understand correctly you would use Spacy and some pre-trained models to validate human labels. Now my question is that what is the point of collecting labels for ML training if we already have a valid model for the same task that can complement human labels?

aliakhtar · on March 6, 2020

Cool project, what would be cooler is if you had an API to retrieve the labels for a given word. May be that's in the works?

flyx · on March 6, 2020

Done and shipped! :D One of our extensions allows you to plug in an API - either use your own model, or an integration with an open-source project like spaCy to apply labels.

staticautomatic · on March 6, 2020

Could you please elaborate on what you mean by "intelligently validate the quality of labels in a document and complement human judgment", and discuss your methodology?

This seems to operate under the assumption that human labels are not actually the ground truth. I understand that they can be dirty, but most unsupervised approaches aren't producing a ground truth, either. So, are you saying it's better to have multiple pretty good sources of truth instead? Because depending on the application, that might make sense or it might be like trying to start a farm with a dead horse and a dead cow.

flyx · on March 6, 2020

Certainly. Our philosophy is to complement human wisdom with computer precision. Humans may often be labeling for 8 hours a day and may get fatigued. So if Starbucks has been labeled as a cafe 35x in a document and as a person 2x, we can flag this and ask "hey, are you sure you wanted to label this as a person?". Or if we know for a fact Canada is a country, but it's labeled as an animal in a document, we can raise a flag as well. This won't work for everything, but we think it can help with quality assurance.

IanCal · on March 9, 2020

Looks interesting, signed up to try this out and see if it might deal with some of our labelling.

A few notes -

Some help docs would be good, or better links to them. You specify a few types of projects but don't really explain what they are - I tried searching for "constituency" type projects but I have no idea what they are still.

You're sending error messages to the frontend. "Cannot read property 'startsWith' of undefined" is not something that should be reaching an end user, and this is happening unreliably when I upload files.

If I upload a CSV file I cannot seem to do NER from "new project". NER specifically chosen supports TSV but not CSV. My TSV files just say "server error", though they load as just txt files.

What's a question set? What's the format you need from me (it just says "csv").

Autolabelling seems to do nothing. Do you have example text where that should work?

I can navigate the text but the hotkeys for labelling don't do anything until I've already clicked once.

Search & label all is interesting but doesn't seem to give me any labelling options. Also the regex search for "someword \w" just returns all two words next to each other which seems wrong to me.

Congrats on the launch!

flyx · on March 9, 2020

Hi IanCal - Really appreciate you taking the time to test Datasaur out and provide this feedback! Responses below: - We took the risk of launching before our tutorial is in place to get feedback, so I do apologize for lack of clarity for first time users. - Good point, we'll try improve error messages for users. - We've been taking import formats on a case-by-case basis, but we're looking to improve/expand our import/export format flexibility further. - This is probably one of the highest priority issues for us to fix - making sure expected format and input is clear to the user. - Happy to send you additional sample files and instructions on how to get autolabelling working. We can accept your own models/api endpoints as well! - Hotkeys for labeling should work - this bug seems odd. What kind of project is this for? - Search and label should work as soon as labels have been uploaded, sorry that it wasn't working for you.

Happy to discuss any of this in further detail. Thanks again for leaving such comprehensive thoughts!

Shenglong · on March 6, 2020

This is awesome--really excited to see this need being solved.

crimsalis · on March 6, 2020

Congrats on the launch! I spend more than 50% of my time labeling data and this will make life much easier.

sailfast · on March 6, 2020

This looks awesome! Waiting for my email confirmation.

I was looking for information about where my data has to be hosted to use this service and could not find it. Will there be some more information about how this data is handled once I get past the login? Thanks!

flyx · on March 6, 2020

We offer both a hosted service on AWS as well as an on-prem solution if needed. We can even choose to host on the cloud provider of your choice - happy to work with you on this!

andrewnc · on March 6, 2020

This is very cool! I especially love the logo. Congrats on the launch and best of luck.

flyx · on March 7, 2020

Always happy to hear people complimenting the logo. I've been told it's not professional enough, but I really wanted our site to have some personality. Thanks!

gault8121 · on March 7, 2020

In the spreadsheet view, do users need to upload labels as a text file to then assign them to items? I work with Quill.org, a nonprofit edtech tool that helps students improve their writing skills, and we do a lot of labeling work now where we may need to assign say one of 20 labels to 1,000 responses at a time. I uploaded some sample data, but didn't understand how I could quickly assign labels to my content. Please let me know if I'm missing something here.

flyx · on March 7, 2020

Hi there - you may choose to upload labels as a text file or create your own. I'd be curious to hear more about your use case in batch-applying labels. I'll follow up offline (well, via email).

milani · on March 6, 2020

Congratulation for the launch!

To understand the scope of your work a little bit, if I have Prodigy with custom labeling needs set up for me, do I still benefit from switching to datasaur?

flyx · on March 6, 2020

Apologies for the delay! There is some overlap with what Prodigy works on and I'm a big fan of what they're working on. We cover some additional use cases (like coreference parsing) and additionally help with managing teams of labelers. We're complementary in many regards. Happy to discuss further, based on your labeling needs.

pouta · on March 6, 2020

If anyone on the team is reading this post, please answer this question.

narrationbox · on March 6, 2020

This looks wonderful, will definitely try it out. We ran into the labeling issue when doing NER a couple years ago on Reddit books dataset. If only this existed then.

flyx · on March 6, 2020

Thanks for the kind words! Yea we're building out what I wish we had at my last few companies. Looking forward to your feedback.

dunky11 · on March 6, 2020

Wish you good luck, the website looks clean, the product idea is good:) You request an image however which width is 3000+ pixels: https://s.datasaur.ai/static/media/homepage-hero.4917b8af.pn... . 1200px in width should be enough, I would resize the image, it slows down the page.

flyx · on March 6, 2020

Yikes - good point. We'll optimize.

comet_trail · on March 6, 2020

Interesting product. Could have used this at previous companies. How is this different from FigureEight or Scale?

mrgordon · on March 7, 2020

Yes looks directly comparable to Figure Eight. This product is clearly meant to be similar to the Text Annotation tool: https://www.figure-eight.com/platform/machine-learning-assis...

Both products offer the opportunity to use provided labelers or to bring your own.

veeralpatel979 · on March 6, 2020

Scale offers labeling as a service. Datasaur is an interface that companies can buy for their own labeling personnel, if I understand correctly.

flyx · on March 6, 2020

That's right! Scale probably has some awesome internal tools that help them label faster. Datasaur wants to make those same optimizations available to anyone with their own labelers.

hbcondo714 · on March 6, 2020

Any chance you could support HTML files? We've been using https://www.tagtog.net/ for some of our data labeling / annotations needs but their tool for these file types is still "experimental".

flyx · on March 6, 2020

Sure can! Would love to hear more - what do you want to extract from the HTML files?

hbcondo714 · on March 6, 2020

Thanks! We actually just need to be able to upload HTML files and have it rendered as a web page (and not just display the HTML code) so our team can data label / annotate certain sentences throughout the document.

flyx · on March 6, 2020

Yea, we can 100% handle this. If you sign up for a demo, happy to discuss further!

inerte · on March 6, 2020

LinkedIn suggested a post from you a couple weeks ago and I remember thinking “what’s Ivan up to?” and I saw Datasaur. Congrats on YC! I know that our time at Yahoo was a brief overlap but I remember the swirl of ML, Knowledge Graph and labelling our org was at 5 years ago.

Good luck with Datasaur!

- Julio Nobrega

flyx · on March 6, 2020

Julio - great to hear from you, and thanks for the kind words :) In many ways, my journey to Datasaur began with that team/project 5 years ago.

lerax · on March 7, 2020

This name is the best part of the project (and the project itself it's already an awesome tool).

_prometheus · on March 6, 2020

Datasaur looks awesome! Can't wait to try it out. Congrats on the launch!

Curious about data security and privacy? How do you guarantee privacy? Is there some cryptography or secure enclaves used? Some sets of documents (and email) are super high trust.

Guessing the on-prem version is probably safest route

flyx · on March 6, 2020

Thanks - good to see so many people concerned about privacy here. I consider a privacy a top-level priority at Datasaur - all data is fully encrypted. Our employees will never be able to see or access any customer data. We already work with a bank and cleared their security bar :)

boreas · on March 7, 2020

I've got a question, a lot of startup websites have a similar look to this one. It's a look I actually really like. What technologies are they all using? How would I build a site like this?

wappalyzer doesn't give me anything and I don't have a ton of webdev experience.

flyx · on March 7, 2020

So here's the secret. Our awesome designer actually put a lot of hard work into putting this together. 1 week later a friend told me about https://www.landen.co/ and I wish I had used this to save us some time (don't know the team, just a fan of the product).

WFHRenaissance · on March 6, 2020

Very cool logo. Just signed up.

WFHRenaissance · on March 6, 2020

Following up. Your confirmation email gets flagged as leading to an untrusted site in Gmail. Might be worth figuring out.

flyx · on March 6, 2020

Yikes, will look into it. Thanks for the heads up!

hbcondo714 · on March 6, 2020

On the pricing page, the Growth box shows a checkmark for "Unlimited labels" but right below in the "Choose the right plan for you", the Growth plan says the number of labels is 10,000,000.

flyx · on March 6, 2020

Great catch! We'll correct it asap. Since you caught it, we'll give you unlimited labels :)

inthewoods · on March 7, 2020

Great idea - and this is an odd comment: I think you're pricing it too low relative to the number of people in the market. Just my gut - could well be wrong, wrong, and wrong again.

flyx · on March 7, 2020

Music to my ears! I think (and hope) you're probably right - so let's consider this an introductory launch price and assume it'll go up over time ;)

mroll · on March 6, 2020

Hey Ivan, this looks great! What are the privacy implications for my data that I want to label with your tool? I’m assuming I upload it to your servers?

flyx · on March 6, 2020

Great question! Data privacy is a top-level priority for us. We actually offer both a cloud-based and on-prem solution. One of our clients needed a fully on-prem, air-gapped (no connection to internet) option. Many are choosing to use us because they can't send their data to outsourced, external parties.

braindead_in · on March 6, 2020

Congrats. Do you guys use AllenNLP, by any chance?

flyx · on March 6, 2020

We've looked into it! So far we've chosen to integrate with spaCy. Can I ask what you like about AllenNLP?

gault8121 · on March 7, 2020

Hi there, great product! I'm with a nonprofit edtech writing tool that uses both spaCy and AllenNLP, and we've found AllenNLP's models to be more accurate for tasks like co-reference resolution. AllenNLP's models are built on top of spaCy and tools like neural coref. It'd be great if we could harness things like AllenNLP's semantic role labeling.

flyx · on March 8, 2020

We actually allow you to plug in any existing model (including your own). So we should be able to support AllenNLP as well. happy to chat further!

seaturtles · on March 6, 2020

Awesome! Congrats, excited for this!

foobaw · on March 6, 2020

Any plans to support image annotation (something similar to what CVAT does)?

flyx · on March 7, 2020

We currently support image classification. CVAT is a great tool and we'd love to support all forms and types of data in the future!

ymt · on March 7, 2020

Looks awesome! I need to convince my team to use datasaur

Congratulations for the launch!

chownation · on March 6, 2020

Roarrsome, congrats!

flyx · on March 7, 2020

roar (:

felixkurniawan · on March 7, 2020

Congratulations Ivan for the launch! Best of luck!