Launch HN: Dashblock (YC S19) – Turn Any Website into an API

_xnmw · on Sept 18, 2019

A productivized scraping service - useful! Entire companies are built around scraping certain popular sites - this is a disruptive idea indeed. A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

However, the ML claim is highly suspect. There is no way that a machine could reliably understand the semantic content of a website - that would require Artificial General Intelligence. If anyone could do that, it would've been Google. But even Google relies on human-edited structured metadata to define the content of sites (i.e. Rich Snippets and the like).

HPouillot · on Sept 18, 2019

It doesn't require Artificial General Intelligence, with enough training data (crowdsourced data and human-edited metadata like JSON-LD or RFD), we can classify automatically the attributes on the page (product name, movie title, creation date, author), structure them and recognise the type of entity.

Feel free to contact us if you want to invest (hello@dashblock.com), we are currently raising funds ;)

rvnx · on Sept 18, 2019

But, what's the value compared to using open-source products like Portia https://portia.readthedocs.io/en/latest/getting-started.html ? Functionally it looks very similar.

r0rshrk · on Sept 19, 2019

I'm sure this comment will go down in history like the Dropbox comment

rvnx · on Sept 21, 2019

Fine, I prefer to loose my comment than my invested money

jakubbalada · on Sept 18, 2019

> A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

Check out Apify store (https://apify.com/store). It's built exactly for that purpose.

(Disclaimer: I'm a co-founder of Apify)

enos_feedler · on Sept 20, 2019

Duplex for web [1] would certainly benefit from this kind of understanding so I wouldn't be surprised if Google is working on this.

1. https://www.theverge.com/2019/5/7/18531195/google-duplex-web...

refrigerator · on Sept 18, 2019

Congrats on the launch! Look forward to playing around with it.

I remember a similar startup making quite a splash on HN a few years ago — http://www.kimonolabs.com/. Do you know why they shut it down and what you guys are doing differently?

mxcrbn · on Sept 18, 2019

Thank you! Kimono Labs built an amazing product and has been finally acquired by Palantir. The main difference between us comes from our machine learning model that gives stability to extraction over time and parse webpages in a generic way. We are also working on automating navigation, which is something they didn't do =)

ryanSrich · on Sept 18, 2019

How would you compare to something like UiPath? Congrats on the launch.

mxcrbn · on Sept 18, 2019

UiPath and others RPA actors allow their users to mainly automate local processes on Windows. We are cross-platform and focusing on the web, which makes us useful for other use-cases like gathering data or automating navigation sessions!

refrigerator · on Sept 18, 2019

Very nice :)

throw03172019 · on Sept 18, 2019

They were acquired by Palantir.

refrigerator · on Sept 18, 2019

Seems like the team decided to stop working on it, and that it was an acquihire:

> we’ve realized that continuing to work in isolation on a general data collection tool simply won’t allow us to make the impact we want

Curious to hear the Dashblock vision in comparison :)

whoisjuan · on Sept 18, 2019

This has been tried many times and it never seems to gain traction to become a relevant concept. Off the top of my head, I remember Kimono Labs that looked quite promising. Then it was acquired by Palantir and shut down. I also have seen many solutions that are similar (basically most scraping companies, like Diffbot which also claims to use machine learning for their extraction techniques)

What's the plan here to really become differentiated? Why is now the right time for this concept and not before when others tried it? Also, how do you plan to address the concerns of companies that don't want their data to be accessed programmatically? That seems like a big challenge to overcome in order to become commercially succesful.

HPouillot · on Sept 19, 2019

Thanks for you feedback ! We talked to the co-founders of Kimono Labs and their approach was a bit different. Our goal is to automate processes on the Internet and scraping is just the first step.

Timing is perfect because to do that, you need a robust headless browser and a smart way to locally identify the elements on the page if you don't want to maintain your scripts. That's why we use Puppeteer and TensorflowJS which didn't exists 2-3 years ago.

But sure, there are website owners who don't want an API for their website. Our plan is not to fight against them but to start with owner that are already convinced that they could benefits from automating the usage of their website. The banking sector understood that, and that's why Yodlee and Plaid are so successful today.

And if you step back, there are tons of websites that don't have the ressources to create an API (30% of the websites have been created using Wordpress) and don't know the value they could generate from it.

So yes, we'll have to overcome a lot of challenges to build this technology and make it accessible to everyone but we are convinced that the Internet will be used more and more programmatically in the future and we are just paving the way for it ;)

tipalink · on Sept 18, 2019

In regards to your question about companies' concerns: if the data is made publicly available (i.e. web page is not behind authentication), then why should it matter how it's accessed?

whoisjuan · on Sept 19, 2019

If you can access it programatically, then you can access it at scale which means you can quickly scrape content and replicate it somewhere else. Many business rely on a model where the data or information they generate is meant to be consumed by a human.

For example, Google temporarily bans your IP when you hit things like Google Play urls multiple times in a few minutes. This is clearly an attempt to block anyone but a human to extract information from the Play store.

jcutrell · on Sept 18, 2019

I can imagine some companies wanting that data to be accessed in a specific delivery format (i.e. with branding experience attached).

Also might be concerned about inaccuracies from variable pricing models for example. There’s a few reasons why you may not want it accessible - hence one of the reasons why CORS is even a thing.

statictype · on Sept 19, 2019

The API would bypass ads on the page?

I feel like this would have the same sort of friction that RSS had.

Which is to say, it could certainly still work.

egfx · on Sept 19, 2019

https://hn.algolia.com/?q=wrapapi also comes to mind.

d--b · on Sept 18, 2019

Sorry to be a pain here, but I very much doubt your ML thing is working at all. Opening a website and finding a dom element is trivial, so the only thing I'd get when buying from you is the promise that this will be resilient to website updates.

But at the same time, for $500/month, you can definitely have people updating the selectors manually...

garbowza · on Sept 18, 2019

I've been using Dashblock in a production environment and it's super easy to create and use the APIs on sites. We'd previously written our own scripts to do this at scale, but it was difficult to keep them all up to date. You're right that fixing a single page's dom changes is trivial, but it's a real pain to scale that. Regarding the ML aspect, I've tested changing the dom for a page in Dashblock and it seems to work... it didn't break the scraping I had set up. The price might not make sense for everybody, but for me it's definitely worth it.

AznHisoka · on Sept 18, 2019

Save your money. Create robust regression scripts and hire a freelancer to fix it when the schema changes. anyone serious crawling/scraping data from websites won’t leave it up to some automated ml magic to extract the right data for them.

jakoblorz · on Sept 18, 2019

The Freelancer approach does not work, if the data you extract is time critical. But then again, why would anyone not try to find an api for such data and rely on dom parsing. So, OPs product is worth it, if you validated and trust their ML model to work correctly, believe they can guarantee a certain uptime and the data you extract is time sensitive or mission critical. Also the data cannot be resourced from an api. People with such needs may be willing to pay good bucks, but good luck finding early adopters as well as the data-niche where there is demand for something like this.

bdcravens · on Sept 18, 2019

Have you tried it?

turtlebits · on Sept 18, 2019

Very cool, but super slow, especially for an API point, which I would expect you could use directly from a front end.

Tested on a site I regularly visit

  dashblock (3 selectors, ~20 items):  16.911 seconds
  curl (no scraping):  60 ms
  chrome:  987 ms

edit: added chrome

HPouillot · on Sept 18, 2019

Indeed, we are rendering the whole page with the javascript, that's why it takes longer than a curl. For now, it's especially useful for dynamic pages but we also plan on supporting pages that don't require rendering.

__ka · on Sept 18, 2019

Maybe you already do it, but I think integrating adblocker functionality when loading JS sites would be desirable to reduce load time. And if ads are what the API user is interested in, perhaps add a flag for whether or not one wants ads to load.

Recommendation: https://github.com/cliqz-oss/adblocker Should be the fastest adblocker library (used by Ghostery, Cliqz and Brave)

HPouillot · on Sept 18, 2019

Thanks for the advice, it makes a lot of sense !

turtlebits · on Sept 18, 2019

Sounds good. It does make sense to check to see if your selectors work with raw HTML on publish to verify if you require JS or not.

HPouillot · on Sept 18, 2019

Yep, that's our plan :)

miki123211 · on Sept 18, 2019

I see some potential in this for accessibility. There are some websites which are impossible/very hard to use with a screen reader and don't provide an official aPI, mostly for corporate lock-in reasons. Using this tool for API generation and then writing a super nice to use client would be awesome. My life could be so much better with solutions like those.

the_watcher · on Sept 18, 2019

This is cool, looking forward to trying it out. Manual scraping is doable, but there have been plenty of times that I've just decided not to do something because I'd have to spend an hour getting/scraping the data. Hopefully this will take that time down to 10 minutes or less.

mxcrbn · on Sept 18, 2019

Definitely =) Let us know know if we can be of any help!

omarhaneef · on Sept 18, 2019

I really want to try it because I think I need something like this.

However, how do I know it is a legitimate product and not some virus/scam software?

I know YC is a vote of confidence, if it is a YC company, and all the copy sounds legit, and you sound like a pair of honest, hard working entrepreneurs.

But is there some way to check before I run the software?

Edit: Note that I would not have this concern if it were a web platform like Kimono used to be.

mxcrbn · on Sept 18, 2019

That's a good point. Our app is validated by Apple on MacOS and the Windows version will be soon. Also, we have thousands of users and you can google us : no claim for spam at all =)

Note : FyI, we worked on a SaaS version but the user experience was not slick enough in our point of view (e.g. iframe of the websites).

vessenes · on Sept 19, 2019

Hey Hugues and Max, congratulations.

Can I ask some questions about how this would apply to a project of mine?

I currently create a personal newspaper, printed daily in my office. It’s a reasonably large piece of software that pulls in my calendar, emails, news stories I care about, twitter feeds, weather, stock quotes, etc.

I use python’s newspaper library for parsing RSS feed links to news sites, but it is, at times lacking, so dashblock strikes me as interesting.

What I understand from the video is I could over time build out APIs with dashblock for major news sites; this would help with a few sites that are hard for newspaper.

How would I use dashblock in production - unattended, CLI Linux or Mac? Also, it looks really slow in the video, is this typical speeds? Is this something that you require be run on your cloud, or could I run it locally?

Thanks, Peter

HPouillot · on Sept 19, 2019

Thanks !

You can create an API for any website (news website included) from our Mac/Windows software and you can access this API from anywhere. It runs on our servers and you can query it from any language you want. Let us know if you need more help hello@dashblock.com

sradu · on Sept 18, 2019

Have done a lot of scraping in my life, and I'm super excited about what Hugues and Max have built.

I tested a super early version and was surprised how well it worked.

HPouillot · on Sept 18, 2019

Thanks Radu !

css · on Sept 18, 2019

how do you avoid getting banned by the companies you scrape? Most ToSs have a clause like:

> We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means... [etc]

reascenda · on Sept 18, 2019

This may now be moot after the LinkedIn Vs hiq labs case a couple of days ago which appears to have blanket legalised we scraping.

kube-system · on Sept 18, 2019

hiQ v. LinkedIn means you probably aren't going to jail for scraping LinkedIn's website. It doesn't mean LinkedIn can't IP ban you.

mxcrbn · on Sept 18, 2019

Agreed, some websites are really reticent about scraping. But let's think about Google, they are scraping the whole web regardless of the ToS of the websites, so it all boils down to one question : do you create value for the website owner ? That's why we want to focus on use-cases where we create value for both, our users and the website owner. If you think about Yodlee/Plaid in the banking sector, they built partnerships with the Banks but continued scraping them because most of them didn't provide an API.

kube-system · on Sept 18, 2019

Google respects robots.txt.

bytematic · on Sept 18, 2019

https://www.theverge.com/2017/8/15/16148250/microsoft-linked...

AznHisoka · on Sept 18, 2019

BYOP (bring your own proxies)

w457uiw4gftyi · on Sept 18, 2019

Are you supporting the use case where web site providers consider scraping to be hostile? Ie, spinning up new cloud instances until one isn't blacklisted by the site, all behind the scenes so the consumer of your API doesn't have to worry about such things?

HPouillot · on Sept 18, 2019

We don't use sophisticated methods for now, we just use a serverless architecture, so IP changes at each invocation. Feel free to contact us at hello@dashblock.com if it doesn't work for your use-case :)

sbr464 · on Sept 18, 2019

Nice work. Do you have an admin API for creating or managing the APIs you generate? Asking in the case of integrating into another app.

Also, how well does it handle JavaScript apps? Can you specify different engines to parse a site with or specify JS disabled/enabled etc?

HPouillot · on Sept 18, 2019

We don't provide an API to manage other API yet, but this inception use-case is interesting. Could you specify what your app would like to do ? We render the Javascript of the page and for now we don't provide a way to specify if you want to render the page or not but we plan on doing so.

gargarplex · on Sept 18, 2019

Has anyone tried this for careers pages? Would be interested in how this performs on a random sample of ~50 crunchbase NYC startups’ careers pages. I dunno how much time would have to be spent training data...

HPouillot · on Sept 18, 2019

We did :) It works on all kind of pages. You just have to set it up on one page and it will work on all similar pages of the website. Did you have in mind to train a model to recognise careers pages across websites ?

gargarplex · on Sept 18, 2019

Yeah, that would be really helpful. I want to monitor careers pages of all local companies in the Crunchbase NYC geo in order to help candidates search for local companies by keywords (eg C#). We have an API already (syncs with Algolia) to receive the jobs, with unique key on each job’s URI; and we wouldn’t want to scrape more than once per day.

mLuby · on Sept 19, 2019

Would love to use that if/when you get it working.

gargarplex · on Sept 21, 2019

It's quite a daunting project, but if you want to join the @codeforcash on Keybase, would definitely welcome support.

tipalink · on Sept 18, 2019

Very cool! Is there a way to authenticate in to a site and then keep a session alive to scrape private content? Can it pass cookies or can you manually set headers?

mxcrbn · on Sept 18, 2019

Not yet, but we are working on it =)

sidcool · on Sept 19, 2019

Congratulations on launching. This seems like a cool idea. I have some reservations on how widespread the adoption could be. But I love the concept.

h0h0h0h0111 · on Sept 19, 2019

I've had some ideas that have relied on scraping data from sources that don't provide an open API (and server-render their sites), and the scraping part has been a bit of a barrier - gotta say I'm amazed how easy it was using your tool. The UX was pretty intuitive also, I like that you've basically embedded a web browser, cos everybody already knows how to use a web browser!

HPouillot · on Sept 19, 2019

Thanks for your feedback !

reeddavid · on Sept 19, 2019

This looks awesome, just tried it out on Poshmark (they don't have a feature to alert me when new items in my size are listed). I was a huge fan of Kimono Labs before they stopped operating, and this serves a similar purpose for me.

I might have missed it, but how can I see (or edit) the configuration of my configured API? It looks like all I can do is run the API or delete it.

HPouillot · on Sept 19, 2019

I was a huge fan of Kimono as well. You can't edit an API for now but we will add this feature in the next release.

mLuby · on Sept 19, 2019

I like how simple it is—best of luck! (BTW I think your demo video can be shortened in the middle; after 6 selectors it's clear how that works.)

1. How hard would it be to do inputs? That is, there's a form that I have to fill out manually but I want to do so by API.

2. How well does this work for creating UX tests? The Selenium "no code" tools I've seen are terrible.

HPouillot · on Sept 19, 2019

Thanks ! 1. It changes the user experience but the underlying model stay the same and will allow our user to record session with inputs and clicks in next releases. 2. Indeed, if you can replay a session you check the data is what you expected. What solution have you tried so far ?

surfer77 · on Sept 19, 2019

Love this, I submitted this to API List (https://apilist.fun/api/dashblock), been seeing more and more scraping apis become available, it seems it is a becoming a very competitive industry and this is a unique solution (at least from what I've seen)

mxcrbn · on Sept 19, 2019

Thanks =)

RazZziel · on Sept 18, 2019

Looks promising, but it's only available for OSX and Windows. Will we be seeing a Linux release soon?

mxcrbn · on Sept 18, 2019

Yes! we have been quite occupied since the end of YC but we plan on releasing it soon =) Please ping us at hello@dashblock.com and we will inform you when the version is live!

borisandcrispin · on Sept 19, 2019

I tried a couple of web scraping tools in the past weeks and Dashblock was by far the best. Easy to start and getting the results with an API is exactly what I wanted. (In my case I connected it to Zapier + Airtable).

HPouillot · on Sept 19, 2019

Thanks for your feedback !

quickthrower2 · on Sept 19, 2019

If this works for amazon.com.au with it's 20 different page layouts and page navigation systems (sometimes ajax, sometimes not) for different product types, I'll be impressed.

mxcrbn · on Sept 19, 2019

Indeed, amazon has different layouts and can be tricky. For now, our model is resilient to minor changes but we are working on improving it - amazon.com.au look like a good test ;-)

the_watcher · on Sept 18, 2019

It looks like the 10K API call offer is limited to people who sign up for the developer plan ($149/mo), but your post implies it's free. Did I misread the offer in your post?

mxcrbn · on Sept 18, 2019

No you read it correctly : by creating an account today, you will have 10k FREE API calls =)

loceng · on Sept 19, 2019

Ah, guess I missed the 10k limit by only signing up today? Certainly wasn't clear that it was limited to day of the post.

lucasverra · on Sept 18, 2019

good marketing, im creating an account to use later in the year

what’s the mínimums macos versión ? why not web if this is electron ?

HPouillot · on Sept 18, 2019

Ahah, great ! The minimum required version is 10.10 (Yosemite)

If you want to do that on the web, you'll have to render the page in an iframe to select the content and most websites don't allow it. In short, the user experience is way better with a software.

lucasverra · on Sept 20, 2019

ok, downloading now. Can i still benefit the offer pretty please :)

kull · on Sept 21, 2019

I installed it but It didn’t get me data I needed. I am still gonna use parsehub which allows me to easily go up and down in the html tree to get data hidden under layers of divs.

____Sash---701_ · on Sept 20, 2019

See - http://www.nightmarejs.org/

hbcondo714 · on Sept 18, 2019

Congrats on the launch! Seems a little similar to Diffbot.com but they do not require a client download.

mxcrbn · on Sept 18, 2019

Correct! Also, Diffbot extracts automatically generic entities (e.g. product name and price, comments, etc) while we let our users choose exactly the data they want on any webpage =)

username18 · on Sept 18, 2019

Are the number of API calls per month? Is the answer the same for a free account?

HPouillot · on Sept 18, 2019

You have 10k API calls when you sign up and 1k per month after that. Does it answer your question ?

udayrddy · on Sept 18, 2019

did you mention WebSITE, not WebPAGE ?!!

Oh wow, instagram.com is on your youtube demo video thumbnail. Interested to know on how is it traversing the site, I do not think fb has put the usernames in public.

mxcrbn · on Sept 18, 2019

We don't crawl the websites yet. However, you can create an API on a given webpage and gather data from similar webpages on the same website by calling the API with the new URL.

awad · on Sept 18, 2019

How do you differentiate from Octoparse?

mxcrbn · on Sept 19, 2019

There is plenty of differences among which 1/ we don't rely on classic selectors (CSS, xPath, etc) which allows us to be resilient to website updates 2/ we offer a simple UI that automates data selection and structuration and 3/ we are available on Windows and MacOS =)

kabacha · on Sept 19, 2019

> we don't rely on classic selectors (CSS, xPath, etc)

I'm not buying this, does AI process html as text lol? Surely it process it as a tree, right?

HPouillot · on Sept 19, 2019

We use machine learning to extract the content of the page, which means when the webpage changes you don't have to update your selection like Octoparse.

bayareamaverick · on Sept 18, 2019

Congrats! Look forward to trying it out.

mxcrbn · on Sept 18, 2019

Thanks! Let us know if you have any question or feedback =)

cryptozeus · on Sept 18, 2019

Really neat !

mxcrbn · on Sept 18, 2019

Thank you !

jtx22 · on Sept 18, 2019

I wonder how long it'll take these sites to require Captcha for basic access.

mxcrbn · on Sept 18, 2019

Good question. However, that would require websites' users to validate a Captcha every time they navigate it, which is not optimal in terms of user experience.

skidMarkUndies · on Sept 18, 2019

reCaptcha V3 operates behind the scenes though:

https://developers.google.com/recaptcha/docs/v3

mxcrbn · on Sept 18, 2019

Good point! That's why our plan is to focus on use-cases that create value for websites too, in order to partner up with them.