Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Dashblock (YC S19) – Turn Any Website into an API
165 points by HPouillot on Sept 18, 2019 | hide | past | favorite | 100 comments
Hey HN,

We're Hugues and Max, co-founders of Dashblock (https://dashblock.com). Dashblock turns any website into an API. People use us to access product information, news content, sales-related data or real-estate offers for instance.

As a data scientist, Hugues realised how complicated it was to access web data programmatically when a website doesn't provide an API. You have to build a script to pull the HTML, render the page in some cases, find selectors for the information you are interested in, distribute your tasks to scale and if the structure of the page changes, you have to update your selectors to find back the information.

We decided to build Dashblock to make it really simple to access web data through an API. Our software is basically a browser that allows you to access a website, right-click on the information you want to extract and preview your API on other pages.

In order to create long-lasting APIs, we developed a machine learning model that is resilient to website updates. For now, we mainly handle changes at the level of the HTML structure but with enough training data, we will also be resilient to UI updates.

Besides, our model detects similar content on the page to facilitate the selection process. When you call your API, we launch a headless browser, render the page, classify the content of the page using structural, visual and semantic features, and structure it by minimizing the entropy to give you a list when needed.

Our pricing model is related to the number of API calls our users make per month and if you want to give it a try, we currently offer 10k API calls when you sign up! You can download our software here : dashblock.com.

If you have any questions, we would be happy to answer them and if you have any related ideas, feedbacks or experiences, feel free to share them :)

Thank you !




A productivized scraping service - useful! Entire companies are built around scraping certain popular sites - this is a disruptive idea indeed. A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

However, the ML claim is highly suspect. There is no way that a machine could reliably understand the semantic content of a website - that would require Artificial General Intelligence. If anyone could do that, it would've been Google. But even Google relies on human-edited structured metadata to define the content of sites (i.e. Rich Snippets and the like).


It doesn't require Artificial General Intelligence, with enough training data (crowdsourced data and human-edited metadata like JSON-LD or RFD), we can classify automatically the attributes on the page (product name, movie title, creation date, author), structure them and recognise the type of entity.

Feel free to contact us if you want to invest (hello@dashblock.com), we are currently raising funds ;)


But, what's the value compared to using open-source products like Portia https://portia.readthedocs.io/en/latest/getting-started.html ? Functionally it looks very similar.


I'm sure this comment will go down in history like the Dropbox comment


Fine, I prefer to loose my comment than my invested money


> A growing catalog of up-to-date scrapers for popular websites would put of lot of freelancers out of work. I would invest in this.

Check out Apify store (https://apify.com/store). It's built exactly for that purpose.

(Disclaimer: I'm a co-founder of Apify)


Duplex for web [1] would certainly benefit from this kind of understanding so I wouldn't be surprised if Google is working on this.

1. https://www.theverge.com/2019/5/7/18531195/google-duplex-web...


Congrats on the launch! Look forward to playing around with it.

I remember a similar startup making quite a splash on HN a few years ago — http://www.kimonolabs.com/. Do you know why they shut it down and what you guys are doing differently?


Thank you! Kimono Labs built an amazing product and has been finally acquired by Palantir. The main difference between us comes from our machine learning model that gives stability to extraction over time and parse webpages in a generic way. We are also working on automating navigation, which is something they didn't do =)


How would you compare to something like UiPath? Congrats on the launch.


UiPath and others RPA actors allow their users to mainly automate local processes on Windows. We are cross-platform and focusing on the web, which makes us useful for other use-cases like gathering data or automating navigation sessions!


Very nice :)


They were acquired by Palantir.


Seems like the team decided to stop working on it, and that it was an acquihire:

> we’ve realized that continuing to work in isolation on a general data collection tool simply won’t allow us to make the impact we want

Curious to hear the Dashblock vision in comparison :)


This has been tried many times and it never seems to gain traction to become a relevant concept. Off the top of my head, I remember Kimono Labs that looked quite promising. Then it was acquired by Palantir and shut down. I also have seen many solutions that are similar (basically most scraping companies, like Diffbot which also claims to use machine learning for their extraction techniques)

What's the plan here to really become differentiated? Why is now the right time for this concept and not before when others tried it? Also, how do you plan to address the concerns of companies that don't want their data to be accessed programmatically? That seems like a big challenge to overcome in order to become commercially succesful.


Thanks for you feedback ! We talked to the co-founders of Kimono Labs and their approach was a bit different. Our goal is to automate processes on the Internet and scraping is just the first step.

Timing is perfect because to do that, you need a robust headless browser and a smart way to locally identify the elements on the page if you don't want to maintain your scripts. That's why we use Puppeteer and TensorflowJS which didn't exists 2-3 years ago.

But sure, there are website owners who don't want an API for their website. Our plan is not to fight against them but to start with owner that are already convinced that they could benefits from automating the usage of their website. The banking sector understood that, and that's why Yodlee and Plaid are so successful today.

And if you step back, there are tons of websites that don't have the ressources to create an API (30% of the websites have been created using Wordpress) and don't know the value they could generate from it.

So yes, we'll have to overcome a lot of challenges to build this technology and make it accessible to everyone but we are convinced that the Internet will be used more and more programmatically in the future and we are just paving the way for it ;)


In regards to your question about companies' concerns: if the data is made publicly available (i.e. web page is not behind authentication), then why should it matter how it's accessed?


If you can access it programatically, then you can access it at scale which means you can quickly scrape content and replicate it somewhere else. Many business rely on a model where the data or information they generate is meant to be consumed by a human.

For example, Google temporarily bans your IP when you hit things like Google Play urls multiple times in a few minutes. This is clearly an attempt to block anyone but a human to extract information from the Play store.


I can imagine some companies wanting that data to be accessed in a specific delivery format (i.e. with branding experience attached).

Also might be concerned about inaccuracies from variable pricing models for example. There’s a few reasons why you may not want it accessible - hence one of the reasons why CORS is even a thing.


The API would bypass ads on the page?

I feel like this would have the same sort of friction that RSS had.

Which is to say, it could certainly still work.



Sorry to be a pain here, but I very much doubt your ML thing is working at all. Opening a website and finding a dom element is trivial, so the only thing I'd get when buying from you is the promise that this will be resilient to website updates.

But at the same time, for $500/month, you can definitely have people updating the selectors manually...


I've been using Dashblock in a production environment and it's super easy to create and use the APIs on sites. We'd previously written our own scripts to do this at scale, but it was difficult to keep them all up to date. You're right that fixing a single page's dom changes is trivial, but it's a real pain to scale that. Regarding the ML aspect, I've tested changing the dom for a page in Dashblock and it seems to work... it didn't break the scraping I had set up. The price might not make sense for everybody, but for me it's definitely worth it.


Save your money. Create robust regression scripts and hire a freelancer to fix it when the schema changes. anyone serious crawling/scraping data from websites won’t leave it up to some automated ml magic to extract the right data for them.


The Freelancer approach does not work, if the data you extract is time critical. But then again, why would anyone not try to find an api for such data and rely on dom parsing. So, OPs product is worth it, if you validated and trust their ML model to work correctly, believe they can guarantee a certain uptime and the data you extract is time sensitive or mission critical. Also the data cannot be resourced from an api. People with such needs may be willing to pay good bucks, but good luck finding early adopters as well as the data-niche where there is demand for something like this.


Have you tried it?


Very cool, but super slow, especially for an API point, which I would expect you could use directly from a front end.

Tested on a site I regularly visit

  dashblock (3 selectors, ~20 items):  16.911 seconds
  curl (no scraping):  60 ms
  chrome:  987 ms

edit: added chrome


Indeed, we are rendering the whole page with the javascript, that's why it takes longer than a curl. For now, it's especially useful for dynamic pages but we also plan on supporting pages that don't require rendering.


Maybe you already do it, but I think integrating adblocker functionality when loading JS sites would be desirable to reduce load time. And if ads are what the API user is interested in, perhaps add a flag for whether or not one wants ads to load.

Recommendation: https://github.com/cliqz-oss/adblocker Should be the fastest adblocker library (used by Ghostery, Cliqz and Brave)


Thanks for the advice, it makes a lot of sense !


Sounds good. It does make sense to check to see if your selectors work with raw HTML on publish to verify if you require JS or not.


Yep, that's our plan :)


I see some potential in this for accessibility. There are some websites which are impossible/very hard to use with a screen reader and don't provide an official aPI, mostly for corporate lock-in reasons. Using this tool for API generation and then writing a super nice to use client would be awesome. My life could be so much better with solutions like those.


This is cool, looking forward to trying it out. Manual scraping is doable, but there have been plenty of times that I've just decided not to do something because I'd have to spend an hour getting/scraping the data. Hopefully this will take that time down to 10 minutes or less.


Definitely =) Let us know know if we can be of any help!


I really want to try it because I think I need something like this.

However, how do I know it is a legitimate product and not some virus/scam software?

I know YC is a vote of confidence, if it is a YC company, and all the copy sounds legit, and you sound like a pair of honest, hard working entrepreneurs.

But is there some way to check before I run the software?

Edit: Note that I would not have this concern if it were a web platform like Kimono used to be.


That's a good point. Our app is validated by Apple on MacOS and the Windows version will be soon. Also, we have thousands of users and you can google us : no claim for spam at all =)

Note : FyI, we worked on a SaaS version but the user experience was not slick enough in our point of view (e.g. iframe of the websites).


Hey Hugues and Max, congratulations.

Can I ask some questions about how this would apply to a project of mine?

I currently create a personal newspaper, printed daily in my office. It’s a reasonably large piece of software that pulls in my calendar, emails, news stories I care about, twitter feeds, weather, stock quotes, etc.

I use python’s newspaper library for parsing RSS feed links to news sites, but it is, at times lacking, so dashblock strikes me as interesting.

What I understand from the video is I could over time build out APIs with dashblock for major news sites; this would help with a few sites that are hard for newspaper.

How would I use dashblock in production - unattended, CLI Linux or Mac? Also, it looks really slow in the video, is this typical speeds? Is this something that you require be run on your cloud, or could I run it locally?

Thanks, Peter


Thanks !

You can create an API for any website (news website included) from our Mac/Windows software and you can access this API from anywhere. It runs on our servers and you can query it from any language you want. Let us know if you need more help hello@dashblock.com


Have done a lot of scraping in my life, and I'm super excited about what Hugues and Max have built.

I tested a super early version and was surprised how well it worked.


Thanks Radu !


how do you avoid getting banned by the companies you scrape? Most ToSs have a clause like:

> We prohibit crawling, scraping, caching or otherwise accessing any content on the Service via automated means... [etc]


This may now be moot after the LinkedIn Vs hiq labs case a couple of days ago which appears to have blanket legalised we scraping.


hiQ v. LinkedIn means you probably aren't going to jail for scraping LinkedIn's website. It doesn't mean LinkedIn can't IP ban you.


Agreed, some websites are really reticent about scraping. But let's think about Google, they are scraping the whole web regardless of the ToS of the websites, so it all boils down to one question : do you create value for the website owner ? That's why we want to focus on use-cases where we create value for both, our users and the website owner. If you think about Yodlee/Plaid in the banking sector, they built partnerships with the Banks but continued scraping them because most of them didn't provide an API.


Google respects robots.txt.



BYOP (bring your own proxies)


Are you supporting the use case where web site providers consider scraping to be hostile? Ie, spinning up new cloud instances until one isn't blacklisted by the site, all behind the scenes so the consumer of your API doesn't have to worry about such things?


We don't use sophisticated methods for now, we just use a serverless architecture, so IP changes at each invocation. Feel free to contact us at hello@dashblock.com if it doesn't work for your use-case :)


Nice work. Do you have an admin API for creating or managing the APIs you generate? Asking in the case of integrating into another app.

Also, how well does it handle JavaScript apps? Can you specify different engines to parse a site with or specify JS disabled/enabled etc?


We don't provide an API to manage other API yet, but this inception use-case is interesting. Could you specify what your app would like to do ? We render the Javascript of the page and for now we don't provide a way to specify if you want to render the page or not but we plan on doing so.


Has anyone tried this for careers pages? Would be interested in how this performs on a random sample of ~50 crunchbase NYC startups’ careers pages. I dunno how much time would have to be spent training data...


We did :) It works on all kind of pages. You just have to set it up on one page and it will work on all similar pages of the website. Did you have in mind to train a model to recognise careers pages across websites ?


Yeah, that would be really helpful. I want to monitor careers pages of all local companies in the Crunchbase NYC geo in order to help candidates search for local companies by keywords (eg C#). We have an API already (syncs with Algolia) to receive the jobs, with unique key on each job’s URI; and we wouldn’t want to scrape more than once per day.


Would love to use that if/when you get it working.


It's quite a daunting project, but if you want to join the @codeforcash on Keybase, would definitely welcome support.


Very cool! Is there a way to authenticate in to a site and then keep a session alive to scrape private content? Can it pass cookies or can you manually set headers?


Not yet, but we are working on it =)


Congratulations on launching. This seems like a cool idea. I have some reservations on how widespread the adoption could be. But I love the concept.


I've had some ideas that have relied on scraping data from sources that don't provide an open API (and server-render their sites), and the scraping part has been a bit of a barrier - gotta say I'm amazed how easy it was using your tool. The UX was pretty intuitive also, I like that you've basically embedded a web browser, cos everybody already knows how to use a web browser!


Thanks for your feedback !


This looks awesome, just tried it out on Poshmark (they don't have a feature to alert me when new items in my size are listed). I was a huge fan of Kimono Labs before they stopped operating, and this serves a similar purpose for me.

I might have missed it, but how can I see (or edit) the configuration of my configured API? It looks like all I can do is run the API or delete it.


I was a huge fan of Kimono as well. You can't edit an API for now but we will add this feature in the next release.


I like how simple it is—best of luck! (BTW I think your demo video can be shortened in the middle; after 6 selectors it's clear how that works.)

1. How hard would it be to do inputs? That is, there's a form that I have to fill out manually but I want to do so by API.

2. How well does this work for creating UX tests? The Selenium "no code" tools I've seen are terrible.


Thanks ! 1. It changes the user experience but the underlying model stay the same and will allow our user to record session with inputs and clicks in next releases. 2. Indeed, if you can replay a session you check the data is what you expected. What solution have you tried so far ?


Love this, I submitted this to API List (https://apilist.fun/api/dashblock), been seeing more and more scraping apis become available, it seems it is a becoming a very competitive industry and this is a unique solution (at least from what I've seen)


Thanks =)


Looks promising, but it's only available for OSX and Windows. Will we be seeing a Linux release soon?


Yes! we have been quite occupied since the end of YC but we plan on releasing it soon =) Please ping us at hello@dashblock.com and we will inform you when the version is live!


I tried a couple of web scraping tools in the past weeks and Dashblock was by far the best. Easy to start and getting the results with an API is exactly what I wanted. (In my case I connected it to Zapier + Airtable).


Thanks for your feedback !


If this works for amazon.com.au with it's 20 different page layouts and page navigation systems (sometimes ajax, sometimes not) for different product types, I'll be impressed.


Indeed, amazon has different layouts and can be tricky. For now, our model is resilient to minor changes but we are working on improving it - amazon.com.au look like a good test ;-)


It looks like the 10K API call offer is limited to people who sign up for the developer plan ($149/mo), but your post implies it's free. Did I misread the offer in your post?


No you read it correctly : by creating an account today, you will have 10k FREE API calls =)


Ah, guess I missed the 10k limit by only signing up today? Certainly wasn't clear that it was limited to day of the post.


good marketing, im creating an account to use later in the year

what’s the mínimums macos versión ? why not web if this is electron ?


Ahah, great ! The minimum required version is 10.10 (Yosemite)

If you want to do that on the web, you'll have to render the page in an iframe to select the content and most websites don't allow it. In short, the user experience is way better with a software.


ok, downloading now. Can i still benefit the offer pretty please :)


I installed it but It didn’t get me data I needed. I am still gonna use parsehub which allows me to easily go up and down in the html tree to get data hidden under layers of divs.



Congrats on the launch! Seems a little similar to Diffbot.com but they do not require a client download.


Correct! Also, Diffbot extracts automatically generic entities (e.g. product name and price, comments, etc) while we let our users choose exactly the data they want on any webpage =)


Are the number of API calls per month? Is the answer the same for a free account?


You have 10k API calls when you sign up and 1k per month after that. Does it answer your question ?


did you mention WebSITE, not WebPAGE ?!!

Oh wow, instagram.com is on your youtube demo video thumbnail. Interested to know on how is it traversing the site, I do not think fb has put the usernames in public.


We don't crawl the websites yet. However, you can create an API on a given webpage and gather data from similar webpages on the same website by calling the API with the new URL.


How do you differentiate from Octoparse?


There is plenty of differences among which 1/ we don't rely on classic selectors (CSS, xPath, etc) which allows us to be resilient to website updates 2/ we offer a simple UI that automates data selection and structuration and 3/ we are available on Windows and MacOS =)


> we don't rely on classic selectors (CSS, xPath, etc)

I'm not buying this, does AI process html as text lol? Surely it process it as a tree, right?


We use machine learning to extract the content of the page, which means when the webpage changes you don't have to update your selection like Octoparse.


Congrats! Look forward to trying it out.


Thanks! Let us know if you have any question or feedback =)


Really neat !


Thank you !


I wonder how long it'll take these sites to require Captcha for basic access.


Good question. However, that would require websites' users to validate a Captcha every time they navigate it, which is not optimal in terms of user experience.


reCaptcha V3 operates behind the scenes though:

https://developers.google.com/recaptcha/docs/v3


Good point! That's why our plan is to focus on use-cases that create value for websites too, in order to partner up with them.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: