Hacker News new | past | comments | ask | show | jobs | submit | asciimoo's comments login

It's not only open source, it is free software. Take a look at https://github.com/asciimoo/omnom - suggestions/contributions are appreciated =)


That looks like a pretty heavy-weight solution, with a lot of complexity, and I don't mean that as a criticism at all. I'm not a 'go' developer myself. I've always wanted a pure JS solution (as a browser extension, maximum of 200 lines of code) that can capture the content of a web page (doing a virtual scroll to the bottom, to capture the whole page). Since there's no perfect way to translate HTML to PDF, my idea had always been to capture the IMAGE of the page (aside from capturing keywords for DB indexing which can be done separately just for 'search' support later on).

The fly in the ointment is of course the scrolling too, because some apps have "infinite" scrolling, and so in many SPAs there's literally no such thing as "The whole page". Anway, I haven't tried your app yet, because of not-JS and not-Small, reasons, but I'm just sharing my perspective on this topic. Thanks for sharing your project!


I recently released a Chrome extension that converts webpages to PDF. It's free, but you need to register to get a key. Unfortunately, this solution isn't client-side JavaScript; I'm using an API underneath. To be honest, I mainly created it to promote the API, but if it's useful for people, I might develop it further. Perhaps it could be useful to you in some way. I don't know your requirements, but maybe with this base in the form of this extension, it wouldn't be difficult to add something that meets your expectations, let me know. However, if you want to export a PDF from Ahrefs, for example, I'm afraid that might not be possible; currently, only basic authentication is supported. Unless maybe I could add an option like in my API to pass JavaScript code, but I also doubt that would work because Ahrefs probably has some bot protection.

edit: i forgot the link https://chromewebstore.google.com/detail/pdfbolt-web-to-pdf/...


Thanks for sharing that. Looks pretty nice!


You can find detailed search (including title/content) using the bookmarks endpoint. Snapshot search is currently only for finding multiple snapshots for a single URL/domain. Probably it should be emphasized, or content based search should be available there as well. Thanks for the feedback!


Scraping JS-only sites is also possible without a headless browser, but requires a bit more debugging of the internal structure of these sites. Most of JS-only websites have API endpoints with JSON responses, which can make scraping more reliable than parsing custom (and sometimes invalid) HTML. The drawback of headless browser based scraping is that it requires significant amount of cpu time and memory compared to "static" scraping frameworks.


Interesting idea, how do you imagine a channel based API for this?


I would ignore the GP's advice. Channels are prone to big errors -- panics and blocking -- which aren't detectable at compile time. They make sense to use internally but shouldn't be exposed in a public API. As one example, notice how the standard library's net/http package doesn't require you to use channels, but it uses them internally.


Would this work?

  c := colly.NewCollector()

  // this functions create a goroutine and returns a channel
  ch := c.HTML("a")  
  e := <- ch
  link := e.Attr("href")
  // ...
I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.


How do you recognize if the collector has finished? If the site doesn't contain "a" elements (e.g. because of a network error), this example would block forever.


The producer closes the channel. This is differentiable from an open empty channel in a select.


Makes sense, thanks =)

In the above example this would require `nil` checking of the retrieved value every time. I'm not sure if it would make the API cleaner


This would work. No callback hell a pleasure for eyes!


Actually, I wrote this tool to make searx's engine development easier. It is glad to see, that so many people find it useful. =)


Thank you for both! :)


Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: