asciimoo's comments

asciimoo · 2025-04-14T14:20:59 1744640459

It's not only open source, it is free software. Take a look at https://github.com/asciimoo/omnom - suggestions/contributions are appreciated =)

quantadev · 2025-04-14T16:57:13 1744649833

That looks like a pretty heavy-weight solution, with a lot of complexity, and I don't mean that as a criticism at all. I'm not a 'go' developer myself. I've always wanted a pure JS solution (as a browser extension, maximum of 200 lines of code) that can capture the content of a web page (doing a virtual scroll to the bottom, to capture the whole page). Since there's no perfect way to translate HTML to PDF, my idea had always been to capture the IMAGE of the page (aside from capturing keywords for DB indexing which can be done separately just for 'search' support later on).

The fly in the ointment is of course the scrolling too, because some apps have "infinite" scrolling, and so in many SPAs there's literally no such thing as "The whole page". Anway, I haven't tried your app yet, because of not-JS and not-Small, reasons, but I'm just sharing my perspective on this topic. Thanks for sharing your project!

Ametrin · 2025-04-16T19:48:06 1744832886

I recently released a Chrome extension that converts webpages to PDF. It's free, but you need to register to get a key. Unfortunately, this solution isn't client-side JavaScript; I'm using an API underneath. To be honest, I mainly created it to promote the API, but if it's useful for people, I might develop it further. Perhaps it could be useful to you in some way. I don't know your requirements, but maybe with this base in the form of this extension, it wouldn't be difficult to add something that meets your expectations, let me know. However, if you want to export a PDF from Ahrefs, for example, I'm afraid that might not be possible; currently, only basic authentication is supported. Unless maybe I could add an option like in my API to pass JavaScript code, but I also doubt that would work because Ahrefs probably has some bot protection.

edit: i forgot the link https://chromewebstore.google.com/detail/pdfbolt-web-to-pdf/...

quantadev · 2025-04-17T01:48:14 1744854494

Thanks for sharing that. Looks pretty nice!

asciimoo · 2025-04-14T13:51:46 1744638706

You can find detailed search (including title/content) using the bookmarks endpoint. Snapshot search is currently only for finding multiple snapshots for a single URL/domain. Probably it should be emphasized, or content based search should be available there as well. Thanks for the feedback!

asciimoo · on Oct 5, 2017

Scraping JS-only sites is also possible without a headless browser, but requires a bit more debugging of the internal structure of these sites. Most of JS-only websites have API endpoints with JSON responses, which can make scraping more reliable than parsing custom (and sometimes invalid) HTML. The drawback of headless browser based scraping is that it requires significant amount of cpu time and memory compared to "static" scraping frameworks.

asciimoo · on Oct 5, 2017

Interesting idea, how do you imagine a channel based API for this?

hellcow · on Oct 5, 2017

I would ignore the GP's advice. Channels are prone to big errors -- panics and blocking -- which aren't detectable at compile time. They make sense to use internally but shouldn't be exposed in a public API. As one example, notice how the standard library's net/http package doesn't require you to use channels, but it uses them internally.

JoeAcchino · on Oct 5, 2017

Would this work?

  c := colly.NewCollector()

  // this functions create a goroutine and returns a channel
  ch := c.HTML("a")  
  e := <- ch
  link := e.Attr("href")
  // ...

I'm a bit rusty (ah!) with go, so bear with me if the above contains errors.

asciimoo · on Oct 5, 2017

How do you recognize if the collector has finished? If the site doesn't contain "a" elements (e.g. because of a network error), this example would block forever.

dward · on Oct 5, 2017

The producer closes the channel. This is differentiable from an open empty channel in a select.

asciimoo · on Oct 5, 2017

Makes sense, thanks =)

In the above example this would require `nil` checking of the retrieved value every time. I'm not sure if it would make the API cleaner

maxpert · on Oct 5, 2017

This would work. No callback hell a pleasure for eyes!

asciimoo · on Feb 5, 2017

Actually, I wrote this tool to make searx's engine development easier. It is glad to see, that so many people find it useful. =)

OJFord · on Feb 5, 2017

Thank you for both! :)