Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep, that's why I've build my own, the existing ones don't give out a list of the links or are super slow. A co-worker made the first one in Python but it was so slow that it took hours (6+ sometimes) to finish a site and I thought "you can do that faster".

Problematic are some sites that don't use <a href="asd.com"> tags because that's what my crawler is looking for.

C# & Elixier & Rust where the the other options I thought about and I want to build the same crawler on these languages (relative easy to do with ~300 LOC) to compare them for network / server / cli stuff but that has to wait till next year.



the biggest headache with the c# implementation was the threading. A lot of the out-of-the-box threading structures (pools, etc...) have limitations you might not think about checking for; e.g. you can't set the number of threads lower than the CPU count on the machine with some of the official .net threadPool helpers; you can try, but it will just silently ignore you.

There is some super useful stuff too though that made it easy to write a generic extensible crawler. My implementation ended up supporting separately compiled plugins you could just dump in a 'plugins\' directory, which responded to events and had full ability to manipulate the output pipeline. Do-able in lots of languages, but c# has some formalized helpers around it that make it super easy.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: