MyHTML – HTML Parser on Pure C with POSIX Threads Support

leeoniya · on March 12, 2016

> By the way, SCRIPT tag tokenization is a hell of an effort. I had to draw a graph [...] Next in turn are the CSS parser and Renderer.

CSS parsing should be ok, but layout computation is hard, especially with all the latest specs. The graph presented in the article will be the size of a postage stamp on an aircraft carrier deck.

Take a look at the Cassowary constraint solver, btw: http://overconstrained.io/

> I'm writing them all by myself, still full of energy.

I wish the author the best of luck.

vjeux · on March 12, 2016

The great thing about writing the layout computation code is that specs are mostly additive. You can start by supporting only a few properties and then as you progress support more and more.

I've used this approach for css-layout[1] and in 2 weeks I got enough implemented to support most use cases we've needed to build mobile apps.

Also, Cassowary won't really help you there. It's going to take a much bigger effort to translate CSS into constraints than just reimplementing the steps themselves.

[1] https://github.com/facebook/css-layout

leeoniya · on March 12, 2016

cool project!

flex-box is a mostly self-contained and powerful spec (if i understand correctly).

however, when you don't need to account for floats, relative layout, mixed box-sizing, negative margins, complex overflow conditions and interaction with a ton of older specs, you vastly simplify the problem space for yourself.

it makes perfect sense for a modern system but is quite far from a general impl that can compute layout from html+css unconditionally. the article starts with:

"Once I got an X idea, but its implementation required a calculated DOM with all its styles and goodies"

so the goal is not "the most useful subset". flexbox is currently the least-used (& least supported) layout, so for the author's purposes which sounds like scraping existing markup would not help very much.

scrollaway · on March 12, 2016

This is incredibly clean code. Large, long-term single-person hobby projects make for some kickass codebases. Well done.

lxe · on March 12, 2016

Amazing work. How does this compare (in terms of speed mostly) to Google's gumbo parser?

fabrice_d · on March 12, 2016

It looks much faster: http://lexborisov.github.io/myhtml/bm/time_0_100.png

kudosall · on March 12, 2016

could you elaborate please? it looks quite radical.

nly · on March 12, 2016

No idea, but it's ~40,000 LOC compared to gumbos 30,000. Hand writing parsers in C in 2016 is nuts. Gumbo at least has the virtue of being gruelingly tested by a heavy hitter.

marssaxman · on March 12, 2016

It is a really fun kind of nuts. I built one myself a couple of weeks ago, because I wanted to play around with some ideas in the "nanopass framework" for compiler design, but I don't speak Scheme.

legulere · on March 12, 2016

https://github.com/servo/html5ever seems to also have a (not yet complete) C API

mablae · on March 12, 2016

Putting "my" in front of anything should be forbidden.

Just "my" two cents.

vardump · on March 12, 2016

MySQL. Although that My refers to author's daughter's name. You can guess the rest of his kids names from other products: MaxDB and MariaDB.

agumonkey · on March 12, 2016

Interesting to see, just took handmade xml parser as a personal challenge, in python though, I've been hitting nasty performance issues compared to libxml2.

Mikhail_Edoshin · on March 12, 2016

Basic XML parsing should be very simple, it's deterministic with one-symbol lookahead. There's a number of small C parsers out there and even one written in assembly. If you want to validate it though or parse DTDs, that's a different story.

gsnedders · on March 12, 2016

DTDs really are the thing that should never have been in XML. Namespaces in XML make them pretty useless, given they only have a concept of QNames and not namespace URL & local name tuples, and they don't compose in any sensible way. They are such a large part of the complexity of the XML it's just sad.

agumonkey · on March 12, 2016

Indeed that's why I was surprised to be 200x slower than libxml2. A lesson in performance.

khedoros · on March 13, 2016

Any idea what the cause of the bottleneck is?

chris_wot · on March 12, 2016

I wonder how easy it would be too adapt the API to a set of C++ classes?