Lexbor – An open source HTML Renderer library

chearon · 2024-06-12T17:49:25 1718214565

The title made me think this could actually layout and paint HTML, but I couldn't find anything remotely layout-related in the source tree. Then I found this comment saying even block sizing isn't done: https://github.com/lexbor/lexbor/issues/219#issuecomment-207.... Looks like a nice groundwork, though. It's nice to see things like parsing and Unicode being part of the same source tree.

nicoburns · 2024-06-12T22:29:53 1718231393

We have a decent chunk of layout and paint implemented in an HTML renderer I'm working on (https://github.com/DioxusLabs/blitz), which is targeting the "electron" use case (but with a rust scripting interface rather than a JS one).

The implementation is currently very immature and there are a lot of bugs and missing features (I only got a first cut of inline layout working yesterday (but we already have flexbox and grid implemented)), but we're already seeing pretty decent results on a bunch of real-world web pages and hope to be at the point where we can render most of the web (excl. JS) in the next 6 - 12 months.

There are some screenshots on the PR for the inline layout branch https://github.com/DioxusLabs/blitz/pull/63

yencabulator · 2024-06-14T20:12:54 1718395974

Sometimes it's really hard to tell the exact boundary between current day software development and elaborate jokes:

> Blitz builds upon:

> Parley for text/inline-level layout

> Currently, Parley directly depends on four crates: Fontique, Swash, Skrifa, and Peniko.

> Peniko builds on top of kurbo

Kiro · 2024-06-13T06:38:04 1718260684

I interpreted your comment as this being unfinished but then I heard that PHP has already switched from libxml2 to Lexbor so I guess it's production-ready.

lmz · 2024-06-13T15:41:21 1718293281

I guess PHP isn't using it for Rendering (as in the title), just the parsing parts.

bratao · 2024-06-12T16:01:21 1718208081

We have been using https://github.com/rushter/selectolax as a faster alternative to BeautifulSoup with html5lib because many malformed webpages in the wild don't work with lxml.

nwellnhof · 2024-06-12T18:39:28 1718217568

The problem is that libxml2's 20-year old HTML parser never supported HTML5 [1], leading to more and more problems with downstream consumers like lxml, PHP or Nokogiri. PHP recently switched to Lexbor [2] and Nokogiri to libgumbo [3]. That said, I'm hopeful to receive enough funding to implement a HTML5 parser in libxml2.

[1] https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

[2] https://wiki.php.net/rfc/domdocument_html5_parser

[3] https://github.com/sparklemotion/nokogiri/issues/2204

postepowanieadm · 2024-06-12T21:22:49 1718227369

libxml is xml parser, html5 is not xml.

tedunangst · 2024-06-12T23:06:05 1718233565

It's a bit late to be saying that to people already using libxml because "It should be able to parse "real world" HTML." https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2...

sgc · 2024-06-13T13:18:48 1718284728

Speaking of which, I don't understand why not. It seems like it would have been trivial to keep html5 a true xml. I do not understand what the actual technical reason for not doing that was. Naively, it just seems like breaking compatibility out of disdain rather than actually useful progress. Saving a couple of characters every once in a while does not justify the change, so I presume there must be a better reason?

dfox · 2024-06-13T16:27:43 1718296063

There was XHTML and HTML5 is a direct result of finding out that was not the right solution. The main issue that was being solved there was that browsers do not parse invalid plain HTML consistently, which XHTML solved by requiring invalid XHTML to be rejected outright. This did not work. HTML5 solves this by defining the parsing rules such that there is a concept of document being invalid, every sequence of bytes deterministically maps to one particular DOM tree. This feature essentially precluded basing HTML5 on either XML (simply impossible) or SGML (that might be possible, but is in fact redundant formalism and describing the syntax in prose makes more sense, as everybody is going to hand-craft the parser anyway).

yencabulator · 2024-06-14T20:15:59 1718396159

They specified how tag soup gets forced into HTML5.

They could have just as well defined how tag soup gets forced into XHTML.

ygra · 2024-06-13T15:13:21 1718291601

I felt XHTML had fairly limited adoption on the web and in many cases web page authors seem to have preferred the »render tag soup« approach that in most cases did the intended thing than having to deal with XML namespaces, proper nesting and escaping, etc. Even though in most cases HTML nowadays seems to be authored as if it was XML with every element painstakingly closed and often even making elements that need no closing self-closing.

cess11 · 2024-06-13T20:27:37 1718310457

Probably because XML would need to be extended quite a bit to accommodate all of the multimedia stuff, attributes without values or quotes, special names for certain characters, optional or disallowed closing tags and whatnot that's in HTML5.

I think pushing in both the layout design conveniences and the strictness of XML data transfer in the same standard would be quite bulky at best. In practice we'd likely see a lot of nasty security issues in implementations and so on.

thomasfromcdnjs · 2024-06-12T17:11:14 1718212274

Ah this answers my question in another comment.

Thanks!

hliyan · 2024-06-12T16:23:46 1718209426

Rarely does one see a C++ quick start guide that's actually this quick: https://lexbor.com/docs/lexbor/#quick_start

lelanthran · 2024-06-12T17:36:43 1718213803

> Rarely does one see a C++ quick start guide that's actually this quick: https://lexbor.com/docs/lexbor/#quick_start

Could be because it isn't C++?

zamadatix · 2024-06-12T20:55:25 1718225725

Step 1 is a bit of a "draw the rest of the owl" step in that it's either done for you on your specific platform with default settings already or you have to go do all of the actually hard stuff of building the app (and sure enough that's where the typical cmake build step is hidden as well). Step 2 is just "and remember to link your code against the hard part when you compile it, by the way here's a single minimal example".

Maxatar · 2024-06-13T16:52:30 1718297550

Step 1 is:

  cmake .
  make
  make install

boxed · 2024-06-12T16:35:09 1718210109

C, not C++

hartator · 2024-06-13T10:46:52 1718275612

We open sourced our Ruby bindings and port:

- https://github.com/serpapi/nokolexbor

- https://serpapi.com/blog/nokolexbor-a-performance-focused-ht...

It is super fast compared to Nokogiri with libxml.

thomasfromcdnjs · 2024-06-12T17:04:21 1718211861

Inspiring infrastructure.

The module aspect is super cool, is there much adoption with any other projects using the individual modules? e.g. a webparser using the dom module

troupo · 2024-06-12T20:31:48 1718224308

Quite unusual to see Elixir among languages supported via bindings

lelanthran · 2024-06-12T20:44:23 1718225063

> Quite unusual to see Elixir among languages supported via bindings

Not due to difficulty, usually. Bindings to non-mainstream languages are unusual to see.

I never heard of a language that couldn't interface to C in one way or another; it's one of the advantages of using C over (say) C++.