Hacker News new | past | comments | ask | show | jobs | submit login
Lexbor – An open source HTML Renderer library (github.com/lexbor)
235 points by bratao 10 months ago | hide | past | favorite | 24 comments



The title made me think this could actually layout and paint HTML, but I couldn't find anything remotely layout-related in the source tree. Then I found this comment saying even block sizing isn't done: https://github.com/lexbor/lexbor/issues/219#issuecomment-207.... Looks like a nice groundwork, though. It's nice to see things like parsing and Unicode being part of the same source tree.


We have a decent chunk of layout and paint implemented in an HTML renderer I'm working on (https://github.com/DioxusLabs/blitz), which is targeting the "electron" use case (but with a rust scripting interface rather than a JS one).

The implementation is currently very immature and there are a lot of bugs and missing features (I only got a first cut of inline layout working yesterday (but we already have flexbox and grid implemented)), but we're already seeing pretty decent results on a bunch of real-world web pages and hope to be at the point where we can render most of the web (excl. JS) in the next 6 - 12 months.

There are some screenshots on the PR for the inline layout branch https://github.com/DioxusLabs/blitz/pull/63


Sometimes it's really hard to tell the exact boundary between current day software development and elaborate jokes:

> Blitz builds upon:

> Parley for text/inline-level layout

> Currently, Parley directly depends on four crates: Fontique, Swash, Skrifa, and Peniko.

> Peniko builds on top of kurbo


I interpreted your comment as this being unfinished but then I heard that PHP has already switched from libxml2 to Lexbor so I guess it's production-ready.


I guess PHP isn't using it for Rendering (as in the title), just the parsing parts.


We have been using https://github.com/rushter/selectolax as a faster alternative to BeautifulSoup with html5lib because many malformed webpages in the wild don't work with lxml.


The problem is that libxml2's 20-year old HTML parser never supported HTML5 [1], leading to more and more problems with downstream consumers like lxml, PHP or Nokogiri. PHP recently switched to Lexbor [2] and Nokogiri to libgumbo [3]. That said, I'm hopeful to receive enough funding to implement a HTML5 parser in libxml2.

[1] https://gitlab.gnome.org/GNOME/libxml2/-/issues/211

[2] https://wiki.php.net/rfc/domdocument_html5_parser

[3] https://github.com/sparklemotion/nokogiri/issues/2204


libxml is xml parser, html5 is not xml.


It's a bit late to be saying that to people already using libxml because "It should be able to parse "real world" HTML." https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2...


Speaking of which, I don't understand why not. It seems like it would have been trivial to keep html5 a true xml. I do not understand what the actual technical reason for not doing that was. Naively, it just seems like breaking compatibility out of disdain rather than actually useful progress. Saving a couple of characters every once in a while does not justify the change, so I presume there must be a better reason?


There was XHTML and HTML5 is a direct result of finding out that was not the right solution. The main issue that was being solved there was that browsers do not parse invalid plain HTML consistently, which XHTML solved by requiring invalid XHTML to be rejected outright. This did not work. HTML5 solves this by defining the parsing rules such that there is a concept of document being invalid, every sequence of bytes deterministically maps to one particular DOM tree. This feature essentially precluded basing HTML5 on either XML (simply impossible) or SGML (that might be possible, but is in fact redundant formalism and describing the syntax in prose makes more sense, as everybody is going to hand-craft the parser anyway).


They specified how tag soup gets forced into HTML5.

They could have just as well defined how tag soup gets forced into XHTML.


I felt XHTML had fairly limited adoption on the web and in many cases web page authors seem to have preferred the »render tag soup« approach that in most cases did the intended thing than having to deal with XML namespaces, proper nesting and escaping, etc. Even though in most cases HTML nowadays seems to be authored as if it was XML with every element painstakingly closed and often even making elements that need no closing self-closing.


Probably because XML would need to be extended quite a bit to accommodate all of the multimedia stuff, attributes without values or quotes, special names for certain characters, optional or disallowed closing tags and whatnot that's in HTML5.

I think pushing in both the layout design conveniences and the strictness of XML data transfer in the same standard would be quite bulky at best. In practice we'd likely see a lot of nasty security issues in implementations and so on.


Ah this answers my question in another comment.

Thanks!


Rarely does one see a C++ quick start guide that's actually this quick: https://lexbor.com/docs/lexbor/#quick_start


> Rarely does one see a C++ quick start guide that's actually this quick: https://lexbor.com/docs/lexbor/#quick_start

Could be because it isn't C++?


Step 1 is a bit of a "draw the rest of the owl" step in that it's either done for you on your specific platform with default settings already or you have to go do all of the actually hard stuff of building the app (and sure enough that's where the typical cmake build step is hidden as well). Step 2 is just "and remember to link your code against the hard part when you compile it, by the way here's a single minimal example".


Step 1 is:

  cmake .
  make
  make install


C, not C++


We open sourced our Ruby bindings and port:

- https://github.com/serpapi/nokolexbor

- https://serpapi.com/blog/nokolexbor-a-performance-focused-ht...

It is super fast compared to Nokogiri with libxml.


Inspiring infrastructure.

The module aspect is super cool, is there much adoption with any other projects using the individual modules? e.g. a webparser using the dom module


Quite unusual to see Elixir among languages supported via bindings


> Quite unusual to see Elixir among languages supported via bindings

Not due to difficulty, usually. Bindings to non-mainstream languages are unusual to see.

I never heard of a language that couldn't interface to C in one way or another; it's one of the advantages of using C over (say) C++.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: