Hacker News new | past | comments | ask | show | jobs | submit login
Oh Yes You Can Use Regexes to Parse HTML (stackoverflow.com)
239 points by draegtun on July 8, 2011 | hide | past | favorite | 77 comments



I classify this as a parser using regular expressions rather than parsing using regular expressions. That is, his regular expressions don't parse the document. He wrote a parser, and uses regular expressions in that parser.


This is correct. As soon as I saw his lex_html function, I could see that he was using regexes to tokenize and then basically drop into different contexts while consuming the document, thereby tracking multiple levels of state throughout. That's what a full-blown parser does and it goes beyond the scope of regular languages.

By "parsing with regular expressions" most people mean applying one or two regexes to the entire document and using, for example, the group capture facility to extract information. That is and will always be a bad idea for HTML, because HTML is not a regular language.

Applying regular expressions to tokenize a string for a parsing is actually a fairly standard pattern.


Exactly. Gnu's Flex lets the user define regex's for each state of the parser... http://flex.sourceforge.net/manual/Patterns.html#Patterns


Flex is used by the GNU System, but it is not GNU. It is incorrect to call it "Gnu's Flex". http://www.gnu.org/software/flex/


You're quite right, thanks


Right. Pyparsing is a nice parsing library, and it uses regular expression. I'd expect most parsing libraries do let you define stuff in regex. But they can't do recursion, right?


What do you mean by "can't do recursion"? A regular grammar can definitely be recursive.


Negative, a regular grammar which describes a regular language cannot be recursive. For recursion you need to be context-free at least.

Of course, most modern "regex" engines provide features which go beyond the strictly regular languages.


No, kd0amg is right. Here's a simple example of a recursive regular grammar:

    S → aS | ε
This grammar is right linear (see http://en.wikipedia.org/wiki/Regular_grammar).


You can in fact do recursion with regular grammars as you've just demonstrated, but if you mix right regular grammars with left regular grammars (as you would need to do in order to, for example, match parenthesis) your grammar is no longer regular (not equivalent to regular expressions). I suspect that is what wisty meant.

Honestly though, this entire conversation is fuzzed by the blurred distinction between "parsing with regular expressions, and writing a parser that uses regular expressions to tokenize" by the original author. I'm not really sure where just about every other person in this discussion is coming from as a result..


I didn't know that a grammar featuring recursion can still be called a "regular" grammar if the language it describes is regular.

However, no regular grammar (not even a recursive regular grammar) can describe a recursive language.


That's basically the standard definition of regular grammar: a grammar that describes a regular language.

All this subthread is fairly confusing because people are using the term "recursive" in a non-stardard way. Recursive languages are the languages decidable by a Turing machine. They're a strict superset of regular languages, i.e., not all regular languages can be described by regular grammars, but all regular grammars describe recursive languages.

Now, some regular grammars are left-recursive or right-recursive, which means sometimes the same symbol appears on both sides of a rule. It doesn't mean they the have the power of full recursion that we find in full-blown programming languages, since they don't have the equivalent of a stack.


Typo: I meant not all recursive languages can be described by regular grammars.


Errr.. Did he just wrote an html parser (and then used it) to prove everyone that you can use regex to solve the use-html-parser-instead-of-regex problem?!

It's like if I suggest someone to use Python instead of ASM to solve a simple problem, but then someone try to prove me wrong by writing a python interpreter in ASM and then USE it to solve the same problem!

Also, that being said, I feel like the post is more a brag about "I'm the creator of a popular perl book and perl rocks your language here's why blahbahblah".


Yeah, this is pretty much what seems to be going on.

I don't think anyone has ever said that regexs can't be used in the parsing of html, just that they can't be used for the parsing of html. It's like someone saying "You can't use bricks to keep warm!" and countering that by saying "Observe! I have built a house using, among other things, bricks, and it shelters me from the weather and keeps me warm!" Deliberately missing the point in order to show off your housebuilding skills.

Still, it was an informative and well written article. Article, rather than an answer to a question. Why do people write these thousands of words on stack overflow, when they could publish them on their blogs. Are SO points from confused corporate employees really worth more than the adulation of the blogosphere? Actually, who cares. I almost lost the will to live just writing that sentence...


I'd rather see it in Stackoverflow. It has much more visibility there than if each person wrote it on their own blog (assuming they have one) which may not be seen by more than a handful of friends.


Look at all the people on HN saying they want to hire based on SO rep and a Github account.


Actually I'm more interested in the answer to this question:

Was he able write an arbitrary HTML parser using regexes because HTML is a regular language or because Perl's string matching syntax, which can match some languages that are not regular, is called "regular expressions" anyway?

I am pretty sure that, for example, an arbitrary number of balanced "<div></div>" pairs are not a regular expression however you can use modern PCRE to match it anyway.


If I'm understanding it correctly (which I'm not convinced of; this is much more advanced Perl than I'm used to seeing), neither is quite accurate.

The first option is easy to rule out: HTML is not a regular language. Evidence: S-Expressions are not a regular language. XML is isomorphic to S-expressions of the form (<tagname> ((<attr1> <value1>) (<attr2> <value2>) ...) <contents>). HTML is isomorphic to XHTML, which is XML.[1]

I can't (read: don't feel like trying to) rigorously disprove that Perl's "regular expression" facility could be an HTML parser all by itself, but it looks like what's going on here is closer to a regex-based tokenizer (certainly possible, and in fact this is a very common pattern). Then, regular Perl flow control constructs are used to interpret the tokens as tags and such.

[1] Edit: Looking at that, it's not as solid as it was in my head when I was writing it. HTML is still not a regular language, because regular languages cannot key on nesting with uncapped depth.


It goes deeper than that. He has used back references. So formally it aren't real regular expressions anymore. (Back references rewrites the grammar on the fly).

Perl regular expressions are pretty powerful. I wrote a small perl program (just for fun after reading the post), which can parse S-Expressions. It wrote it quick, but it shows it is possible:

http://pastebin.com/ndrHTZJB

I know it has bugs.


It's interesting, I'll take a look at it more seriously. Thanks for sharing.


And for the next ten years, flawed attempts at imitating this will show up in production code all around the world...

I'm not saying he shouldn't have (I've certainly learned something I didn't know), but let's face it, posting this on StackOverflow is like handing a loaded gun to a bunch of children and telling them not to pull the trigger.


  but let's face it, posting this on StackOverflow is like 
  handing a loaded gun to a bunch of children and telling 
  them not to pull the trigger.
No, let's not face it. How is sharing well thought out, designed solutions to problems ever a bad thing. Sure there are a handful of junior coders that now feel overly confident and will mess this up, but they'll learn.

And then there is an equal pool of good developers that are now even better thanks to the info-share.

I acknowledge I'm being pedantic, but the Let's all nod and pretend we are way smarter than THAT group over there mindset gives me brain-diarrhea... you see it in every community (reddit, slashdot, HN, digg, etc.) and I've never seen it help anybody accomplish anything, anywhere... ever.

</takes off his internet-police hat>


No, using regexes for this is terrible. I could write you a program that will parse html employing GOTOs, but it's been recognized as a bad programming practice. I'm not going to do your research for you, but this is not opinion, it's the result of people studying large code bases and defect rates within them. Same goes for overly long functions, bad variable naming, poor commenting, and so on.


A correctly written regex is not a bad programming practice. The insinuation that most programmers can't write or debug a regex correctly is disingeneous.


First, correctly written code can be bad practice. Regexes are a powerful tool, and have appropriate uses. I disagree this is one of those cases, but at -4 on my previous comment, I guess most don't agree. Second, I would bet the majority of programmers are mediocre with regular expresssions at best, and even worse at reading regexes written by other programmers contributing to code maintenance issues.

Finally, I may have been "incorrect" but "disingenuous" is an insult. I'll be charitable and assume you're using the word wrong.


Perhaps it's a difference of experience, but I really haven't met a professional programmer who doesn't understand regexes but still insists on using them for non-trivial tasks. Thus my usage of "disingenuous", because I have trouble believing that such people exist, and I felt you were trying to make a point insincerely, perhaps out of confirmation bias. I apologize if it came across as an insult -- it wasn't intended as one.


No harm, no foul. They really do exist, unfortunately.


Using "regexex" for doing what is terrible? Using regexes for a lexer is quite common, and if you read his code, that's what he is doing, and then feeding the tokens to the parser.


I'm not saying he shouldn't have (I've certainly learned something I didn't know), but let's face it, posting this on StackOverflow is like handing a loaded gun to a bunch of children and telling them not to pull the trigger.

True, but that's the case for posting anything non-trivial where Google can find it. And isomorphic to using non-trivial programming languages or patterns in industrial programs.

At a certain point, you have to let go of the idea that if you hide anything requiring thought and judgment from people, they won't hurt themselves.

Some people are going to foul things up no matter what you do or don't tell them, and some will excel even if you make it hard for them to find knowledge. The interesting group are those who are open to being influenced by what you have to share.

I suggest that even if only a small proportion of SO readers walk away enlightened, this article is a win. The folks who use it to justify their own failed attempts to manage arbitrary HTML with regexen would have pooched the task in some other way even if they stuck to using a parser library.


I disagree. He went through a lot of effort to lay down the foundations of parsing html with regexps. If anything, he's demonstrated exactly how difficult doing this is and how finicky the solution is. In short, he's stripped the sexy out of using regexps to parse html: we now know it can be done, but it's a pain in the ass, and specially written parsers are faster and easier to use. So, there should be no illusion that regexps are a shortcut in this situation.

At least that's my takeaway.


  > You can write a novel like tchrist did
  > [...] – meder Nov 20 '10 at 19:36

  > That was kinda my point, actually. I
  > wanted to show how hard it is. – tchrist
  > Nov 20 '10 at 19:38
Seems that you are correctly channeling the author.


He keeps saying that, but he also keeps railing on anyone who suggests that it has anything to do with HTML not being a regular language.


How many problems would novice have if he decides to parse HTML with regexes in Perl?


Which one is this: a serious expression of concern or some kind of a lightbulb joke?


Pretty sure it's a reference to this famous[1] joke:

>> Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

[1] http://regex.info/blog/2006-09-15/247


The correct response to that saying is to point out that regular expressions are significantly simpler than general-purpose, Turing-complete programming languages.


You can use perl's string-matching facilities (which really are not regular expressions at all[0]) to parse HTML.

[0] in fact it was a rather neat idea to rename those "patterns" (or something along those lines) in Perl 6. Unfortunately this name change has been rolled back. Shame.


There was some form of consensus in a recent discussion not particularly pertaining to Perl that the term regex did not equate anymore to regular expression.


Meanwhile, the highest-rated answer in StackOverflow history says otherwise:

http://stackoverflow.com/questions/1732348/regex-match-open-...


OP doesn't actually do regex parsing of HTML, so all is well. It does regular 'ol parsing, and uses regex to chunk html into consumables.


One of the best and academically proficient answers I've seen on SO. And if I understand correctly turns on its head the old refrain "You can't use regexes to parse HTML", of which I've always been a staunch proponent.

Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true? Does this solution in perl work because of the extended capabilities of perl regexes?


> Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true?

From the comments:

Q. "The answer is you can't. HTML is not regular so be definition it can't be described by a regular expression."

A. "Your use of REGULAR in regular expressions has been irrelevant and wrong since Ken Thompson first put backrefs into regexes around 40 years ago. /(.)\1/ parses non-REGULAR languages perfectly well. Please stop repeating this nonsense. – tchrist"


Backreferences in some sense don't work because real HTML contains misnesteded tags and other bugs, so many of them, even HTML5 spec explictely mentions correction algo.


Why is the language matched by /(.)\1/ non-regular? Isn't it finite (aa, bb, cc, dd, ..., zz if your alphabet is the lowercase letters a-z)? And all finite languages are trivially regular.

Maybe he meant /(.*)\1/?

Note: I know that backreferences make Perl regexes capable of matching non-regular languages, but this doesn't seem like an example of that.


> Does this solution in perl work because of the extended capabilities of perl regexes?

Yes. It works because perl's "regexes" are not regular expressions (nb: almost no "regex engine" these days is limited to regular expressions). Actual regular expressions can express type-3 chomsky grammars, whereas HTML's grammar (and most programming languages's) is of type 2.


HTML isn't even actually context-free, though you can possibly parse it as context-free (type-2) with some kludges. The kludges are the same as the kludges for parsing it in a regular language, though! (I.e. "all" you need to do is add backreferences.)

Recognizing paired start/end tags is equivalent to recognizing palindromes, and the palindrome language is only context-free if there's a finite, fixed alphabet. But the set of HTML tags a parser needs to match isn't finite. While there are a finite number of defined tags in any particular version of the HTML specification, a parser is still supposed to be able to match start/end tags it doesn't recognize, to distinguish the case of malformed syntax from the case of correctly formed syntax that uses unrecognized tags.

An easy way to see it is to try to write out CFG productions for HTML. The intuitive way requires something like backreferences, which would make the language no longer context-free:

   html = '<' + tag + '>' + html + '</' + $0 + '>'
But in a context-free language you don't have that, so you instead have to do something like:

   html = '<head>' + html + '</head>'
        | '<body>' + html + '</body>'
        | '<b>' = html + '</b>'
        | ...
which of course only works if the number of tags is finite so that you can exhaustively enumerate them.


While there are a finite number of defined tags in any particular version of the HTML specification, a parser is still supposed to be able to match start/end tags it doesn't recognize, to distinguish the case of malformed syntax from the case of correctly formed syntax that uses unrecognized tags.

I don't think this is necessary in HTML, which expects every unknown tag to be ignored regardless of nesting, unless you are planning to render the content using CSS or support scripting, which may expect the unknown element to be in the DOM or to be styleable, but even there historically different browsers tended to do different things to unknown tags.


I don't wish to be rude, but how can you call yourself a staunch proponent of something you admit two sentences later not to really understand? That seems like an odd position to me.

More on topic, XML is not even a type 2 language since you have to check for matching of tags, and the tags can come from an arbitrary set, so you can't write a context-free grammar that recognises well-formed XML.

You'd want rules like:

    XML -> TAG
    TAG -> "<" ([a-z]+) ">" (TAG|Text)* "<" "/" \1 ">"
but backreferences are not possible. However, sane parsers allow a lexing phase and semantic rules or post-processing, which is why I think any academic assertions about which level of grammar you need for a certain file format are essentially a waste of time.


Yes, HTML is not a regular language, and so cannot be matched with true regular expressions. As mentioned elsewhere, most "regex" libraries in existence today are not limited to simple regular expressions -- they simply wouldn't be powerful enough for many common tasks otherwise.

HTML is more accurately a context-free language (CFL). A regular expression does not allow you to do any sort of counting or stack-based matching in a match, which is required to do things like "I just saw <div>...<a>...<img> so I better see "</img>...</a>...</div>" later on.

The reason the linked solution works is because of the extended capabilities of PCREs, such as backtracking and things like that.


>Now I understood the reason _why_ you can't use regular expressions to parse HTML is that HTML is usually not regular. Is this true?

I believe the reason is that HTML is a Type 2 grammar by Chomsky hierarchy (that is, a push-down automaton), whereas regexp is a Type 3 grammar (that is, a finite state automaton). To put it simply, HTML has a frame/state stack, and regexp isn't advanced enough for that (instead it reads "left-to-right" - no subroutines or recursion).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

I suspect you might be right about him "cheating" using perl, but not knowing a lick of perl, I can't say for sure one way or the other.

Edit: Apparently, back-references mean regexp isn't regular - that actually makes more sense now; they've never quite meshed with my understanding of regular languages.


As the comment by tchrist to this [0] downvoted answer says, as soon as back references were used (40 years ago apparently) regexes were not regular anymore.

[0] http://stackoverflow.com/questions/4231382/regular-expressio...


Some discussion, plus links, from an earlier HN posting about a related SO post (which argued, roughly, that parsing HTML with regexes opened the door to Hell): http://news.ycombinator.com/item?id=1487975


A slightly more accurate summary of Tom Christiansen's excellent answer there would be: "Oh yes you can use regexes to parse HTML, but you usually shouldn't, unless what you want to do is really, really simple."

Actual quotations: "Even if my program is taken as illustrative of why you should not use regexes for parsing general HTML -- which is ok, because I kinda meant for it to be that"; "That was kinda my point, actually. I wanted to show how hard it is." (the latter in response to someone else who said "You can write a novel, like tchrist did, or you can use a DOM library and write one line of XPath").


That said, his HTML chunker is, dare I say, gorgeous.

If that is an example of what should not be done, I wish there was more of them like that around.

Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive. Writing even a basic non regex-based parser is bound to take quite some space.

To me the real conclusion is not: "don't try to parse random HTML using regexes" but "don't try to write your own wide-purpose HTML parser".

Or, as Tom put it in his SO answer:

> The correct and honest answer is that they shouldn’t attempt [trying to parse arbitrary HTML] because it is too much of a bother to figure out from scratch


"Besides, lexing HTML in 234 lines grand total, most of them being whitespace, (169 SLOCs according to sloccount) is impressive."

I mean no disrespect at all to tchrist, but it isn't impressive at all; not because tchrist is wrong, but because lexing isn't hard. If you understand the problem, you can almost literally read the lexer right off the standard; indeed, that's part of the purpose of the standard. Look at it (taking HTML4 here as it's easier to see): http://www.w3.org/TR/html401/types.html#h-6.2 You can literally read off the lexer expression for ID and NAME right from 'must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").'

Generally, if you're having a hard time putting a lexer together for some language you're creating (bearing in mind this includes the broader definition of "language" beyond just "programming language", which includes things like JSON or text formats you may create ad hoc), that's a sign that you've got an overcomplicated language on your hand. (Hi, C++! I see you over there!)


You still can't use A regular expression to parse HTML. Of course you can use a set of regular expressions and some other logic to parse HTML. There's shouldn't be anything surprising about this post.


I think the point here is don't use regexps for this. Some time ago I got into a spat with Eric Raymond about this very subject. Bottom line is that there are nice libraries for HTML parsing out there. Use one: http://blog.jgc.org/2009/11/parsing-html-in-python-with.html


And related this previous HN discussion on different SO question RegEx match open tags except XHTML self-contained tags (http://news.ycombinator.com/item?id=1487695)


In the spirit of Tom Christiansen's lexer solution, here's a link to Robert Cameron's seemingly forgotten 1998 article, "REX: XML Shallow Parsing with Regular Expressions".

  http://www.cs.sfu.ca/~cameron/REX.html
"""

Abstract

The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

"""


If you enjoy reading about regular expressions, Cameron's paper is fascinating. His writing is concise, thorough, and very detailed. He's not simply showing you how to construct the REX regular expression but also an approach for constructing any complex regex.

I've been using the REX regular expression on and off for 10 years to solve the sort of problem the initial poster on Stack Overflow asked about (how do I match this particular tag but not some other very similar tag?). I've found the regex he developed to be completely reliable.


REX is most useful is when you're focusing on lexical details of a document -- for example, when transforming one kind of text document (e.g., plain text, XML, SGML, HTML) into another, where the document may not be valid, well formed, or even parsable for most of the transformation. It lets you target islands of markup anywhere within a document without disturbing the rest of the document.


No you can't.

HTML is a Context Free Language (Type 2 in the Chomsky hierarchy) that is defined by a Context Free Grammar, and parsed by a stack machine.

Regular expressions can describe regular (Type 3) languages. They do not have a stack.

Note that there is a loop in the code, so it's not just regular expressions.


Isn't the theory that for a regular grammar you can use regexp, for a context-free grammar you need sth. with a stack (= parser).

HTML isn't regular, though, is it? So if there's not (even an implicit) stack in his example, this won't work for the general case.


For quick-and-dirty extraction of data from HTML documents, lynx -dump can be useful.


When I need to extract data from HTML I use XPath. But to do so I have to use the combination of following tools

iconv: necessary only when the page is NOT encoded in UTF-8

tidy: used to convert from HTML to XHTML which is XML. I call it as

xmlstarlet: to extract data from the XML file using XPath.

I find XPath a much better and much reliable tool for HTML data extraction.


> When I need to extract data from HTML I use XPath. But to do so I have to use the combination of following tools

I just use lxml.html (handles 99.999% of the HTML out there, add beautifulsoup's UnicodeDammit[0] for wonky encodings) and then use lxml's built-in xpath support on top of that.

Plus extra bonus, if the datamining paths are simple enough you can use CSS queries instead of XPath.

[0] http://lxml.de/elementsoup.html#using-only-the-encoding-dete...


Thanks for pointing this out. I forgot to precise that I am using XPath on Unix/Cygwin command line or in Shell scripts. Here an example of extracting the Hacker News title;url;points;user data:

   curl -s http://news.ycombinator.com/news | tidy -quiet -asxml -numeric -utf8 -file /dev/null | xmlstarlet sel -N x=http://www.w3.org/1999/xhtml -t -m "//x:tr[x:td[1][@class='title']]" -v "normalize-space(x:td[3][@class='title']/x:a)" -o ";" -v "x:td[3][@class='title']/x:a/@href" -o ";" -v "str:tokenize(following-sibling::x:tr[1]/x:td[2]/x:span[1], ' ')[1]" -o ";" -v "following-sibling::x:tr[1]/x:td[2]/x:a[1]" -n | xmlstarlet unesc


Could you put some newlines in that? For some reason, my browser (IE9) is rendering that construct as a VERY long line and extending the page to match it. (no code-specific scroll-bar for me)


  curl -s http://news.ycombinator.com/news | \
	tidy -quiet -asxml -numeric -utf8 -file /dev/null | \
	xmlstarlet sel \
		-N x=http://www.w3.org/1999/xhtml \
		-t -m "//x:tr[x:td[1][@class='title']]" \
		-v "normalize-space(x:td[3][@class='title']/x:a)" -o ";" \
		-v "x:td[3][@class='title']/x:a/@href" -o ";" \
		-v "str:tokenize(following-sibling::x:tr[1]/x:td[2]/x:span[1], ' ')[1]" -o ";" \
		-v "following-sibling::x:tr[1]/x:td[2]/x:a[1]" -n | \
	xmlstarlet unesc


Thank you.


It's also much faster than BeautifulSoup, and uses much less memory. Matters if you're running in a memory/CPU constrained VPS like I was :)


it doesn't handle script tags correctly for the example cited. anything between <script> and </script> shouldn't be interpreted as html.

<html> <head> <script type='text/javascript'> var tag = '<input type="hidden" name="foo" value="bar"/>'; </script> </head> <body> body </body> </html>

./html_input_rx test.html input tag #1 at character 57: name => "foo" type => "hidden" value => "bar"

but very cool and could probably be fixed to handle that case quite easily.


For parsing HTML, I'd recommend using a purpose-built HTML-parsing library instead of bothering with regular expressions. (Though, like the author of that answer wrote, regex's can work just fine for parsing small snippets.)

An interesting fact: you can parse HTML/XHTML (correctly) with some of the popular regex implementations. (Note the word implementations.)


I'm just disappointed that he didn't use the flip-flop operator ..


best part is in the answer just below this one: "1. You can write a novel like tchrist did..."


Now you have two problems.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: