Lexing is strictly simpler than parsing - lexing is done with state machines* , while many parsers can be viewed as state machines with stacks, AKA "pushdown automata" (http://en.wikipedia.org/wiki/Pushdown_automata). Lexers don't have stacks, which is why basic regular expressions can't do things like match arbitrarily nested parenthesis.
* Basically: make a byte-by-byte flowchart ("an int is a digit, followed by more digits, stop when it reaches a non-digit character and return the accumulated string as an int"), then convert that to code. Pretty simple. Graphviz (http://graphviz.org) makes pretty state machine diagrams.
Just knowing a little bit about automata / state machines will be enough to write a decent lexer. The Wirth PDF above explains it just a few pages (chapter 3). People often use tools like lex or ragel because writing them by hand can be rather tedious (being able to say e.g. "[+-]?[1-9][0-9]* { return LEXEME_INT(atoi(str)); }" rather than writing out the tests byte-by-byte), but they're not hard.
And yeah, the Lispers are able to sidestep parsing entirely.
Writing a state machine such that state transitions map to instruction pointer changes can be a viable technique. For example, using Pascal - handy because of sets - your example could be recognized thusly, with cp: PChar, or ^Char:
start := cp;
if cp^ in ['+', '-'] then Inc(cp);
if cp^ in ['1'..'9'] then Inc(cp) else goto no_match;
while cp^ in ['0'..'9'] do Inc(cp);
SetString(str, start, cp - start);
I don't think this is tedious at all. The various components of a regular expression have analogues in code: * and + become loops, ? and | if statements, character classes become set membership tests (which is nice in Pascal). For most token definitions used in practical programming languages, writing lexers for them by hand is fairly trivial; and when things get more complicated, the freedom of having hand-coded the lexer usually gives you more tools to solve those complications.
Having a lexer coded as a loop driven by state transition tables can be useful in other scenarios, however; for example, if you only receive a little bit of input at a time, and you want to continuously lex it until a whole token has formed. There, you want to be able to suspend tokenizing at any point, and resume it later. Without coroutines, continuations or threads, this is more awkward when hand-coding.
That rule doesn't hold for some languages. For example, a Python lexer needs to remember a stack of indentation levels to know whether a given line's indentation should be tokenized as an INDENT, a DEDENT, or no token at all.
True. Strictly speaking, that means it isn't context-free in the usual sense (right?), but it's a practical extension.
Matt Might uses Python's indentation as a lexing/parsing example in his ongoing "Scripting Language Design and Implementation" course (http://matt.might.net/teaching/scripting-languages/spring-20...), which is worth following if you're reading this thread.
That first paragraph was eye opening all on its own. I actually think I've collected masses of links to compiler resources from one or more of your comments in the past, so thanks for this one too.
* Basically: make a byte-by-byte flowchart ("an int is a digit, followed by more digits, stop when it reaches a non-digit character and return the accumulated string as an int"), then convert that to code. Pretty simple. Graphviz (http://graphviz.org) makes pretty state machine diagrams.
Just knowing a little bit about automata / state machines will be enough to write a decent lexer. The Wirth PDF above explains it just a few pages (chapter 3). People often use tools like lex or ragel because writing them by hand can be rather tedious (being able to say e.g. "[+-]?[1-9][0-9]* { return LEXEME_INT(atoi(str)); }" rather than writing out the tests byte-by-byte), but they're not hard.
And yeah, the Lispers are able to sidestep parsing entirely.