Re: Interesting paper on regex NFA matching

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Sat, 27 Jan 2024 12:47:33 +0200

          From comp.compilers

Related articles
Interesting paper on regex NFA matching johnl@taugh.com (John R Levine) (2024-01-25)
Re: Interesting paper on regex NFA matching cross@spitfire.i.gajendra.net (2024-01-26)
Re: Interesting paper on regex NFA matching 433-929-6894@kylheku.com (Kaz Kylheku) (2024-01-26)
Re: Interesting paper on regex NFA matching christopher.f.clark@compiler-resources.com (Christopher F Clark) (2024-01-27)
Re: Interesting paper on regex NFA matching 433-929-6894@kylheku.com (Kaz Kylheku) (2024-01-27)
Re: Interesting paper on regex NFA matching christopher.f.clark@compiler-resources.com (Christopher F Clark) (2024-01-29)
Re: Interesting paper on regex NFA matching 433-929-6894@kylheku.com (Kaz Kylheku) (2024-01-29)
Re: Interesting paper on regex NFA matching christopher.f.clark@compiler-resources.com (Christopher F Clark) (2024-02-01)
Re: Interesting paper on regex NFA matching drikosev@gmail.com (Ev Drikos) (2024-02-04)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Sat, 27 Jan 2024 12:47:33 +0200
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="71672"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, NFA
Posted-Date: 27 Jan 2024 12:52:49 EST

Kaz Kylheku <433-929-6894@kylheku.com> wrote:


> (I personaly am not so interested in backtracking regexes; and they are
> not directly applicable to compilers. You'd never want to design a
> lexical analyzer that requires "lookaround" or whatnot in order to
> recognize tokens. It's nice they are generating Arxiv papers for
> someone; no genuine intellectual inquiry is invalid; good luck! It's
> not applicable to the "Programming Languages" category it has been filed
> under. I will skim through it, though.)


You are mistaken that they are not applicable to programming languages.


Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.


The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.


In any case, parser generators like ANTLR use backtracking to great
effect (and yes, we "borrowed" that idea in Yacc++) and so did the
entire PEG (parsing expression grammar) community, using memoizartion
to give linear time performance in many cases.


Thus, to me, it is no surprise to see this reintegrated back into the
PCRE world.


Moreover, even if it is of little interest [mistakenly in my opinion] to the
"programming language" community, it is of definite interest in the
"real world:" PCRE regular expressions are both used (and abused)
in many applications, including SNORT where linear performance and
preventing DDOS attacks would be of prime importance. (While at Intel,
I had the honor of working with the Hyperscan (Sensory Networks) team
and improving PCRE performance for SNORT was a major "selling point"


And "greedy" (longest match), "non-greedy" (shortest match) and other
variations like "lexically first in the grammar" matching all have their
applications. Saying that only one of them is correct is ignoring the
needs of actual users.


And, yes, we had discussions about "automata theory correct" interpretations
of what certain PCRE expressions meant and how even "canonical"
implementations had various "bugs" that made their meaning ambiguous.


By the way, to give credit where due, we "borrowed" the idea of
"tunnel automatons" from Josef Grosch to implement controlled
backtracking in Yacc++. It lets one do subset construction (i.e.
the LR(0) algorithm) for items, breaking out for "syntactic predicates"
exactly as needed.


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.