Skip to content

Commit

Permalink
released 4.1.0
Browse files Browse the repository at this point in the history
improved lazy quantifiers for POSIX regex lazy pattern matching
  • Loading branch information
genivia-inc committed Mar 5, 2024
1 parent c2b0ab3 commit 187103a
Show file tree
Hide file tree
Showing 14 changed files with 315 additions and 327 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,17 @@ Two example use cases:
2. The RE/flex C++ regex engines are used by [ugrep](https://ugrep.com).

The RE/flex lexical analyzer generator extends Flex++ with Unicode support,
indent/dedent anchors, lazy quantifiers, word boundaries, functions for lex and
syntax error reporting and other new features. RE/flex is faster than Flex and
much faster than regex libraries such as Boost.Regex, C++11 std::regex, PCRE2
and RE2. For example, tokenizing a 2 KB representative C source code file into
244 tokens takes only 8.7 microseconds:
indent/dedent anchors, POSIX regex lazy quantifiers, word boundaries, functions
for lex and syntax error reporting, lexer rule execution performance profiling,
and other new features.

Only RE/flex supports POSIX regex lazy matching in linear time using an
advanced DFA transformation algorithm invented by Dr. van Engelen. By
contrast, Perl regex lazy quantifiers require backtracking to match.

RE/flex is faster than Flex and much faster than regex libraries such as
Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a 2 KB
representative C source code file into 244 tokens takes only 8.7 microseconds:

<table>
<tr><th>Command / Function</th><th>Software</th><th>Time (μs)</th></tr>
Expand Down Expand Up @@ -81,7 +87,8 @@ Features
- Generates scanners for lexical analysis on files, C++ streams, (wide)
strings, and memory such as mmap files.
- Indent/nodent/dedent anchors to match indentation levels to tokenize.
- Lazy quantifiers, no hacks are needed to work around greedy repetitions.
- Lazy quantifiers for POSIX regex matching, i.e. no hacks are needed to work
around greedy repetitions.
- Word boundary anchors.
- Freespace mode option to improve readability of lexer specifications.
- `%class` and `%init` to customize the generated Lexer classes.
Expand Down Expand Up @@ -582,6 +589,7 @@ Changelog
- Nov 5, 2023: 3.5.1 minor improvements.
- Feb 17, 2024: 4.0.0 faster `Matcher::find()` with a new DFA cut algorithm to optimize match prediction speed and accuracy, see also ugrep 5.0; apply Unicode pattern canonicalization with `reflex::convert(..., reflex::convert_flag::unicode)`.
- Feb 23, 2024: 4.0.1 new `rawk` example to demonstrate awk-like fast search in C++; enable `<<EOF>>` rules for option `find` to generate a fast search engine.
- Mar 5, 2024: 4.1.0 improved lazy quantifiers for POSIX regex lazy matching in linear time using an advanced DFA transformation algorithm introduced in RE/flex in 2016.

[logo-url]: https://www.genivia.com/images/reflex-logo.png
[reflex-url]: https://www.genivia.com/reflex.html
Expand Down
Binary file modified bin/win32/reflex.exe
Binary file not shown.
Binary file modified bin/win64/reflex.exe
Binary file not shown.
11 changes: 7 additions & 4 deletions doc/man/reflex.1
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.TH REFLEX "1" "February 23, 2024" "reflex 4.0.1" "User Commands"
.TH REFLEX "1" "March 05, 2024" "reflex 4.1.0" "User Commands"
.SH NAME
\fBreflex\fR -- regex\-centric, fast and flexible lexical analyzer generator
.SH SYNOPSIS
Expand Down Expand Up @@ -131,12 +131,15 @@ NOTE: adds functions only, reflex scanners are always reentrant
.TP
\fB\-y\fR, \fB\-\-yy\fR
same as \fB\-\-flex\fR and \fB\-\-bison\fR, also generate global yyin, yyout
.TP
\fB\-\-yypanic\fR
call yypanic() when scanner jams, requires \fB\-\-flex\fR \fB\-\-nodefault\fR
.TP
\fB\-\-noyywrap\fR
do not call global yywrap() on EOF, requires option \fB\-\-flex\fR
do not call yywrap() on EOF, requires option \fB\-\-flex\fR
.TP
\fB\-\-exception\fR=\fIVALUE\fR
use exception VALUE to throw in the default rule of the scanner
use exception VALUE to throw as the default rule
.TP
\fB\-\-token\-type\fR=\fINAME\fR
use NAME as the return type of lex() and yylex() instead of int
Expand All @@ -150,7 +153,7 @@ enable debug mode in scanner
scanner reports detailed performance statistics to stderr
.TP
\fB\-s\fR, \fB\-\-nodefault\fR
disable the default rule in scanner that echoes unmatched text
disable the default rule that echoes unmatched text
.TP
\fB\-v\fR, \fB\-\-verbose\fR
report summary of scanner statistics to stdout
Expand Down
12 changes: 4 additions & 8 deletions fuzzy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ FuzzyMatcher

A C++ class extension of the [RE/flex](https://github.com/Genivia/RE-flex)
Matcher class for efficient fuzzy matching and fuzzy search with regex patterns.
Regex patterns are of the POSIX ERE type, but also support Unicode matching,
lazy quantifiers, word boundaries and lookaheads.

- specify max error as a parameter, i.e. the max edit distance or
[Levenshstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
Expand Down Expand Up @@ -153,20 +155,14 @@ in frequently executed functions:
static const reflex::Pattern pattern(reflex::Matcher::convert("PATTERN", reflex::convert_flag::unicode));
reflex::FuzzyMatcher matcher(pattern, [MAX,] INPUT);

Requires
--------

[RE/flex](https://github.com/Genivia/RE-flex) downloaded and locally built or
globally installed to access the `reflex/include` and `reflex/lib` files.

Compiling
---------

Assuming `reflex` dir with source code is locally built in the project dir:
Assuming `reflex` dir with RE/flex source code is locally built:

c++ -o myapp myapp.cpp -Ireflex/include reflex/lib/libreflex.a

Or when the `libreflex` library is installed:
When the `libreflex` library is built and installed:

c++ -o myapp myapp.cpp -lreflex

Expand Down
39 changes: 14 additions & 25 deletions include/reflex/pattern.h
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,6 @@
#include <bitset>
#include <vector>

// ugrep 3.7: use vectors instead of sets to store positions to compile DFAs
#define WITH_VECTOR

// ugrep 3.7.0a: use a map to construct fixed string pattern trees
// #define WITH_TREE_MAP
// ugrep 3.7.0b: use a DFA as a tree to bypass DFA construction step when possible
Expand Down Expand Up @@ -503,7 +500,7 @@ class Pattern {
static const value_type RES3 = 1ULL << 50; ///< reserved
static const value_type NEGATE = 1ULL << 51; ///< marks negative patterns
static const value_type TICKED = 1ULL << 52; ///< marks lookahead ending ) in (?=X)
static const value_type GREEDY = 1ULL << 53; ///< force greedy quants
static const value_type RES4 = 1ULL << 53; ///< reserved
static const value_type ANCHOR = 1ULL << 54; ///< marks begin of word (\b,\<,\>) and buffer (\A,^) anchors
static const value_type ACCEPT = 1ULL << 55; ///< accept, not a regex position
Position() : k(NPOS) { }
Expand All @@ -514,7 +511,6 @@ class Pattern {
Position iter(Iter i) const { return Position(k + (static_cast<value_type>(i) << 32)); }
Position negate(bool b) const { return b ? Position(k | NEGATE) : Position(k & ~NEGATE); }
Position ticked(bool b) const { return b ? Position(k | TICKED) : Position(k & ~TICKED); }
Position greedy(bool b) const { return b ? Position(k | GREEDY) : Position(k & ~GREEDY); }
Position anchor(bool b) const { return b ? Position(k | ANCHOR) : Position(k & ~ANCHOR); }
Position accept(bool b) const { return b ? Position(k | ACCEPT) : Position(k & ~ACCEPT); }
Position lazy(Lazy l) const { return Position((k & 0x00FFFFFFFFFFFFFFULL) | static_cast<value_type>(l) << 56); }
Expand All @@ -524,30 +520,20 @@ class Pattern {
Iter iter() const { return static_cast<Index>((k >> 32) & 0xFFFF); }
bool negate() const { return (k & NEGATE) != 0; }
bool ticked() const { return (k & TICKED) != 0; }
bool greedy() const { return (k & GREEDY) != 0; }
bool anchor() const { return (k & ANCHOR) != 0; }
bool accept() const { return (k & ACCEPT) != 0; }
Lazy lazy() const { return static_cast<Lazy>(k >> 56); }
value_type k;
};
typedef std::vector<Lazy> Lazyset;
#ifdef WITH_VECTOR
typedef std::vector<Position> Lazypos;
typedef std::vector<Position> Positions;
#else
typedef std::set<Position> Positions;
#endif
typedef std::map<Position,Positions> Follow;
typedef std::pair<Chars,Positions> Move;
typedef std::list<Move> Moves;
#ifdef WITH_VECTOR
inline static void pos_insert(Positions& s1, const Positions& s2) { s1.insert(s1.end(), s2.begin(), s2.end()); }
inline static void pos_add(Positions& s, const Position& e) { s.insert(s.end(), e); }
#else
inline static void pos_insert(Positions& s1, const Positions& s2) { s1.insert(s2.begin(), s2.end()); }
inline static void pos_add(Positions& s, const Position& e) { s.insert(e); }
#endif
inline static void lazy_insert(Lazyset& s1, const Lazyset& s2) { s1.insert(s1.end(), s2.begin(), s2.end()); }
inline static void lazy_add(Lazyset& s, const Lazy& e) { s.insert(s.end(), e); }
inline static void lazy_insert(Lazypos& s1, const Lazypos& s2) { s1.insert(s1.end(), s2.begin(), s2.end()); }
inline static void lazy_add(Lazypos& s, const Lazy i, Location p) { s.insert(s.end(), Position(p).lazy(i)); }
#ifndef WITH_TREE_DFA
/// Tree DFA constructed from string patterns.
struct Tree {
Expand Down Expand Up @@ -859,6 +845,7 @@ class Pattern {
void parse(
Positions& startpos,
Follow& followpos,
Lazypos& lazypos,
Mods modifiers,
Map& lookahead);
void parse1(
Expand All @@ -869,7 +856,7 @@ class Pattern {
bool& nullable,
Follow& followpos,
Lazy& lazyidx,
Lazyset& lazyset,
Lazypos& lazypos,
Mods modifiers,
Locations& lookahead,
Iter& iter);
Expand All @@ -881,7 +868,7 @@ class Pattern {
bool& nullable,
Follow& followpos,
Lazy& lazyidx,
Lazyset& lazyset,
Lazypos& lazypos,
Mods modifiers,
Locations& lookahead,
Iter& iter);
Expand All @@ -893,7 +880,7 @@ class Pattern {
bool& nullable,
Follow& followpos,
Lazy& lazyidx,
Lazyset& lazyset,
Lazypos& lazypos,
Mods modifiers,
Locations& lookahead,
Iter& iter);
Expand All @@ -905,7 +892,7 @@ class Pattern {
bool& nullable,
Follow& followpos,
Lazy& lazyidx,
Lazyset& lazyset,
Lazypos& lazypos,
Mods modifiers,
Locations& lookahead,
Iter& iter);
Expand All @@ -915,21 +902,23 @@ class Pattern {
void compile(
DFA::State *start,
Follow& followpos,
const Lazypos& lazypos,
const Mods modifiers,
const Map& lookahead);
void lazy(
const Lazyset& lazyset,
const Lazypos& lazypos,
Positions& pos) const;
void lazy(
const Lazyset& lazyset,
const Lazypos& lazypos,
const Positions& pos,
Positions& pos1) const;
void greedy(Positions& pos) const;
void trim_anchors(Positions& follow, const Position p) const;
void trim_lazy(Positions *pos) const;
void trim_lazy(Positions *pos, const Lazypos& lazypos) const;
void compile_transition(
DFA::State *state,
Follow& followpos,
const Lazypos& lazypos,
const Mods modifiers,
const Map& lookahead,
Moves& moves) const;
Expand Down
2 changes: 1 addition & 1 deletion lib/convert.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1452,7 +1452,7 @@ static void convert_escape(const char *pattern, size_t len, size_t& loc, size_t&
}
loc = pos + 1;
}
else
else if (c != ' ' && c != '\t')
{
convert_escape_char(pattern, len, loc, pos, flags, signature, mod, par, regex, nl);
}
Expand Down
Loading

0 comments on commit 187103a

Please sign in to comment.