pell_re is a pure-PL/SQL regex engine. It’s a Thompson-NFA matcher
(linear-time, no catastrophic backtracking) with bytecode held in
RAW(32767) via UTL_RAW. No Java, no extproc — runs identically
on 19c, 23ai, Autonomous Database, and AWS RDS for Oracle.
Installation
# from the repo root
sqlplus user/pass @runtime/pell_re.sql
The package is self-contained; install once per schema.
Public types
t_text_list
TYPE t_text_list IS TABLE OF VARCHAR2(4000) INDEX BY PLS_INTEGER;
1-based list of strings. Returned by find_all, split.
t_int_list
TYPE t_int_list IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;
1-based list of integers. Returned by find_span (2 entries: start
and end position).
t_capture
TYPE t_capture IS RECORD (
start_pos PLS_INTEGER,
end_pos PLS_INTEGER,
match_text VARCHAR2(4000)
);
A single capture: span (1-based positions, end-exclusive) and the
matched substring. Both start_pos and match_text are NULL if the
group didn’t participate.
t_capture_list
TYPE t_capture_list IS TABLE OF t_capture INDEX BY PLS_INTEGER;
Index 0 is the whole match; indices 1..N are the parenthesized groups in source order.
t_capture_map
TYPE t_capture_map IS TABLE OF t_capture INDEX BY VARCHAR2(64);
Captures keyed by group name (only (?<name>...) groups appear here).
Public functions
matches(s, pat)
FUNCTION matches(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN BOOLEAN;
True if any substring of s matches pat (unanchored).
find(s, pat[, pos])
FUNCTION find(p_s IN VARCHAR2, p_pat IN VARCHAR2,
p_pos IN PLS_INTEGER DEFAULT 1) RETURN VARCHAR2;
Returns the substring of the first match starting at or after pos.
NULL if no match.
find_all(s, pat)
FUNCTION find_all(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN t_text_list;
All non-overlapping matches, left-to-right.
find_span(s, pat[, pos])
FUNCTION find_span(p_s IN VARCHAR2, p_pat IN VARCHAR2,
p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_int_list;
A 2-element list with [start_pos, end_pos] (1-based, end-exclusive)
of the first match. Empty list if no match.
capture(s, pat[, pos])
FUNCTION capture(p_s IN VARCHAR2, p_pat IN VARCHAR2,
p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_capture_list;
Returns the capture spans for the first match. Index 0 is the whole
match; indices 1..N are the parenthesized groups (capturing only —
(?:...) is skipped). Empty list if no match.
capture_by_name(s, pat[, pos])
FUNCTION capture_by_name(p_s IN VARCHAR2, p_pat IN VARCHAR2,
p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_capture_map;
Named-group captures only, keyed by name.
replace_all(s, pat, rep)
FUNCTION replace_all(p_s IN VARCHAR2, p_pat IN VARCHAR2,
p_rep IN VARCHAR2) RETURN VARCHAR2;
Replace every non-overlapping match with rep (literal, no
backreferences in v0).
split(s, pat)
FUNCTION split(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN t_text_list;
Split on every match of pat. Leading/trailing matches produce
empty fields (matching PCRE-style split semantics).
clear_cache()
PROCEDURE clear_cache;
Drop the per-session compiled-pattern cache. Useful for tests; rarely needed otherwise.
Supported regex syntax
| Construct | Notes |
|---|---|
. |
any char except \n |
* + ? |
greedy quantifiers |
{n} {n,} {n,m} |
counted quantifiers |
| |
alternation |
(...) |
capturing group |
(?:...) |
non-capturing group |
(?<name>...) |
named capture |
[...] [^...] |
char class / negated |
[a-z] |
range |
\d \D \w \W \s \S |
shorthand classes (ASCII) |
^ $ |
start / end of line |
\A \z |
start / end of string |
\b \B |
word boundary / non-boundary |
\n \r \t \f \v \0 |
char escapes |
\. \( \) \\ |
escape literals |
Not supported in v0: backreferences \1, lookbehinds (?<=...),
possessive quantifiers *+, Unicode property classes \p{L}, inline
flags (?i) (planned), /pattern/flags suffixes (rejected at lex
time with a helpful error).
Bytecode layout
For the curious: the regex compiler produces variable-length opcodes
in RAW(32767):
| Op | Hex | Operands | Size | Semantics |
|---|---|---|---|---|
| MATCH | 01 | - | 1 | accept |
| CHAR | 02 | cp:u32 | 5 | match codepoint |
| ANY | 03 | - | 1 | . (no \n) |
| ANYD | 04 | - | 1 | . in dotall mode |
| CLS | 05 | pool_offset:u16 | 3 | character class |
| JMP | 06 | addr:u16 | 3 | goto |
| SPLIT | 07 | addr1:u16, addr2:u16 | 5 | NFA fork (prefer addr1) |
| SAVE | 08 | slot:u8 | 2 | record current pos |
| BOL | 09 | - | 1 | ^ |
| EOL | 0A | - | 1 | $ |
| BOS | 0B | - | 1 | \A |
| EOS | 0C | - | 1 | \z |
| WB | 0D | - | 1 | \b |
| NWB | 0E | - | 1 | \B |
Character classes live in a pool after the program. Each class:
1 byte negated flag + 2 bytes range count + N×8 bytes of
(start_cp:u32, end_cp:u32) ranges.
The matcher uses Pike’s algorithm with per-thread capture arrays. SAVE opcodes copy and update the capture array; seen-by-PC dedup gives leftmost-greedy semantics.
Cache
Each call goes through get_compiled(pat), which:
- Hashes the pattern:
RAWTOHEX(STANDARD_HASH(pat, 'SHA256')). - Checks
g_cache(hash)— return if hit. - Otherwise compiles to bytecode and stores in
g_cache.
The cache is package state; it lives for the duration of the session.
Call clear_cache to drop it manually.
Performance characteristics
- Compilation: ~5-20µs for typical patterns. Cache wipes this out for repeat calls.
- Matching: linear in input length. ~50-200ns per input character for typical patterns. Worst case is the same as best case — no catastrophic backtracking.
- Memory: bounded by
O(N × P)where N is input length and P is the number of NFA states (usually small).
For a 1KB input and a typical pattern, expect <1ms total. For 1MB inputs, expect ~10-50ms. Both are well within what’s tolerable for OLTP workloads.
Why pure PL/SQL?
Pell was designed to run on managed Oracle instances (Autonomous Database, RDS for Oracle, etc.) where Java SP and extproc are either disabled or impractical. Pure PL/SQL is portable, comes with no “works on prem only” footnotes, and runs identically on every Oracle version pell targets.
Cost: ~10-100× slower than C-implemented PCRE. But for the patterns and inputs typical of OLTP workloads (validation, extraction, log parsing), the absolute timings stay in the sub-millisecond range, where the overhead is invisible to users.