pell_re is a pure-PL/SQL regex engine. It’s a Thompson-NFA matcher (linear-time, no catastrophic backtracking) with bytecode held in RAW(32767) via UTL_RAW. No Java, no extproc — runs identically on 19c, 23ai, Autonomous Database, and AWS RDS for Oracle.

Installation

# from the repo root
sqlplus user/pass @runtime/pell_re.sql

The package is self-contained; install once per schema.

Public types

t_text_list

TYPE t_text_list IS TABLE OF VARCHAR2(4000) INDEX BY PLS_INTEGER;

1-based list of strings. Returned by find_all, split.

t_int_list

TYPE t_int_list IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;

1-based list of integers. Returned by find_span (2 entries: start and end position).

t_capture

TYPE t_capture IS RECORD (
  start_pos  PLS_INTEGER,
  end_pos    PLS_INTEGER,
  match_text VARCHAR2(4000)
);

A single capture: span (1-based positions, end-exclusive) and the matched substring. Both start_pos and match_text are NULL if the group didn’t participate.

t_capture_list

TYPE t_capture_list IS TABLE OF t_capture INDEX BY PLS_INTEGER;

Index 0 is the whole match; indices 1..N are the parenthesized groups in source order.

t_capture_map

TYPE t_capture_map IS TABLE OF t_capture INDEX BY VARCHAR2(64);

Captures keyed by group name (only (?<name>...) groups appear here).

Public functions

matches(s, pat)

FUNCTION matches(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN BOOLEAN;

True if any substring of s matches pat (unanchored).

find(s, pat[, pos])

FUNCTION find(p_s IN VARCHAR2, p_pat IN VARCHAR2,
              p_pos IN PLS_INTEGER DEFAULT 1) RETURN VARCHAR2;

Returns the substring of the first match starting at or after pos. NULL if no match.

find_all(s, pat)

FUNCTION find_all(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN t_text_list;

All non-overlapping matches, left-to-right.

find_span(s, pat[, pos])

FUNCTION find_span(p_s IN VARCHAR2, p_pat IN VARCHAR2,
                   p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_int_list;

A 2-element list with [start_pos, end_pos] (1-based, end-exclusive) of the first match. Empty list if no match.

capture(s, pat[, pos])

FUNCTION capture(p_s IN VARCHAR2, p_pat IN VARCHAR2,
                 p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_capture_list;

Returns the capture spans for the first match. Index 0 is the whole match; indices 1..N are the parenthesized groups (capturing only — (?:...) is skipped). Empty list if no match.

capture_by_name(s, pat[, pos])

FUNCTION capture_by_name(p_s IN VARCHAR2, p_pat IN VARCHAR2,
                         p_pos IN PLS_INTEGER DEFAULT 1) RETURN t_capture_map;

Named-group captures only, keyed by name.

replace_all(s, pat, rep)

FUNCTION replace_all(p_s IN VARCHAR2, p_pat IN VARCHAR2,
                     p_rep IN VARCHAR2) RETURN VARCHAR2;

Replace every non-overlapping match with rep (literal, no backreferences in v0).

split(s, pat)

FUNCTION split(p_s IN VARCHAR2, p_pat IN VARCHAR2) RETURN t_text_list;

Split on every match of pat. Leading/trailing matches produce empty fields (matching PCRE-style split semantics).

clear_cache()

PROCEDURE clear_cache;

Drop the per-session compiled-pattern cache. Useful for tests; rarely needed otherwise.

Supported regex syntax

Construct Notes
. any char except \n
* + ? greedy quantifiers
{n} {n,} {n,m} counted quantifiers
| alternation
(...) capturing group
(?:...) non-capturing group
(?<name>...) named capture
[...] [^...] char class / negated
[a-z] range
\d \D \w \W \s \S shorthand classes (ASCII)
^ $ start / end of line
\A \z start / end of string
\b \B word boundary / non-boundary
\n \r \t \f \v \0 char escapes
\. \( \) \\ escape literals

Not supported in v0: backreferences \1, lookbehinds (?<=...), possessive quantifiers *+, Unicode property classes \p{L}, inline flags (?i) (planned), /pattern/flags suffixes (rejected at lex time with a helpful error).

Bytecode layout

For the curious: the regex compiler produces variable-length opcodes in RAW(32767):

Op Hex Operands Size Semantics
MATCH 01 - 1 accept
CHAR 02 cp:u32 5 match codepoint
ANY 03 - 1 . (no \n)
ANYD 04 - 1 . in dotall mode
CLS 05 pool_offset:u16 3 character class
JMP 06 addr:u16 3 goto
SPLIT 07 addr1:u16, addr2:u16 5 NFA fork (prefer addr1)
SAVE 08 slot:u8 2 record current pos
BOL 09 - 1 ^
EOL 0A - 1 $
BOS 0B - 1 \A
EOS 0C - 1 \z
WB 0D - 1 \b
NWB 0E - 1 \B

Character classes live in a pool after the program. Each class: 1 byte negated flag + 2 bytes range count + N×8 bytes of (start_cp:u32, end_cp:u32) ranges.

The matcher uses Pike’s algorithm with per-thread capture arrays. SAVE opcodes copy and update the capture array; seen-by-PC dedup gives leftmost-greedy semantics.

Cache

Each call goes through get_compiled(pat), which:

  1. Hashes the pattern: RAWTOHEX(STANDARD_HASH(pat, 'SHA256')).
  2. Checks g_cache(hash) — return if hit.
  3. Otherwise compiles to bytecode and stores in g_cache.

The cache is package state; it lives for the duration of the session. Call clear_cache to drop it manually.

Performance characteristics

  • Compilation: ~5-20µs for typical patterns. Cache wipes this out for repeat calls.
  • Matching: linear in input length. ~50-200ns per input character for typical patterns. Worst case is the same as best case — no catastrophic backtracking.
  • Memory: bounded by O(N × P) where N is input length and P is the number of NFA states (usually small).

For a 1KB input and a typical pattern, expect <1ms total. For 1MB inputs, expect ~10-50ms. Both are well within what’s tolerable for OLTP workloads.

Why pure PL/SQL?

Pell was designed to run on managed Oracle instances (Autonomous Database, RDS for Oracle, etc.) where Java SP and extproc are either disabled or impractical. Pure PL/SQL is portable, comes with no “works on prem only” footnotes, and runs identically on every Oracle version pell targets.

Cost: ~10-100× slower than C-implemented PCRE. But for the patterns and inputs typical of OLTP workloads (validation, extraction, log parsing), the absolute timings stay in the sub-millisecond range, where the overhead is invisible to users.