Skip to main content

Command Palette

Search for a command to run...

WideLips: A Multi GB/S LISP Parser And Parsing Framework

Updated
11 min read
WideLips: A Multi GB/S LISP Parser And Parsing Framework

Introduction

Parsers have been around since the dawn of computing. If you've worked with compiler frontends or semi-structured data processors like JSON or XML, you’ve likely implemented or maintained one.

Traditional parsers scan text character-by-character, producing tokens for the parser to match against a grammar. WideLips belongs to a different breed: hyper-optimized parsers that utilize SIMD instructions and advanced optimization techniques to achieve exceptional throughput. Better yet, WideLips isn't just a generic S-expression parser—it's a framework for building high-performance LISP parsers.

Wait, What Does the Name Mean?

The ‘Wide’ refers to the ‘wide’ words in VLIW (Very Long Instruction Word) architectures. The ‘Lips’ part? Well, LISP lists () arguably look like lips. 😊

The Problem: Serial Loops Leave Performance on the Table

Traditional LISP parsers follow a simple, familiar pattern:

while (pos < end) {
    char c = source[pos++];
    if (IsWhitespace(c)) SkipWhitespace();
    else if (IsIdentifier(c)) TokenizeIdentifier();
    /* ... rest of state checks here */
}

This works, but it is fundamentally serial. While modern CPUs fetch data in 64-byte cache lines (meaning we aren't waiting on slow DRAM for every byte), the loop still forces the CPU to execute a separate mov instruction for each character.

This creates a rigid dependency chain: the CPU cannot process iteration N+1 until it resolves the branch and state update of iteration N. Modern CPUs have wide decoders and deep Reorder Buffers (ROB) designed for Out-of-Order execution, but this serial dependency renders those features useless.

Many modern architectures like X86 (starting from Intel Haswell+, AMD Excavator+) feature 256-bit SIMD register files capable of processing 32 bytes simultaneously and depending on the microarchitecture it can dispatch multiple of those and execute them in parallel. Character-by-character loops utilize almost none of this capability.

The Insight: LISP's Regularity Enables High-Performance Indexing

LISP syntax is based on S-expressions, which are extraordinarily regular:

  • Parentheses: ( )

  • Whitespace: spaces, tabs, newlines

  • Symbols: alphanumeric + some punctuation

  • Numbers: digits, decimal points, signs

  • Comments: ; to end of line

  • Strings: "..." with escapes

By leveraging vectorized/SIMD classification, we can build indices based purely on the structure of LISP lists. A single index entry marks where a list starts (, where it ends ), and contains all necessary source location metadata.

Architecture: Two-Phase Blue/Green Pass Design

Figure 1: Two-phase pipeline separating structural analysis from tokenization

Phase 1: Blue Pass (Eager, Full-File Scan)

The Blue Pass runs once per file. It performs structural analysis (including both list and atom validation) without generating full tokens:

  1. SIMD Character Classification: Processes the file in 32-byte chunks, classifying characters (parentheses, whitespace, identifiers, etc.) using vectorized table lookups. This produces compact TokenizationBlock[] arrays where character classes are encoded as bitmasks.

  2. Validation: Maintains a stack to check for balanced parentheses and report structural errors immediately and it also validates atoms like string literals , integers, etc.. .

  3. S-Expression Indexing: Builds a sparse SExprIndex[] array mapping every opening parenthesis to its closing mate—essentially a map of where every S-expression lives.

The Blue Pass typically completes in single-digit milliseconds or less, even for multi-megabyte files.

Phase 2: Green Pass (Lazy, On-Demand)

Tokenization happens strictly on-demand via TokenizeSExpr() or TokenizeNext():

  1. On-Demand Tokenization: Using the SExprIndex[] from the Blue Pass, we tokenize only the requested S-expressions.

  2. SWAR Keyword Recognition: For identifiers ≤8 characters (covering most keywords like define, lambda, let), we use SIMD-Within-A-Register tricks to recognize keywords without standard string comparisons.

  3. Parse Tree Construction: Parse nodes are built only for the specific S-expressions accessed.

Why This Split Matters

Consider analyzing one function in a 50,000-line codebase:

  1. Blue Pass scans the full file (taking a few (wall clock) milliseconds for large multi MB file).

  2. We tokenize only that specific function.

  3. Everything else is skipped.

  4. We avoid expensive memory writes (memcpys) to store tokens that are never used.

For IDE tools, refactoring engines, and code analysis, this is transformative. For compilers, it allows time-sharing between parsing and semantic analysis, avoiding full tree expansion when early errors occur.

SIMD Deep Dive

Here is what almost happens under the hood:


// Table duplicated for high/low 128-bit lanes
__m256i sexprAndOpsTable { 
            '=','/','.','-',',','+','*',')',
            '(','\'','&','%','$','#',0,'!',
            '=','/','.','-',',','+','*',')',
            '(','\'','&','%','$','#',0,'!',
        };

// ... other tables ...

// The SIMD Loop 
// 1. Load 32 bytes
__m256i chunk = _mm256_loadu_si256((const __m256i*)(source + pos));

// 2. Hash (Range Reduction) and Lookup
// Subtract 0x30 to map ASCII chars to 0-15 index range
__m256i hashed = _mm256_subs_epu8(_mm256_set1_epi8(0x30), chunk);

// Shuffle uses low 4 bits for lookup. 
// If high bit (0x80) is set, result is 0.
__m256i looked_up = _mm256_shuffle_epi8(sexprAndOpsTable, hashed);

// 3. Compare and Mask
__m256i matches = _mm256_cmpeq_epi8(looked_up, chunk);
uint32_t paren_mask = _mm256_movemask_epi8(matches);

// ... (similar logic for identifiers) ...

// 4. Store in TokenizationBlock
blocks[block_idx] = TokenizationBlock {
    .FragmentsMask = whitespace_mask,
    .SExprAndOpsMask = paren_mask,
    .IdentifierMask = identifier_mask,
    // ...
};

The Mechanism: We use _mm256_shuffle_epi8 (vpshufb) as a parallel lookup table. However, vpshufb only supports 16-entry tables (indexed by the lower 4 bits of the byte).

To make this work for ASCII, we perform range reduction: the reduction differ between tables because we are dealing with different kind of ranges where elements values are not next to each other but in case of the code provided we do saturated subtract (mm256subs_epu8) 0x30 from the input chunk. This maps our target characters into the 0-15 range the shuffle instruction can handle. If a character is outside this range the byte at position N would naturally be one of the 0-15 values due to saturation, other hashing or range reduction functions may have different semantics like setting up the high-nibble which based on _mm256_shuffle_epi8 such byte will be zeroed out,

We then compare the looked-up values against the original chunk to verify the match! if the value at index N is of tokens that we are interested in ('$','+','-' etc...) then the byte at index N on both the original source chunk and the looked-up vector from before would be equal, the _mm256_cmpeq_epi8 will do this check and when equal the high-nibble of the byte at index N in the comparison vector would be set (0xF0) otherwise it's zero then we extract the mask .

//at the high level here is what is going

void Func(){
    char table[] {
    '=','/','.','-',',','+','*',')',
    '(','\'','&','%','$','#',0,'!'
    };
    char content[N] {'(','+',' ','1',' ','2',')','(','&',' ','2',' ','4',')'...}

    //the range reduction with saturated subtraction would be equivalent to something like 
    //let hashedTable = foreach element in content => element - 0x30
    const uint8_t hashedContent[] = {
        0x08, // '(' (48 - 40 = 8)
        0x05, // '+' (48 - 43 = 5)
        0x10, // ' ' (48 - 32 = 16) -> Note: 0x10 & 0x0F is index 0 ('=')
        0x00, // '1' (48 - 49 = -1 -> Clamps to 0)
        0x10, // ' '
        0x00, // '2' (48 - 50 = -2 -> Clamps to 0)
        0x07, // ')' (48 - 41 = 7)
        0x08, // '('
        0x0A, // '&' (48 - 38 = 10)
        0x10, // ' '
        0x00, // '2' (Clamps to 0)
        0x10, // ' '
        0x00, // '4' (Clamps to 0)
        0x07  // ')'
    };
    //now the lookup/vpshufb look conceptually like
    for(int i=0;i<vecWidth;++i){
        if (hashedContent[i] & 0x80) {
             resultVec[i] = 0; // vpshufb handles high bit set
        } else {
             resultVec[i] = table[hashedContent[i] & 0x0F];
        }
    }

    //then we do the equality comparison and extract the mask
}

This classification loop is branch-free and effectively saturates the CPU's vectorization units.

Figure 2: Parallel classification

Bitmask-Based State Machine

Once you have bitmasks, finding token boundaries becomes a series of lightweight bit-twiddling operations. Most m,odern CPUs execute tzcnt (count trailing zeros), popcnt, and bitwise instructions in 1–3 cycles.

// Find first '(' after position
uint32_t mask = paren_mask >> bit_offset;
int offset = __builtin_ctz(mask);  // Count trailing zeros - single instruction

The lexer loops through tokenization blocks. While a loop exists, it is canonical. Multiple profiling sessions showed the branch predictor had no issues with these simple loops—exactly what we expect from a high-end desktop CPU.

Memory Management: Zero-Copy + Arenas

Zero-Copy Token Design

This is a foundational optimization used by industry-standard compilers.

struct Token {
    const char* TextPtr;  // Points directly into source
    uint32_t Length;
    // ...
};

There is no allocation, no copying, and no string management. Even the token text is computed lazily (TextPtr + Length).

Arena Allocators

If SIMD classification is the first optimization, Arena allocation is the second. Dynamic memory allocation is a performance killer. To avoid it, the core lexer pre-allocates sufficient memory for its frequently used vectors (represented by MonoBumpVector). When memory is needed, we simply increment an internal pointer.

However, we cannot escape the cost of memory transfers (memcpys) from stack to arena. The good news is that the classification loop has no dependency on the data being transferred to the arena ,but it’s not always the case for other loops.

Note: MonoBumpVector only works with trivially-copyable and trivially-destructible structs. This is crucial because it allows us to use the built-in memcpy, which compilers highly optimize.

It looks roughly like this:

template<typename T>
class MonoBumpVector {
     ALWAYS_INLINE PointerType EmplaceBack(T&& element) noexcept {
            std::memcpy(++_pin, &element, sizeof(T));
            return _pin;
        }
};

Lazy Parsing: Only Pay for What You Use

After the Blue Pass, you have:

  1. TokenizationBlock[] encoding character classes.

  2. SExprIndex[] mapping S-expression locations.

The entry point is LispParseTree::Parse(...), which returns the parse tree root (the first List Expression). From there:

  • Call GetSubExpression(): This normalizes all atom tokens and marks nested lists for future visits. It returns the first token in the list.

  • Call NextNode(): Accesses sibling nodes. If the next node is a list, GetSubExpression hasn't been called on it yet, so its subtree is not yet created.

We use the tokenization blocks again in the Green Pass to determine token kind, but the actual token object is created only then. Because we know where the list starts and ends (thanks to the Blue Pass index), we can navigate lazily.

Crucial Detail: If a LispList node is const qualified, calling GetSubExpression() or NextNode() will recompute and allocate nodes, creating a fresh copy (preserving immutability). For unqualified nodes, the created sub-trees and sibling nodes are cached.

SWAR Keyword Recognition

For keywords ≤8 characters, we load the text as a 64-bit integer and compare:

const auto op = *reinterpret_cast<const uintptr_t*>(identifier.data());
// Mask out bytes beyond the identifier length
const auto operand = std::bit_cast<uintptr_t>(
    op & ~(UINTPTR_MAX << (identifier.size() << 3)));

constexpr uintptr_t let = /* "let" as uint64 */;
constexpr uintptr_t lambda = /* "lambda" as uint64 */;

switch (operand)
{
    case let: ... return LispTokenKind::Let;
    case lambda: ... return LispTokenKind::Lambda;
}

This makes comparison incredibly fast. The operand fits in an x64 general-purpose register, so the equality check costs a single cmp instruction. If the switch arms are ordered well, the compiler may even optimize it as a binary search.

Benchmarks

Unlike JSON or XML, there aren't many standard benchmarking datasets for S-expression languages. I created a suite covering various complexities.

Figure 3: Top 8 benchmarks showing mean and peak throughput

Figure 4: Complete throughput across all benchmarks

Figure 5: Size of each sample

Environment:

  • CPU: i9-13900K (Raptor Lake)

  • Compiler: LLVM Clang 21

  • OS: Windows 11 x64

Key Results:

  • LongSymbols: 10.61 GB/s mean (Ideal for SIMD).

  • 1GB AdjacentSExpressions: 6.49 GB/s mean (Demonstrates linear scaling).

  • DeeplyNested: 3.91 GB/s mean (High structural complexity).

A Note on Thread Scheduling: You might notice some variance between mean and peak throughput. During benchmarking, I verified that manually pinning the process to a specific Performance (P) core actually caused a significant slowdown.

It turns out that modern CPU hardware thread directors and the OS scheduler are far better at workload distribution than we are. The benchmarks presented here reflect the natural OS scheduling, which consistently outperformed manual pinning.

Why the variance? You might wonder about the gap between Quoted Expression and Long Symbols. In Quoted Expression (e.g., `(a b c 0)), every position is a token boundary. The state machine has to stop and start constantly—it can't skip blocks. It's effectively advancing byte-by-byte. Even in this "worst case," WideLips exceeds 1 GB/s throughout .

Conversely, Long Symbols allows the state machine to jump over many blocks in minimal iterations, reaching a throughput of 13+ GB/s. In the wild, you should expect 2.5–3 GB/s on similar hardware.

What Made the Difference

No single technique here is revolutionary. SIMD parsers, bump allocators, and lazy parsing have all existed for decades. What makes WideLips different is the compounding effect of combining them coherently:

  1. SIMD classification → Compact bitmasks.

  2. Bitmask state machine → Minimal branching; batched jumps.

  3. Fast structural validation → Enables safe lazy parsing.

  4. Lazy two-phase design → Pay only for what you use.

  5. Zero-copy tokens → No allocation overhead.

  6. Arena allocators → Minimal system memory management.

Remove any one of these, and performance drops. SIMD gains would be eaten by allocation overhead; lazy parsing wouldn't matter if the Blue Pass was slow.

Closing Thoughts

Building WideLips involved trial and error, dead ends, and 3 AM bug hunts and profiling. But the core lesson is this: there's huge headroom in "solved" problems when we give a fuck about performance.

WideLips may be a LISP specific parsing framework but if you are developing a new programming language or improving an existing one or anything that parse some text! then most if not all of the optimizations and paradigms used in WideLips can be imported to your parser.

Thank you for reading and happy coding ❤️.

More from this blog

Y

Yazan The Compiler Guy Blog

7 posts