An intro to nom

parsing made easy for Rustaceans

Rust Dublin Meetup 2024-03-06

Roberto Gambuzzi - Luciano Mammino

๐Ÿš€ย Principal Software Engineer here and there

Let's connect!ย linktr.ee/gambuzzi

Hello, I'm Roberto...

๐Ÿ’ค Lazy by nature, Pythonic by choice

๐Ÿฆ€ Old enough to be rusted

๐Ÿ‘‹ I'm Lucianoย (๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ•๐Ÿ๐ŸคŒ)

๐Ÿ‘จโ€๐Ÿ’ป Senior Architect @ fourTheorem

๐Ÿ“” Co-Author of Node.js Design Patternsย  ๐Ÿ‘‰

Let's connect!ย linktr.ee/loige

What the heck is a parser? ๐Ÿค“

What the heck is a parser? ๐Ÿค“

A program that can turn text (or bytes) into structured information

What the heck is a parser? ๐Ÿค“

A program that can turn text (or bytes) into structured information

"Point: (22, 17, -11)"

struct Point3D {
  x: i32,
  y: i32,
  z: i32,
}

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"

label

remainder

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point:ย (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)

Naive approach

String split like a mad-man ๐Ÿคช

[label, remainder] = split(input, ": ", 2)
input = "Point: (22, 17, -11)"
[, remainder] = split(remainder, "(", 2)
[numbers,] = split(remainder, ")", 2)
[x, y, z] = split(numbers, ", ", 3)

Problems with this approach

  1. Not always suitable (you need "delimiters")
  2. Simple for simple stuff, increasingly hard for "realistic" use cases
  3. Hard to handle errors consistently

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/ /

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: /

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \(/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((-?\d+), (-?\d+), (-?\d+)\)$/

Capture groups

Using a Regex ๐Ÿง 

Point: (22, 17, -11)
/^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$/

Named capture groups

x

y

z

$ cargo add regex

yes, the Rust standard library doesn't have built-in support for regex (yet)! ๐Ÿคทโ€โ™€๏ธ

*

use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}
use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}
use regex::Regex;

#[derive(Debug)]
struct Point3D {
    x: i32,
    y: i32,
    z: i32,
}

fn main() {
    let re = Regex::new(r"^Point: \((?<x>-?\d+), (?<y>-?\d+), (?<z>-?\d+)\)$").unwrap();
    let caps = re.captures("Point: (22, 17, -11)").unwrap();

    let point = Point3D {
        x: caps["x"].parse().unwrap(),
        y: caps["y"].parse().unwrap(),
        z: caps["z"].parse().unwrap(),
    };

    println!("{:?}", point);
}

Problems with this approach

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

โ€” Jamie Zawinski

Problems with this approach

  • Regex are hard to debug:
    • it's either a full match (with captures)
    • or nothing!
  • Learning to write regex takes a lot of practice (possibly years!)
  • Regex might not be suitable for more advanced parsing cases
    • E.g. nested structures

Consuming parser

Tries to "eat and match" the source text (or bytes)

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

Parsed data

Remainder string

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point: (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point:ย (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

Consuming parser

Tries to "eat and match" the source text (or bytes)

input = "Point:ย (22, 17, -11)"

1. readLabel(input)

"Point", ": (22, 17, -11)"

2. readSeparator(remainder)

":", " (22, 17, -11)"

3. readSpaces(remainder)

" ", "(22, 17, -11)"

4. readIntTuple(remainder)

[22,17,-11], ""

Consuming parser

Handling errors: if we fail to match we have to return an error

input = "Hello World"

readNumber(input)

"Cannot match number on 'Hello World'"
$ cargo add nom
fn main() {
    let s = "Point: (22, 17, -11)";
    
    let (_, point) = parse_point(s).unwrap();
    println!("{:?}", point);
}
fn parse_point(input: &str) -> IResult<&str, Point3D> {
    // ... parsing logic goes here

    Ok((input, Point3D { x, y, z }))
}

the value to parse from

A nom Result type

Input type

Output type

fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    
    // ...

    Ok((input, Point3D { x, y, z }))
}

nom combinator: exact match

what to match

the value to match against

If it fails to match

we propagate the error

the parsed value

(in this case "Point: ")

the remainder string

Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;

    // ...

    Ok((input, Point3D { x, y, z }))
}

combinator that parses a sign (+ or -) and sequence of digits and converts them to a i32

first coordinate

Point: (22, 17, -11)
fn parse_point(input: &str) -> IResult<&str, Point3D> {
	let (input, _) = tag("Point: ")(input)?;
    let (input, _) = tag("(")(input)?;
    let (input, x) = i32(input)?;
	let (input, _) = tag(", ")(input)?;
    let (input, y) = i32(input)?;
    let (input, _) = tag(", ")(input)?;
    let (input, z) = i32(input)?;
    let (input, _) = tag(")")(input)?;

    Ok((input, Point3D { x, y, z }))
}
Point: (22, 17, -11)

With nom you generally break parsers into smaller and reusable parsers ๐Ÿค

/// parses "," followed by 0 or more spaces
fn separator(input: &str) -> IResult<&str, ()> {
    let (input, _) = pair(tag(","), space0)(input)?;
    Ok((input, ()))
}

nom combinator that applies 2 parsers in sequence and returns a tuple with the 2 parsed values

/// parses 3 numbers separated by a separator (e.g. "22, 17, -11")
fn parse_coordinates(input: &str) -> IResult<&str, (i32, i32, i32)> {
    let (input, (x, _, y, _, z)) = tuple((
        i32,
        separator,
        i32,
        separator,
        i32
    ))(input)?;

    Ok((input, (x, y, z)))
}

nom combinator that applies a sequence of parsers in sequence and returns a tuple with the parsed values

Note how we are reusing the parser we just created!

fn parse_point2(input: &str) -> IResult<&str, Point3D> {
    let (input, _) = tag("Point: ")(input)?;
    let (input, (x, y, z)) = delimited(
        tag("("),
        parse_coordinates,
        tag(")")
    )(input)?;
    
    Ok((input, Point3D { x, y, z }))
}

We already learned about these useful nomย combinators:

  • tag: exact string match
  • i32: parse a number from text as i32
  • space0: matches 0 or more spaces
  • pair: apply 2 parsers in sequence
  • tuple: apply N parsers in sequence
  • delimited: combines 3 parsers to extract a value wrapped around 2 delimiters

Let's learn more by trying to parse RESP
the Redis Serialization Protocol ๐Ÿค 

RESP Primer (spec)

  • Binary protocol that uses control sequences encoded in standard ASCII
  • The \r\nย (CRLF) is the protocol's terminator, which always separates its parts
  • The first byte in a payload identifies its type. Subsequent bytes constitute the type's contents.
  • Different data types (simple, bulk, or aggregate)

RESP Primer (spec)

RESP data type First byte
Simple strings +
Simple Errors -
Integers :
Bulk strings $
Arrays *
Nulls _
Booleans #
Doubles ,
Big numbers (
Bulk errors !
Verbatim strings =
Maps %
Sets ~
Pushes >

Example messages

+OK\r\n
-ERR unknown command 'asdf'\r\n
:1000\r\n
$5\r\nhello\r\n
*2\r\n$5\r\nhello\r\n$5\r\nworld\r\n
,1.23\r\n
(3492890328409238509324850943850943825024385\r\n
%2\r\n+first\r\n:1\r\n+second\r\n:2\r\n

Simple string

Simple error

Integer

Bulk string

Array

Double

Big number

Map

["hello", "world"]
{"first": 1, "second": 2}
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)]
pub enum Value<'a> {
    SimpleString(&'a str),
    SimpleError(&'a str),
    Integer(i64),
    BulkString(&'a str),
    Array(Vec<Value<'a>>),
    Null,
    Boolean(bool),
    Double(String),
    BigNumber(&'a str),
    BulkError(&'a str),
    VerbatimString(&'a str, &'a str),
    Map(Vec<Value<'a>>, Vec<Value<'a>>),
    Set(BTreeSet<Value<'a>>),
    Pushes(Vec<Value<'a>>),
}

Recursive data type

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    let (input, type_char) = one_of("+-:$*_#,(!=%~>")(input)?;
    let parser = match type_char {
        '+' => parse_simple_string,
        '-' => parse_simple_error,
        ':' => parse_integer,
        '$' => parse_bulk_string,
        '*' => parse_array,
        '_' => parse_null,
        '#' => parse_bool,
        ',' => parse_double,
        '(' => parse_bignumber,
        '!' => parse_bulk_error,
        '=' => parse_verbatim_string,
        '%' => parse_map,
        '~' => parse_set,
        '>' => parse_push,
        _ => unreachable!("Invalid type"),
    };

    parser(input)
}

Alt-ernative approach ๐Ÿ™„

pub fn parse_value(input: &str) -> IResult<&str, Value> {
    alt((
        parse_simple_string,
        parse_simple_error,
        parse_integer,
        parse_bulk_string,
        parse_array,
        parse_null,
        parse_bool,
        parse_double,
        parse_bignumber,
        parse_bulk_error,
        parse_verbatim_string,
        parse_map,
        parse_set,
        parse_pushes,
    ))(input)
}
fn crlf(input: &str) -> IResult<&str, &str> {
    tag("\r\n")(input)
}

fn parse_simple_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("+")(input)?;
    let (input, value) = terminated(
        take_while(|c| c != '\r' && c != '\n'), 
        crlf
    )(input)?;
    Ok((input, Value::SimpleString(value)))
}
+Simple String Example\r\n
fn parse_bulk_string(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("$")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, value) = terminated(
        take(length as usize),
        crlf
    )(input)?;
    Ok((input, Value::BulkString(value)))
}
$19\r\nBulk String Example\r\n
fn parse_array(input: &str) -> IResult<&str, Value> {
    let (input, _) = tag("*")(input)?;
    let (input, length) = terminated(u32, crlf)(input)?;
    let (input, values) = count(parse_value, length as usize)(input)?;
    Ok((input, Value::Array(values)))
}
*2\r\n$5\r\nExample\r\n$5\r\nArray\r\n

Other useful nomย combinators:

  • one_of: match one char from a list of allowed ones
  • terminated: applies a parser and then a second one on the remaining input but returns only the value of the first parser.
  • eof: checks that the remaining input has length 0
  • alt: tries a sequence of parsers until one matches
  • take_while: take characters until a given condition matches
  • take: takes N characters
  • count: repeats a parser N times and collects the results in a vector.

Can we parse binaryย file formats? ๐Ÿค”

UINT8[80]    โ€“ Header                 -     80 bytes
UINT32       โ€“ Number of triangles    -      4 bytes
foreach triangle                      - 50 bytes:
    REAL32[3] โ€“ Normal vector             - 12 bytes
    REAL32[3] โ€“ Vertex 1                  - 12 bytes
    REAL32[3] โ€“ Vertex 2                  - 12 bytes
    REAL32[3] โ€“ Vertex 3                  - 12 bytes
    UINT16    โ€“ Attribute byte count      -  2 bytes
end

STL file format

๐Ÿคทโ€โ™‚๏ธ Did you know there's a monumentย of this teapot in Dublin?

use std::{fs, io};

use nom::bytes::complete::take;
use nom::multi::count;
use nom::number::complete::{le_f32, le_u16, le_u32};
use nom::sequence::tuple;
use nom::IResult;

#[derive(Debug)]
struct Triangle {
    normal_vector: [f32; 3],   // REAL32[3] - Normal vector
    vertex_1: [f32; 3],        // REAL32[3] - Vertex 1
    vertex_2: [f32; 3],        // REAL32[3] - Vertex 2
    vertex_3: [f32; 3],        // REAL32[3] - Vertex 3
    attribute_byte_count: u16, // UINT16 - Attribute byte count
}

#[derive(Debug)]
pub struct StlFile<'a> {
    header: &'a [u8],         // UINT8[80] - Header
    number_of_triangles: u32, // UINT32 - Number of triangles
                              // little endian
    triangles: Vec<Triangle>, // Variable number of triangles
}

fn parse_triangle(input: &[u8]) -> IResult<&[u8], Triangle> {
    let mut three_floats = tuple((le_f32, le_f32, le_f32));
    let (input, normal_vector) = three_floats(input)?;
    let (input, vertex_1) = three_floats(input)?;
    let (input, vertex_2) = three_floats(input)?;
    let (input, vertex_3) = three_floats(input)?;
    let (input, attribute_byte_count) = le_u16(input)?;
    Ok((
        input,
        Triangle {
            normal_vector: [normal_vector.0, normal_vector.1, 
                            normal_vector.2],
            vertex_1: [vertex_1.0, vertex_1.1, vertex_1.2],
            vertex_2: [vertex_2.0, vertex_2.1, vertex_2.2],
            vertex_3: [vertex_3.0, vertex_3.1, vertex_3.2],
            attribute_byte_count,
        },
    ))
}

pub fn parse_stl(data: &[u8]) -> IResult<&[u8], StlFile> {
    let (data, header) = take(80usize)(data)?;
    let (data, number_of_triangles) = le_u32(data)?;
    let (data, triangles) = count(parse_triangle,
                            number_of_triangles as usize)(data)?;
    Ok((
        data,
        StlFile {
            header,
            number_of_triangles,
            triangles,
        },
    ))
}

fn main() -> io::Result<()> {
    let file_path = "example.stl";
    let data: Vec<u8> = fs::read(file_path)?;
    let bin_data: &[u8] = &data;
    let (_, stl_file) = parse_stl(bin_data).unwrap();
    dbg!("{:?}", stl_file);

    Ok(())
}

Result

[src/main.rs:61:5] stl_file = StlFile {
    header: [
        34,
        74,
        101,
        119,
        101,
    ...
        0,
        0,
        0,
        0,
        0,
        0,
    ],
    number_of_triangles: 2466,
    triangles: [
        Triangle {
            normal_vector: [
                -0.0,
                0.0,
                1.0,
            ],
            vertex_1: [
                146.41861,
                -0.44428897,
                0.0,
            ],
            vertex_2: [
                151.41861,
                -0.44428897,
                0.0,
            ],
            vertex_3: [
                146.56938,
                0.4107614,
                0.0,
            ],
            attribute_byte_count: 0,
        },
        Triangle {
...

It's a wrap! ๐ŸŒฏ

  • If you need to do parsing, string splitting is an option but it gets messy really quick
  • Regex can be a better approach, but they are hard to learn, debug, and to get right
  • `nom` offers a more interesting approach through parser combinators
  • Once you learn how to create a parser function and some of the combinators available you are good to go!
  • Nom also allows you to parse binary data!

Bonus content ๐Ÿ˜ฑ

Why is it called nom? ๐Ÿค”

Why is it called nom? ๐Ÿค”

Bonus content ๐Ÿ˜ฑ

nom will happily take a byte out of your files :)

Alternative parser combinator crates

  • winnow: fork of nom. It seems to be updated more frequently and the team claims they are working to provide a better DX.
  • chumsky: Better error reporting and error recovery. Favors a function chaining style rather than composition.

Links

THANK YOU!

Questions?