SMUG
is a library for parsing text, based on monadic parser
combinators 1. Using a simple technique from the functional
programming camp, SMUG
makes it simple to create quick extensible
recursive descent parsers without funky syntax or impenetrable
macrology.
Features
- parsers are first class values written in basic lisp
- non-deterministic infinite look-ahead
- easy to learn, easy to use
- extensible input protocol : parse strings, streams, lists etc
- in-depth tutorial : no previous experience required.
Tutorial and Technical Report
There is an extensive Tutorial available that starts from
scratch and literately implements SMUG
itself.
If that is not enough, the paper that SMUG
is based on, Monadic
Parser Combinators, has been updated with Common Lisp code. This is a
Work In Progress and does not yet reflect the source code of SMUG
itself, but it may.
Example : Parsing S-Expressions
SMUG
itself is written using the Common Lisp style of "symbolic
expression"s. Using smug
to parse smug
is circular enough to be
lisp-y.
Definition
S-expression From Wikipedia, the free encyclopedia
Tree data structure representing the s-expression for (* 2 (+ 3 4)) In computing, s-expressions, sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data.
In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as
- an atom, or
- an expression of the form
(x . y)
wherex
andy
are s-expressions.[…]
Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that
(x y z)
stands for
(x . (y . (z . NIL)))
Parser .SEXP
All an s-expression could contain is :
- an atom
- an expression.
The expression will now be known as a CONS
, so really the parser is
simply this.
(.or (.atom sexp) (.cons sexp))
A (.optional (.whitespace))
is added to the start, and the parser is
complete.
(defun .sexp (&optional (sexp *sexp*)) (.progn (.optional (.whitespace)) (.or (.atom sexp) (.cons sexp))))
A Note on Recursion
A temptation could be to use .AND
as depicted in the
tutorial.
(defun .sexp (&optional (sexp *sexp*)) (.and (.optional (.whitespace)) (.or (.atom sexp) (.cons sexp))))
The issue is that .AND
as defined is not a Special Operator, or
even a macro, but a simple function. Now, in .CONS
we call .SEXP
recursively. The brain seems to think that it should simply work, but
the INFO: Control stack guard page unprotected
says something
different.
To avoid infinite recursion, it is generally a good idea to have at
least one .LET*
binding before the recursion. .PROGN
is a macro
that contains one.
Structure Class SEXP
The definition itself has x
and y
, which themselves are
s-expressions, and #\)
, #\(
, #\Space
and #\.
. That can simply
recorded in a struct.
(defstruct sexp (open #\() (separate #\.) (close #\)) (whitespace '(#\Newline #\Space #\Tab)))
Variable *SEXP*
Most of the time "normal" s-expressions will be parsed., so a default special variable is on order.
(defparameter *sexp* (make-sexp))
Parser .whitespace
(defun .whitespace (&key (sexp *sexp*) (result-type nil)) (.first (.map result-type (.is 'member (sexp-whitespace sexp)))))
Parser .ATOM
(defun .atom (&optional (sexp *sexp*)) (.or (|.#\\|) (.first (.map 'string (.constituent sexp)))))
Parser |.#\\|
: "Sharpsign Backslash"
Syntax:
#\<<x>>
When the token x is a single character long, this parses as the literal character char. Uppercase and lowercase letters are distinguished after #\; #\A and #\a denote different character objects. Any single character works after #\, even those that are normally special to read, such as left-parenthesis and right-parenthesis.
In the single character case, the x must be followed by a non-constituent character. After #\ is read, the reader backs up over the slash and then reads a token, treating the initial slash as a single escape character (whether it really is or not in the current readtable).
When the token x is more than one character long, the x must have the syntax of a symbol with no embedded package markers. In this case, the sharpsign backslash notation parses as the character whose name is (string-upcase x);
(defun |.#\\| () (.let* ((sb (.string= "#\\")) (char (.or (.prog1 (.item) (.or (.whitespace) (.not (.item)))) (.first (.map 'string (.constituent)))))) (.result (concatenate 'string sb (string char)))))
|\.
Parser .CONSTITUENT
constituent n., adj.
- a. n. :: the syntax type of a character that is part of a token. For details, see Section 2.1.4.1 (Constituent Characters).
b. adj. (of a character) having the constituent[1a] syntax type.
c. n. a constituent character.
– http://www.lispworks.com/documentation/lw60/CLHS/Body/26_glo_c.htm#constituent
The SEXP
struct says what is not a member of a sexp, so anything
that is not not a member is double negatively a constituent
character.
Strangely enough, what immediately arises adds another negative. The
definition of the Structure Class SEXP
includes non-constituent characters
yet is itself an s-expression. As luck would have it, they are
preceeded by a \
character.
By allowing that form of escaping, the .CONSTITUENT
can be DEFUN
'd.
(defun .constituent (&optional (sexp *sexp*)) (.or (.and (.char= #\\) (.item)) (.is-not 'member (list* (sexp-open sexp) (sexp-close sexp) (sexp-whitespace sexp)))))
\.CONSTITUENT
Parser .CONS
#+name : test-sexp.cons
(defun .cons (&optional (sexp *sexp*)) (.let* ((car (.progn (.char= (sexp-open sexp)) (.sexp))) (dot (.optional (.progn (.whitespace) (.char= (sexp-separate sexp))))) (cdr (.prog1 (if dot (.sexp) (.first (.map 'list (.sexp) :at-least 0))) (.char= #\))))) (.result (cons car cdr))))
Example
So, there is a struct definition sexp.
(defstruct sexp (open #\() (separate #\.) (close #\)) (whitespace '(#\Newline #\Space #\Tab)))
When the CL:STRING
actually make a out of it, we will see two
backslashes where in reality there is only one.
(≻ (alexandria:read-file-into-string (merge-pathnames "doc/test/sexp.lisp-expr" (asdf:system-source-directory :smug))) :=> "(defstruct sexp (open #\\() (separate #\\.) (close #\\)) (whitespace '(#\\Newline #\\Space #\\Tab))) ")
This is simply because the \
is already used as a \
.
(≻ (length "\\") :=> 1)
Source File
(defpackage :smug/parse/sexp (:use :cl :smug)) (in-package :smug/parse/sexp) (defstruct sexp (open #\() (separate #\.) (close #\)) (whitespace '(#\Newline #\Space #\Tab))) (defparameter *sexp* (make-sexp)) (defun .whitespace (&key (sexp *sexp*) (result-type nil)) (.first (.map result-type (.is 'member (sexp-whitespace sexp))))) (defun |.#\\| () (.let* ((sb (.string= "#\\")) (char (.or (.prog1 (.item) (.or (.whitespace) (.not (.item)))) (.first (.map 'string (.constituent)))))) (.result (concatenate 'string sb (string char))))) (defun .constituent (&optional (sexp *sexp*)) (.or (.and (.char= #\\) (.item)) (.is-not 'member (list* (sexp-open sexp) (sexp-close sexp) (sexp-whitespace sexp))))) (defun .atom (&optional (sexp *sexp*)) (.or (|.#\\|) (.first (.map 'string (.constituent sexp))))) (defun .sexp (&optional (sexp *sexp*)) (.progn (.optional (.whitespace)) (.or (.atom sexp) (.cons sexp))))
(defpackage :smug/test/sexp (:use :cl :smug)) (in-package :smug/test/sexp) (defstruct sexp (open #\() (separate #\.) (close #\)) (whitespace '(#\Newline #\Space #\Tab))) (defparameter *sexp* (make-sexp)) (defun .whitespace (&key (sexp *sexp*) (result-type nil)) (.first (.map result-type (.is 'member (sexp-whitespace sexp))))) (defun |.#\\| () (.let* ((sb (.string= "#\\")) (char (.or (.prog1 (.item) (.or (.whitespace) (.not (.item)))) (.first (.map 'string (.constituent)))))) (.result (concatenate 'string sb (string char))))) (defun .constituent (&optional (sexp *sexp*)) (.or (.and (.char= #\\) (.item)) (.is-not 'member (list* (sexp-open sexp) (sexp-close sexp) (sexp-whitespace sexp))))) (defun .atom (&optional (sexp *sexp*)) (.or (|.#\\|) (.first (.map 'string (.constituent sexp))))) (defun .sexp (&optional (sexp *sexp*)) (.progn (.optional (.whitespace)) (.or (.atom sexp) (.cons sexp))))
Source Code
smug.lisp
(require 'org-id) (org-babel-lob-ingest "tutorial.org")
(defpackage :smug/smug (:nicknames :smug) (:use :cl) (:export #:.identity #:.fail #:.plus #:.item #:.bind #:input-empty-p #:input-first #:input-rest #:run #:parse #:.let* #:.or #:.not #:.map #:.concatenate #:.is #:.is-not #:.char= #:.char-equal #:.string-equal #:.string= #:.progn #:.prog1 #:.prog2 #:.and #:.or #:.not #:.first #:.optional #:.read-line)) (in-package :smug/smug) ;; * Monad ;; Bind and Identity does a monad make. ;; ** /Parser Function/ Identity ;; It is very simple, and can be inlined if needed. ;; "The inline proclamation preceding the defun form ensures that the ;; compiler has the opportunity save the information necessary for ;; inline expansion, and the notinline proclamation following the ;; defun form prevents f from being expanded inline everywhere." (declaim (inline .identity)) (defun .identity (value) (lambda (input) (list (cons value input)))) (declaim (notinline .identity)) ;; ** BIND (defun .bind (parser function) (lambda (input) (loop :for (value . input) :in (run parser input) :append (run (funcall function value) input)))) ;; * let* : The sequential binding macro (defmacro .let* (bindings &body body) (if bindings (let ((symbol (first (first bindings)))) `(.bind ,@(cdr (first bindings)) (lambda (,symbol) ,@(when (or (string-equal (symbol-name symbol) "_") (null (symbol-package symbol))) `((declare (ignorable ,symbol)))) (.let* ,(cdr bindings) ,@body)))) `(progn ,@body))) ;; * Fail (aka zero) and Plus ;; Fail is also Zero when making this a Zero/Plus type. While this ;; does not really matter in this implementation, it is an "Interface" ;; type nonetheless. (defun .fail () (lambda (input) (declare (ignore input)) nil)) (defun .plus (first-parser second-parser) (lambda (input) (append (funcall first-parser input) (funcall second-parser input)))) ;; * Run ;; Interface type for run-ables. (defun run (parser input) (funcall parser input)) ;; * Parser (defgeneric input-empty-p (input) (:method ((input string)) (zerop (length input)))) (defgeneric input-first (input) (:method ((input string)) (aref input 0))) (defgeneric input-rest (input) (:method ((input string)) (multiple-value-bind (string displacement) (array-displacement input) (make-array (1- (length input)) :displaced-to (or string input) :displaced-index-offset (1+ displacement) :element-type (array-element-type input))))) ;; ** Item makes it a parser (defun .item () (lambda (input) (unless (input-empty-p input) (list (cons (input-first input) (input-rest input)))))) ;; ** Parse makes it VALUESd (defun parse (parser input) (let ((result (run parser input))) (when result (destructuring-bind ((result . input) &rest rest) result (apply #'values result input rest))))) (defun .or (parser &rest parsers) (lambda (input) (or (funcall parser input) (when parsers (funcall (apply #'.or parsers) input))))) (defun .not (parser) (lambda (input) (let ((result (funcall parser input))) (if result nil (list (cons t input)))))) (defun .satisfies (predicate &rest args) (.bind (.item) (lambda (x) (if (apply predicate x args) (.identity x) (.fail))))) (defun .optional (parser) (.or parser (.identity nil))) (defun .and (p1 &rest ps) (.let* ((result p1)) (if ps (apply #'.and ps) (.identity result)))) (defmacro .progn (&rest parsers) (if (rest parsers) (let ((name (gensym))) `(.let* ((,name ,(first parsers))) (.progn ,@(rest parsers)))) (first parsers))) (defmacro .prog1 (parser &rest parsers) (let ((name (gensym)) (ignore (gensym))) `(.let* ((,name ,parser) (,ignore (.progn ,@parsers))) (.identity ,name)))) (defmacro .prog2 (parser1 parser2 &rest parsers) (let ((name (gensym)) (ignore (gensym))) `(.let* ((,ignore ,parser1) (,name ,parser2) (,ignore (.progn ,@parsers))) (.identity ,name)))) (defun .is-not (predicate &rest args) (.satisfies (lambda (i) (cl:not (apply predicate i args))))) (defun .is (predicate &rest args) (apply #'.satisfies predicate args)) (defun .mapcar (parser) (.plus (.let* ((x parser) (xs (.mapcar parser))) (.identity (cons x xs))) (.identity ()))) (defun .mapc (parser) (.plus (.let* ((_ parser) (_ (.mapc parser))) (.identity parser)) (.identity parser))) (defun .make-list (size &key (initial-element (.item))) (if (zerop size) (.identity nil) (.let* ((first initial-element) (rest (.make-list (1- size) :initial-element initial-element))) (.identity (list* first rest))))) (defun .concatenate (output-type-spec &rest parsers) (if (not parsers) (.fail) (.let* ((first (first parsers)) (rest (if (rest parsers) (apply #'.concatenate output-type-spec (rest parsers)) (.identity nil)))) (.identity (cl:concatenate output-type-spec first rest))))) (defun .map (result-type parser &key (at-least 1)) "=> a ~result-type~ of /parser/ results." (.let* ((list-1 (.make-list at-least :initial-element parser)) (list-2 (funcall (if result-type #'.mapcar #'.mapc) parser))) (.identity (when result-type (concatenate result-type list-1 list-2))))) (defun .char= (x) (.is #'cl:char= x)) (defun .digit-char-p () (.is #'cl:digit-char-p)) (defun .lower-case-p () (.is #'cl:lower-case-p)) (defun .upper-case-p () (.is #'cl:upper-case-p)) (defun .read-line (&optional (eof-error-p t) eof-value) (.let* ((text (.optional (.first (.map 'list (.is-not #'char= #\Newline))))) (newline (.or (.char= #\Newline) (.and (.not (.item)) (.identity '()))))) (if (or text newline) (.identity (concatenate 'string text (when newline (string newline)))) (if eof-error-p (.fail) (.identity eof-value))))) (defun .string= (string) (if (string= string "") (.identity string) (.let* ((_ (.is 'char= (aref string 0))) (_ (.string= (input-rest string)))) (.identity string)))) (defun .char-equal (char) (.is #'cl:char-equal char)) (defun .string-equal (string) (labels ((%string-equal (string) (.let* ((first (.char-equal (aref string 0))) (rest (if (> (length string) 1) (%string-equal (subseq string 1)) (.identity nil)))) (.identity (cons first rest))))) (.let* ((list (%string-equal string))) (.identity (coerce list 'string))))) (defun .first (parser) (lambda (input) (let ((results (run parser input))) (when results (list (cl:first results))))))
footnotes
Footnotes:
Monadic parser combinators (pdf, ps, bibtex) Graham Hutton and Erik Meijer. Technical Report NOTTCS-TR-96-4, Department of Computer Science, University of Nottingham, 1996.