SMUG is a library for parsing text, based on monadic parser combinators 1. Using a simple technique from the functional programming camp, SMUG makes it simple to create quick extensible recursive descent parsers without funky syntax or impenetrable macrology.

Features

  • parsers are first class values written in basic lisp
  • non-deterministic infinite look-ahead
  • easy to learn, easy to use
  • extensible input protocol : parse strings, streams, lists etc
  • in-depth tutorial : no previous experience required.

Tutorial and Technical Report

There is an extensive Tutorial available that starts from scratch and literately implements SMUG itself.

If that is not enough, the paper that SMUG is based on, Monadic Parser Combinators, has been updated with Common Lisp code. This is a Work In Progress and does not yet reflect the source code of SMUG itself, but it may.

Example : Parsing S-Expressions

SMUG itself is written using the Common Lisp style of "symbolic expression"s. Using smug to parse smug is circular enough to be lisp-y.

Definition

S-expression From Wikipedia, the free encyclopedia

Tree data structure representing the s-expression for (* 2 (+ 3 4)) In computing, s-expressions, sexprs or sexps (for "symbolic expression") are a notation for nested list (tree-structured) data, invented for and popularized by the programming language Lisp, which uses them for source code as well as data.

In the usual parenthesized syntax of Lisp, an s-expression is classically defined inductively as

  • an atom, or
  • an expression of the form (x . y) where x and y are s-expressions.

[…]

Most modern sexpr notations in addition use an abbreviated notation to represent lists in s-expressions, so that

(x y z) stands for

(x . (y . (z . NIL)))

Parser .SEXP

All an s-expression could contain is :

  • an atom
  • an expression.

The expression will now be known as a CONS, so really the parser is simply this.

(.or (.atom sexp) (.cons sexp))

A (.optional (.whitespace)) is added to the start, and the parser is complete.

(defun .sexp (&optional (sexp *sexp*))
  (.progn (.optional (.whitespace))
          (.or (.atom sexp) (.cons sexp))))

A Note on Recursion

A temptation could be to use .AND as depicted in the tutorial.

(defun .sexp (&optional (sexp *sexp*))
  (.and (.optional (.whitespace))
          (.or (.atom sexp) (.cons sexp))))

The issue is that .AND as defined is not a Special Operator, or even a macro, but a simple function. Now, in .CONS we call .SEXP recursively. The brain seems to think that it should simply work, but the INFO: Control stack guard page unprotected says something different.

To avoid infinite recursion, it is generally a good idea to have at least one .LET* binding before the recursion. .PROGN is a macro that contains one.

Structure Class SEXP

The definition itself has x and y, which themselves are s-expressions, and #\), #\(, #\Space and #\.. That can simply recorded in a struct.

(defstruct sexp
  (open #\()
  (separate #\.)
  (close #\))
  (whitespace '(#\Newline #\Space #\Tab)))

Variable *SEXP*

Most of the time "normal" s-expressions will be parsed., so a default special variable is on order.

(defparameter *sexp* (make-sexp))

Parser .whitespace

(defun .whitespace (&key (sexp *sexp*)
                         (result-type nil))
  (.first (.map result-type (.is 'member (sexp-whitespace sexp)))))

Parser .ATOM

(defun .atom (&optional (sexp *sexp*))
  (.or (|.#\\|) 
       (.first (.map 'string (.constituent sexp)))))

Parser |.#\\| : "Sharpsign Backslash"

Syntax: #\<<x>>

When the token x is a single character long, this parses as the literal character char. Uppercase and lowercase letters are distinguished after #\; #\A and #\a denote different character objects. Any single character works after #\, even those that are normally special to read, such as left-parenthesis and right-parenthesis.

In the single character case, the x must be followed by a non-constituent character. After #\ is read, the reader backs up over the slash and then reads a token, treating the initial slash as a single escape character (whether it really is or not in the current readtable).

When the token x is more than one character long, the x must have the syntax of a symbol with no embedded package markers. In this case, the sharpsign backslash notation parses as the character whose name is (string-upcase x);

(defun |.#\\| ()
  (.let* ((sb (.string= "#\\"))
          (char 
           (.or (.prog1 (.item) (.or (.whitespace) 
                                     (.not (.item))))
                (.first (.map 'string (.constituent))))))
    (.result (concatenate 'string sb (string char)))))
|\.

Parser .CONSTITUENT

constituent n., adj.

  1. a. n. :: the syntax type of a character that is part of a token. For details, see Section 2.1.4.1 (Constituent Characters).

    b. adj. (of a character) having the constituent[1a] syntax type.

    c. n. a constituent character.

http://www.lispworks.com/documentation/lw60/CLHS/Body/26_glo_c.htm#constituent

The SEXP struct says what is not a member of a sexp, so anything that is not not a member is double negatively a constituent character.

Strangely enough, what immediately arises adds another negative. The definition of the Structure Class SEXP includes non-constituent characters yet is itself an s-expression. As luck would have it, they are preceeded by a \ character.

By allowing that form of escaping, the .CONSTITUENT can be DEFUN'd.

(defun .constituent (&optional (sexp *sexp*))
 (.or (.and (.char= #\\) (.item))
      (.is-not 'member (list* (sexp-open sexp)
                              (sexp-close sexp)
                              (sexp-whitespace sexp)))))
\.CONSTITUENT

Parser .CONS

#+name : test-sexp.cons

(defun .cons (&optional (sexp *sexp*))
  (.let* ((car (.progn (.char= (sexp-open sexp))
                       (.sexp)))
          (dot (.optional 
                (.progn 
                 (.whitespace)
                 (.char= (sexp-separate sexp)))))
          (cdr (.prog1 (if dot 
                           (.sexp)
                           (.first (.map 'list (.sexp) 
                                         :at-least 0)))
                       (.char= #\)))))

    (.result (cons car cdr))))

Example

So, there is a struct definition sexp.

(defstruct sexp
  (open #\()
  (separate #\.)
  (close #\))
  (whitespace '(#\Newline #\Space #\Tab)))

When the CL:STRING actually make a out of it, we will see two backslashes where in reality there is only one.

(≻ (alexandria:read-file-into-string 
    (merge-pathnames "doc/test/sexp.lisp-expr"
       (asdf:system-source-directory :smug)))
 :=> 
"(defstruct sexp
  (open #\\()
  (separate #\\.)
  (close #\\))
  (whitespace '(#\\Newline #\\Space #\\Tab)))
")

This is simply because the \ is already used as a \.

(≻ (length "\\") 
  :=> 1)

Source File

(defpackage :smug/parse/sexp
  (:use :cl :smug))
(in-package :smug/parse/sexp)

(defstruct sexp
  (open #\()
  (separate #\.)
  (close #\))
  (whitespace '(#\Newline #\Space #\Tab)))

(defparameter *sexp* (make-sexp))

(defun .whitespace (&key (sexp *sexp*)
                         (result-type nil))
  (.first (.map result-type (.is 'member (sexp-whitespace sexp)))))

(defun |.#\\| ()
  (.let* ((sb (.string= "#\\"))
          (char 
           (.or (.prog1 (.item) (.or (.whitespace) 
                                     (.not (.item))))
                (.first (.map 'string (.constituent))))))
    (.result (concatenate 'string sb (string char)))))


(defun .constituent (&optional (sexp *sexp*))
 (.or (.and (.char= #\\) (.item))
      (.is-not 'member (list* (sexp-open sexp)
                              (sexp-close sexp)
                              (sexp-whitespace sexp)))))


(defun .atom (&optional (sexp *sexp*))
  (.or (|.#\\|) 
       (.first (.map 'string (.constituent sexp)))))



(defun .sexp (&optional (sexp *sexp*))
  (.progn (.optional (.whitespace))
          (.or (.atom sexp) (.cons sexp))))
(defpackage :smug/test/sexp
  (:use :cl :smug))
(in-package :smug/test/sexp)

(defstruct sexp
  (open #\()
  (separate #\.)
  (close #\))
  (whitespace '(#\Newline #\Space #\Tab)))

(defparameter *sexp* (make-sexp))

(defun .whitespace (&key (sexp *sexp*)
                         (result-type nil))
  (.first (.map result-type (.is 'member (sexp-whitespace sexp)))))

(defun |.#\\| ()
  (.let* ((sb (.string= "#\\"))
          (char 
           (.or (.prog1 (.item) (.or (.whitespace) 
                                     (.not (.item))))
                (.first (.map 'string (.constituent))))))
    (.result (concatenate 'string sb (string char)))))


(defun .constituent (&optional (sexp *sexp*))
 (.or (.and (.char= #\\) (.item))
      (.is-not 'member (list* (sexp-open sexp)
                              (sexp-close sexp)
                              (sexp-whitespace sexp)))))


(defun .atom (&optional (sexp *sexp*))
  (.or (|.#\\|) 
       (.first (.map 'string (.constituent sexp)))))



(defun .sexp (&optional (sexp *sexp*))
  (.progn (.optional (.whitespace))
          (.or (.atom sexp) (.cons sexp))))

Source Code

smug.lisp

(require 'org-id)
(org-babel-lob-ingest "tutorial.org")
(defpackage :smug/smug
  (:nicknames :smug)
  (:use :cl)
  (:export 
   #:.identity
   #:.fail
   #:.plus
   #:.item
   #:.bind

   #:input-empty-p
   #:input-first
   #:input-rest
   #:run
   #:parse  
   #:.let*


   #:.or
   #:.not

   #:.map
   #:.concatenate
   #:.is
   #:.is-not
   #:.char=
   #:.char-equal
   #:.string-equal
   #:.string=
   #:.progn
   #:.prog1
   #:.prog2
   #:.and
   #:.or
   #:.not
   #:.first
   #:.optional
   #:.read-line))
(in-package :smug/smug) 

;; * Monad 
;; Bind and Identity does a monad make.

;; ** /Parser Function/ Identity 

;; It is very simple, and can be inlined if needed.  

;; "The inline proclamation preceding the defun form ensures that the
;; compiler has the opportunity save the information necessary for
;; inline expansion, and the notinline proclamation following the
;; defun form prevents f from being expanded inline everywhere."

(declaim (inline .identity))
(defun .identity (value)
  (lambda (input)
    (list (cons value input))))
(declaim (notinline .identity))

;; ** BIND

(defun .bind (parser function)
  (lambda (input)
    (loop :for (value . input) :in (run parser input)
          :append (run (funcall function value) input))))

;; * let* : The sequential binding macro

(defmacro .let* (bindings &body body)
  (if bindings
      (let ((symbol (first (first bindings))))
        `(.bind ,@(cdr (first bindings))
               (lambda (,symbol)
                 ,@(when (or (string-equal (symbol-name symbol) "_")
                             (null (symbol-package symbol)))
                         `((declare (ignorable ,symbol))))
                 (.let* ,(cdr bindings)
                   ,@body))))
      `(progn ,@body)))

;; * Fail (aka zero) and Plus

;; Fail is also Zero when making this a Zero/Plus type. While this
;; does not really matter in this implementation, it is an "Interface"
;; type nonetheless.

(defun .fail ()
  (lambda (input) (declare (ignore input)) nil))

(defun .plus (first-parser second-parser)
  (lambda (input)
    (append (funcall first-parser input) (funcall second-parser input))))

;; * Run
;; Interface type for run-ables.

(defun run (parser input)
  (funcall parser input))

;; * Parser

(defgeneric input-empty-p (input)
  (:method ((input string)) (zerop (length input))))

(defgeneric input-first (input)
  (:method ((input string)) (aref input 0)))

(defgeneric input-rest (input)
  (:method ((input string))
    (multiple-value-bind (string displacement) 
        (array-displacement input)      
      (make-array (1- (length input))
                  :displaced-to (or string input)
                  :displaced-index-offset (1+ displacement)
                  :element-type (array-element-type input)))))

;; ** Item makes it a parser
(defun .item ()
  (lambda (input)
    (unless (input-empty-p input)
      (list (cons (input-first input)
                  (input-rest input))))))


;; ** Parse makes it VALUESd
(defun parse (parser input)
  (let ((result (run parser input)))
    (when result 
      (destructuring-bind ((result . input) &rest rest)
          result      
        (apply #'values result input rest)))))


(defun .or (parser &rest parsers)
  (lambda (input)
    (or (funcall parser input) 
        (when parsers 
          (funcall (apply #'.or parsers) input)))))

(defun .not (parser)
  (lambda (input)
    (let ((result (funcall parser input)))
      (if result
          nil
          (list (cons t input))))))








(defun .satisfies (predicate &rest args)
  (.bind (.item) 
        (lambda (x) 
          (if (apply predicate x args)
              (.identity x)
              (.fail)))))

(defun .optional (parser)
  (.or parser (.identity nil)))

(defun .and (p1 &rest ps)
  (.let* ((result p1))
    (if ps
        (apply #'.and ps)
        (.identity result))))


(defmacro .progn (&rest parsers)
    (if (rest parsers)
        (let ((name (gensym)))
          `(.let* ((,name ,(first parsers)))
             (.progn ,@(rest parsers))))
        (first parsers)))

(defmacro .prog1 (parser &rest parsers)
  (let ((name (gensym))
        (ignore (gensym)))
    `(.let* ((,name ,parser)
             (,ignore (.progn ,@parsers)))
       (.identity ,name))))

(defmacro .prog2 (parser1 parser2 &rest parsers)
  (let ((name (gensym))
        (ignore (gensym)))
    `(.let* ((,ignore ,parser1)
             (,name ,parser2)
             (,ignore (.progn ,@parsers)))
       (.identity ,name))))

(defun .is-not (predicate &rest args)
  (.satisfies (lambda (i) 
                 (cl:not (apply predicate i args)))))

(defun .is (predicate &rest args)
  (apply #'.satisfies predicate args))

(defun .mapcar (parser)
    (.plus (.let* ((x parser)
                   (xs (.mapcar parser)))
             (.identity (cons x xs)))
           (.identity ())))

(defun .mapc (parser)
    (.plus (.let* ((_ parser)
                   (_ (.mapc parser)))
             (.identity parser))
           (.identity parser)))

(defun .make-list (size &key (initial-element (.item)))
  (if (zerop size) 
      (.identity nil)
      (.let* ((first initial-element)
              (rest (.make-list (1- size) 
                                :initial-element initial-element)))
        (.identity (list* first rest)))))               

(defun .concatenate (output-type-spec &rest parsers)
  (if (not parsers)
      (.fail)
      (.let* ((first (first parsers))
              (rest (if (rest parsers)
                        (apply 
                         #'.concatenate output-type-spec (rest parsers))
                        (.identity nil))))
        (.identity (cl:concatenate output-type-spec first rest)))))


(defun .map (result-type parser
             &key 
               (at-least 1))
  "=> a ~result-type~ of /parser/ results."
  (.let* ((list-1 (.make-list at-least :initial-element parser))
          (list-2 (funcall (if result-type #'.mapcar #'.mapc) parser)))
    (.identity (when result-type (concatenate result-type list-1 list-2)))))

(defun .char= (x)
  (.is #'cl:char= x))

(defun .digit-char-p ()
  (.is #'cl:digit-char-p))

(defun .lower-case-p ()
  (.is #'cl:lower-case-p))

(defun .upper-case-p ()
  (.is #'cl:upper-case-p))  

(defun .read-line (&optional 
                     (eof-error-p t)
                     eof-value)
  (.let* ((text (.optional 
                 (.first (.map 'list (.is-not #'char= #\Newline)))))
          (newline (.or (.char= #\Newline)
                        (.and (.not (.item)) 
                              (.identity '())))))
    (if (or text newline)
        (.identity (concatenate 'string text (when newline (string newline))))
        (if eof-error-p 
            (.fail)
            (.identity eof-value)))))



(defun .string= (string)
  (if (string= string "")
      (.identity string)
      (.let* 
          ((_ (.is 'char= (aref string 0)))
           (_ (.string= (input-rest string))))
        (.identity string))))

(defun .char-equal (char)
  (.is #'cl:char-equal char))

(defun .string-equal (string)
  (labels ((%string-equal (string)
             (.let* ((first (.char-equal (aref string 0)))
                     (rest (if (> (length string) 1)
                                  (%string-equal (subseq string 1))
                                  (.identity nil))))
               (.identity (cons first rest)))))
    (.let* ((list (%string-equal string)))
      (.identity (coerce list 'string)))))

(defun .first (parser)
  (lambda (input)
    (let ((results (run parser input)))
       (when results (list (cl:first results))))))

footnotes

Footnotes:

1

Monadic parser combinators (pdf, ps, bibtex) Graham Hutton and Erik Meijer. Technical Report NOTTCS-TR-96-4, Department of Computer Science, University of Nottingham, 1996.

http://www.cs.nott.ac.uk/~gmh/bib.html#monparsing