sedlex

Unicode-friendly lexer generator for OCaml.

This package is licensed by LexiFi under the terms of the MIT license.

sedlex was originally written by Alain Frisch alain.frisch@lexifi.com and is now maintained as part of the ocaml-community repositories on github.

API

The API is documented here.

Overview

sedlex is a lexer generator for OCaml, similar to ocamllex, but supporting Unicode. Contrary to ocamllex, lexer specifications for sedlex are embedded in regular OCaml source files.

The lexers work with a new kind of "lexbuf", similar to ocamllex Lexing lexbufs, but designed to support Unicode, and abstracting from a specific encoding. A single lexer can work with arbitrary encodings of the input stream.

sedlex is the successor of the ulex project. Contrary to ulex which was implemented as a Camlp4 syntax extension, sedlex is based on the new "-ppx" technology of OCaml, which allow rewriting OCaml parse trees through external rewriters. (And what a better name than "sed" for a rewriter?)

As any -ppx rewriter, sedlex does not touch the concrete syntax of the language: lexer specifications are written in source file which comply with the standard grammar of OCaml programs. sedlex reuse the syntax for pattern matching in order to describe lexers (regular expressions are encoded within OCaml patterns). A nice consequence is that your editor (vi, emacs, ...) won't get confused (indentation, coloring) and you don't need to learn new priority rules. Moreover, sedlex is compatible with any front-end parsing technology: it works fine even if you use camlp4 or camlp5, with the standard or revised syntax.

Lexer specifications

sedlex adds a new kind of expression to OCaml: lexer definitions. The syntax for the new construction is:

  match%sedlex lexbuf with
  | R1 -> e1
  ...
  | Rn -> en
  | _  -> def

or:

  [%sedlex match lexbuf with 
  | R1 -> e1
  ...
  | Rn -> en
  | _  -> def
  ]

(The first vertical bar is optional as in any OCaml pattern matching. Guard expressions are not allowed.)

where:

lexbuf is an arbitrary lowercase identifier, which must refer to an existing value of type Sedlexing.lexbuf.
the Ri are regular expressions (see below);
the ei and def are OCaml expressions (called actions) of the same type (the type for the whole lexer definition).

Unlike ocamllex, lexers work on stream of Unicode codepoints, not bytes.

Like ocamllex, sedlex uses longest match with first rule priority:

The lexer always tries to match the longest possible prefix of the input. It does so by continuing to read characters as long as some rule can still match a longer string, while remembering the last position at which a rule did match.
When two or more rules match the same longest prefix (a tie), the rule that appears first in the match%sedlex definition wins. For example, given the rules | "if" -> ... and | Plus ('a'..'z') -> ..., the input "if" is matched by the first rule because it is listed first, even though the second rule also accepts "if".

The actions can call functions from the Sedlexing module to extract (parts of) the matched lexeme, in the desired encoding.

Regular expressions are syntactically OCaml patterns:

"...." (string constant): recognize the specified string.
'....' (character constant) : recognize the specified character
i (integer constant) : recognize the specified codepoint
'...' .. '...': character range
i1 .. i2: range between two codepoints
R1 | R2 : alternation
R, R2, ..., Rn : concatenation
Star R : Kleene star (0 or more repetition)
Plus R : equivalent to R, R*
Opt R : equivalent to ("" | R)
Rep (R, n) : equivalent to R{n}
Rep (R, n .. m) : equivalent to R{n, m}
Chars "..." : recognize any character in the string
Compl R : assume that R is a single-character length regexp (see below) and recognize the complement set
Sub (R1,R2) : assume that R is a single-character length regexp (see below) and recognize the set of items in R1 but not in R2 ("subtract")
Intersect (R1,R2) : assume that R is a single-character length regexp (see below) and recognize the set of items which are in both R1 and R2
Utf8 R : string literals inside R are assumed to be utf-8 encoded.
Latin1 R : string literals inside R are assumed to be latin1 encoded.
Ascii R : string literals inside R are assumed to be ascii encoded.
lid (lowercase identifier) : reference a named regexp (see below)

A single-character length regexp is a regexp which does not contain (after expansion of references) concatenation, Star, Plus, Opt or string constants with a length different from one.

Note:

The OCaml source is assumed to be encoded in UTF-8.
Strings and chars litterals will be interpreted in ASCII unless otherwise specified by the Latin1,Ascii and Utf8 constructors in patterns.

Named regular expressions

You can give names to regular expressions with [%sedlex.regexp? ...] and reference them by name in lexer rules.

Top-level definitions are visible for the rest of the module:

let digit = [%sedlex.regexp? '0' .. '9']
let number = [%sedlex.regexp? Plus digit]

let rec token buf =
  match%sedlex buf with
  | number -> INT (Sedlexing.Utf8.lexeme buf)
  | _ -> ...

Local definitions with let ... in are scoped to the body expression:

let hex_digit =
  let digit = [%sedlex.regexp? '0' .. '9'] in
  let hex_letter = [%sedlex.regexp? 'a' .. 'f' | 'A' .. 'F'] in
  [%sedlex.regexp? digit | hex_letter]

Local definitions also work inside expressions:

let token buf =
  let int_lit =
    let digit = [%sedlex.regexp? '0' .. '9'] in
    [%sedlex.regexp? Plus digit]
  in
  match%sedlex buf with
  | int_lit -> ...
  | _ -> ...

Predefined regexps

sedlex provides a set of predefined regexps:

any: any character
eof: the virtual end-of-file character
xml_letter, xml_digit, xml_extender, xml_base_char, xml_ideographic, xml_combining_char, xml_blank: as defined by the XML recommandation
tr8876_ident_char: characters names in identifiers from ISO TR8876
cc, cf, cn, co, cs, ll, lm, lo, lt, lu, mc, me, mn, nd, nl, no, pc, pd, pe, pf, pi, po, ps, sc, sk, sm, so, zl, zp, zs: as defined by the Unicode standard (categories)
alphabetic, ascii_hex_digit, hex_digit, id_continue, id_start, lowercase, math, other_alphabetic, other_lowercase, other_math, other_uppercase, uppercase, white_space, xid_continue, xid_start: as defined by the Unicode standard (properties)

Running a lexer

See the interface of the Sedlexing module for a description of how to create lexbuf values (from strings, stream or channels encoded in Latin1, utf8 or utf16, or from integer arrays or streams representing Unicode code points).

It is possible to work with a custom implementation for lex buffers. To do this, you just have to ensure that a module called Sedlexing is in scope of your lexer specifications, and that it defines at least the following functions: start, next, mark, backtrack. See the interface of the Sedlexing module for more information.

Using sedlex

The quick way:

   opam install sedlex

Otherwise, the first thing to do is to compile and install sedlex. You need a recent version of OCaml and dune.

  make

With findlib

If you have findlib, you can use it to install and use sedlex. The name of the findlib package is "sedlex".

Installation (after "make"):

  make install

Compilation of OCaml files with lexer specifications:

  ocamlfind ocamlc -c -package sedlex.ppx my_file.ml

When linking, you must also include the sedlex package:

  ocamlfind ocamlc -o my_prog -linkpkg -package sedlex.ppx my_file.cmo

There is also a sedlex.ppx subpackage containing the code of the ppx filter. This can be used to build custom drivers (combining several ppx transformations in a single process).

Without findlib

You can use sedlex without findlib. To compile, you need to run the source file through -ppx rewriter ppx_sedlex. Moreover, you need to link the application with the runtime support library for sedlex (sedlexing.cma / sedlexing.cmxa).

With utop

Once sedlex is installed as per above, simply type

#require "sedlex.ppx";;

Integration with ocamlyacc and menhir

sedlex uses its own Sedlexing.lexbuf type, while ocamlyacc and menhir (classic API) expect a lexer function of type Lexing.lexbuf -> token. To bridge the two, create a dummy Lexing.lexbuf and update its position fields after each token:

(* In lexer.ml — the sedlex lexer *)
let rec token buf =
  match%sedlex buf with
    | Plus ('0'..'9') -> Parser.INT (int_of_string (Sedlexing.Utf8.lexeme buf))
    | '+' -> Parser.PLUS
    | Plus white_space -> token buf
    | eof -> Parser.EOF
    | _ -> failwith "Unexpected character"

(* Wrap for ocamlyacc / menhir classic API *)
let tokenize buf =
  let lexbuf = Lexing.from_string "" in
  let tokenize lexbuf =
    let tok = token buf in
    let start_pos, end_pos = Sedlexing.lexing_positions buf in
    lexbuf.Lexing.lex_start_p <- start_pos;
    lexbuf.Lexing.lex_curr_p <- end_pos;
    tok
  in
  (tokenize, lexbuf)

(* In main.ml *)
let () =
  let buf = Sedlexing.Utf8.from_string "1 + 2" in
  let tokenize, lexbuf = Lexer.tokenize buf in
  let result = Parser.main tokenize lexbuf in
  ...

For menhir's incremental API, use Sedlexing.with_tokenizer which returns a unit -> token * position * position supplier directly:

let supplier = Sedlexing.with_tokenizer token buf in
let result =
  Parser.MenhirInterpreter.loop supplier
    (Parser.Incremental.main Lexing.dummy_pos)
in
...

Complete working examples are in examples/with_ocamlyacc/ and examples/with_menhir/.

Examples

The examples/ subdirectory contains several samples of sedlex in use.

Contributors

Benus Becker: implementation of Utf16
sghost: for Unicode 6.3 categories and properties
Peter Zotov:
- improvements to the build system
- switched parts of ppx_sedlex to using concrete syntax (with ppx_metaquot)
Steffen Smolka: port to dune
Romain Beauxis:
- Implementation of the unicode table extractors
- General maintenance

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.github		.github
examples		examples
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.merlin		.merlin
.ocamlformat		.ocamlformat
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dune-project		dune-project
sedlex.opam		sedlex.opam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sedlex

API

Overview

Lexer specifications

Named regular expressions

Predefined regexps

Running a lexer

Using sedlex

With findlib

Without findlib

With utop

Integration with ocamlyacc and menhir

Examples

Contributors

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sedlex

API

Overview

Lexer specifications

Named regular expressions

Predefined regexps

Running a lexer

Using sedlex

With findlib

Without findlib

With utop

Integration with ocamlyacc and menhir

Examples

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages