[manpage_begin string::token n 1] [keywords lexing] [keywords regex] [keywords string] [keywords tokenization] [moddesc {Text and string utilities}] [titledesc {Regex based iterative lexing}] [category {Text processing}] [require Tcl 8.5] [require string::token [opt 1]] [require fileutil] [description] This package provides commands for regular expression based lexing (tokenization) of strings. [para] The complete set of procedures is described below. [list_begin definitions] [comment {- - -- --- ----- -------- ------------- ---------------------}] [call [cmd {::string token text}] [arg lex] [arg string]] This command takes an ordered dictionary [arg lex] mapping regular expressions to labels, and tokenizes the [arg string] according to this dictionary. [para] The result of the command is a list of tokens, where each token is a 3-element list of label, start- and end-index in the [arg string]. [para] The command will throw an error if it is not able to tokenize the whole string. [comment {- - -- --- ----- -------- ------------- ---------------------}] [call [cmd {::string token file}] [arg lex] [arg path]] This command is a convenience wrapper around [cmd {::string token text}] above, and [cmd {fileutil::cat}], enabling the easy tokenization of whole files. [emph Note] that this command loads the file wholly into memory before starting to process it. [para] If the file is too large for this mode of operation a command directly based on [cmd {::string token chomp}] below will be necessary. [comment {- - -- --- ----- -------- ------------- ---------------------}] [call [cmd {::string token chomp}] [arg lex] [arg startvar] [arg string] [arg resultvar]] This command is the work horse underlying [cmd {::string token text}] above. It is exposed to enable users to write their own lexers, which, for example may apply different lexing dictionaries according to some internal state, etc. [para] The command takes an ordered dictionary [arg lex] mapping regular expressions to labels, a variable [arg startvar] which indicates where to start lexing in the input [arg string], and a result variable [arg resultvar] to extend. [para] The result of the command is a tri-state numeric code indicating one of [list_begin definitions] [def [const 0]] No token found. [def [const 1]] Token found. [def [const 2]] End of string reached. [list_end] Note that recognition of a token from [arg lex] is started at the character index in [arg startvar]. [para] If a token was recognized (status [const 1]) the command will update the index in [arg startvar] to point to the first character of the [arg string] past the recognized token, and it will further extend the [arg resultvar] with a 3-element list containing the label associated with the regular expression of the token, and the start- and end-character-indices of the token in [arg string]. [para] Neither [arg startvar] nor [arg resultvar] will be updated if no token is recognized at all. [para] Note that the regular expressions are applied (tested) in the order they are specified in [arg lex], and the first matching pattern stops the process. Because of this it is recommended to specify the patterns to lex with from the most specific to the most general. [para] Further note that all regex patterns are implicitly prefixed with the constraint escape [const \A] to ensure that a match starts exactly at the character index found in [arg startvar]. [list_end] [vset CATEGORY textutil] [include ../doctools2base/include/feedback.inc] [manpage_end]