Built-in rules

Besides ANY, matching any single Unicode character, pest provides several rules to make parsing text more convenient.

ASCII rules

Among the printable ASCII characters, it is often useful to match alphabetic characters and numbers. For numbers, pest provides digits in common radixes (bases):

Built-in rule Equivalent
ASCII_DIGIT '0'..'9'
ASCII_NONZERO_DIGIT '1'..'9'
ASCII_BIN_DIGIT '0'..'1'
ASCII_OCT_DIGIT '0'..'7'
ASCII_HEX_DIGIT '0'..'9' | 'a'..'f' | 'A'..'F'

For alphabetic characters, distinguishing between uppercase and lowercase:

Built-in rule Equivalent
ASCII_ALPHA_LOWER 'a'..'z'
ASCII_ALPHA_UPPER 'A'..'Z'
ASCII_ALPHA 'a'..'z' | 'A'..'Z'

And for miscellaneous use:

Built-in rule Meaning Equivalent
ASCII_ALPHANUMERIC any digit or letter ASCII_DIGIT | ASCII_ALPHA
NEWLINE any line feed format "\n" | "\r\n" | "\r"

Unicode rules

To make it easier to correctly parse arbitrary Unicode text, pest includes a large number of rules corresponding to Unicode character properties. These rules are divided into general category and binary property rules.

Unicode characters are partitioned into categories based on their general purpose. Every character belongs to a single category, in the same way that every ASCII character is a control character, a digit, a letter, a symbol, or a space.

In addition, every Unicode character has a list of binary properties (true or false) that it does or does not satisfy. Characters can belong to any number of these properties, depending on their meaning.

For example, the character "A", "Latin capital letter A", is in the general category "Uppercase Letter" because its general purpose is being a letter. It has the binary property "Uppercase" but not "Emoji". By contrast, the character "🅰", "negative squared Latin capital letter A", is in the general category "Other Symbol" because it does not generally occur as a letter in text. It has both the binary properties "Uppercase" and "Emoji".

For more details, consult Chapter 4 of The Unicode Standard.

General categories

Formally, categories are non-overlapping: each Unicode character belongs to exactly one category, and no category contains another. However, since certain groups of categories are often useful together, pest exposes the hierarchy of categories below. For example, the rule CASED_LETTER is not technically a Unicode general category; it instead matches characters that are UPPERCASE_LETTER or LOWERCASE_LETTER, which are general categories.

  • LETTER
    • CASED_LETTER
      • UPPERCASE_LETTER
      • LOWERCASE_LETTER
    • TITLECASE_LETTER
    • MODIFIER_LETTER
    • OTHER_LETTER
  • MARK
    • NONSPACING_MARK
    • SPACING_MARK
    • ENCLOSING_MARK
  • NUMBER
    • DECIMAL_NUMBER
    • LETTER_NUMBER
    • OTHER_NUMBER
  • PUNCTUATION
    • CONNECTOR_PUNCTUATION
    • DASH_PUNCTUATION
    • OPEN_PUNCTUATION
    • CLOSE_PUNCTUATION
    • INITIAL_PUNCTUATION
    • FINAL_PUNCTUATION
    • OTHER_PUNCTUATION
  • SYMBOL
    • MATH_SYMBOL
    • CURRENCY_SYMBOL
    • MODIFIER_SYMBOL
    • OTHER_SYMBOL
  • SEPARATOR
    • SPACE_SEPARATOR
    • LINE_SEPARATOR
    • PARAGRAPH_SEPARATOR
  • OTHER
    • CONTROL
    • FORMAT
    • SURROGATE
    • PRIVATE_USE
    • UNASSIGNED

Binary properties

Many of these properties are used to define Unicode text algorithms, such as the bidirectional algorithm and the text segmentation algorithm. Such properties are not likely to be useful for most parsers.

However, the properties XID_START and XID_CONTINUE are particularly notable because they are defined "to assist in the standard treatment of identifiers", "such as programming language variables". See Technical Report 31 for more details.

  • ALPHABETIC
  • BIDI_CONTROL
  • CASE_IGNORABLE
  • CASED
  • CHANGES_WHEN_CASEFOLDED
  • CHANGES_WHEN_CASEMAPPED
  • CHANGES_WHEN_LOWERCASED
  • CHANGES_WHEN_TITLECASED
  • CHANGES_WHEN_UPPERCASED
  • DASH
  • DEFAULT_IGNORABLE_CODE_POINT
  • DEPRECATED
  • DIACRITIC
  • EXTENDER
  • GRAPHEME_BASE
  • GRAPHEME_EXTEND
  • GRAPHEME_LINK
  • HEX_DIGIT
  • HYPHEN
  • IDS_BINARY_OPERATOR
  • IDS_TRINARY_OPERATOR
  • ID_CONTINUE
  • ID_START
  • IDEOGRAPHIC
  • JOIN_CONTROL
  • LOGICAL_ORDER_EXCEPTION
  • LOWERCASE
  • MATH
  • NONCHARACTER_CODE_POINT
  • OTHER_ALPHABETIC
  • OTHER_DEFAULT_IGNORABLE_CODE_POINT
  • OTHER_GRAPHEME_EXTEND
  • OTHER_ID_CONTINUE
  • OTHER_ID_START
  • OTHER_LOWERCASE
  • OTHER_MATH
  • OTHER_UPPERCASE
  • PATTERN_SYNTAX
  • PATTERN_WHITE_SPACE
  • PREPENDED_CONCATENATION_MARK
  • QUOTATION_MARK
  • RADICAL
  • REGIONAL_INDICATOR
  • SENTENCE_TERMINAL
  • SOFT_DOTTED
  • TERMINAL_PUNCTUATION
  • UNIFIED_IDEOGRAPH
  • UPPERCASE
  • VARIATION_SELECTOR
  • WHITE_SPACE
  • XID_CONTINUE
  • XID_START