JazzJackrabbit Community Forums - View Single Post

Seren · 07:28 AM

Escape sequences

You probably not only encountered, but also already used strings in your own scripts. They're a basic built-in object class designed for storing text. String literals¹ begin and end with either single quotation marks (') or double quotation marks ("). No surprise so far. However, did you ever consider what you'd have to do if you wanted to use a string that contained both of those characters inside it? If the text is surrounded by double quotation marks, you can't simply place another double quotation mark inside it, because the compiler would treat it as the end of the string and the intended result wouldn't be achieved. Instead, you should make use of what's called "escape sequences".

Escape sequences are character sequences of special meaning that the compiler replaces with a specified character. In AngelScript, escape sequences are created by typing backslash (\) followed by a certain character or characters. The sequence \" will be replaced by a double quotation mark and the sequence \' will be replaced by a single quotation mark, but that's not all escape sequences there are: \t is replaced by a single tab character, \n is replaced by a new line feed character, which you may find useful for drawing multi-line text on screen, \r becomes a carriage return character, which you probably won't ever need unless you decide to work with text files, and \0 is a null character (not to be mistaken with the numerical character '0'; null is a type of a control character²). Granted, these aren't something you use commonly in AngelScript, but it's worth knowing they exist.

Now if you've been reading carefully, you should be wondering how you include the backslash character in a string if the compiler expects it to start an escape sequence, and the answer is more escape sequences: \\ is an escape sequence that produces a single backslash. This is very important to know if you happen to write a string that's meant to remind a Windows file path. Consider a naively written string "C:\Games\". When the compiler reaches it, it will see a double quotation mark and realize it's dealing with a string. It will parse the first two characters and then reach \G. That's not an escape sequence it knows, so it will ignore that bit and throw a warning that will be displayed by the chatlogger. Then it reads further and finds \". It understands it as an instruction to insert a double quotation mark into the string, and then continues parsing all following code as a string. Eventually it reaches the end of the script file, and because the string should have been closed but wasn't, it submits an error to the chatlogger. Thus, backslashes in strings must always be escaped and the valid form of that string would've been "C:\\Games\\".

Finally, there are some universal escape sequences that allow you to insert any ASCII or Unicode character into a string as long as you know its hexadecimal code point. Or well, not really, since JJ2+ doesn't actually support Unicode strings, but the sequences can still come in useful. Those are \xFFFF, \uFFFF, and \UFFFFFFFF, where F's should be replaced by hexadecimal (base 16) digits. The number of digits must exactly match the number of F's given here, except for \x, which accepts anywhere between 1 and 4 digits. One case when this is particularly worth remembering is when you need to insert a section sign character (§) to modify spacing between letters. Simply type \xA7 or \u00A7 and it's there.

Heredoc strings

Escape sequences can be useful, but there are times when they simply make your string look really terrible. Or on other occasions you have a really big chunk of multi-line text and it would be inconvenient to replace characters with their corresponding escape sequences. AngelScript offers a solution to that, named heredoc strings. Heredoc strings are placed between two sets of triple double quotation marks (""") and can safely contain backslashes, quotation marks, and new line characters. What should you do if you want your heredoc string to also contain triple double quotation marks? AngelScript doesn't tell, which is most unfortunate because most programming languages that have heredoc strings offer solutions to this kind of problems, but luckily the necessity to ever do that is unlikely enough to happen to you.

Methods and supporting functions

Strings come with some functions and methods that are listed but not explained on the AngelScript website, and not mentioned at all within the JJ2+ scripting API, but definitely worth knowing as they can save a lot of time. Here they are:

Methods:

uint length() const - Returns the number of characters in the string.
void resize(uint n) - Sets the number of characters in the string to n. If n is less than the current size, the string is reduced to its first n characters. If n is more than the current size, an appropriate number of null characters (\0) is appended at the end of the string.
bool isEmpty() const - Returns whether the string is empty, i.e. whether the number of characters is zero.
string substr(uint start = 0, int count = -1) const - Returns a substring of the string, starting at the character with index start and containing count characters, or if count equals -1, containing the remainder of the string.
int findFirst(const string &in str, uint start = 0) const - Finds the first occurrence of substring str within the string, starting the search from start. Returns index of the first character of the found substring or -1 if the substring is not found.
int findLast(const string &in str, int start = -1) const - Finds the last occurrence of substring str within the string, starting the search from start, or if start equals -1, from the end of the string. Returns index of the first character of the found substring or -1 if the substring is not found.
array<string>@ split(const string &in delimiter) const - Separates the string into substrings wherever the character sequence delimiter is found and returns the resulting array. The substrings don't contain the delimiter. This can be very useful for creating custom commands, as splitting user input by spaces (as in, text.split(" ")) greatly simplifies parsing. For example, if text is "!ban 5 60m", the resulting array will be {"!ban", "5", "60m"}, which is much easier to implement logic for. This method can also be used to divide the content of a loaded text file into separate lines, using text.split("\r\n") for Windows files or text.split("\n") for Unix files³.

Functions:

string join(const array<string> &in arr, const string &in delimiter) - Joins the array of strings arr into a single string, inserting delimiter between substrings. This essentially reverts the operation of split. Like split, it can come in useful during command parsing, mainly if the final argument of a command is allowed to contain multiple words, for example after text "!ban 5 60m airboard bug abuse" is split, the resulting array is {"!ban", "5", "60m", "airboard", "bug", "abuse"}. After processing the first 3 arguments of the command we can remove them from the array and end up with {"airboard", "bug", "abuse"}, which we can then join with spaces to obtain "airboard bug abuse".
int64 parseInt(const string &in str, uint base = 10, uint &out byteCount = 0) - Returns an integer represented by the string str. For example for the input string "1056", the function will return 1056. The base argument determines the numerical base used for parsing, which must be either 10 or 16, otherwise the function fails. If the function encounters a character that is not a valid digit in the provided base or a '+' or '-' sign after the first position, it stops parsing and returns the integer calculated up to that point. byteCount may be used to obtain the number of valid characters parsed.
double parseFloat(const string &in str, uint &out byteCount = 0) - Returns a floating-point number represented by the string str. For example for the input string "0.635", the function will return 0.635. If the function encounters a character that shouldn't occur in a valid representation of a floating-point number, it stops parsing and returns the value calculated up to that point. byteCount may be used to obtain the number of valid characters parsed.
string formatInt(int64 val, const string &in options = "", uint width = 0) - Returns a text representation of the provided integer val. Available options explained below. The width argument specifies the minimum length (in characters) of the result. If the result is shorter than the minimum length, an appropriate number of spaces will be appended on its left (or right if left justified; see options below).
string formatFloat(double val, const string &in options = "", uint width = 0, uint precision = 0) - Returns a text representation of the provided floating-point number val. Available options explained below. The width argument specifies the minimum length (in characters) of the result. If the result is shorter than the minimum length, an appropriate number of spaces will be appended on its left (or right if left justified; see options below). precision specifies the minimum number of digits to appear after the decimal point character.

In formatInt and formatFloat, options should be a string that's a combination of the following characters:

'l' (lowercase L) - Left justify. If length of the result is less than width, space characters will be used to justify the result to left rather than the default of right.
'0' (zero) - Pad with zeroes. If length of the result is less than width, an appropriate number of zeroes will be appended in front of the result.
'+' (plus) - Always include the sign, even if positive. If the number is positive, a plus sign will be appended in front of the result.
' ' (space) - Add a space in case of positive number. If the number is positive, a single space will be appended in front of the result.
'h' - Hexadecimal integer small letters. The integer is formatted in base 16 and lowercase letters a-f are used as digits. Not available in formatFloat.
'H' - Hexadecimal integer capital letters. The integer is formatted in base 16 and uppercase letters A-F are used as digits. Not available in formatFloat.
'e' - Exponent character with small e. Scientific notation is used and signified by a lowercase e. Only available in formatFloat.
'E' - Exponent character with capital E. Scientific notation is used and signified by an uppercase E. Only available in formatFloat.

A recent update of AngelScript also added formatUInt, analogous to formatInt but operating on unsigned integers, however this function is not available in JJ2+ yet (and only comes in useful if you work with really big numbers, we're speaking 18-19 digits).

Operators

Finally, strings support a number of operators, some obvious and some less. It's very useful to know them as it can save you a lot of redundant code. Naturally, strings can be assigned (=) and tested for equality (== and !=). It's less common knowledge that they can also be compared using other relational operators (<, >, <=, >=). Such comparison is performed according to character codes, i.e. for most common uses, ASCII. It's useful to notice, for example, that if both compared strings are lowercase English words, the one that will be considered "smaller" is the one that precedes the other alphabetically. This has the advantage of allowing to quickly sort an array of words according to alphabet with use of the array method sortAsc (which may be discussed in another tutorial).

However, in ASCII, uppercase letters will always be considered "less than" lowercase ones (take a look at an ASCII table). This is quite unfortunate for the purpose of sorting alphabetically in case our words happen to contain mixed lowercase and uppercase characters. Luckily, ASCII is actually a fairly well designed system, in which the difference between an uppercase letter and its lowercase equivalent is always exactly 32. This is essential information, because it means that we can switch the case of a letter as long as we know its character code.

Most programming languages provide a way to convert between charcters and character codes. AngelScript does too, but it makes it particularly bothersome and doesn't really document it. You do it using the index operator ([]). For example, by typing text[0], we obtain the character code of the first character of text. Like most functionalities, it's available even for string literals, so "A"[0] is a valid expression that has the value of the character code of the letter A (65). That means you don't have to look up the ASCII code of a specific character when you need it in your script (useful in connection with jjKey, as keyboard codes of digits 0-9 and letters A-Z are the same as their ASCII codes, and jjKey["A"[0]], even if not pretty, will be easier to understand than jjKey[65]).

The index operator can be used not only to read, but also to set a character code at the selected position, e.g. text[0] = "!"[0]; will set the first character of text to an exclamation mark. In either case, like when working with arrays, keep in mind that the index has to be valid, i.e. non-negative and less than the string length, otherwise a run-time "index out of bounds" error will occur.

With this information, you should be able to understand the following function that converts a string to lowercase:

Code:

string lowercase(string text) {
  for (uint i = 0; i < text.length(); i++) {    // for each character
    if (text[i] >= "A"[0] && text[i] <= "Z"[0]) // if the character is an uppercase letter
      text[i] += 32;                            // convert it to its lowercase equivalent
  }
  return text;
}

It's not the most efficient way to do it, but it works.

The last pair of operators supported by strings are concatenation operators + and +=, which you probably already encountered. Although they are represented by the same characters as arithmetic addition operators (such as in the expression 1 + 1), they serve a completely different purpose and have different properties. Concatenation joins two strings together, simply appending one at the end of the other. As obvious as it may sound, you should notice that in contrast to arithmetic summation, the order of operands in concatenation does matter: "A" + "B" is not the same as "B" + "A"; one equals "AB" and the other "BA".

AngelScript simplifies syntax by allowing one of the operands of the concatenation operator to not be a string but another data type, such as a Boolean, integer, or float. Booleans will be appropriately converted to strings that say either "true" or "false", while integers and floats will be formatted with use of respectively formatInt and formatFloat implicitly called with default parameters. This can shorten your code, but remember to get the order of operations right: the expression "two: " + 1 + 1 will result in a string saying "two: 11" and to achieve "two: 2" instead, you should write "two: " + (1 + 1). The expression "zero: " + 1 - 1 won't compile at all, because the concatenation will occur first, and an attempt to compile the operation "zero: 1" - 1 will cause an error as the minus sign doesn't have a meaning in strings.

Summary

Hopefully that's all you could ever want to know about strings. To sum it up, strings are a data type for storing text, string literals can contain escape sequences for representing characters you can't otherwise include in them, and a special syntax called heredoc strings exists for when you don't want escape sequences to be interpreted or want to include long multi-line text. Strings have many useful methods and supporting functions, among them ways to obtain and modify length, search for desired terms, split into an array and join it back into a string, and convert between strings and arithmetic data types. They also support several operators, in particular ones that allow lexicographical comparison according to their internal ASCII code representation, obtaining and modifying character codes at selected position, and concatenation.

Useful links:
AngelScript documentation of strings
ASCII chart