|
| 1 | +\hsection{Text Strings}% |
| 2 | +\label{sec:str}% |
| 3 | +% |
| 4 | +The fourth important datatype in \python\ are text strings. |
| 5 | +Text strings are sequences of characters of an arbitrary length. |
| 6 | +In \python, they are represented by the datatype \pythonilIdx{str}. |
| 7 | +Indeed, we have already used it before, even in our very first example program back that simply printed \pythonil{"Hello World"} in \cref{lst:very_first_program} in \cref{sec:ourFirstProgram}. |
| 8 | +\pythonil{"Hello World"} is such a text string.% |
| 9 | +% |
| 10 | +\hsection{Basic String Operations}% |
| 11 | +% |
| 12 | +\begin{figure}% |
| 13 | +\centering% |
| 14 | +\includegraphics[width=0.8\linewidth]{\currentDir/strIndexing}% |
| 15 | +\caption{Specifying string literals and indexing its characters.}% |
| 16 | +\label{fig:strIndexing}% |
| 17 | +\end{figure}% |
| 18 | +% |
| 19 | +As \cref{fig:strIndexing} shows, there are two basic ways to specify a text string literal\pythonIdx{str!literal}: |
| 20 | +Either enclosed by double quotes, e.g., \pythonil{"Hello World!"}\pythonIdx{\textquotedbl} or enclosed by single quotes, e.g., \pythonil{'Hello World!'}\pythonIdx{\textquotesingle}. |
| 21 | +The double-quote variant is usually preferred and we should always use it in our programs. |
| 22 | +The quotation marks are only used to delimit the strings, i.e., to tell \python\ where the string begins or ends. |
| 23 | +They are not themselves part of the string. |
| 24 | + |
| 25 | +One basic operation is string concatenation\pythonIdx{str!concatenation}\pythonIdx{str!+}\pythonIdx{+}: |
| 26 | +\pythonil{"Hello" + ' ' + "World"}\pythonIdx{\textquotedbl}\pythonIdx{\textquotesingle} concatenates the three strings \pythonil{"Hello"}, \pythonil{" "}, and \pythonil{"World"}. |
| 27 | +The result is \pythonil{"Hello World"}\pythonIdx{\textquotedbl}. |
| 28 | +Notice how the singe space character string is needed, because \pythonil{"Hello" + "World"} would just yield \pythonil{"HelloWorld"}. |
| 29 | + |
| 30 | +Strings are different from the other datatypes we have seen so far. |
| 31 | +They are \emph{sequences}\pythonIdx{Sequence}, meaning that they are linear arrays composed of elements. |
| 32 | +These elements are the single characters, which correspond to letters, numbers, punctuation marks, white space, etc. |
| 33 | + |
| 34 | +One basic set of things that we can do with strings is to extract these single characters. |
| 35 | +First, we need to know the length of a string. |
| 36 | +For this purpose, we can invoke the \pythonilIdx{len}\pythonIdx{str!len}\pythonIdx{str!length} function: |
| 37 | +\pythonil{len("Hello")} is \pythonil{5}, because there are five characters in \inQuotes{Hello}. |
| 38 | +\pythonil{len("Hello World!")} would give us \pythonil{12}, because \pythonil{"Hello"} has five characters, \pythonil{"World!"} has six characters (the \pythonil{"!"} does count!) and there is the single space character in the middle, so $5+6+1=12$. |
| 39 | + |
| 40 | +Knowing the length\pythonIdx{str!length} of a string, we can now safely access its single characters. |
| 41 | +These characters are obtained using the square brackets \pythonil{[]}\pythonIdx{str![]}\pythonIdx{[}\pythonIdx{]} with the character index inbetween. |
| 42 | +The character indexes start at~0. |
| 43 | +Therefore, \pythonil{"Hello"[0]}\pythonIdx{str![]}\pythonIdx{[}\pythonIdx{]} returns the first character of \pythonil{"Hello"} as a \pythonilIdx{str}, which is \pythonil{"H"}\pythonIdx{\textquotedbl}. |
| 44 | +\pythonil{"Hello"[1]} returns the second character, which is \pythonil{"e"}. |
| 45 | +\pythonil{"Hello"[2]} returns the third character, which is \pythonil{"l"}. |
| 46 | +\pythonil{"Hello"[3]}\pythonIdx{str![]}\pythonIdx{[}\pythonIdx{]} gives us the second \pythonil{"l"}. |
| 47 | +Finally, \pythonil{"Hello"[4]} gives us the fifth and last character, namely \pythonil{"o"}\pythonIdx{\textquotedbl}. |
| 48 | +If we would try to access a character outside of the valid range of the string, say \pythonil{"Hello"[5]}, this results in an \pythonilIdx{IndexError}. |
| 49 | +We learn later what errors are and how to handle them -- for now, it is sufficient to know that they will stop your program. |
| 50 | +And rightly so, because \pythonil{"Hello"}\pythonIdx{\textquotedbl} has only five characters and accessing the sixth one is not possible and would have an undefined result. |
| 51 | + |
| 52 | +Negative indices, however, are permitted: |
| 53 | +The index \pythonil{-1} just means \inQuotes{last character}, so \pythonil{"Hello"[-1]} yields the string \pythonil{"o"}. |
| 54 | +The index \pythonil{-2} then refers to the \inQuotes{second-to-last character}, so \pythonil{"Hello"[-2]} gives us \pythonil{"l"}. |
| 55 | +The third character from the end, accessed via index \pythonil{-3}, is again \pythonil{"l"}. |
| 56 | +\pythonil{"Hello"[-4]} gives us \pythonil{"e"} and \pythonil{"Hello"[-5]} gives us \pythonil{"H"}. |
| 57 | +Of course, using a negative index that would bring us out of the string's valid range, such as \pythonil{-6}, again yields an \pythonilIdx{IndexError}. |
| 58 | + |
| 59 | +We can also obtain whole substrings by using index ranges, where the inclusive starting index and the \emph{exclusive} end index are separated by a~\pythonilIdx{:}. |
| 60 | +In other words, applying the index \pythonil{[a:b]} to a string results in all characters in the index range from \pythonil{a} to \pythonil{b - 1}. |
| 61 | +\pythonil{"Hello"[0:3]} yields a string composed of the characters at positions~0, 1, and~2 inside \pythonil{"Hello"}, i.e., \pythonil{"Hel"}. |
| 62 | +The end index is always excluded, so the character at index~3 is not part of the result. |
| 63 | +If we do \pythonil{"Hello"[1:3]}, we get \pythonil{"He"}, because only the characters at indices~1 and~2 are included. |
| 64 | +If we do not specify an end index, then everything starting at the start index until the end of the string is included. |
| 65 | +This means that \pythonil{"Hello"[2:]} will return all the text starting at index~2, which is \pythonil{"llo"}. |
| 66 | +We can also use negative indices, if we want. |
| 67 | +Therefore, \pythonil{"Hello"[1:-2]} yields \pythonil{"el"} |
| 68 | +Finally, we can also omit the start index, in which case everything until right before the end index is returned. |
| 69 | +Therefore, \pythonil{"Hello"[:-2]} will return everything from the beginning of the string until right before the second-to-last character. |
| 70 | +This gives us \pythonil{"Hel"}. |
| 71 | + |
| 72 | +\begin{figure}% |
| 73 | +\centering% |
| 74 | +\includegraphics[width=0.8\linewidth]{\currentDir/strBasicOps}% |
| 75 | +\caption{Some more basic string operations.}% |
| 76 | +\label{fig:strBasicOps}% |
| 77 | +\end{figure}% |
| 78 | + |
| 79 | +Besides concatenating and extracting substrings, the \pythonilIdx{str} datatype supports many other operations. |
| 80 | +Here, we can just discuss the few most commonly used ones. |
| 81 | + |
| 82 | +There are several ways to check whether one string is contained in another one. |
| 83 | +The first method is to use the \pythonilIdx{in} keyword. |
| 84 | +As \cref{fig:strBasicOps} shows, \pythonil{"World" in "Hello World!"} yields \pythonilIdx{True}, as it checks whether \pythonil{"World"} is contained in \pythonil{"Hello World!"}, which is indeed the case. |
| 85 | +\pythonil{"Earth" in "Hello World!"} is \pythonilIdx{False}, because \pythonil{"Earth"} is not contained in \pythonil{"Hello World!"}. |
| 86 | + |
| 87 | +Often, however, we do not just want to know whether a string is contained in another one, but also \emph{where} it is contained. |
| 88 | +For this, the \pythonilIdx{find} method exists. |
| 89 | +\pythonil{"Hello World!".find("World")} tries to find the position of \pythonil{"World"} inside \pythonil{"Hello World!"}. |
| 90 | +It returns \pythonil{6}, because the \inQuotes{W} of \inQuotes{World} is the seventh character in this string and the indices are zero-based. |
| 91 | +Trying to find the \pythonil{"world"} in \pythonil{"Hello World!"} yields~\pythonil{-1}, however. |
| 92 | +\pythonil{-1} means that the string cannot be found. |
| 93 | +We learn that string operations are case-sensitive\pythonIdx{str!case-sensitive}: |
| 94 | +\pythonil{"World" != "world"} would be \pythonilIdx{True}. |
| 95 | +We also learn that we need to be careful not to use the result of \pythonilIdx{find} as index in a string directly before checking that it is \pythonil{>= 0}! |
| 96 | +As you have learned, \pythonil{-1} is a perfectly fine index into a string, even though it means that the string we tried to find was not found. |
| 97 | + |
| 98 | +Sometimes, the text we are looking for is contained multiple times in a given string. |
| 99 | +For example, \pythonil{"Hello World!".find("l")} returns~\pythonil{2}, because \inQuotes{l} is the third character in the string. |
| 100 | +However, it is also the fourth character in the string. |
| 101 | +\pythonilIdx{find} accepts an optional second parameter, namely the starting index where the search should begin. |
| 102 | +\pythonil{"Hello World!".find("l", 3)} begins to search for \pythonil{"l"} inside \pythonil{"Hello World!"} starting at index~3. |
| 103 | +Right at that index, the second~\inQuotes{l} is found, so that \pythonil{3} is also returned. |
| 104 | +If we search for another~\inQuotes{l} after that, we would do \pythonil{"Hello World!".find("l", 4)}, which returns index~9, identifying the~\inQuotes{l} in~\inQuotes{World}. |
| 105 | +After that, no more~\inQuotes{l} can be found in the string, so \pythonil{"Hello World!".find("l", 10)} results in a~\pythonil{-1}.% |
| 106 | +% |
| 107 | +\begin{sloppypar}% |
| 108 | +While \pythonilIdx{find} returns the first occurrence of a string in the supplied range, we sometimes want the last occurrence instead. |
| 109 | +If we want to search from the end of the string, we use \pythonilIdx{rfind}. |
| 110 | +\pythonil{"Hello World!".rfind("l")} gives us~\pythonil{9} directly. |
| 111 | +If we want to search for the~\inQuotes{l} before that one, we need to supply an inclusive starting and exclusive ending index of the range to be searched. |
| 112 | +\pythonil{"Hello World!".rfind("l", 0, 9)} searches for any~\inQuotes{l} from index~8 down to~0 and thus returns~\pythonil{3}. |
| 113 | +\pythonil{"Hello World!".rfind("l", 0, 3)} gives us~\pythonil{2} and since there is no~\inQuotes{l} before that, \pythonil{"Hello World!".rfind("l", 0, 2)} yields~\pythonil{-1}. |
| 114 | +\end{sloppypar}% |
| 115 | +% |
| 116 | +\begin{sloppypar}% |
| 117 | +Another common operation is to replace substrings with something else. |
| 118 | +\pythonil{"Hello World!".replace("Hello", "Hi")}\pythonIdx{replace} replaces all occurrences of \inQuotes{"Hello"} in \inQuotes{Hello World} with \inQuotes{Hi}. |
| 119 | +The result is \pythonil{"Hi World!"} and \pythonil{"Hello Hello World!".replace("Hello", "Hi")} becomes \pythonil{"Hi Hi World!"}. |
| 120 | +\end{sloppypar}% |
| 121 | +% |
| 122 | +\begin{sloppypar}% |
| 123 | +Often, we want to remove all leading or trailing whitespace characters from a string. |
| 124 | +The \pythonilIdx{strip} function does this for us: |
| 125 | +\pythonil{" Hello World! ".strip()} returns \pythonil{"Hello World!".strip()}, i.e., the same string, but with the leading and trailing space removed. |
| 126 | +If we only want to remove the spaces on the left-hand side, we use \pythonilIdx{lstrip} and if we only want to remove those on the right-hand side, we use \pythonilIdx{rstrip} instead. |
| 127 | +Therefore, \pythonil{" Hello World! ".lstrip()} yields \pythonil{"Hello World! "} and \pythonil{" Hello World! ".rstrip()} gives us \pythonil{" Hello World!"}. |
| 128 | +\end{sloppypar}% |
| 129 | +% |
| 130 | +In alphabet-based languages, we usually can distinguish between uppercase\pythonIdx{str!uppercase} characters, such as \inQuotes{H} and \inQuotes{W}, and lowercase\pythonIdx{str!lowercase}, such as \inQuotes{e}, \inQuotes{l}, and~\inQuotes{o}. |
| 131 | +The method \pythonilIdx{lower} transforms all characters in a string to lowercase and \pythonilIdx{upper} translates them to uppercase instead. |
| 132 | +Thus \pythonil{"Hello World!".lower()} returns \pythonil{hello world!} whereas \pythonil{"Hello World!".upper()} yields \pythonil{"HELLO WORLD!"}. |
| 133 | + |
| 134 | +As final functions, we can check whether a string begins or ends with another, we can use \pythonilIdx{startswith} and \pythonilIdx{endswith}, respectively. |
| 135 | +\pythonil{"Hello World!".startswith("hello")} is \pythonilIdx{False} whereas \pythonil{"Hello World!".startswith("Hello")} is \pythonilIdx{True}. |
| 136 | +\pythonil{"Hello World!".endswith("Hello")} is \pythonilIdx{False}, too, but \pythonil{"Hello World!".endswith("World!")} is \pythonilIdx{True}. |
| 137 | + |
| 138 | +Of course, these were just a small selection of the many string operations available in \python. |
| 139 | +You can find more in the \href{https://docs.python.org/3/library/stdtypes.html\#textseq}{official documentation}~\cite{PSF2024TSTS}.% |
| 140 | +\endhsection% |
| 141 | +% |
| 142 | +\endhsection% |
| 143 | +% |
0 commit comments