Strings and Regexes

John Gabriele

Strings are sequences of unicode, and are immutable. You can index into them and get slices of them.

len('hi')   # => 2
'hello'[1]  # => 'e'

"""This is
a multi-line
string."""

"x" * 3  # => "xxx"

"""Ignore the newline\
at the end of that line."""
# ^^ No space between "newline" and "at".

Embedding unicode:

"this → that"
# or
"this \u2192 that"
# or
"this \N{RIGHTWARDS ARROW} that"

Some methods:

capitalize
lower
upper
title

replace
splitlines
strip, lstrip, rstrip       Can take arg specifying what to strip!

count         how many non-overlapping times a given substring is present
find          returns -1 on failure
index         raises exception on failure

format
format_map

startswith    can take a tuple
endswith      same as above

Easy way to get a list of words:

'foo bar baz moo'.split()

Converting hex strings ↔︎ ints:

hex(255)            # => '0xff'
import sys
hex(sys.maxunicode) # => 0x10ffff
int('ff', 16)       # => 255
# Base 0 means to look at the 2-char prefix to determine the base.
int('0xff', 0)      # => 255
# And, of course:
str(0xff)           # => '255'

By default, str.replace does global search/replace, but you can pass a num-times arg to limit how many it performs.

Use repr to get a string representation of a Python object. That is, how the object would be written in Python code (to be eval’d).

format

todo: mention format strings (“f-strings”).

The str.format method. There is also a global format function, which formats a single value (it just calls its first arg’s format method).

'a {} b'.format('XX')                           # => 'a XX b'
'a {foo} b {bar} c'.format(foo='XX', bar='YY')  # => 'a XX b YY c'
d = {'a': 1, 'b': 2}
'{a} and {b}'.format(**d)  # => '1 and 2'

x = 12.348
format(x, '0.2f')  # => '12.35'

Previously had used '...' % ... instead of the the format function.

Both format and str.format take the sprintf formatting codes.

For more, see the docs at library/string.html. See also https://mkaz.github.io/2012/10/10/python-string-format/.

Regular Expressions

import re
s = 'foo123bar456baz'
re.split(r'\d+', s, flags=re.M|re.S) # => ['foo', 'bar', 'baz']

re.findall(r'\d+', s)  # => ['123', '456']
re.findall(r'xxx', s)  # => []

\A is beginning of string
\Z is end of string

Use (?:...) for a non-capturing group.

re.match(r'\d{3}', s)   # looks for that at start of `s`
re.search(r'\d{3}', s)  # looks for that anywhere in `s`

re.match and re.search return None if no match.

some_group = re.search(r'some-regex', line)
some_group.group(0)    # Gets you the match object.

# This is the one you usually want.
re.findall(r'...', s) # Global search. Returns a possibly-empty list.

re.findall(r'\{\{(.+?)}}', 'foo 12 {{bar}} 123{{baz}}45moo {{oof}}')
# => ['bar', 'baz', 'oof']

To search/replace: re.sub. Does a global search/replace. Use \1, \2, etc. to use groups in the replace-text.

# re.sub(regex, replacements, text)
re.sub(r'...(\d)-(\d)...', r'...\2-\1...', some_text)

TODO:

replace unbreakable whitespace with space character.

Unicode

Read http://nedbatchelder.com/text/unipain.html.

Unicode code points are written as 4, 5, or 6 hex digits prefixed with “U+”. Every character has an unambiguous full name in uppercase ASCII (for example, “CHECK MARK”).

Code points map to bytes via an encoding. Use UTF-8.

Legacy: Back in Python 2, "this" gave you an str — a sequence of bytes. u"this" gave you a unicode — a sequence of code points. You could then do unicode_s.encode('utf-8') to get a str (bytes), and s.decode('utf-8') to get a unicode (u”one of these”). Concatenating u”this” + “that” gets you u”thisthat” (a unicode). Python 2 tries to be helpful by doing implicit conversions, but this can result in pain.

In Python 3: "this" is a str, which is a sequence of code points. b"this" is a bytes, a sequence of bytes.

Python 3 does not try to implicitly convert for you; 'this' + b'that' fails. b’this’ != ‘this’.

open('foo.txt', 'r').read() gets you unicode/str (using the default encoding on this machine as reported by locale.getpreferredencoding()). open('foo.txt', 'rb').read() gets you bytes.

import locale
locale.getpreferredencoding()  # => 'UTF-8'

Careful: on Windows, the default encoding may be CP-1252 (“Windows-1252”?).

Data coming into or going out of your program is all bytes. Decode incoming bytes into unicode:

'hi there'.encode()  # => b'hi there'
b'hey'.decode()      # => 'hey'