Python Notes and Examples

Strings and Regexes

Strings are sequences of unicode, and are immutable. You can index into them and get slices of them.

Embedding unicode:

Some methods:

capitalize
lower
upper
title

replace
splitlines
strip, lstrip, rstrip       Can take arg specifying what to strip!

count         how many non-overlapping times a given substring is present
find          returns -1 on failure
index         raises exception on failure

format
format_map

startswith    can take a tuple
endswith      same as above

Easy way to get a list of word:

Converting between hex strings and ints:

By default, str.replace does global search/replace, but you can pass a num-times arg to limit how many it performs.

Use repr to get a string representation of a Python object. That is, how the object would be written in Python code (to be eval’d).

format

The str.format method. There is also a global format function, which formats a single value (it just calls its first arg’s format method).

Previously had used '...' % ... instead of the the format function.

Both format and str.format take the sprintf formatting codes.

For more, see the docs at library/string.html. See also https://mkaz.github.io/2012/10/10/python-string-format/.

Regular Expressions

\A is beginning of string
\Z is end of string

Use (?:...) for a non-capturing group.

re.match and re.search return None if no match.

To search/replace: re.sub. Does a global search/replace. Use \1, \2, etc. to use groups in the replace-text.

TODO:

  • replace unbreakable whitespace with space character.

Unicode

Read http://nedbatchelder.com/text/unipain.html.

Unicode code points are written as 4, 5, or 6 hex digits prefixed with “U+”. Every character has an unambiguous full name in uppercase ASCII (for example, “CHECK MARK”).

Code points map to bytes via an encoding. Use UTF-8.

Legacy: Back in Python 2, "this" gave you an str — a sequence of bytes. u"this" gave you a unicode — a sequence of code points. You could then do unicode_s.encode('utf-8') to get a str (bytes), and s.decode('utf-8') to get a unicode (u“one of these”). Concatenating u“this” + “that” gets you u“thisthat” (a unicode). Python 2 tries to be helpful by doing implicit conversions, but this can result in pain.

In Python 3: "this" is a str, which is a sequence of code points. b"this" is a bytes, a sequence of bytes.

Python 3 does not try to implicitly convert for you; 'this' + b'that' fails. b’this’ != ‘this’.

open('foo.txt', 'r').read() gets you unicode/str (using the default encoding on this machine as reported by locale.getpreferredencoding()). open('foo.txt', 'rb').read() gets you bytes.

Careful: on Windows, the default encoding may be CP-1252 (“Windows-1252”?).

Data coming into or going out of your program is all bytes. Decode incoming bytes into unicode: