Strings and Regexes
Strings are sequences of unicode, and are immutable. You can index into them and get slices of them.
len('hi') # => 2
'hello'[1] # => 'e'
"""This is
a multi-line
string."""
"x" * 3 # => "xxx"
"""Ignore the newline\
at the end of that line."""
# ^^ No space between "newline" and "at".Embedding unicode:
"this → that"
# or
"this \u2192 that"
# or
"this \N{RIGHTWARDS ARROW} that"Some methods:
capitalize
lower
upper
title
replace
splitlines
strip, lstrip, rstrip Can take arg specifying what to strip!
count how many non-overlapping times a given substring is present
find returns -1 on failure
index raises exception on failure
format
format_map
startswith can take a tuple
endswith same as above
Easy way to get a list of words:
'foo bar baz moo'.split()Converting hex strings ↔︎ ints:
hex(255) # => '0xff'
import sys
hex(sys.maxunicode) # => 0x10ffff
int('ff', 16) # => 255
# Base 0 means to look at the 2-char prefix to determine the base.
int('0xff', 0) # => 255
# And, of course:
str(0xff) # => '255'By default, str.replace does global search/replace, but
you can pass a num-times arg to limit how many it performs.
Use repr to get a string representation of a Python
object. That is, how the object would be written in Python code (to be
eval’d).
format
todo: mention format strings (“f-strings”).
The str.format method. There is also a global
format function, which formats a single value (it just
calls its first arg’s format method).
'a {} b'.format('XX') # => 'a XX b'
'a {foo} b {bar} c'.format(foo='XX', bar='YY') # => 'a XX b YY c'
d = {'a': 1, 'b': 2}
'{a} and {b}'.format(**d) # => '1 and 2'
x = 12.348
format(x, '0.2f') # => '12.35'Previously had used '...' % ... instead of the the
format function.
Both format and str.format take the sprintf
formatting codes.
For more, see the docs at library/string.html. See also https://mkaz.github.io/2012/10/10/python-string-format/.
Regular Expressions
import re
s = 'foo123bar456baz'
re.split(r'\d+', s, flags=re.M|re.S) # => ['foo', 'bar', 'baz']
re.findall(r'\d+', s) # => ['123', '456']
re.findall(r'xxx', s) # => []\A is beginning of string\Z is end of stringUse (?:...) for a non-capturing group.
re.match(r'\d{3}', s) # looks for that at start of `s`
re.search(r'\d{3}', s) # looks for that anywhere in `s`re.match and re.search return None if no match.
some_group = re.search(r'some-regex', line)
some_group.group(0) # Gets you the match object.
# This is the one you usually want.
re.findall(r'...', s) # Global search. Returns a possibly-empty list.
re.findall(r'\{\{(.+?)}}', 'foo 12 {{bar}} 123{{baz}}45moo {{oof}}')
# => ['bar', 'baz', 'oof']To search/replace: re.sub. Does a global search/replace.
Use \1, \2, etc. to use groups in the
replace-text.
# re.sub(regex, replacements, text)
re.sub(r'...(\d)-(\d)...', r'...\2-\1...', some_text)TODO:
- replace unbreakable whitespace with space character.
Unicode
Read http://nedbatchelder.com/text/unipain.html.
Unicode code points are written as 4, 5, or 6 hex digits prefixed with “U+”. Every character has an unambiguous full name in uppercase ASCII (for example, “CHECK MARK”).
Code points map to bytes via an encoding. Use UTF-8.
Legacy: Back in Python 2,
"this"gave you anstr— a sequence of bytes.u"this"gave you aunicode— a sequence of code points. You could then dounicode_s.encode('utf-8')to get astr(bytes), ands.decode('utf-8')to get aunicode(u”one of these”). Concatenating u”this” + “that” gets you u”thisthat” (a unicode). Python 2 tries to be helpful by doing implicit conversions, but this can result in pain.
In Python 3: "this" is a str, which is a
sequence of code points. b"this" is a bytes, a
sequence of bytes.
Python 3 does not try to implicitly convert for you;
'this' + b'that' fails. b’this’ != ‘this’.
open('foo.txt', 'r').read() gets you unicode/str (using
the default encoding on this machine as reported by
locale.getpreferredencoding()).
open('foo.txt', 'rb').read() gets you bytes.
import locale
locale.getpreferredencoding() # => 'UTF-8'Careful: on Windows, the default encoding may be CP-1252 (“Windows-1252”?).
Data coming into or going out of your program is all bytes. Decode incoming bytes into unicode:
'hi there'.encode() # => b'hi there'
b'hey'.decode() # => 'hey'