Python Notes and Examples
← prev | next →     Top-level ToC     /strings.html     (printable version)

Strings are immutable. You can index into them and get slices of them.

len('hi')  #=> 2
'hello'[1] #=> 'e'

Some methods:

capitalize
lower
upper
title

replace
splitlines
strip, lstrip, rstrip       Can take arg specifying what to strip!

count         how many non-overlapping times a given substring is present
find          returns -1 on failure
index         raises exception on failure

format
format_map

startswith    can take a tuple
endswith      same as above

Easy way to get a list of word:

'foo bar baz moo'.split()

Converting between hex strings and ints:

hex(255)            #=> '0xff'
import sys
hex(sys.maxunicode) #=> 0x10ffff
int('ff', 16)       #=> 255
int('0xff', base=0) #=> 255
# And, of course:
str(0xff)           #=> '255'

By default, str.replace does global search/replace, but you can pass a num-times arg to limit how many it performs.

Use repr to get a string representation of a Python object. That is, how the object would be written in Python code.

1 format

The str.format method. There is also a global format function, which formats a single value (it just calls its first arg’s format method).

'a {} b'.format('XX')                          #=> 'a XX b'
'a {foo} b {bar} c'.format(foo='XX', bar='YY') #=> 'a XX b YY c'
d = {'a': 1, 'b': 2}
'{a} and {b}.format(**d) #=> '1 and 2'

x = 12.348
format(x, '0.2f') #=> '12.35'

Previously had used '...' % ... instead of the the format function.

Both format and str.format take the sprintf formatting codes.

For more, see the docs at library/string.html. See also https://mkaz.github.io/2012/10/10/python-string-format/.

2 Regular Expressions

import re
s = 'foo123bar456baz'
re.split(r'\d+', s, flags=re.M|re.S)
#=> ['foo', 'bar', 'baz']

re.findall(r'\d+', s)
#=> ['123', '456']

\A is beginning of string
\Z is end of string

Use (?:...) for a non-capturing group.

re.match and re.search return None if no match.

some_group = re.search(r'some-regex', line)
some_group.group(0)    # Gets you the match object.

# This is the one you usually want.
re.findall(r'...', s) # Global search. Returns a possibly-empty list.

re.findall(r'\{\{(.+?)}}', 'foo 12 {{bar}} 123{{baz}}45moo {{oof}}')
#=> ['bar', 'baz', 'oof']

To search/replace: re.sub. Does a global search/replace. Use , , etc. to use groups in the replace-text.

# re.sub(regex, replacements, text)
re.sub(r'...(\d)-(\d)...', r'...\2-\1...', some_text)

TODO:

  • replace unbreakable whitespace with space character.

3 Unicode

Read http://nedbatchelder.com/text/unipain.html.

Unicode code points are written as 4, 5, or 6 hex digits prefixed with “U+”. Every charactec has an unambiguous full name in uppercase ASCII (for example, “CHECK MARK”).

Code points map to bytes via an encoding. Use UTF-8.

Legacy: Back in Python 2, "this" gave you an str — a sequence of bytes. u"this" gave you a unicode — a sequence of code points. You could then do unicode_s.encode('utf-8') to get a str (bytes), and s.decode('utf-8') to get a unicode (u“one of these”). Concatenating u“this” + “that” gets you u“thisthat” (a unicode). Python 2 tries to be helpful by doing implicit conversions, but this can result in pain.

In Python 3: "this" is a str, which is a sequence of code points. b"this" is a bytes, a sequence of bytes.

Python 3 does not try to implicitly convert for you; 'this' + b'that' fails. b’this’ != ‘this’.

open('foo.txt', 'r').read() gets you unicode/str (using the default encoding on this machine as reported by locale.getpreferredencoding()). open('foo.txt', 'rb').read() gets you bytes.

import locale
locale.getpreferredencoding()  #=> 'UTF-8'

Careful: on Windows, the default encoding may be CP-1252 (“Windows-1252”?).

Data coming into or going out of your program is all bytes. Decode incoming bytes into unicode:

'hi there'.encode() #=> b'hi there'
b'hey'.decode() #=> 'hey'