Python etc
6.11K subscribers
18 photos
194 links
Regular tips about Python and programming in general

Owner — @pushtaev

© CC BY-SA 4.0 — mention if repost
Download Telegram
Pagination is the pretty standard problem that countless developers solve every day. If you use a relational database, you can explicitly set the offset with LIMIT:

SELECT *
FROM table
LIMIT 1001, 1100


That indeed returns a hundred of records, from 1001 to 1100. The thing is, it's as hard for a database as selecting all 1001 tuples. So the later page your user requests, the slower you return the result.

The solution is to use WHERE instead of LIMIT, asking a client to provide the last result of her current page ($last_seen_id in the example):

SELECT *
FROM table
WHERE id > $last_seen_id
ORDER BY id ASC
LIMIT 100


See the perfect article regarding the subject.
If you have a CPU-heavy task and want to utilize all the cores you have, then multiprocessing.Pool is for you. It spawns multiple processes and delegates tasks to them automatically. Simply create a pool with Pool(number_of_processes) and run p.map with the list of inputs.

In : import math
In : from multiprocessing import Pool
In : inputs = [i ** 2 for i in range(100, 130)]
In : def f(x):
...: return len(str(math.factorial(x)))
...:

In : %timeit [f(x) for x in inputs]
1.44 s ± 19.2 ms per loop (...)

In : p = Pool(4)
In : %timeit p.map(f, inputs)
451 ms ± 34 ms per loop (...)
Python multiprocessing module allows you to spawn not only processes but threads as well. Mind, however, than CPython is notorious for its GIL (global interpreter lock), the interpreter feature that doesn't allow different threads run Python bytecode simultaneously.

That means that threads are only useful when your program spends time outside of Python interpreter, usually waiting for IO. For example, downloading three different Wikipedia articles at once with threads will be as efficient as with processes (and thrice as efficient as downloading using only one process):

from multiprocessing import Pool
from multiprocessing.pool import ThreadPool

def download_wiki_article(article):
url = 'http://de.wikipedia.org/wiki/'
return requests.get(url + article)

process_pool = Pool(3)
thread_pool = ThreadPool(3)

thread_pool.map(download_wiki_article, ['a', 'b', 'c'])
# 376 ms ± 11 ms

process_pool.map(download_wiki_article, ['a', 'b', 'c'])
# 373 ms ± 3.17 ms

[download_wiki_article(a) for a in ['a', 'b', 'c']]
# 1.09 s ± 27.9 ms


On the other hand, it doesn't make much sense to solve CPU-heavy tasks with threads:

import math
from multiprocessing import Pool
from multiprocessing.pool import ThreadPool

def f(x):
return len(str(math.factorial(x)))

process_pool = Pool(4)
thread_pool = ThreadPool(4)
inputs = [i ** 2 for i in range(100, 130)]

[f(x) for x in inputs]
# 1.48 s ± 7.61 ms

thread_pool.map(f, inputs)
# 1.48 s ± 7.78 ms

process_pool.map(f, inputs)
# 478 ms ± 7.55 ms
When Python executes a method call, say a.f(b, c, d), it should first select the right f function. Due to polymorphism, it depends on the type of a. The process of choosing the method is usually called dynamic dispatch.

Python supports only single-dispatch polymorphism because a single object alone (a in the example) affects the method selection. Some other languages, however, may also consider a type of b, c and d. This mechanism is called multiple disaptch. C# is a notable example of languages that support that technique.

However, multiple dispatch can be emulated via single-dispatch. The visitor design pattern is created exactly for this. What visitor do is essentially calling single-dispatch twice to imitate double-dispatch.

Mind, that the ability to overload methods (like in Java and C++) is not the same as multiple dispatch. Dynamic dispatch works in runtime while overloading solely affects compile time.
Though decorators and context managers are quite similar and often interchangeable, context managers are severely more limited. You can't skip a block or execute it twice; it always runs exactly one time.

However, you can control whether the exception that is raised inside a context should be propagated to the caller or not. It's done by the slightly obscure way: the exception is suppressed if __exit__ return a true value:

class Atomic:
def __enter__(self):
print('BEGIN')

def __exit__(self, exc_type, exc_value, traceback):
if exc_type:
print(
'ROLLBACK due to {}({})'.format(
exc_type, exc_value
)
)
else:
print('COMMIT')

return True

with Atomic():
print('A')

with Atomic():
print('B')
raise RuntimeError('C')


The output is:

BEGIN
A
COMMIT
BEGIN
B
ROLLBACK due to <type 'exceptions.RuntimeError'>(C)
If you need to search through a sorted collection, binary search is what you need. This simple algorithm compares the target value to the middle of the array; the result determines which half should be searched next.

Python standard library provides a way to use binary search without directly implementing it. bisect_left function returns the leftmost position in a sorted list for the element, while bisect_right return the rightmost one.

In : from random import randrange
In : from bisect import bisect_left
In : n = 1000000
In : look_for = 555555
In : lst = sorted(randrange(0, n) for _ in range(n))

In : %timeit look_for in lst
69.7 ms ± 449 µs per loop

In : %timeit look_for == lst[bisect_left(lst, look_for)]
927 ns ± 2.28 ns per loop
To store any information in memory or on a storage device, you should represent it in bytes. Python usually provides the level of abstraction where you can think about data itself, not its byte form.

Still, when you write, say, a string to a file, you deal with a physical structure of data. To put characters into a file you should transform them into bytes; that is called encoding. When you get bytes from a file, you probably want to convert them into meaningful characters; that is call decoding.

There are hundreds of encoding methods out there. The most popular one is probably Unicode, but you can't transform anything to bytes with it. In the sense of byte representation, Unicode is not even an encoding. Unicode defines a mapping between characters and their integer codes. 🐍 is 128 013, for example.

But to put integers into a file, you need a real encoding. Unicode is usually used with utf-8, which is (usually) a default in Python. When you read form a file, Python automatically decodes utf-8. You can choose any other encoding with encoding= parameter of the open function, or you can read plane bytes by appending b to its mode.
Is Python interpreted or compiled? The simple answer here is interpreted; the right one is — it's both.

Python compiles your source code to bytecode (.pyc files). It does that implicitly, but it's still an essential phase of Python code execution. Java, for example, does the same but explicitly: you compile with javac and run with java.

Despite that, Python is usually called interpreted language while Java is called compiled language, which is, strictly speaking, not entirely correct.

Here is an article on the subject with more details and explanations.
You can use any object as a dictionary key in Python as long as it implements the __hash__ method. This method can return any integer as long as the only requirement is met: equal objects should have equal hashes (not vice versa).

You also should avoid using mutable objects as keys, because once the object becomes not equal to the old self, it can't be found in a dictionary anymore.

There is also one bizarre thing that might surprise you during debugging or unit testing:

In : class A:
...: def __init__(self, x):
...: self.x = x
...:
...: def __hash__(self):
...: return self.x
...:
In : hash(A(2))
Out: 2
In : hash(A(1))
Out: 1
In : hash(A(0))
Out: 0
In : hash(A(-1)) # sic!
Out: -2
In : hash(A(-2))
Out: -2


In CPython -1 is internally reserved for error states, so it's implicitly converted to -2.
Creating an external process in Python is an easy task, you can do it with subprocess module. However, reading both stdout and stderr of the spawned process may be more challenging.

Let's suppose we ask Popen to create two pipes, one for stdout and one for stderr:

p = subprocess.Popen(
["python", "-c", "..."],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)


Now we have to read from them. The problem is, you can't just do readline() for any of those pipes since it can cause deadlocks. Consider the more concrete example:

import subprocess

SUBPROCESS_CODE = """
import sys
sys.stderr.write('err')
print('out')
"""

p = subprocess.Popen(
["python", "-c", SUBPROCESS_CODE],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)

print(p.stdout.readline())


The primary process creates the child and waits for stdout. The child process first writes to stderr pipe and then to stdout. The main process successfully receives 'out' from the pipe. But what about 'err'? It's stored in the pipe buffer until the main process does p.stderr.readline(). But what if the buffer is full?

import subprocess

SUBPROCESS_CODE = """
import sys
for _ in range(100000):
sys.stderr.write('err')
print('out')
"""

p = subprocess.Popen(
["python", "-c", SUBPROCESS_CODE],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)

print(p.stdout.readline())


In this case sys.stderr.write('err') will be blocked at some point until someone reads from the buffer. But no one ever will: the main process waits for data in stdout. This is the deadlock we are talking about.

To solve this problem, you should read from both stdout and stderr at once. You can do it with select module or simply use p.communicate(). The second approach is much more straightforward but doesn't let you read data line by line.
Sometimes you need to create a function from a more universal one.
For example, int() has a base parameter which we would like to freeze to have new base2 function:

>>> int("10")
10
>>> int("10", 2)
2
>>> def base2(x):
... return int(x, 2)
...
>>> base2("10")
2


The functools.partial allows you to do the same more accurate and semantically clear:

base2 = partial(int, base=2)


It can be helpful when you need to pass a function as an argument to another higher order function, but some arguments should be locked:

>>> map(partial(int, base=2), ["1", "10", "100"])
[1, 2, 4]


Without partial you do something like this:

>>> map(lambda x: int(x, base=2), ["1", "10", "100"])
[1, 2, 4]
Liskov substitution principle tells us that if S is a subtype of T, then all occurrences of T may be replaced with S without breaking any code. That means that S should satisfy all the guarantees that T introduces.

Whenever you work with dict you usually assume that as long as x in d returns False, d[x] raises KeyError. But if your dict is a defaultdict, that it's simply not true. Does defaultdict violate the LSP then?

Strictly speaking, it doesn't. The dict documentation explicitly says that d[x] may return something even though x is not present in d (if the __missing__ method defined). So you still potentially get something from the dictionary even if the key is not in it. That means that you shouldn't ever do x in d to check the existence of element as long as you want to support all dict subclasses.

It may seem that defaultdict violates the LSP, but it “breaks” the guarantees that were never there.
Sometimes you want to know whether something is a function or not. The obvious solution is to check the object class with isinstance. The class of functions is function, but you can't access directly. You can instead get type of any existing function:

FunctionType = type(lambda: None)


Now you can do checking:

def isfunction(object):
return isinstance(object, FunctionType)


Luckily all above code is already written for you: FunctionType is an existing member of types and isfunction already exists in the inspect module.

Note, that you usually don't care whether something is a function, but rather if it's callable or not. It can be done with callable:

>>> callable(int)
True
>>> callable(42)
False
>>> callable(callable)
True
The format method of Python string is a mighty tool that supports a lot of things that you are probably not even aware of. Each replacement placeholder ({...}) may contain three parts: field name, conversion and format specification.

The field name is used to specify which argument exactly should be used as a replacement:

>>> '{}'.format(42)
'42'
>>> '{1}'.format(1, 2)
'2'
>>> '{y}'.format(x=1, y=2)
'2'


The conversion let you ask format to use repr() (or ascii()) instead of str() while converting objects to strings:

>>> '{!r}'.format(datetime.now())
'datetime.datetime(2018, 5, 3, 23, 48, 49, 157037)'
>>> '{}'.format(datetime.now())
'2018-05-03 23:49:01.060852'


Finally, the format specification is a way to define how values are presented:

>>> '{:+,}'.format(1234567)
'+1,234,567'
>>> '{:>19}'.format(1234567)
' 1234567'


This specification may be applied to a single object with format function (not the str method):

format(5000000, '+,')
'+5,000,000'


The format function calls __format__ method of the object internally so you can alter its behavior for your types.
Let's suppose you have some datetime object and want to know how much time has passed since the start of the day. How do you do that?

First of all, to know what day we are talking about, we need to have the timezone; having the datetime is not enough. As long you have the timezone you need to convert the datetime and strip time with the date() method:

def date_of_time(datetime_object, tz):
return datetime_object.astimezone(tz).date()


Having the date, we can get its midnight time. To do this, we glue 00:00:00 back to the datetime object and assign the original timezone:

def midnight_of_date(date_in_given_timezone, tz):
midnight = datetime.datetime.combine(
date_in_given_timezone, datetime.time()
)

return tz.localize(midnight)


And now we put things together:

def midnight(datetime_object, tz):
return midnight_of_date(
date_of_time(datetime_object, tz), tz
)
PEP 424 allows generators and other iterable objects that don't have the exact predefined size to expose a length hint. For example, the following generator will likely return ~50 elements:

(x for x in range(100) if random() > 0.5)


If you write an iterable and want to add the hint, define the __length_hint__ method. If the length is known for sure, use __len__ instead.

If you use an iterable and want to know its expected length, use operator.length_hint.
In Python, for lets you withdraw elements from a collection without thinking about their indexes:

def find_odd(lst):
for x in lst:
if x % 2 == 1:
return x

return None


If you do care about indexes you can iterate over range(len(lst)):

def find_odd(lst):
for i in range(len(lst)):
x = lst[i]
if x % 2 == 1:
return i, x

return None, None


But perhaps the more semantically correct and expressive way to do the same it to use enumerate:

def find_odd(lst):
for i, x in enumerate(lst):
if x % 2 == 1:
return i, x

return None, None
The itertools.chain function is a way to iterate over many iterables as though they are glued together:

In : list(chain(['a', 'b'], range(3), set('xyz')))
Out: ['a', 'b', 0, 1, 2, 'x', 'z', 'y']


Sometimes you want to know whether a generator is empty (rather say, exhausted). To do this, you have to try getting the next element from the generator. If it works, you would like to put element back in the generator, which of course is not possible. You can glue it back with chain instead:

def sum_of_odd(gen):
try:
first = next(gen)
except StopIteration:
raise ValueError('Empty generator')

return sum(
x for x in chain([first], gen)
if x % 2 == 1
)


Usage example:

In : sum_of_odd(x for x in range(1, 6))
Out: 9
In : sum_of_odd(x for x in range(2, 3))
Out: 0
In : sum_of_odd(x for x in range(2, 2))
...
ValueError: Empty generator
In Python, an else block could be presented not only after if, but after for and while as well. The code inside else is executed unless the loop was interrupted by break.

The common usage for this is to search something in a loop and use break when found:

In : first_odd = None
In : for x in [2,3,4,5]:
...: if x % 2 == 1:
...: first_odd = x
...: break
...: else:
...: raise ValueError('No odd elements in list')
...:
In : first_odd
Out: 3

In : for x in [2,4,6]:
...: if x % 2 == 1:
...: first_odd = x
...: break
...: else:
...: raise ValueError('No odd elements in list')
...:
...
ValueError: No odd elements in list
Since loops in Python don't create scopes, you usually need an extra function to create closures. The straightforward way doesn't work:

multipliers = []
for i in range(10):
multipliers.append(lambda x: x * i)

[multipliers[i](2) for i in range(5)]
# [18, 18, 18, 18, 18]


Let's add the extra function:

multiplier_creator = lambda i: lambda x: x * i
for i in range(10):
multipliers.append(multiplier_creator(i))


It works this way, but the code can be clumsy, especially if you need def, not lambda:

def multiplier_creator(i):
def multiplier(x):
return x * i
for i in range(10):
return multiplier

multipliers.append(multiplier_creator(i))


To make it slightly more readable, you can write universal function and get partials of it:

multiplier = lambda x, i: x * i
for i in range(10):
multipliers.append(partial(multiplier, i=i))


You can always emulate partial with custom lambda, but repr of partials are generally more readable:

In : partial(int, base=2)
Out: functools.partial(<class 'int'>, base=2)

In : lambda x: int(x, base=2)
Out: <function __main__.<lambda>>


Fun fact: thanks to the operator module this particular example can be expressed even more appealing:

for i in range(10):
multipliers.append(partial(operator.mul, i))