Item 21: Be Defensive when Iterating Over Arguments

Notes

  • Often when a function accepts a list of values it wants to iterate over them
    • Sometimes multiple times

Example: American Tourism Numbers

  • Given a dataset of visitors to a city (in the millions)
  • Calculate what percentage of overall tourism each city receives
  • Have to normalise the dataset
    • First sum the dataset
    • Second divide through the dataset
def normalise(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


visits = [15, 35, 80]
percentages = normalise(visits)
print(percentages)
assert sum(percentages) == 100.0
[11.538461538461538, 26.923076923076923, 61.53846153846154]
  • To scale this up use a generator from a file
    • However, returns an empty list!
import os


def normalise(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


def read_visits(path):
    with open(path) as f:
        for line in f:
            yield int(line)


try:
    # generate some fake data
    visits = [15, 35, 80]
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in visits])

    # use the fake data
    it = read_visits(path)
    percentages = normalise(it)
    print(percentages)
except Exception as e:
    print(f"{type(e)}: {e}")
finally:
    if os.path.exists(path):
        os.remove(path)
[]
  • Example above fails because an iterator only produces it’s results once
  • Once it raises a StopIteration exception it won’t return anything further
    • Note, that a StopIteration is raised once
  • Constructors that expect StopIteration can’t tell the difference between an iterator that returns nothing and an exhausted iterator
import os


def read_visits(path):
    with open(path) as f:
        for line in f:
            yield int(line)


try:
    # generate some fake data
    visits = [15, 35, 80]
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in visits])

    it = read_visits(path)
    print(list(it))  # generates a list
    print(list(it))  # already consumed
except Exception:
    pass
finally:
    if os.path.exists(path):
        os.remove(path)
[15, 35, 80]
[]
  • To get around the problem
    • Can explicitly exhaust a generator into a list
    • Has the downside that the generator’s content need to be read into memory
    • Doesn’t work for large or arbitrarily long generators
import os


def read_visits(path):
    with open(path) as f:
        for line in f:
            yield int(line)


def normalise_copy_into_list(numbers):
    numbers_copy = list(numbers)  # copy the iterator
    total = sum(numbers_copy)
    result = []
    for value in numbers_copy:
        percent = 100 * value / total
        result.append(percent)
    return result


try:
    # generate some fake data
    visits = [15, 35, 80]
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in visits])

    it = read_visits(path)
    percentages = normalise_copy_into_list(it)
    print(percentages)
    assert sum(percentages) == 100.0
except Exception:
    pass
finally:
    if os.path.exists(path):
        os.remove(path)
[11.538461538461538, 26.923076923076923, 61.53846153846154]
  • We wrote the generator approach to avoid having to load the entire dataset into memory in the first place!
  • Alternative solution, accept a function that returns an iterator each time it’s called
import os


def read_visits(path):
    with open(path) as f:
        for line in f:
            yield int(line)


def normalise_receive_iterator(get_iter):
    total = sum(get_iter())
    result = []
    for value in get_iter():
        percent = 100 * value / total
        result.append(percent)
    return result


try:
    # generate some fake data
    visits = [15, 35, 80]
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in visits])

    percentages = normalise_receive_iterator(lambda: read_visits(path))
    print(percentages)
    assert sum(percentages) == 100.0
except Exception:
    pass
finally:
    if os.path.exists(path):
        os.remove(path)
[11.538461538461538, 26.923076923076923, 61.53846153846154]
  • This works, but it’s clumsy
    • We have to pass a lambda lambda : read_visits(path) to our normalise function
  • Better solution to use a container that supports the iterator protocol
    • The protocol that for loops and related syntax use to traverse a container
    • for x in foo: is syntactic sugar for the iter(foo)
      • Returns the iterator to loop through
    • iter built-in calls the foo.__iter__ dunder method
    • __iter__ method must return an iterator object
      • Iterator object must support the __next__ method
    • for loop repeatedly calls __next__ built-in function on the iterator until its exhausted
      • Via the StopIteration exception
  • The easy way to implement this is to implement __iter__ as a method
    • See the sample implementation below
import os


class ReadVisits:
    def __init__(self, path):
        self.path = path

    def __iter__(self):
        with open(self.path) as f:
            for line in f:
                yield int(line)


def normalise(numbers):
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


try:
    # generate some fake data
    visits = [15, 35, 80]
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in visits])

    it = ReadVisits(path)
    percentages = normalise(it)
    print(percentages)
    assert sum(percentages) == 100.0
except Exception:
    pass
finally:
    if os.path.exists(path):
        os.remove(path)
[11.538461538461538, 26.923076923076923, 61.53846153846154]
  • We define a lightweight class that yields lines from the specified file
  • We can then pass this to the original normalise implementation
    • The built-in sum function calls the ReadVisits.__iter__ method
    • Creates a new iterator object
    • the normalisation for loop again calls ReadVisits.__iter__
      • Accesses a second iterator object
    • Each iterator advances and exhausts independently
  • Downside of the above is that it reads the data twice
  • We can invert the order of control so that the functions ensure the parameters aren’t just iterators
    • An iterator passed to iter must return the iterator itself
    • When a container is passed to iter a new iterator is returned
    • Provides a testable behaviour we can use to switch on
import os


def normalise_defensive(numbers):

    if iter(numbers) is numbers:
        raise TypeError("Must supply a container")
    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


# works on a list
numbers_list = [15, 35, 80]
numbers_percentage = normalise_defensive(numbers_list)
assert sum(numbers_percentage) == 100.0
print("Successfully normalised numbers list")

# fails on an iterator
numbers_iter = iter(numbers_list)
normalise_defensive(numbers_iter)  # should raise an exception
Successfully normalised numbers list
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 24
     22 # fails on an iterator
     23 numbers_iter = iter(numbers_list)
---> 24 normalise_defensive(numbers_iter)  # should raise an exception

Cell In[7], line 7, in normalise_defensive(numbers)
      4 def normalise_defensive(numbers):
      6     if iter(numbers) is numbers:
----> 7         raise TypeError("Must supply a container")
      8     total = sum(numbers)
      9     result = []

TypeError: Must supply a container
  • Alternatively can import the Iterator class from the abstract collections built-in module (collections.abc)
from collections.abc import Iterator


def normalise_defensive(numbers):

    if isinstance(numbers, Iterator):
        raise TypeError("Must supply a container")

    total = sum(numbers)
    result = []
    for value in numbers:
        percent = 100 * value / total
        result.append(percent)
    return result


# works on a list
numbers_list = [15, 35, 80]
numbers_percentage = normalise_defensive(numbers_list)
assert sum(numbers_percentage) == 100.0
print("Successfully normalised numbers list")


# works on our custom class
class ReadVisits:
    def __init__(self, path):
        self.path = path

    def __iter__(self):
        with open(self.path) as f:
            for line in f:
                yield int(line)


try:
    # generate some fake data
    path = "temp.txt"
    with open(path, "w") as f:
        f.writelines([f"{visit}\n" for visit in numbers_list])

    it = ReadVisits(path)
    percentages = normalise_defensive(it)
    print(percentages)
    assert sum(percentages) == 100.0
    assert percentages == numbers_percentages
    print("Successfully normalised ReadVisits object")
except Exception:
    pass
finally:
    if os.path.exists(path):
        os.remove(path)

# fails on an iterator
numbers_iter = iter(numbers_list)
normalise_defensive(numbers_iter)  # should raise an exception
Successfully normalised numbers list
[11.538461538461538, 26.923076923076923, 61.53846153846154]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 55
     53 # fails on an iterator
     54 numbers_iter = iter(numbers_list)
---> 55 normalise_defensive(numbers_iter)  # should raise an exception

Cell In[8], line 7, in normalise_defensive(numbers)
      4 def normalise_defensive(numbers):
      6     if isinstance(numbers, Iterator):
----> 7         raise TypeError("Must supply a container")
      9     total = sum(numbers)
     10     result = []

TypeError: Must supply a container
  • Expecting a container works if you don’t want to copy the full iterator
    • Requires us to perform multiple iterations over the data though
  • As you can see above, normalise_defensive accepts the built-in list and our user-defined ReadVisits class
    • In theory any other Iterator class
  • Can also extend this approach to asynchronous iterators

Things to Remember

  • Beware of functions and methods that iterate over input arguments multiple times
    • If arguments are iterators you might see strange behaviour and missing values
  • Python’s iterator protocol defines the iter and next built-ins
    • Defines how for loops and related constructs operate
  • You can define your own iterable container via the __iter__ dunder method
  • You can detect that a value is an iterator
    • Calling iter on an iterator returns itself
    • Calling iter on a container returns a new iterator
    • Alternatively containers.abc supplies the Iterator class
      • You can test for it with isinstance