Python dictionary implementation

August 29, 2011

This post describes how dictionaries are implemented in the Python language.

Dictionaries are indexed by keys and they can be seen as associative arrays. Let’s add 3 key/value pairs to a dictionary:

>>> d = {'a': 1, 'b': 2}
>>> d['c'] = 3
>>> d
{'a': 1, 'b': 2, 'c': 3}

The values can be accessed this way:

>>> d['a']
1
>>> d['b']
2
>>> d['c']
3
>>> d['d']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'd'

The key ‘d’ does not exist so a KeyError exception is raised.

Hash tables

Python dictionaries are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys. The goal of a hash function is to distribute the keys evenly in the array. A good hash function minimizes the number of collisions e.g. different keys having the same hash. Python does not have this kind of hash function. Its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

We are going to assume that we are using strings as keys for the rest of this post. The hash function for strings in Python is defined as:

arguments: string object
returns: hash
function string_hash:
    if hash cached:
        return it
    set len to string's length
    initialize var p pointing to 1st char of string object
    set x to value pointed by p left shifted by 7 bits
    while len >= 0:
        set var x to (1000003 * x) xor value pointed by p
        increment pointer p
    set x to x xor length of string object
    cache x as the hash so we don't need to calculate it again
    return x as the hash

If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344. Here we assume we are using a 64-bit machine.

If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. This makes the computation of the slot index fast. The probability to find an empty slot is high due to the resizing mechanism described below. This means that having a simple computation makes sense in most of the cases. If the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.

hash table

We can see that the Python hash function does a good job when the keys are consecutive which is good because it is quite common to have this type of data to work with. However, once we add the key ‘z’, there is a collision because it is not consecutive enough.

We could use a linked list to store the pairs having the same hash but it would increase the lookup time e.g. not O(1) anymore. The next section describes the collision resolution method used in the case of Python dictionaries.

Open addressing

Open addressing is a method of collision resolution where probing is used. In case of ‘z’, the slot index 3 is already used in the array so we need to probe for a different index to find one which is not already used. Adding a key/value pair might take more time because of the probing but the lookup will be O(1) and this is the desired behavior.

A quadratic probing sequence is used to find a free slot. The code is the following:

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;

Recurring on 5*j+1 quickly magnifies small differences in the bits that didn’t affect the initial index. The variable “perturb” gets the other bits of the hash code into play.

Just out of curiosity, let’s look at the probing sequence when the table size is 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

You can read more about this probing sequence by looking at the source code of dictobject.c. A detailed explanation of the probing mechanism can be found at the top of the file.

open addressing

Now, let’s look at the Python internal code along with an example.

Dictionary C structures

The following C structure is used to store a dictionary entry: key/value pair. The hash, key and value are stored. PyObject is the base class of the Python objects.

typedef struct {
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

The following structure represents a dictionary. ma_fill is the number of used slots + dummy slots. A slot is marked dummy when a key pair is removed. ma_used is the number of used slots (active). ma_mask is equal to the array’s size minus 1 and is used to calculate the slot index. ma_table is the array and ma_smalltable is the initial array of size 8.

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

Dictionary initialization

When you first create a dictionary, the function PyDict_New() is called. I removed some of the lines and converted the C code to pseudocode to concentrate on the key concepts.

returns new dictionary object
function PyDict_New:
    allocate new dictionary object
    clear dictionary's table
    set dictionary's number of used slots + dummy slots (ma_fill) to 0
    set dictionary's number of active slots (ma_used) to 0
    set dictionary's mask (ma_value) to dictionary size - 1 = 7
    set dictionary's lookup function to lookdict_string
    return allocated dictionary object

Adding items

When a new key/value pair is added, PyDict_SetItem() is called. This function takes a pointer to the dictionary object and the key/value pair. It checks if the key is a string and calculates the hash or reuses the one cached if it exists. insertdict() is called to add the new key/value pair and the dictionary is resized if the number of used slots + dummy slots is greater than 2/3 of the array’s size.
Why 2/3? It is to make sure the probing sequence can find a free slot fast enough. We will look at the resizing function later.

arguments: dictionary, key, value
returns: 0 if OK or -1
function PyDict_SetItem:
    if key's hash cached:
        use hash
    else:
        calculate hash
    call insertdict with dictionary object, key, hash and value
    if key/value pair added successfully and capacity over 2/3:
        call dictresize to resize dictionary's table

inserdict() uses the lookup function lookdict_string() to find a free slot. This is the same function used to find a key. lookdict_string() calculates the slot index using the hash and the mask values. If it cannot find the key in the slot index = hash & mask, it starts probing using the loop described above, until it finds a free slot. At the first probing try, if the key is null, it returns the dummy slot if found during the first lookup. This gives priority to re-use the previously deleted slots.

We want to add the following key/value pairs: {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}. This is what happens:

A dictionary structure is allocated with internal table size of 8.

  • PyDict_SetItem: key = ‘a’, value = 1
    • hash = hash(‘a’) = 12416037344
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12416037344 & 7 = 0
        • slot 0 is not used so return it
      • init entry at index 0 with key, value and hash
      • ma_used = 1, ma_fill = 1
  • PyDict_SetItem: key = ‘b’, value = 2
    • hash = hash(‘b’) = 12544037731
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12544037731 & 7 = 3
        • slot 3 is not used so return it
      • init entry at index 3 with key, value and hash
      • ma_used = 2, ma_fill = 2
  • PyDict_SetItem: key = ‘z’, value = 26
    • hash = hash(‘z’) = 15616046971
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15616046971 & 7 = 3
        • slot 3 is used so probe for a different slot: 5 is free
      • init entry at index 5 with key, value and hash
      • ma_used = 3, ma_fill = 3
  • PyDict_SetItem: key = ‘y’, value = 25
    • hash = hash(‘y’) = 15488046584
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15488046584 & 7 = 0
        • slot 0 is used so probe for a different slot: 1 is free
      • init entry at index 1 with key, value and hash
      • ma_used = 4, ma_fill = 4
  • PyDict_SetItem: key = ‘c’, value = 3
    • hash = hash(‘c’) = 12672038114
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12672038114 & 7 = 2
        • slot 2 is free so return it
      • init entry at index 2 with key, value and hash
      • ma_used = 5, ma_fill = 5
  • PyDict_SetItem: key = ‘x’, value = 24
    • hash = hash(‘x’) = 15360046201
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15360046201 & 7 = 1
        • slot 1 is used so probe for a different slot: 7 is free
      • init entry at index 7 with key, value and hash
      • ma_used = 6, ma_fill = 6

This is what we have so far:

python dictionary insert

6 slots on 8 are used now so we are over 2/3 of the array’s capacity. dictresize() is called to allocate a larger array. This function also takes care of copying the old table entries to the new table.

dictresize() is called with minused = 24 in our case which is 4 * ma_used. 2 * ma_used is used when the number of used slots is very large (greater than 50000). Why 4 times the number of used slots? It reduces the number of resize steps and it increases sparseness.

The new table size needs to be greater than 24 and it is calculated by shifting the current size 1 bit left until it is greater than 24. It ends up being 32 e.g. 8 -> 16 -> 32.

This is what happens with our table during resizing: a new table of size 32 is allocated. Old table entries are inserted into the new table using the new mask value which is 31. We end up with the following:

python dictionary table resizing

Removing items

PyDict_DelItem() is called to remove an entry. The hash for this key is calculated and the lookup function is called to return the entry. The slot is now a dummy slot.

We want to remove the key ‘c’ from our dictionary. We end up with the following array:

Python dictionary delete key

Note that the delete item operation doesn’t trigger an array resize if the number of used slots is much less that the total number of slots. However, when a key/value pair is added, the need for resize is based on the number of used slots + dummy slots so it can shrink the array too.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

tags:
posted in Uncategorized by Laurent Luce

Follow comments via the RSS Feed | Leave a comment | Trackback URL

27 Comments to "Python dictionary implementation"

  1. Tweets that mention Python’s dictionary implementation | Laurent Luce Blog -- Topsy.com wrote:

    [...] This post was mentioned on Twitter by HN Firehose, newsery1. newsery1 said: Python's dictionary implementation – http://bit.ly/g4Q8y4 – [Hacker News FH] [...]

  2. Valentin wrote:

    Hi, thank you very much for comprehensive coverage of the topic. I really enjoy it. But you left me with one minor question. Does python dict resized when we remove an item? In fact my experience with large JSON objects tells me that it never shrinks.

  3. Laurent Luce wrote:

    @Valentin: There is no resize when an item is removed but if you look at dictresize, you will see that if the new size is equal to 8 then it means it is a large table getting reduced to an array of size 8. The need for resize is only checked when an element is added (as far as I know). Shrinkage happens when there are too many unused slots.

  4. Ram Rachum wrote:

    Question: Isn’t that memory-hungry? I mean, if you have 1000 keys, wouldn’t it take a huge array to ensure there are no collisions?

  5. Marius Gedminas wrote:

    “If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344.”

    Unless you’re on a 32-bit system, in which case hash(‘a’) returns -468864544.

  6. Laurent Luce wrote:

    @Marius: Good point. I updated the post to indicate that I am using a 64-bit machine.

  7. Laurent Luce wrote:

    @Ram: More memory is used to make sure the probing doesn’t take too long so inserting a new item is fast enough. Taking your example of inserting 1000 keys, the following will happen:
    1- array starts with size 8.
    2- when used slots == 6 (2/3 of capacity), size increased to 32
    3- when used slots == 22 (2/3 of capacity), size increased to 128
    4- when used slots == 86 (2/3 of capacity), size increased to 512
    5- when used slots == 342 (2/3 of capacity), size increased to 2048
    The array will be resized 4 times to accommodate 1000 items and at that time you will have 1048 free slots but this helps with sparseness which is key.

  8. Benjamin wrote:

    I like the detail in the explanation, but you left out one hugely important detail: how lookups work. It appears that just using the object hash with the bitmap is insufficient to determine slot reference in the case of collisions.

    Also, partial code excerpts tend to be terribly unhelpful. Without the context of knowing what a variable is, seeing what’s done to them doesn’t have meaning. I’d much rather see a written explanation, or simplified pseudo-code, personally.

  9. Ram Rachum wrote:

    Are you sure that 2048 cells are enough to accommodate 1000 items? What if two items have a collision? Won’t you need to enlarge the array (say to 10,000 or 100,000, who knows) in order that each of the 1000 items will have its own slot?

  10. Carlos Costa wrote:

    Thank you for this excelente article :-)

    I’d like to point out a little mistake (not serious, but awful for newcomers):
    >>> d = {‘a’: 1, ‘b’, 2}
    File “”, line 1
    d = {‘a’: 1, ‘b’, 2}
    ^
    SyntaxError: invalid syntax

    Best regards,
    Carlos.

  11. Laurent Luce wrote:

    @Benjamin: Lookup is taken care by the same lookup function used when a new item is added. It uses the probing technique until it finds the correct slot. I will add a section on that. Thanks for the feedback.

    When I get a chance, I will change the code excerpts to pseudo-code as I agree it can be easier to read. Some people do like to see real code though e.g. me :)

  12. Laurent Luce wrote:

    @Ram: When 2 keys collide, probing is used to find a free slot. Please see the section “Open Addressing” to learn more about it. A quadratic formula is used to probe for new slots until a free one is found. This probing technique works faster when 1/3 of the array is free so this is why the array is resized each time 2/3 of the capacity is reached.

    In your example, only when the 1366th item is added, the array is resized to from 2048 to 8192 slots.

  13. Laurent Luce wrote:

    @Carlos: Thanks for pointing that out. I updated the post.

  14. Weekly Python News: 02/11/2011 « The Mouse Vs. The Python wrote:

    [...] Python’s dictionary implementation [...]

  15. Filipe wrote:

    Nice explanation, thanks.

    On the first line under the title “Hash tables”, this: “It is an array which indexes are obtain” could be fixed to: “It is an array whose indexes are obtained”.

  16. Laurent Luce wrote:

    @Filipe. Thanks for noticing that.

  17. Sean Holdsworth wrote:

    Thank you for the interesting article, but of course there are a few niggles… ;)

    The use of the term ‘power’ in your first block of pseudo code should actually be xor

    It would be interesting to understand how precomputed hash values are cached and how those values are used. I suppose you could use a hash table, but that way lies madness; recursion, see recursion.

  18. Michael wrote:

    In your pseudocode for string_hash you translate ^ as power, but of course you mean xor.

  19. Laurent Luce wrote:

    @Sean Holdsworth: Thanks for reporting the power/xor mistake. If you are referring to the string’s hash value, it is cached in the string object structure using the attribute ob_shash. See Include/stringobject.h.

  20. David wrote:

    Hi,

    The following algorithm is from function lookdict_string(PyDictObject *mp, PyObject *key, long hash);
    1 i is the current slot index
    2 set perturb to hash
    3 forever loop:
    4 set i to i << 2 + i + perturb + 1
    5 set slot index to i & mask
    6 if slot is free:
    7 return it
    8 right shift perturb by 5 bits

    I want to implement this algorithm in my own program to store a large number (about 13 M) of key/value pairs. (The keys are strings.) But I find that this algorithm fails to find a free slot after inserting 70,000 key/value pairs, though I set the hash table size to 30 M.

    So my question is, does python support a dictionary at this scale (10M entries)?

    Thanks!

  21. David wrote:

    Hi,

    “Just out of curiosity, let’s look at the probing sequence when the table size is 32 e.g. mask = 31
    3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…”

    By my calculation, the sequence is:
    3 11 21 29 30 8 11 24 25 30 23 20 5 26 3 16 17 22 15 12 29 18 27 8 9 14 7 4 …

    Can you please confirm your results?
    Thanks!

  22. Laurent Luce wrote:

    @David: I simulated the probing loop with the following Python code and I am getting the same result as previously noted.

    i = 3
    mask = 31
    perturb = hash(‘z’)
    while True:
    i = (i < < 2) + i + perturb + 1
    slot = i & mask
    print slot
    perturb >>= 5

  23. Want it faster? Hash Table is the answer! « My Developed World wrote:

    [...] see how Python implements Hash Table, have a look at this link Share this:Like this:LikeBe the first to like [...]

  24. Andreas Paul wrote:

    Hi Laurent,

    thanks for this great article, it really helps understanding Python’s dict implementation.

    I tried to implement the probing algorithm in python:

    https://gist.github.com/3724747

    As you can see I’m trying to insert several ‘z’ chars into the table, which results in multiple collisions.
    What I don’t understand is, after the 5th ‘z’ gets inserted, why doesn’t the probing algorithm ever terminate and never reach one of the empty table slots 1, 2 or 4?
    When perturb reaches -1 (I’m on a 32bit OS btw) the probing slot just toggles between slot 3 and 7.

    Can this somehow be the correct behaviour, because more than 5 consecutive collisions is very unlikely?
    Do you see any mistakes?

    BTW @David
    Your calculated sequence is the sequence for a table with 32 slots, which makes the mask 31 and the first slot lookup hash(‘z’) & 31 = 27
    Whereas Laurent used the 3 for the initial slot, which came from hash(‘z’) & 7 example, but then used 31 for the mask from the 32 slot table example.

  25. Andreas Paul wrote:

    Okay, I can answer myself.
    When I run my script on a 64bit machine it works just like it’s supposed to. :)

    The problem was indeed that I was running it on a 32 bit machine, where the hash function return a negative number for the char ‘z’, that’s the reason, why perturb doesn’t reach 0 and every table slot is probed.

    And the hash function returns a different number, because the char pointer size is 4 bytes on 32bit and 8 bytes on 64bit.

  26. Shriram wrote:

    Thank you so much Laurent for writing this post.
    I’ve some knowledge of Hash Table in Java and learnt about techniques and its implementation via wiki and other sources.
    Then I started using dict in Python.
    It was good until I had my own question on how it’s implemented.
    I came across this post. didn’t read this entirely.
    I jumped straight to the comments which’re excellent questions.
    Thank you for patiently answering them !
    It was pleasure, enjoyable, and very entertaining to read :)

  27. David wrote:

    @Andreas Paul and @Laurent Luce, thank you you both. I know where my problem is.

Leave Your Comment

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org