Least frequently used cache eviction scheme with complexity O(1) in Python

June 10, 2015

This post describes the implementation in Python of a “Least Frequently Used” (LFU) algorithm cache eviction scheme with complexity O(1). The algorithm is described in this paper written by Prof. Ketan Shah, Anirban Mitra and Dhruv Matani. The naming in the implementation follows the naming in the paper.

LFU cache eviction scheme is useful for an HTTP caching network proxy for example, where we want the least frequently used items to be removed from the cache.

The goal here is for the LFU cache algorithm to have a runtime complexity of O(1) for all of its operations, which include insertion, access and deletion (eviction).

Doubly linked lists are used in this algorithm. One for the access frequency and each node in that list contains a list with the elements of same access frequency. Let say we have five elements in our cache. Two have been accessed one time and three have been accessed two times. In that case, the access frequency list has two nodes (frequency = 1 and frequency = 2). The first frequency node has two nodes in its list and the second frequency node has three nodes in its list.

LFU doubly linked lists.

How do we build that? The first object we need is a node:

class Node(object):
    """Node containing data, pointers to previous and next node."""
    def __init__(self, data):
        self.data = data
        self.prev = None
        self.next = None

Next, our doubly linked list. Each node has a prev and next attribute equal to the previous node and next node respectively. The head is set to the first node and the tail to the last node.

LFU doubly linked list.

We can define our doubly linked list with methods to add a node at the end of the list, insert a node, remove a node and get a list with the nodes data.

class DoublyLinkedList(object):
    def __init__(self):
        self.head = None
        self.tail = None
        # Number of nodes in list.
        self.count = 0

    def add_node(self, cls, data):
        """Add node instance of class cls."""
        return self.insert_node(cls, data, self.tail, None)

    def insert_node(self, cls, data, prev, next):
        """Insert node instance of class cls."""
        node = cls(data)
        node.prev = prev
        node.next = next
        if prev:
            prev.next = node
        if next:
            next.prev = node
        if not self.head or next is self.head:
            self.head = node
        if not self.tail or prev is self.tail:
            self.tail = node
        self.count += 1
        return node

    def remove_node(self, node):
        if node is self.tail:
            self.tail = node.prev
            node.next.prev = node.prev
        if node is self.head:
            self.head = node.next
            node.prev.next = node.next
        self.count -= 1

    def get_nodes_data(self):
        """Return list nodes data as a list."""
        data = []
        node = self.head
        while node:
            node = node.next
        return data

Each node in the access frequency doubly linked list is a frequency node (Freq Node on the diagram below). It is a node and also a doubly linked list containing the elements (Item nodes on the diagram below) of same frequency. Each item node has a pointer to its frequency node parent.

LFU frequency and items doubly linked list.

class FreqNode(DoublyLinkedList, Node):
    """Frequency node containing linked list of item nodes with
       same frequency."""
    def __init__(self, data):
        Node.__init__(self, data)

    def add_item_node(self, data):
        node = self.add_node(ItemNode, data)
        node.parent = self
        return node

    def insert_item_node(self, data, prev, next):
        node = self.insert_node(ItemNode, data, prev, next)
        node.parent = self
        return node

    def remove_item_node(self, node):

class ItemNode(Node):
    def __init__(self, data):
        Node.__init__(self, data)
        self.parent = None

The item node data is equal to the key of the element we are storing, an HTTP request could be the key. The content itself (HTTP response for example) is stored in a dictionary. Each value in this dictionary is of type LfuItem where “data” is the content cached, “parent” is a pointer to the frequency node and “node” is a pointer to the item node under the frequency node.

LFU Item.

class LfuItem(object):
    def __init__(self, data, parent, node):
        self.data = data
        self.parent = parent
        self.node = node

We have defined our data objects classes, now we can define our cache object class. It has a doubly linked list (access frequency list) and a dictionary to contain the LFU items (LfuItem above). We defined two methods: one to insert a frequency node and one to remove a frequency node.

class Cache(DoublyLinkedList):
    def __init__(self):
        self.items = dict()

    def insert_freq_node(self, data, prev, next):
        return self.insert_node(FreqNode, data, prev, next)

    def remove_freq_node(self, node):

Next step is to define methods to insert to the cache, access the cache and delete from the cache.

Let’s look at the insert method logic. It takes a key and a value, for example HTTP request and response. If the frequency node with frequency one does not exist, it is inserted at the beginning of the access frequency linked list. An item node is added to the frequency node items linked list. The key and value are added to the dictionary. Complexity is O(1).

def insert(self, key, value):
    if key in self.items:
        raise DuplicateException('Key exists')
    freq_node = self.head
    if not freq_node or freq_node.data != 1:
        freq_node = self.insert_freq_node(1, None, freq_node)

    self.items[key] = LfuItem(value, freq_node)

We insert two elements in our cache, we end up with:

LFU insert method.

Let’s look at the access method logic. If the key does not exist, we raise an exception. If the key exists, we move the item node to the frequency node list with frequency + 1 (adding the frequency node if it does not exist). Complexity is O(1).

def access(self, key):
        tmp = self.items[key]
    except KeyError:
        raise NotFoundException('Key not found')

    freq_node = tmp.parent
    next_freq_node = freq_node.next

    if not next_freq_node or next_freq_node.data != freq_node.data + 1:
        next_freq_node = self.insert_freq_node(freq_node.data + 1,
            freq_node, next_freq_node)
    item_node = next_freq_node.add_item_node(key)
    tmp.parent = next_freq_node

    if freq_node.count == 0:

    tmp.node = item_node
    return tmp.data

If we access the item with Key 1, the item node with data Key 1 is moved to the frequency node with frequency equal to 2. We end up with:

LFU access method.

If we access the item with Key 2, the item node with data Key 2 is moved to the frequency node with frequency equal to 2. The frequency node 1 is removed. We end up with:

LFU access 2 method.

Let’s look at the delete_lfu method. It removes the least frequently used item from the cache. To do that, it removes the first item node from the first frequency node and also the LFUItem object from the dictionary. If after this operation, the frequency node list is empty, it is removed.

def delete_lfu(self):
    """Remove the first item node from the first frequency node.
    Remove the item from the dictionary.
    if not self.head:
        raise NotFoundException('No frequency nodes found')
    freq_node = self.head
    item_node = freq_node.head
    del self.items[item_node.data]
    if freq_node.count == 0:

If we call delete_lfu on our cache, the item node with data equal to Key 1 is removed and its LFUItem too. We end up with:

LFU delete method.

Github repo for the complete implementation.

Cambridge city geospatial statistics

February 17, 2014

Using the Cambridge (Massachusetts) GIS data, we can compute some interesting geospatial statistics.

Subway stations

Using the address points (20844 points) and the subway stations data files, we can find out how many Cambridge addresses are in a certain radius of a Cambridge subway station. More particularly, we want to find the percentage of Cambridge addresses less than 1/8, 1/4 and 1/2 mile from a Cambridge subway station.

  • Less than 1/8 of a mile: 3%
  • Less than 1/4 of a mile: 14%
  • Less than 1/2 of a mile: 57%

Note: You can have many people living at the same address point.

Note: This does not take into consideration subway stations outside Cambridge.

Note: Distance is a straight line between two points.

Bus shelters

We can perform the same calculation as above but with bus shelters (55 in Cambridge).

  • Less than 1/8 of a mile: 26%
  • Less than 1/4 of a mile: 68%
  • Less than 1/2 of a mile: 99%

Note: This does not take into consideration bus shelters outside Cambridge.

Sports areas

We can find the percentage of Cambridge addresses less than 1/8, 1/4 and 1/2 mile from a sports area. In addition, we can do the same calculation for a specific sports area type:

A sport area (265 in Cambridge)

  • Less than 1/8 of a mile: 66.8%
  • Less than 1/4 of a mile: 95.8%
  • Less than 1/2 of a mile: 99.9%

Basketball sport area (34 in Cambridge)

  • Less than 1/8 of a mile: 31.3%
  • Less than 1/4 of a mile: 75.3%
  • Less than 1/2 of a mile: 99.3%

Tennis sport area (18 in Cambridge)

  • Less than 1/8 of a mile: 7%
  • Less than 1/4 of a mile: 31%
  • Less than 1/2 of a mile: 80%

Park sport area (1 in Cambridge)

  • Less than 1/8 of a mile: 0.02%
  • Less than 1/4 of a mile: 1.48%
  • Less than 1/2 of a mile: 10.17%

Field sport area (24 in Cambridge)

  • Less than 1/8 of a mile: 10%
  • Less than 1/4 of a mile: 38%
  • Less than 1/2 of a mile: 89%

Baseball sport area (23 in Cambridge)

  • Less than 1/8 of a mile: 14%
  • Less than 1/4 of a mile: 48%
  • Less than 1/2 of a mile: 92%

Playground sport area (138)

  • Less than 1/8 of a mile: 58%
  • Less than 1/4 of a mile: 93%
  • Less than 1/2 of a mile: 100%

Note: This does not take into consideration sports areas outside Cambridge.


One data file lists the different neighborhoods, 13 in total:

  • Area Four
  • Neighborhood Nine
  • Area 2/MIT
  • Cambridgeport
  • Riverside
  • Mid-Cambridge
  • Wellington-Harrington
  • East Cambridge
  • Agassiz
  • Cambridge Highlands
  • Strawberry Hill
  • West Cambridge
  • North Cambridge

Cambridge neighborhoods

Single-Family assessed value

Using two data files, we can find out the average assessed value for a single-family house per neighborhood. Here are the results sorted from most expensive to least expensive:

  • West Cambridge – $1507544
  • Neighborhood Nine – $1451155
  • Agassiz – $1075459
  • Mid-Cambridge – $786673
  • Riverside – $712027
  • Strawberry Hill – $707192
  • Area 2/MIT – $599264
  • Cambridgeport – $592257
  • North Cambridge – $534844
  • Cambridge Highlands – $507051
  • Area Four – $505535
  • Wellington-Harrington – $456673
  • East Cambridge – $407907

Cambridge gingle-Family assessed value

Single-Family assessed value per sq. ft.

We can also perform a similar calculation per sq. ft. Here are the results sorted from most expensive to least expensive:

  • West Cambridge – $318
  • Neighborhood Nine – $318
  • Agassiz – $287
  • Mid-Cambridge – $261
  • Riverside – $253
  • Area 2/MIT – $225
  • Cambridgeport – $223
  • Cambridge Highlands – $222
  • Wellington-Harrington – $208
  • Strawberry Hill – $207
  • North Cambridge – $204
  • East Cambridge – $203
  • Area Four – $200

Condo assessed value

  • Agassiz – $603000
  • West Cambridge – $591502
  • Neighborhood Nine – $581251
  • Mid-Cambridge – $483449
  • Riverside – $470751
  • Cambridge Highlands – $449725
  • Area 2/MIT – $431406
  • Cambridgeport – $422049
  • North Cambridge – $410023
  • Strawberry Hill – $387065
  • Area Four – $384520
  • Wellington-Harrington – $366460
  • East Cambridge – $357547

Condo assessed value per sq. ft.

  • West Cambridge – $395
  • Neighborhood Nine – $387
  • Agassiz – $382
  • Mid-Cambridge – $376
  • Riverside – $371
  • Cambridge Highlands – $338
  • Area 2/MIT – $334
  • Cambridgeport – $334
  • Wellington-Harrington – $335
  • Area Four – $335
  • East Cambridge – $335
  • North Cambridge – $330
  • Strawberry Hill – $313


There is a data file listing areas with concentration of vegetation. Here is their definition: “Vegetation is a wooded area with a group or high density of trees.”

Each of those is a polygon. We can find which neighborhoods contain which vegetation polygons and calculate the total area of vegetation per neighborhood. Which neighborhood has the most vegetation? It is “Cambridge Highlands” thanks to the Fresh Pond vegetation. “Cambridge Highlands” contains 32% of the total vegetation in Cambridge followed by “North Cambridge” with 26%. Here is the full ranking:

  • Cambridge Highlands – 31.7%
  • North Cambridge – 25.7%
  • West Cambridge – 22.1%
  • Strawberry Hill – 7.5%
  • Neighborhood Nine – 4.1%
  • Agassiz – 3.1%
  • Area 2/MIT – 2.3%
  • East Cambridge – 1.6%
  • Cambridgeport – 1.5%
  • Riverside – 0.2%
  • Mid-Cambridge – 0.1%
  • Wellington-Harrington – 0%
  • Area Four – 0%

Fire Stations

There are eight fire stations in Cambridge:

  • Fire Headquarters
  • Fire Company 2
  • Fire Company 3
  • Fire Company 4
  • Fire Company 5
  • Fire Company 6
  • Fire Company 8
  • Fire Company 9

Going back to each of our neighborhood. We calculate the geometric center of each neighborhood and find out the closest fire station. Here are the results ranked by distance to the closest fire station.

  • Neighborhood Nine center is 164 meters from Fire Company 8.
  • East Cambridge center is 328 meters from Fire Company 3.
  • Area Four center is 407 meters from Fire Company 2.
  • Cambridgeport center is 492 meters from Fire Company 6.
  • West Cambridge center is 534 meters from Fire Company 9.
  • Riverside center is 559 meters from Fire Company 6.
  • Agassiz center is 602 meters from Fire Headquarters.
  • Mid-Cambridge center is 613 meters from Fire Headquarters.
  • Wellington-Harrington center is 694 meters from Fire Company 5.
  • Area 2/MIT center is 910 meters from Fire Company 6.
  • Strawberry Hill center is 987 meters from Fire Company 9.
  • North Cambridge center is 1271 meters from Fire Company 4.
  • Cambridge Highlands center is 1340 meters from Fire Company 9.

Note: 1 meter ~= 1.1 yard.

Emergency Shelters

We can do the same with emergency shelters. There are 15 in Cambridge. Here are the results ranked by distance to the closest shelter.

  • Wellington-Harrington center is 68 meters from Frisoli Youth Center.
  • Strawberry Hill center is 111 meters from Russell / West Cambridge Youth Ctr.
  • Area Four center is 138 meters from Fletcher / Maynard Academy.
  • Agassiz center is 212 meters from Baldwin School.
  • Mid-Cambridge center is 223 meters from Cambridge Rindge and Latin.
  • Neighborhood Nine center is 469 meters from Graham & Parks School.
  • Cambridgeport center is 567 meters from Morse School.
  • East Cambridge center is 634 meters from Kennedy / Longfellow School.
  • North Cambridge center is 715 meters from Peabody School.
  • Area 2/MIT center is 782 meters from CASPAR Shelter.
  • Riverside center is 838 meters from Cambridge Rindge and Latin
  • West Cambridge center is 917 meters from Tobin School.
  • Cambridge Highlands center is 1038 meters from Tobin School.


One of the data file contains the streets/roads intersections. An interesting attribute is the intersecting street count. Here is the breakdown based on this attribute:

  • 2 intersecting streets: 1392
  • 3 intersecting streets: 109
  • 4 intersecting streets: 2

The two “4 intersecting streets” intersections are the “Coolidge Hill Rd & Fresh Pond Pkwy & Gerrys Landing Rd & Mt Auburn St” intersection and the “Magazine St & Massachusetts Ave & Prospect St & Western Ave” intersection. After all, Cambridge is notorious for those type of intersections.

Note: There are 357 intersections without a number of intersecting streets so we dismissed those.

Metered parking spaces

Based on one of the data file, Cambridge has 3094 metered parking spaces. A different data file lists streets segments. Using both data, we can find out the amount of street coverage when it comes to metered parking spaces. We found out that 13% of streets segments have metered parking spaces in Cambridge.


Python and the SnowFloat API are used here to compute those statistics. Here is how the subway stations 1/8 mile proximity statistics were calculated:

import snowfloat.client

address_points_uuid = 'ecf03fb0e1f34773b53959a2c603e214'
subway_stations_uuid = '32d771c0bc0641eb9fc5c2f04305ecb9'

points = []
subway_stations = client.get_features(subway_stations_uuid)
for subway_station in subway_stations:
    address_points = client.get_features(address_points_uuid,
        query='distance_lte', geometry=subway_station.geometry,
    for address_point in address_points:
        for point in points:
            if address_point.uuid == point.uuid:
print len(points) / 20844.

I can publish more of the code used if needed.

More data to analyze

Here is the full list of data files the city of Cambridge made public in the geojson format. Also listed are the fields for each data file. Let me know other statistics you would like to see added to this post.


  • address_id
  • ml
  • stname
  • stnm
  • bldgid
  • full_addr
  • entry
  • type
  • editdate


  • unq_id2
  • unq_id
  • nhood
  • globalid


  • dist_name
  • globalid
  • district


  • dist_name
  • district


  • globalid



  • bsize_inch_to_foot
  • dsize_map_scale
  • globalid
  • bsize_map_scale
  • pcix_no
  • rotation
  • dsize_inch_to_foot


  • bsize_inch_to_foot
  • dsize_map_scale
  • bsize_map_scale
  • pcix_no
  • rotation
  • dsize_inch_to_foot


  • map
  • globalid
  • ml
  • fcode
  • lot
  • uyear
  • editdate
  • editor
  • approved


  • map
  • uyear
  • loc_id
  • lot
  • ml
  • editdate
  • editor
  • poly_type
  • source
  • plan_id


  • globalid
  • district


  • globalid



  • type


  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl
  • bldgid
  • editdate


  • type
  • name


  • type


  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl


  • type


  • type


  • type


  • type


  • type


  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl


  • type


  • type


  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl


  • type


  • type


  • type


  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl


  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl


  • type
  • editdate


  • top_gl
  • stair_type
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl


  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl


  • type


  • type


  • webpage
  • n_hood
  • name
  • globalid


  • globalid


  • globalid
  • zip_code


  • zone_type
  • zcode
  • globalid
  • pud_type_new
  • pud_type


  • name
  • overlap
  • globalid
  • type
  • label
  • ordnum
  • editdate


  • blkgrpid
  • globalid


  • blkgrps_id
  • globalid
  • tract
  • group_
  • fipsstco
  • stfid


  • awater10
  • tractce10
  • aland10
  • blkgrpce10
  • funcstat10
  • namelsad10
  • countyfp10
  • statefp10
  • mtfcc10
  • geoid10
  • intptlat10
  • intptlon10


  • pop100
  • tractblk
  • globalid


  • pop_lab
  • fipsstco
  • block_id
  • globalid
  • stfid


  • name10
  • blockce10
  • tractce10
  • complete
  • awater10
  • aland10
  • statefp10
  • funcstat10
  • countyfp10
  • intptlat10
  • mtfcc10
  • geoid10
  • intptlon10
  • followup
  • comments


  • globalid
  • tract


  • fipsstco
  • stfid
  • tractid
  • globalid
  • trt2000


  • name10
  • awater10
  • tractce10
  • aland10
  • statefp10
  • funcstat10
  • namelsad10
  • countyfp10
  • intptlat10
  • mtfcc10
  • geoid10
  • intptlon10


  • globalid
  • park_distr
  • location
  • owner
  • id
  • man
  • type


  • ramp_line_
  • f_puddle
  • f_total
  • f_slope
  • globalid
  • f_cond
  • f_total06
  • sw_width
  • f_align
  • lip
  • location
  • slope_comp
  • total_ramp
  • puddling
  • material_t
  • opening_wi
  • condition
  • f_open
  • date
  • contract


  • diameter
  • trunks
  • creator
  • growspac
  • location
  • streetnumber
  • species
  • inspectr
  • otherriskfactors
  • riskrating
  • globalid
  • sitetype
  • sidewalkobstruction
  • treeid
  • probabilityoffailure
  • probabilityoftargetimpact
  • overheadwires
  • sizeodefectiveparts
  • memtree
  • created
  • modified
  • def_cult
  • streetname
  • notes
  • ownership
  • formersitetype
  • removaldate
  • treecondition
  • cultivar
  • treegrate
  • plantdate
  • sidewalkwidth
  • treegrateactionreq
  • adacompliant
  • posd
  • dw
  • rt
  • speciesshort
  • d
  • lean
  • structuralissues
  • cr


  • editdate
  • congressman
  • name
  • district
  • party


  • w_p
  • location_note
  • location
  • globalid
  • address
  • editdate


  • editdate
  • party
  • rep
  • district


  • senate
  • editdate
  • senator
  • party


  • ward
  • totalpop
  • wardprecinct
  • precinct


  • phone
  • site_name
  • globalid
  • address
  • editdate


  • phone
  • site_name
  • globalid
  • address


  • name
  • globalid


  • dollaramnt
  • globalid
  • project
  • printedmapid
  • granttype
  • buildingid
  • fiscalyear
  • grantid


  • name
  • neighborhood
  • marker_type
  • globalid
  • address
  • description


  • globalid
  • ml
  • property_t
  • historic_n
  • lot
  • address
  • block
  • editdate


  • globalid
  • nrdis
  • location
  • labelname
  • shape_leng
  • histname
  • street
  • aka
  • st_num
  • nrind
  • pres_restr
  • nhl
  • doe


  • name
  • globalid


  • sfha_tf
  • fld_ar_id
  • percentage
  • source_cit
  • fld_zone
  • floodway


  • type
  • name


  • complants1
  • globalid
  • wtdescript
  • datasource
  • wtsystem
  • namelocati
  • wtsubsyste
  • wtlnd_id
  • complants2



  • type
  • name


  • globalid
  • subtype
  • contractauthority
  • lasteditdate
  • enabled
  • ancillaryrole
  • hydrant_id
  • lasteditor
  • gate
  • rotation
  • lifecyclestatus
  • hydrant_gpm
  • installdate
  • lastsource




  • elev


  • point_subt
  • npole_tag
  • symbol
  • lmp_type
  • oldnode_ta
  • ver_end
  • oldpoint_t
  • globalid
  • str_des
  • equip_locn
  • phase__id
  • object__id
  • no_lamp
  • ver_id
  • lock__id
  • str_name
  • group_id
  • pole_type_
  • intersecti
  • oid_
  • lumens
  • feature_st
  • ntag
  • point_tag
  • point_type
  • bracket_cd
  • lmp_class_
  • cntl_ty
  • err_desc
  • err_flag
  • lamp_id


  • type
  • site_name
  • globalid
  • address


  • elev


  • site_name
  • globalid
  • shel_cap
  • phone
  • publicpriv
  • address
  • owner
  • shelter


  • owner
  • publicpriv
  • site_name
  • globalid
  • address


  • source
  • type
  • name
  • globalid
  • pole_id
  • editdate


  • type
  • site_name
  • globalid
  • address
  • editdate


  • site_name
  • globalid
  • address


  • owner
  • phone
  • site_name
  • globalid
  • address


  • phone
  • site_name
  • globalid
  • address


  • site_name
  • globalid
  • shel_cap
  • shelter
  • phone
  • address
  • shel_food


  • phone
  • site_name
  • globalid
  • address


  • capacity
  • name
  • globalid
  • shel_cap
  • shelter
  • concateid
  • zipcode
  • phone
  • med_fac
  • address
  • shel_food
  • type
  • editdate


  • phone
  • site_name
  • globalid
  • address


  • phone
  • site_name
  • globalid
  • address


  • bike_fac
  • globalid
  • st_type
  • pbike_fac
  • length
  • street
  • comments
  • editdate


  • mapid
  • park
  • location
  • globalid
  • editdate


  • globalid
  • fcode
  • owner
  • oscode
  • rcode
  • join_id
  • name
  • editdate


  • location
  • globalid


  • owner
  • name


  • type


  • status
  • join_id
  • park
  • location
  • size


  • globalid
  • ml
  • descriptio
  • accessvia
  • totalsp
  • commspac
  • address
  • id
  • structure
  • owntype


  • space_id
  • src_date
  • globalid
  • editdate
  • editor


  • sitename
  • ml
  • type
  • globalid
  • address


  • type


  • elev_sl
  • direction
  • oid_
  • globalid
  • top_sl
  • top_gl
  • style
  • xls_no
  • x_street
  • street
  • elev_gl
  • address
  • base_elev
  • stop_name
  • stop_id
  • quantity
  • editdate


  • potential_r_to
  • street_name
  • tonode
  • label
  • street_id
  • zip_right
  • direction
  • globalid
  • fromnode
  • roadways
  • potential_l_from
  • zip_left
  • l_to
  • r_from
  • street_type
  • l_from
  • id
  • majorroad
  • restriction
  • r_to
  • street
  • potential_r_from
  • potential_l_to
  • alias
  • editdate


  • nodenumber
  • globalid
  • p_x
  • p_y
  • intersection
  • intersectingstreetcount
  • editdate


  • majorroad
  • street
  • st_name
  • globalid
  • st_type


  • type


  • type


  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl


  • grade
  • source
  • line
  • globalid


  • line
  • station
  • globalid

API to access the Cambridge city geospatial data

December 28, 2013

The city of Cambridge, MA made a large set of geospatial data publicly available.

I uploaded the layers to the geospatial web service I have running on EC2. You can now access the Cambridge geospatial data easily using the Python client or the raw API. Here is a tutorial.

REST service + Python client to access geographic data

November 12, 2013

This post shows how we can query geographic data using a simple Python client pointing to a GIS web service: SnowFloat.

The data file here is the Vermont towns geographic data file. Its name is “Boundary_BNDHASH_region_towns”. We first upload the data file to the web service. The web service here is a layer on top of GeoDjango on top of PostGIS.


Let’s retrieve the layer object:

>>> layer = client.get_layers(name='Boundary_BNDHASH_region_towns')[0]

Number of features:

>>> layer.num_features

Layer’s fields:

>>> layer.fields
[{u'name': u'gdb_vcgi_v', u'type': u'real'},
 {u'name': u'perimeter', u'type': u'real'},
 {u'name': u'towns_', u'type': u'integer'},
 {u'name': u'towns_id', u'type': u'integer'},
 {u'name': u'fips6', u'type': u'integer'},
 {u'name': u'townname', u'size': 18, u'type': u'string'},
 {u'name': u'townnamemc', u'size': 18, u'type': u'string'},
 {u'name': u'cnty', u'type': u'integer'},
 {u'name': u'shape_area', u'type': u'real'},
 {u'name': u'shape_len', u'type': u'real'}]

Layer’s extent:

>>> layer.extent
[424788.84, 25211.8370313, 581562.0, 279799.0]


Towns starting with the letter ‘B':

>>> features = layer.get_features(field_townnamemc_startswith='B')
>>> len(features)
>>> features[0].fields['townnamemc']

Towns less than 10 kms from the town of Burlington:

>>> burlington = layer.get_features(field_townnamemc_exact='Burlington')
>>> features = layer.get_features(query='distance_lte',
...                               geometry=burlington,
...                               distance=10000)
>>> len(features)
>>> [feature.fields['townnamemc'] for feature in features]
 u'South Hero',
 u'St. George',
 u'South Burlington',

Ten furthest towns + their distance to Burlington:

>>> features = layer.get_features(spatial_operation='distance',
...                               spatial_geometry=burlington.geometry,
...                               order_by=('spatial',),
...                               query_slice=(0, 10))
>>> ['%s - %d kms' % (feature.fields['townnamemc'], feature.spatial / 1000) for feature in features]
[u'Vernon - 187 kms',
 u'Guilford - 184 kms',
 u'Halifax - 182 kms',
 u'Whitingham - 180 kms',
 u'Stamford - 179 kms',
 u'Pownal - 178 kms',
 u'Brattleboro - 177 kms',
 u'Readsboro - 177 kms',
 u'Marlboro - 172 kms',
 u'Dummerston - 170 kms']

Towns inside the Robert T. Stafford White Rocks area:

>>> features = layer.get_features(query='intersects', geometry=forest_feature.geometry)
>>> [feature.fields['townnamemc'] for feature in features]
[u'Wallingford', u'Mount Holly', u'Mount Tabor', u'Weston', u'Dorset', u'Peru']

As you can see, the goal here is to build a light tool for easy access to geographic data. As I am continuing to work on the web service back-end, I would really appreciate getting feedback from people working in the GIS world.

Massachusetts Census 2010 Towns maps and statistics using Python

September 10, 2013

Using the Massachusetts Census 2010 Towns data source, we generated the following maps and statistics. The software used relies on technologies such as Python and PostGIS.

Note: The data source has been generated and published by MassGIS.

Population change

Population change 2000-2010

We can see some towns in Cape Cod and Western Massachusetts not growing as fast (few even loosing population) as the rest of the state. What is driving the increase just west of 495? Is it the expensive real state closer to Boston?

Population per town

Population in 2010

No surprise that Western Massachusetts has less populated towns than the Greater Boston area.

Most populated towns:

  1. Boston – 617594
  2. Worcester – 181045
  3. Springfield – 153060
  4. Lowell – 106519
  5. Cambridge – 105162

Least populated towns:

  1. Gosnold – 75
  2. Monroe – 121
  3. Mount Washington – 167
  4. New Ashford – 228
  5. Aquinnah – 311

Residents per house unit in 2010

Residents per house unit 2010

Could the low number in Cape Code be explained by the higher number of secondary residences?

Population density in 2010

Residents per house unit 2010

We are looking at the number of habitants per acre.

Towns area

Largest towns in acres:

  1. Plymouth – 65726
  2. Middleborough – 46179
  3. Petersham – 43689
  4. Barnstable – 40001
  5. Dartmouth – 39513

Smallest towns in acres:

  1. Nahant – 671
  2. Winthrop – 1295
  3. Chelsea – 1416
  4. Hull – 1804
  5. Swampscott – 1945

Furthest towns from Boston

  1. Mount Washington – 193 kms
  2. Alford – 190 kms
  3. Egremont – 190 kms
  4. West Stockbridge – 185 kms
  5. Sheffield – 182 kms

There are many more distance and spatial computations we could make. Which one do you think would be interesting?


We have been using the SnowFloat Python client and its web service running on Amazon EC2.

First step is to import the data source:

>>> import snowfloat.client
>>> client = snowfloat.client.Client()
>>> data_source = 'CENSUS2010TOWNS_POLY.zip'
>>> res = client.import_geospatial_data(data_source)
# res is a dictionary with the ID of the layer created and the number of features (districts) added.
>>> res
{u'layer_uuids': [u'b2b2660e43ef4196b999840984480338'], u'num_features': 351}
>>> layer_uuid = res['layer_uuids'][0]

Population per town

We define a map task and send it to the server.

>>> task = snowfloat.task.Task(
...     operation='map',
...     task_filter={'layer_uuid_exact': layer_uuid},
...     extras={'size': [800, 600],
...             'title': 'Population 2010',
...             'colors': {'type': 'gradient'},
...             'operation': {'type': 'nop',
...                           'value_1': 'field_pop2010',
...                           'normalize': 'log'}})

>>> res = client.execute_tasks([task,])
# res contains the URL of the map stored online.
>>> res
[[{u'url': u'http://snowfloat.results.s3.amazonaws.com/95391554fb6b4ba6bb1cffaf196f83ae?Signature=BQeDD2Ued1N2JolDwedfzjLAuoA%3D&Expires=1378834201&AWSAccessKeyId=AKIAI4JN4Z5ER2EY3A2Q'}]]

Furthest towns from Boston

First step is to calculate the centroid for the Boston town geometry:

>>> feature = client.get_features(layer_uuid=layer_uuid, field_town2_exact='Boston', spatial_operation='centroid')
>>> point = feature.spatial
>>> point
Point(coordinates=[233705.31415324158, 896147.8312876453])

Now, we can retrieve the five furthest towns from this point.

>>> features = client.get_features(layer_uuid=layer_uuid, spatial_operation='distance', spatial_geometry=point, order_by=('-spatial',), query_slice=(0,5))
>>> ['%s - %d kms' % (feature.fields['town2'], feature.spatial / 1000) for feature in features]
[u'Mount Washington - 193 kms',
 u'Alford - 190 kms',
 u'Egremont - 190 kms',
 u'West Stockbridge - 185 kms',
 u'Sheffield - 182 kms']

You can learn more about the SnowFloat service by reading its tutorial.

Python, Twitter statistics and the 2012 French presidential election

August 29, 2012

This post describes how Pytolab was designed to process Tweets related to the 2012 French presidential election, in real-time. This post also goes over some of the statistics computed over a period of 9 months.

Note: I presented this project at EuroSciPy 2012: abstract.



The posts are received from the Twitter streaming API and sent to a messaging exchange. The posts are read from the messaging queue and processed by the computing unit. Most frequently accessed data is stored in an in-memory DB (Redis) and long term data is stored in MySQL. See diagram below.

Twitter statistics

Tweets receiver

The Twitter streaming API filter feature is used here to receive the tweets we are interested in: Tweets referring to at least one of the candidates. The helper library Tweepy facilitates that task.

First thing we do is setting up a stream listener. We get a listener instance, set the callback to be called when a new post arrives and finally get a stream instance by passing our listener instance to it. We will see next how those different objects are defined.

def setup_stream_listener(self):
    Setup Twitter API streaming listener
    listener = Listener()
    self.stream = tweepy.Stream(
        self.config.get('twitter', 'userid'),
        self.config.get('twitter', 'password'),

Note: We use ConfigParser for the configuration file management.

The Listener class is derived from the tweepy.StreamListener class. We overwrite some of the methods to indicate what to do when a new post arrives or when an error is detected.

class Listener(tweepy.StreamListener):
    def on_status(self, status):
        # Do things with the post received. Post is the status object.
    def on_error(self, status_code):
        # If error thrown during streaming.

    def on_timeout(self):
        # If no post received for too long
    def on_limit(self, track):
        # If too many posts match our filter criteria and only a subset is
        # sent to us

    def on_delete(self, status_id, user_id):
         # When a delete notice arrives for a post.

    def set_callback(self, callback):
        # Pass callback to call when a new post arrives
        self.callback = callback

We need to add few lines of code to the on_status method. We parse what we are interested in and publish the data to our messaging queue. We filter out the posts written by an author whose language is not French. The callback is our messaging queue producer publish method.

def on_status(self, status):
    if status.author.lang == 'fr':
        message = {'author_name': status.author.screen_name,
                   'author_id': status.author.id,
                   'id': status.id,
                   'text': status.text,
                   'retweeted': status.retweeted,
                   'coordinates': status.coordinates,
                   'time': int(time.time())}
        self.callback(json.dumps(message), 'posts')

We will see later how the messaging queue producer and consumer are built.

There is one more thing we need to do: setting up our streaming filter so we start receiving posts from Twitter, we are interested in. We have a list of presidential candidates in the list self.person. We build a list of names and start listening for them. The call to stream.filter is blocking and the method on_status of the listener class is called each time a new post arrived.

Keep in mind that the streaming filter returns at most 1% of all posts processed by Twitter. This means that if the posts referring to our candidates represent more than 1% of all posts on Twitter at instant t, then the number of posts will be capped at 1%. We encountered this case only twice: during the first round results and during the second round results. We lost less than 10% of the posts when that situation happened. How do you make sure this does not happen? You will have to subscribe to the complete stream which is provided by some Twitter partners like DataSift and Gnip. Those solutions are not cheap.

Note that we are catching all exceptions. There is no guarantee that you will get continuous streaming with no errors so catching all exceptions is important here.

def stream_filter(self):
    track_list = [data.normalize(p['name']) for p in self.persons]
    while True:
        except Exception:
            logging.exception('stream filter')

Some examples of errors I saw in the past:

File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
File "/usr/lib/python2.6/httplib.py", line 518, in read
    return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 561, in _read_chunked
    raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(0 bytes read)
File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
File "/usr/lib/python2.6/ssl.py", line 96, in <lambda>
    self.recv = lambda buflen=1024, flags=0: SSLSocket.recv(self, buflen, flags)
File "/usr/lib/python2.6/ssl.py", line 222, in recv
    raise x
SSLError: The read operation timed out
File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
File "/usr/lib/python2.6/ssl.py", line 136, in read
    return self._sslobj.read(len)
error: [Errno 104] Connection reset by peer

Let’s take a look at what happens when the method stream.filter is called. An HTTPS POST request is made using the following URL: https://stream.twitter.com/1/statuses/filter.json?delimited=length and the following body: track=candidate1, candidate2,… The stream of data is then read in a loop until there is an error.

The data arrives in the following format: “\n\n…” There is the length before the post data because we ask for it with the URL parameter ‘delimited=length’.

Here is an example of a post content:

    "in_reply_to_user_id": null,
    "in_reply_to_status_id": null,
    "text": "bla bla bla",
    "favorited": false,

A more complete example: https://gist.github.com/900964.

The Tweepy library formats that data as a status object and passes it to the on_status method of the listener object.

See the full Tweets receiver module.

Messaging queue

We are using RabbitMQ for our messaging system plus the Python helper library py-amqlib. An exchange is created to receive the posts and a consumer reads the messages from a queue. Those messages are processed by the computing unit. The advantage of using a messaging queue is we can handle surge of posts.

First is the producer. We create a connection to the messaging server and get a channel on that connection. This channel is used to publish messages to the exchange.

class Producer(object):
    def __init__(self, exchange_name, host, userid, password):
        self.exchange_name = exchange_name
        self.connection = amqp.Connection(
            host=host, userid=userid, password=password, virtual_host="/",
        self.channel = self.connection.channel()

Our publisher class has a publish method to send a message to the exchange. Messages marked as ‘persistent’ that are delivered to ‘durable’ queues will be logged to disk. We use the routing key ‘posts’ which will also be used when we create the queue to route the messages properly.

def publish(self, message, routing_key):
    msg = amqp.Message(message)
    msg.properties["content_type"] = "text/plain"
    msg.properties["delivery_mode"] = 2

Next is the consumer. We also get a connection to the messaging server and get a channel on that connection.

class Consumer(object):
    def __init__(self, host, userid, password):
        self.connection = amqp.Connection(host=host, userid=userid, password=password, virtual_host="/", insist=False)
        self.channel = self.connection.channel()

We also have a method creating the queue and one passing the method to be called each time there is a message to be consumed in the queue.

See the full Messaging queue module.

DB interface

Before we go over the computing unit, let’s look at the DB interface we created to interface with the in-memory DB Redis and MySQL.

Regarding Redis, our interface is built on top of the helper library redis-py. It adds retries around DB commands.

We use the following Redis commands (complexity of the command is shown next to it):

  • GET key – O(1)
  • SET key – O(1)
  • DELETE key – O(1)
  • EXISTS key – O(1)
  • INCR key – O(1)
  • RPUSH key value – O(1)
  • LSET key index value – O(N)
  • LINDEX key index – O(N)
  • LRANGE key start stop – O(S+N)

The key used to store posts is ‘post: ‘. We dump the json post data as the key’s value. For ease of access, we also have a Redis list per person and per hour with the following key: ‘person: :posts:. This list contains the post ids referring to this person during that hour.

Regarding MySQL, our interface is built on top of the helper library MySQLdb.

Here is the method to execute a MySQL command. If the command throws an operational error or an internal error, we try to reconnect to the MySQL server. If it throws a SQL error, we retry multiple times before raising a DBError.

def mysql_command(self, cmd, sql, writer, *args):
    retry = 0
    while retry < self.cmd_retries:
            r = getattr(self.db_cursor, cmd)(sql, args)
            if writer:
                return r
                return self.db_cursor.fetchall() 
        except (MySQLdb.OperationalError, MySQLdb.InternalError):
            self.log.error('MySQL cmd %s DB error', cmd)
            # reconnect
            retry = 0
        except MySQLdb.Error:
            self.log.error('MySQL cmd %s sql %s failed', cmd, sql)
            retry += 1
            if retry <= self.cmd_retries:
        except AttributeError:
            self.log.error('MySQL cmd %s does not exist', cmd)
            raise exceptions.DbError()
    raise exceptions.DbError()

We keep smaller and more recent data in Redis. MySQL is used for larger and long-term data.

We added a thin layer on top of the Redis and MySQL commands to make the dual DB setup transparent. When we request some data, it is read from Redis and/or MySQL based on its age or type.

Twitter statistics

See the full DB module.

Computing unit

We defined a method called when there is a message to read from the queue. When a post is received, we process it the following way:

  • Filter out posts marked as fr language and containing common english words. In most cases, this is a post fully written in English and we need to bypass those.
  • For each person, check if this post is really about that person and not something unrelated.
  • Add post ID to the person’s hourly posts list.
  • Store post data in DB.
def process_post(self, post):
    Process post received from the message queue.
    # is this a post matching one or more persons?
    post_add = False
    # remove accents and lowercase everything
    text = data.normalize(post['text']).lower()
    # check post language
    if data.get_text_language(text) == 'fr':
        for person in self.persons:
            # get person's first name, last name and nickname
            names = data.get_names(person)
            # check if the post is really about that person
            if data.check_names(names, text, person['words']) == 1:
                # one more post for this person
                if not post_add:
                    post_add = True
                    # get next post id
                    post_id = self.db.incr('nextPostId')
                # add post to person's posts list
                key = 'person:%d:posts:%d' % (person['id'],
                self.db.rpush(key, post_id)
        if post_add:
            # add post to db
            # add post id to current hour
            key = 'posts:%d' % (self.stats_last_update)
            self.db.rpush(key, post_id)
        logging.debug('found english word in %s', text)

Filtering out unrelated messages is key here. For example, “Je vais en Hollande demain” (I am going to Holland tomorrow) is not really about the presidential candidate “Hollande” but more about the country “Holland”. Both are spelled the same way in French. We defined a list of words and rules per person to help filtering out the unrelated posts.

See the full compute module.

High availability

Each element above can be highly-available with the use of an extra server. We can add one more server receiving the tweets in case the active one fails over. We can detect this type of failure using an heartbeat between the active and the stand-by instance. RabbitMQ supports mirror queues. Redis and MySQL supports a master/slave architecture.

Twitter statistics


During peak traffic (first round results for example), the bottleneck in our system was the Twitter streaming listener. The code reads the length of the post data byte per byte from the stream and then reads the post data using the length value. This is quite CPU intensive and we had to switch from a small instance (1 computing units) on Amazon EC2 to a large one (4 computing units) to read the posts in real-time during traffic peaks.

The messaging system we used (RabbitMQ) can handle way more than what we used it for so no issue on that side.

Here is some comparison between Redis and MySQL when it comes to storing data on a small EC2 instance.


– MySQL: insert into table for each value, final commit.
– Redis: SET command for each value. Redis persists changes to disk in the background.

Adding 10k posts:

– MySQL: 4.0 seconds.
– Redis: 2.6 seconds – 1.53x faster.

Adding 100k posts:

– MySQL: 42.0 seconds.
– Redis: 23.7 seconds – 1.77x faster.


Over 8 millions tweets (8442728) related to the different candidates were analyzed by Pytolab from Sep 1st 2011 to June 1st 2012. Posts referring to at least one candidate were analyzed. This is different than the posts posted by the candidates themselves.

Here are some key dates from the presidential campaign:

  • 1st round of the Socialist primaries: October 9th 2011
  • 2nd round of the Socialist primaries: October 16th 2011
  • 1st round of the presidential election: April 22nd 2012
  • 2nd round of the presidential election: May 6th 2012

The following chart represents the number of posts per day for each candidate. The key dates described above are shown in red.

Twitter statistics

Here is the list of candidates we tracked:

List of candidates:

  • Nathalie Arthaud
  • Martine Aubry
  • Jean-Michel Baylet
  • François Bayrou
  • Christine Boutin
  • Nicolas Dupont Aignan
  • François Hollande
  • Nicolas Hulot
  • Eva Joly
  • Marine Le Pen
  • Jean-Luc Mélenchon
  • Arnaud Montebourg
  • Philippe Poutou
  • Ségolène Royal
  • Nicolas Sarkozy
  • Manuel Valls
  • Dominique Villepin

Here are the number of posts where each candidate’s name appears.

Twitter statistics

We noticed that Nicolas Sarkozy is referred in 41% of all posts we analyzed. François Hollande in 35% of all posts. There is no strong correlation between the number of posts per candidate and their polling result. The candidate with the most posts was the president at that time so it is expected to see those numbers.

Posts count Polls
Nicolas Sarkozy François Hollande
François Hollande Nicolas Sarkozy
François Bayrou Marine Le Pen
Marine Le Pen Jean-Luc Mélenchon
Jean-Luc Mélenchon François Bayrou

We noticed something interesting where the number of posts were matching the polls during the 11 hours preceding the first round results and during the 6 hours preceding the second round results.

Twitter statistics
Twitter statistics

Let’s look at the authors of the posts now. We counted 388628 different authors. 98.3% of authors posted less than 200 posts during those 9 months. That is less than 1 post per day. 0.7% of authors (2720) posted more than 500 posts and posted 45% of all posts.

The top 10 authors in number of posts are:

  • sarkoactu: 26356
  • bayrouactu: 26345
  • Elysee_2012: 21076
  • sarkozy_info: 18868
  • FlashPresse: 16349
  • Scrutin2012: 16229
  • Eric_vds: 14667
  • democrates: 14528
  • akemoi: 14403
  • blabalade: 14119

Here is the distribution of posts per hour for all posts over our period of 9 months:

Twitter statistics

If we look at the number of posts from “sarkoactu”. it is about 96 posts per day. Looking at the distribution of the posts per hour for that author, we notice that it is probably an automatic feed.

Twitter statistics

Looking at the full list of authors and their posts distribution per hour, we found out that 26 authors are probably automatic feeds and that they represent 215783 posts which is 2.5% of all posts.

Location metadata is attached to only 0.5% of all posts. In our case, this represents 40799 posts. There is not much difference between each candidate in regards to the post location. We do notice that the posts are issued mainly from French speaking countries: France, Belgium, Canada, Algeria, Tunisia… It makes sense as we analyzed posts written in French.

Twitter statistics

This Europe map shows that this event is mainly followed in France and little in the rest of Europe. The fact that we tracked the posts written in French contributes to this result.

Twitter statistics

Next, we looked at what other candidates an author talks about when his most talked candidate is person A. Below, you can see that if an author most talked candidate is Nathalie Arthaud then that author also talks about François Hollande, Nicolas Sarkozy and Marine Le Pen.

In 11 on 17 cases, the most talked candidate is Nicolas Sarkozy. Reciprocity is not a rule. When an author talks about Nicolas Hulot, he also talks about Eva Joly (2nd most). The opposite is not true.

  • Nathalie Arthaud
    • François Hollande – 19.2%
    • Nicolas Sarkozy – 18.8%
    • Marine Le Pen – 11.1%
    • Philippe Poutou – 11.1%
    • Eva Joly – 9.2%
  • Martine Aubry
    • François Hollande – 31.4%
    • Nicolas Sarkozy – 19.5%
    • Ségolène Royal – 9.2%
    • Arnaud Montebourg – 8.6%
    • Marine Le Pen – 7.8%
  • Jean-Michel Baylet
    • François Hollande – 21.9%
    • Nicolas Sarkozy – 19.7%
    • Marine Le Pen – 10.1%
    • Arnaud Montebourg – 7.9%
    • Eva Joly – 7.5%
  • François Bayrou
    • François Hollande – 24.2%
    • Nicolas Sarkozy – 23.9%
    • Marine Le Pen – 10.2%
    • Jean-Luc Mélenchon – 8.2%
    • Eva Joly – 6.6%
  • Christine Boutin
    • Nicolas Sarkozy – 31.5%
    • François Hollande – 24.8%
    • Marine Le Pen – 11.2%
    • Ségolène Royal – 6.7%
    • Eva Joly – 4.9%
  • Nicolas Dupont Aignan
    • Nicolas Sarkozy – 24.1%
    • François Hollande – 23.1%
    • Marine Le Pen – 14.9%
    • Jean-Luc Mélenchon – 8.6%
    • Eva Joly – 8.3%
  • François Hollande
    • Nicolas Sarkozy – 32.8%
    • Marine Le Pen – 13.8%
    • Jean-Luc Mélenchon – 7.6%
    • François Bayrou – 7.5%
    • Eva Joly – 6.9%
  • Nicolas Hulot
    • Nicolas Sarkozy – 31.6%
    • Eva Joly – 18.3%
    • François Hollande – 10.7%
    • Marine Le Pen – 10.2%
    • Ségolène Royal – 8.9%
  • Eva Joly
    • Nicolas Sarkozy – 27.4%
    • Marine Le Pen – 15.2%
    • François Hollande – 14.5%
    • Jean-Luc Mélenchon – 7.6%
    • François Bayrou – 5.8%
  • Marine Le Pen
    • Nicolas Sarkozy – 33.6%
    • François Hollande – 19.8%
    • Jean-Luc Mélenchon – 14.0%
    • Eva Joly – 6.1%
    • Ségolène Royal – 5.5%
  • Jean-Luc Mélenchon
    • Nicolas Sarkozy – 23.3%
    • François Hollande – 15.9%
    • Marine Le Pen – 14.7%
    • François Bayrou – 8.8%
    • Eva Joly – 6.4%
  • Arnaud Montebourg
    • Nicolas Sarkozy – 25.2%
    • François Hollande – 15.8%
    • Ségolène Royal – 8.1%
    • Martine Aubry – 7.9%
    • Manuel Valls – 6.5%
  • Philippe Poutou
    • Nicolas Sarkozy – 32.0%
    • François Hollande – 20.6%
    • Marine Le Pen – 12.1%
    • Eva Joly – 6.8%
    • Ségolène Royal – 6.4%
  • Ségolène Royal
    • Nicolas Sarkozy – 32.4%
    • François Hollande – 19.2%
    • Marine Le Pen – 9.4%
    • Martine Aubry – 6.0%
    • Eva Joly – 5.2%
  • Nicolas Sarkozy
    • François Hollande – 21.4%
    • Marine Le Pen – 13.8%
    • François Bayrou – 9.2%
    • Jean-Luc Mélenchon – 8.8%
    • Eva Joly – 7.9%
  • Manuel Valls
    • François Hollande – 24.4%
    • Nicolas Sarkozy – 19.1%
    • Martine Aubry – 8.8%
    • Arnaud Montebourg – 7.9%
    • Marine Le Pen – 7.6%
  • Dominique Villepin
    • Nicolas Sarkozy – 18.9%
    • François Hollande – 16.4%
    • François Bayrou – 11.3%
    • Marine Le Pen – 9.7%
    • Eva Joly – 6.6%

The following graph shows connections between candidates based on number of identical words used when posts are referring to them. Wider the vertex is, more words are in common. An obvious link is Hollande – Sarkozy. Bayrou being in the center politically has two strong links: with Sarkozy and with Hollande. This is also expected. We notice another strong link between Le Pen and Sarkozy. This link makes sense based on some subjects discussed by both candidates.

Twitter statistics

Next is a similar chart but based on posts referring multiple candidates. What is interesting here are the links going across political boundaries. Mélenchon has two major links: one with a candidate on the left and one with a candidate on the right. Joly has two links with candidates on the left and one with a candidate on the right. It makes sense when you know that Joly is more on the left on the political spectrum.

Twitter statistics

Let’s look at the major events for each candidate. As we are tracking the number of posts for each candidate, we can find out the events and what those were about by looking at the most used words in the posts.

The next chart shows that about 22500 posts talked about François Bayrou on March 9th. Looking at the most used words, we can see that the candidate was participating to a TV show called “Des paroles et des actes” also abbreviated “dpda”. “soir” means evening and the TV show takes place in the evening.

Twitter statistics

See the section events in the annexes for the complete list of events for each candidate.

Next is a bar chart showing the number of authors mainly talking about a candidate (80% or more of the posts only related that candidate). We notice a strong presence online of authors mainly talking about François Hollande. We notice 2 others strong presence online: Marine Le Pen and Ségolène Royal.

Twitter statistics



Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics

Twitter sentiment analysis using Python and NLTK

January 2, 2012

This post describes the implementation of sentiment analysis of tweets using Python and the natural language toolkit NLTK. The post also describes the internals of NLTK related to this implementation.


The purpose of the implementation is to be able to automatically classify a tweet as a positive or negative tweet sentiment wise.

The classifier needs to be trained and to do that, we need a list of manually classified tweets. Let’s start with 5 positive tweets and 5 negative tweets.

Positive tweets:

  • I love this car.
  • This view is amazing.
  • I feel great this morning.
  • I am so excited about the concert.
  • He is my best friend.

Negative tweets:

  • I do not like this car.
  • This view is horrible.
  • I feel tired this morning.
  • I am not looking forward to the concert.
  • He is my enemy.

In the full implementation, I use about 600 positive tweets and 600 negative tweets to train the classifier. I store those tweets in a Redis DB. Even with those numbers, it is quite a small sample and you should use a much larger set if you want good results.

Next is a test set so we can assess the exactitude of the trained classifier.

Test tweets:

  • I feel happy this morning. positive.
  • Larry is my friend. positive.
  • I do not like that man. negative.
  • My house is not great. negative.
  • Your song is annoying. negative.


The following list contains the positive tweets:

pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

The following list contains the negative tweets:

neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3] 
    tweets.append((words_filtered, sentiment))

The list of tweets now looks like this:

tweets = [
    (['love', 'this', 'car'], 'positive'),
    (['this', 'view', 'amazing'], 'positive'),
    (['feel', 'great', 'this', 'morning'], 'positive'),
    (['excited', 'about', 'the', 'concert'], 'positive'),
    (['best', 'friend'], 'positive'),
    (['not', 'like', 'this', 'car'], 'negative'),
    (['this', 'view', 'horrible'], 'negative'),
    (['feel', 'tired', 'this', 'morning'], 'negative'),
    (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
    (['enemy'], 'negative')]

Finally, the list with the test tweets:

test_tweets = [
    (['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]


The list of word features need to be extracted from the tweets. It is a list with every distinct words ordered by frequency of appearance. We use the following function to get the list plus the two helper functions.

word_features = get_word_features(get_words_in_tweets(tweets))
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
    return all_words
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

If we take a pick inside the function get_word_features, the variable ‘wordlist’ contains:

    'this': 6,
    'car': 2,
    'concert': 2,
    'feel': 2,
    'morning': 2,
    'not': 2,
    'the': 2,
    'view': 2,
    'about': 1,
    'amazing': 1,

We end up with the following list of word features:

word_features = [

As you can see, ‘this’ is the most used word in our tweets, followed by ‘car’, followed by ‘concert’…

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

As an example, let’s call the feature extractor with the document [‘love’, ‘this’, ‘car’] which is the first positive tweet. We obtain the following dictionary which indicates that the document contains the words: ‘love’, ‘this’ and ‘car’.

{'contains(not)': False,
 'contains(view)': False,
 'contains(best)': False,
 'contains(excited)': False,
 'contains(morning)': False,
 'contains(about)': False,
 'contains(horrible)': False,
 'contains(like)': False,
 'contains(this)': True,
 'contains(friend)': False,
 'contains(concert)': False,
 'contains(feel)': False,
 'contains(love)': True,
 'contains(looking)': False,
 'contains(tired)': False,
 'contains(forward)': False,
 'contains(car)': True,
 'contains(the)': False,
 'contains(amazing)': False,
 'contains(enemy)': False,
 'contains(great)': False}

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

training_set = nltk.classify.apply_features(extract_features, tweets)

The variable ‘training_set’ contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

[({'contains(not)': False,
   'contains(this)': True,
   'contains(love)': True,
   'contains(car)': True,
   'contains(great)': False},
 ({'contains(not)': False,
   'contains(view)': True,
   'contains(this)': True,
   'contains(amazing)': True,
   'contains(enemy)': False,
   'contains(great)': False},

Now that we have our training set, we can train our classifier.

classifier = nltk.NaiveBayesClassifier.train(training_set)

Here is a summary of what we just saw:

Twitter sentiment analysis with Python and NLTK

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input.

Let’s take a look inside the classifier train method in the source code of the NLTK library. ‘label_probdist’ is the prior probability of each label and ‘feature_probdist’ is the feature/value probability dictionary. Those two probability objects are used to create the classifier.

def train(labeled_featuresets, estimator=ELEProbDist):
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    return NaiveBayesClassifier(label_probdist, feature_probdist)

In our case, the probability of each label is 0.5 as we can see below. label_probdist is of type ELEProbDist.

print label_probdist.prob('positive')
print label_probdist.prob('negative')

The feature/value probability dictionary associates expected likelihood estimate to a feature and label. We can see that the probability for the input to be negative is about 0.077 when the input contains the word ‘best’.

print feature_probdist
{('negative', 'contains(view)'): <ELEProbDist based on 5 samples>,
 ('positive', 'contains(excited)'): <ELEProbDist based on 5 samples>,
 ('negative', 'contains(best)'): <ELEProbDist based on 5 samples>, ...}
print feature_probdist[('negative', 'contains(best)')].prob(True)

We can display the most informative features for our classifier using the method show_most_informative_features. Here, we see that if the input does not contain the word ‘not’ then the positive ration is 1.6.

print classifier.show_most_informative_features(32)
Most Informative Features
           contains(not) = False          positi : negati =      1.6 : 1.0
         contains(tired) = False          positi : negati =      1.2 : 1.0
       contains(excited) = False          negati : positi =      1.2 : 1.0
         contains(great) = False          negati : positi =      1.2 : 1.0
       contains(looking) = False          positi : negati =      1.2 : 1.0
          contains(like) = False          positi : negati =      1.2 : 1.0
          contains(love) = False          negati : positi =      1.2 : 1.0
       contains(amazing) = False          negati : positi =      1.2 : 1.0
         contains(enemy) = False          positi : negati =      1.2 : 1.0
         contains(about) = False          negati : positi =      1.2 : 1.0
          contains(best) = False          negati : positi =      1.2 : 1.0
       contains(forward) = False          positi : negati =      1.2 : 1.0
        contains(friend) = False          negati : positi =      1.2 : 1.0
      contains(horrible) = False          positi : negati =      1.2 : 1.0


Now that we have our classifier initialized, we can try to classify a tweet and see what the sentiment type output is. Our classifier is able to detect that this tweet has a positive sentiment because of the word ‘friend’ which is associated to the positive tweet ‘He is my best friend’.

tweet = 'Larry is my friend'
print classifier.classify(extract_features(tweet.split()))

Let’s take a look at how the classify method works internally in the NLTK library. What we pass to the classify method is the feature set of the tweet we want to analyze. The feature set dictionary indicates that the tweet contains the word ‘friend’.

print extract_features(tweet.split())
{'contains(not)': False,
 'contains(view)': False,
 'contains(best)': False,
 'contains(excited)': False,
 'contains(morning)': False,
 'contains(about)': False,
 'contains(horrible)': False,
 'contains(like)': False,
 'contains(this)': False,
 'contains(friend)': True,
 'contains(concert)': False,
 'contains(feel)': False,
 'contains(love)': False,
 'contains(looking)': False,
 'contains(tired)': False,
 'contains(forward)': False,
 'contains(car)': False,
 'contains(the)': False,
 'contains(amazing)': False,
 'contains(enemy)': False,
 'contains(great)': False}
def classify(self, featureset):
    # Discard any feature names that we've never seen before.
    # Find the log probability of each label, given the features.
    # Then add in the log probability of features given labels.
    # Generate a probability distribution dictionary using the dict logprod
    # Return the sample with the greatest probability from the probability
    # distribution dictionary

Let’s go through that method using our example. The parameter passed to the method classify is the feature set dictionary we saw above. The first step is to discard any feature names that are not know by the classifier. This step does nothing in our case so the feature set stays the same.

Next step is to find the log probability for each label. The probability of each label (‘positive’ and ‘negative’) is 0.5. The log probability is the log base 2 of that which is -1. We end up with logprod containing the following:

{'positive': -1.0, 'negative': -1.0}

The log probability of features given labels is then added to logprod. This means that for each label, we go through the items in the feature set and we add the log probability of each item to logprod[label]. For example, we have the feature name ‘friend’ and the feature value True. Its log probability for the label ‘positive’ in our classifier is -2.12. This value is added to logprod[‘positive’]. We end up with the following logprod dictionary.

{'positive': -5.4785441837188511, 'negative': -14.784261334886439}

The probability distribution dictionary of type DictionaryProbDist is generated:

DictionaryProbDist(logprob, normalize=True, log=True)

The label with the greatest probability is returned which is ‘positive’. Our classifier finds out that this tweets has a positive sentiment based on the training we did.

Another example is the tweet ‘My house is not great’. The word ‘great’ weights more on the positive side but the word ‘not’ is part of two negative tweets in our training set so the output from the classifier is ‘negative’. Of course, the following tweet: ‘The movie is not bad’ would return ‘negative’ even if it is ‘positive’. Again, a large and well chosen sample will help with the accuracy of the classifier.

Taking the following test tweet ‘Your song is annoying’. The classifier thinks it is positive. The reason is that we don’t have any information on the feature name ‘annoying’. Larger the training sample tweets is, better the classifier will be.

tweet = 'Your song is annoying'
print classifier.classify(extract_features(tweet.split()))

There is an accuracy method we can use to check the quality of our classifier by using our test tweets. We get 0.8 in our case which is high because we picked our test tweets for this article. The key is to have a very large number of manually classified positive and negative tweets.

Voilà. Don’t hesitate to post a comment if you have any feedback.

Python dictionary implementation

August 29, 2011

This post describes how dictionaries are implemented in the Python language.

Dictionaries are indexed by keys and they can be seen as associative arrays. Let’s add 3 key/value pairs to a dictionary:

>>> d = {'a': 1, 'b': 2}
>>> d['c'] = 3
>>> d
{'a': 1, 'b': 2, 'c': 3}

The values can be accessed this way:

>>> d['a']
>>> d['b']
>>> d['c']
>>> d['d']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'd'

The key ‘d’ does not exist so a KeyError exception is raised.

Hash tables

Python dictionaries are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys. The goal of a hash function is to distribute the keys evenly in the array. A good hash function minimizes the number of collisions e.g. different keys having the same hash. Python does not have this kind of hash function. Its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

We are going to assume that we are using strings as keys for the rest of this post. The hash function for strings in Python is defined as:

arguments: string object
returns: hash
function string_hash:
    if hash cached:
        return it
    set len to string's length
    initialize var p pointing to 1st char of string object
    set x to value pointed by p left shifted by 7 bits
    while len >= 0:
        set var x to (1000003 * x) xor value pointed by p
        increment pointer p
    set x to x xor length of string object
    cache x as the hash so we don't need to calculate it again
    return x as the hash

If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344. Here we assume we are using a 64-bit machine.

If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. This makes the computation of the slot index fast. The probability to find an empty slot is high due to the resizing mechanism described below. This means that having a simple computation makes sense in most of the cases. If the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.

hash table

We can see that the Python hash function does a good job when the keys are consecutive which is good because it is quite common to have this type of data to work with. However, once we add the key ‘z’, there is a collision because it is not consecutive enough.

We could use a linked list to store the pairs having the same hash but it would increase the lookup time e.g. not O(1) average anymore. The next section describes the collision resolution method used in the case of Python dictionaries.

Open addressing

Open addressing is a method of collision resolution where probing is used. In case of ‘z’, the slot index 3 is already used in the array so we need to probe for a different index to find one which is not already used. Adding a key/value pair will average O(1) and the lookup operation too.

A quadratic probing sequence is used to find a free slot. The code is the following:

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;

Recurring on 5*j+1 quickly magnifies small differences in the bits that didn’t affect the initial index. The variable “perturb” gets the other bits of the hash code into play.

Just out of curiosity, let’s look at the probing sequence when the table size is 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

You can read more about this probing sequence by looking at the source code of dictobject.c. A detailed explanation of the probing mechanism can be found at the top of the file.

open addressing

Now, let’s look at the Python internal code along with an example.

Dictionary C structures

The following C structure is used to store a dictionary entry: key/value pair. The hash, key and value are stored. PyObject is the base class of the Python objects.

typedef struct {
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

The following structure represents a dictionary. ma_fill is the number of used slots + dummy slots. A slot is marked dummy when a key pair is removed. ma_used is the number of used slots (active). ma_mask is equal to the array’s size minus 1 and is used to calculate the slot index. ma_table is the array and ma_smalltable is the initial array of size 8.

typedef struct _dictobject PyDictObject;
struct _dictobject {
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];

Dictionary initialization

When you first create a dictionary, the function PyDict_New() is called. I removed some of the lines and converted the C code to pseudocode to concentrate on the key concepts.

returns new dictionary object
function PyDict_New:
    allocate new dictionary object
    clear dictionary's table
    set dictionary's number of used slots + dummy slots (ma_fill) to 0
    set dictionary's number of active slots (ma_used) to 0
    set dictionary's mask (ma_value) to dictionary size - 1 = 7
    set dictionary's lookup function to lookdict_string
    return allocated dictionary object

Adding items

When a new key/value pair is added, PyDict_SetItem() is called. This function takes a pointer to the dictionary object and the key/value pair. It checks if the key is a string and calculates the hash or reuses the one cached if it exists. insertdict() is called to add the new key/value pair and the dictionary is resized if the number of used slots + dummy slots is greater than 2/3 of the array’s size.
Why 2/3? It is to make sure the probing sequence can find a free slot fast enough. We will look at the resizing function later.

arguments: dictionary, key, value
returns: 0 if OK or -1
function PyDict_SetItem:
    if key's hash cached:
        use hash
        calculate hash
    call insertdict with dictionary object, key, hash and value
    if key/value pair added successfully and capacity over 2/3:
        call dictresize to resize dictionary's table

inserdict() uses the lookup function lookdict_string() to find a free slot. This is the same function used to find a key. lookdict_string() calculates the slot index using the hash and the mask values. If it cannot find the key in the slot index = hash & mask, it starts probing using the loop described above, until it finds a free slot. At the first probing try, if the key is null, it returns the dummy slot if found during the first lookup. This gives priority to re-use the previously deleted slots.

We want to add the following key/value pairs: {‘a': 1, ‘b': 2′, ‘z': 26, ‘y': 25, ‘c': 5, ‘x': 24}. This is what happens:

A dictionary structure is allocated with internal table size of 8.

  • PyDict_SetItem: key = ‘a’, value = 1
    • hash = hash(‘a’) = 12416037344
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12416037344 & 7 = 0
        • slot 0 is not used so return it
      • init entry at index 0 with key, value and hash
      • ma_used = 1, ma_fill = 1
  • PyDict_SetItem: key = ‘b’, value = 2
    • hash = hash(‘b’) = 12544037731
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12544037731 & 7 = 3
        • slot 3 is not used so return it
      • init entry at index 3 with key, value and hash
      • ma_used = 2, ma_fill = 2
  • PyDict_SetItem: key = ‘z’, value = 26
    • hash = hash(‘z’) = 15616046971
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15616046971 & 7 = 3
        • slot 3 is used so probe for a different slot: 5 is free
      • init entry at index 5 with key, value and hash
      • ma_used = 3, ma_fill = 3
  • PyDict_SetItem: key = ‘y’, value = 25
    • hash = hash(‘y’) = 15488046584
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15488046584 & 7 = 0
        • slot 0 is used so probe for a different slot: 1 is free
      • init entry at index 1 with key, value and hash
      • ma_used = 4, ma_fill = 4
  • PyDict_SetItem: key = ‘c’, value = 3
    • hash = hash(‘c’) = 12672038114
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12672038114 & 7 = 2
        • slot 2 is free so return it
      • init entry at index 2 with key, value and hash
      • ma_used = 5, ma_fill = 5
  • PyDict_SetItem: key = ‘x’, value = 24
    • hash = hash(‘x’) = 15360046201
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15360046201 & 7 = 1
        • slot 1 is used so probe for a different slot: 7 is free
      • init entry at index 7 with key, value and hash
      • ma_used = 6, ma_fill = 6

This is what we have so far:

python dictionary insert

6 slots on 8 are used now so we are over 2/3 of the array’s capacity. dictresize() is called to allocate a larger array. This function also takes care of copying the old table entries to the new table.

dictresize() is called with minused = 24 in our case which is 4 * ma_used. 2 * ma_used is used when the number of used slots is very large (greater than 50000). Why 4 times the number of used slots? It reduces the number of resize steps and it increases sparseness.

The new table size needs to be greater than 24 and it is calculated by shifting the current size 1 bit left until it is greater than 24. It ends up being 32 e.g. 8 -> 16 -> 32.

This is what happens with our table during resizing: a new table of size 32 is allocated. Old table entries are inserted into the new table using the new mask value which is 31. We end up with the following:

python dictionary table resizing

Removing items

PyDict_DelItem() is called to remove an entry. The hash for this key is calculated and the lookup function is called to return the entry. The slot is now a dummy slot.

We want to remove the key ‘c’ from our dictionary. We end up with the following array:

Python dictionary delete key

Note that the delete item operation doesn’t trigger an array resize if the number of used slots is much less that the total number of slots. However, when a key/value pair is added, the need for resize is based on the number of used slots + dummy slots so it can shrink the array too.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

Python string objects implementation

June 19, 2011

This article describes how string objects are managed by Python internally and how string search is done.

PyStringObject structure
New string object
Sharing string objects
String search

PyStringObject structure

A string object in Python is represented internally by the structure PyStringObject. “ob_shash” is the hash of the string if calculated. “ob_sval” contains the string of size “ob_size”. The string is null terminated. The initial size of “ob_sval” is 1 byte and ob_sval[0] = 0. If you are wondering where “ob_size is defined”, take a look at PyObject_VAR_HEAD in object.h. “ob_sstate” indicates if the string object is in the interned dictionary which we are going to see later.

typedef struct {
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];
} PyStringObject;

New string object

What happens when you assign a new string to a variable like this one?

>>> s1 = 'abc'

The internal C function “PyString_FromString” is called and the pseudo code looks like this:

arguments: string object: 'abc'
returns: Python string object with ob_sval = 'abc'
    size = length of string
    allocate string object + size for 'abc'. ob_sval will be of size: size + 1
    copy string to ob_sval
    return object

Each time a new string is used, a new string object is allocated.

Sharing string objects

There is a neat feature where small strings are shared between variables. This reduces the amount of memory used. Small strings are strings of size 0 or 1 byte. The global variable “interned” is a dictionary referencing those small strings. The array “characters” is also used to reference the strings of length 1 byte: i.e. single characters. We will see later how the array “characters” is used.

static PyStringObject *characters[UCHAR_MAX + 1];
static PyObject *interned;

Let’s see what happens when a new small string is assigned to a variable in your Python script.

>>> s2 = 'a'

The string object containing ‘a’ is added to the dictionary “interned”. The key is a pointer to the string object and the value is the same pointer. This new string object is also referenced in the array characters at the offset 97 because value of ‘a’ is 97 in ASCII. The variable “s2″ is pointing to this string object.

Python string object internals

What happens when a different variable is assigned to the same string ‘a’?

>>> s3 = 'a'

The same string object previously created is returned so both variables are pointing to the same string object. The “characters” array is used during that process to check if the string already exists and returns the pointer to the string object.

if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL)
    return (PyObject *)op;

Python string object internals

Let’s create a new small string containing the character ‘c’.

>>> s4 = 'c'

We end up with the following:

Python string object internals

We also find the “characters” array at use when a string’s item is requested like in the following Python script:

>>> s5 = 'abc'
>>> s5[0]

Instead of creating a new string containing ‘a’, the pointer at the offset 97 of the “characters” array is returned. Here is the code of the function “string_item” which is called when we request a character from a string. The argument “a” is the string object containing ‘abc’ and the argument “i” is the index requested: 0 in our case. A pointer to a string object is returned.

static PyObject *
string_item(PyStringObject *a, register Py_ssize_t i)
    char pchar;
    PyObject *v;
    pchar = a->ob_sval[i];
    v = (PyObject *)characters[pchar & UCHAR_MAX];
    if (v == NULL)
        // allocate string
    else {
    return v;

The “characters” array is also used for function names of length 1:

>>> def a(): pass

String search

Let’s take a look at what happens when you perform a string search like in the following Python code:

>>> s = 'adcabcdbdabcabd'
>>> s.find('abcab')
>>> 11 

The “find” function returns the index where the string ‘abcd’ is found in the string “s”. It returns -1 if the string is not found.

So, what happens internally? The function “fastsearch” is called. It is a mix between Boyer-Moore and Horspool algorithms plus couple of neat tricks.

Let’s call “s” the string to search in and “p” the string to search for. s = ‘adcabcdbdabcabd’ and p = ‘abcab’. “n” is the length of “s” and “m” is the length of “p”. n = 18 and m = 5.

The first check in the code is obvious, if m > n then we know that we won’t be able to find the index so the function returns -1 right away as we can see in the following code:

w = n - m;
if (w < 0)
    return -1;

When m = 1, the code goes through “s” one character at a time and returns the index when there is a match. mode = FAST_SEARCH in our case as we are looking for the index where the string is found first and not the number of times the string if found.

if (m <= 1) {
    if (mode == FAST_COUNT) {
    } else {
        for (i = 0; i < n; i++)
            if (s[i] == p[0])
                return i;
    return -1;

For other cases i.e. m > 1. The first step is to create a compressed boyer-moore delta 1 table. Two variables will be assigned during that step: “mask” and “skip”.

“mask” is a 32-bit bitmask, using the 5 least significant bits of the character as the key. It is generated using the string to search “p”. It is a bloom filter which is used to test if a character is present in this string. It is really fast but there are false positives. You can read more about bloom filters here. This is how the bitmask is generated in our case:

mlast = m - 1
/* process pattern[:-1] */
for (mask = i = 0; i < mlast; i++) {
    mask |= (1 << (p[i] & 0x1F));
/* process pattern[-1] outside the loop */
mask |= (1 << (p[mlast] & 0x1F));

First character of “p” is ‘a’. Value of ‘a’ is 97 = 1100001 in binary format. Using the 5 least significants bits, we get 00001 so “mask” is first set to: 1 << 1 = 10. Once the entire string "p" is processed, mask = 1110. How do we use this bitmask? By using the following test where "c" is the character to look for in the string "p".

if ((mask & (1 << (c & 0x1F))))

Is ‘a’ in “p” where p = ‘abcab’? Is 1110 & (1 << ('a' & 0X1F)) true? 1110 & (1 << ('a' & 0X1F)) = 1110 & 10 = 10. So, yes 'a' is in 'abcab'. If we test with 'd', we get false and also with the characters from 'e' to 'z' so this filter works pretty well in our case. "skip" is set to the index of the character with the same value as the last character in the string to search for. "skip" is set to the length of "p" - 1 if the last character is not found. The last character in the string to search for is 'b' which means "skip" will be set to 2 because this character can also be found by skipping over 2 characters down. This variable is used in a skip method called the bad-character skip method. In the following example: p = 'abcab' and s = 'adcabcaba'. The search starts at index 4 of "s" and checks backward if there is a string match. This first test fails at index = 1 where 'b' is different than 'd'. We know that the character 'b' in "p" is also found 3 characters down starting from the end. Because 'c' is part of "p", we skip to the following 'b'. This is the bad-character skip. Python string object internals

Next is the search loop itself (real code is in C instead of Python):

for i = 0 to n - m = 13:
    if s[i+m-1] == p[m-1]:
        if s[i:i+mlast] == p[0:mlast]:
            return i
        if s[i+m] not in p:
            i += m
            i += skip
        if s[i+m] not in p:
            i += m
return -1

The test “s[i+m] not in p” is done using the bitmask. “i += skip” is the bad-character skip. “i += m” is done when the next character is not found in “p”.

Let’s see how this search algorithm works with our strings “p” and “s”. The first 3 steps are familiar. After that, the character ‘d’ is not in the string “p” so we skip the length of “p” and quickly find a match after that.

Python string object internals

You can take a look at the code for the string objects here.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

Python integer objects implementation

May 15, 2011

This article describes how integer objects are managed by Python internally.

An integer object in Python is represented internally by the structure PyIntObject. Its value is an attribute of type long.

typedef struct {
    long ob_ival;
} PyIntObject;

To avoid allocating a new integer object each time a new integer object is needed, Python allocates a block of free unused integer objects in advance.

The following structure is used by Python to allocate integer objects, also called PyIntObjects. Once this structure is initialized, the integer objects are ready to be used when new integer values are assigned to objects in a Python script. This structure is called “PyIntBlock” and is defined as:

struct _intblock {
    struct _intblock *next;
    PyIntObject objects[N_INTOBJECTS];
typedef struct _intblock PyIntBlock;

When a block of integer objects is allocated by Python, the objects have no value assigned to them yet. We call them free integer objects ready to be used. A value will be assigned to the next free object when a new integer value is used in your program. No memory allocation will be required when a free integer object’s value is set so it will be fast.

The integer objects inside the block are linked together back to front using their internal pointer called ob_type. As noted in the source code, this is an abuse of this internal pointer so do not pay too much attention to the name.

Each block of integers contains the number of integer objects which can fit in a block of 1K bytes, about 40 PyIntObject objects on my 64-bit machine. When all the integer objects inside a block are used, a new block is allocated with a new list of integer objects available.

A singly-linked list is used to keep track of the integers blocks allocated. It is called “block_list” internally.

Python integer object internals

A specific structure is used to refer small integers and share them so access is fast. It is an array of 262 pointers to integer objects. Those integer objects are allocated during initialization in a block of integer objects we saw above. The small integers range is from -5 to 256. Many Python programs spend a lot of time using integers in that range so this is a smart decision.

#define NSMALLPOSINTS           257
#define NSMALLNEGINTS           5
static PyIntObject *small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

Python integer object internals

The integer object representing the integer -5 is at the offset 0 inside the small integers array. The integers object representing -4 is at offset 1 …

What happens when an integer is defined in a Python script like this one?

>>> a=1
>>> a

When you execute the first line, the function PyInt_FromLong is called and its logic is the following:

if integer value in range -5,256:
    return the integer object pointed by the small integers array at the 
    offset (value + 5).
    if no free integer object available:
        allocate new block of integer objects 
    set value of the next free integer object in the current block 
    of integers.
    return integer object

With our example: integer 1 object is pointed by the small integers array at offset: 1+5 = 6. A pointer to this integer object will be returned and the variable “a” will be pointing to that integer object.

Python integer object internals

Let’s a look at a different example:

>>> a=300
>>> a

300 is not in the range of the small integers array so the next free integer object’s value is set to 300.

Python integer object internals

If you take a look at the file intobject.c in the Python 2.6 source code, you will see a long list of functions taking care of operations like addition, multiplication, conversion… The comparison function looks like this:

static int
int_compare(PyIntObject *v, PyIntObject *w)
    register long i = v->ob_ival;
    register long j = w->ob_ival;
    return (i < j) ? -1 : (i > j) ? 1 : 0;

The value of an integer object is stored in its ob_ival attribute which is of type long. Each value is placed in a register to optimize access and the comparison is done between those 2 registers. -1 is returned if the integer object pointed by v is less than the one pointed by w. 1 is returned for the opposite and 0 is returned if they are equal.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback.

Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org