Cambridge city geospatial statistics

February 17, 2014

Using the Cambridge (Massachusetts) GIS data, we can compute some interesting geospatial statistics.

Subway stations

Using the address points (20844 points) and the subway stations data files, we can find out how many Cambridge addresses are in a certain radius of a Cambridge subway station. More particularly, we want to find the percentage of Cambridge addresses less than 1/8, 1/4 and 1/2 mile from a Cambridge subway station.

  • Less than 1/8 of a mile: 3%
  • Less than 1/4 of a mile: 14%
  • Less than 1/2 of a mile: 57%

Note: You can have many people living at the same address point.

Note: This does not take into consideration subway stations outside Cambridge.

Note: Distance is a straight line between two points.

Bus shelters

We can perform the same calculation as above but with bus shelters (55 in Cambridge).

  • Less than 1/8 of a mile: 26%
  • Less than 1/4 of a mile: 68%
  • Less than 1/2 of a mile: 99%

Note: This does not take into consideration bus shelters outside Cambridge.

Sports areas

We can find the percentage of Cambridge addresses less than 1/8, 1/4 and 1/2 mile from a sports area. In addition, we can do the same calculation for a specific sports area type:

A sport area (265 in Cambridge)

  • Less than 1/8 of a mile: 66.8%
  • Less than 1/4 of a mile: 95.8%
  • Less than 1/2 of a mile: 99.9%

Basketball sport area (34 in Cambridge)

  • Less than 1/8 of a mile: 31.3%
  • Less than 1/4 of a mile: 75.3%
  • Less than 1/2 of a mile: 99.3%

Tennis sport area (18 in Cambridge)

  • Less than 1/8 of a mile: 7%
  • Less than 1/4 of a mile: 31%
  • Less than 1/2 of a mile: 80%

Park sport area (1 in Cambridge)

  • Less than 1/8 of a mile: 0.02%
  • Less than 1/4 of a mile: 1.48%
  • Less than 1/2 of a mile: 10.17%

Field sport area (24 in Cambridge)

  • Less than 1/8 of a mile: 10%
  • Less than 1/4 of a mile: 38%
  • Less than 1/2 of a mile: 89%

Baseball sport area (23 in Cambridge)

  • Less than 1/8 of a mile: 14%
  • Less than 1/4 of a mile: 48%
  • Less than 1/2 of a mile: 92%

Playground sport area (138)

  • Less than 1/8 of a mile: 58%
  • Less than 1/4 of a mile: 93%
  • Less than 1/2 of a mile: 100%

Note: This does not take into consideration sports areas outside Cambridge.

Neighborhoods

One data file lists the different neighborhoods, 13 in total:

  • Area Four
  • Neighborhood Nine
  • Area 2/MIT
  • Cambridgeport
  • Riverside
  • Mid-Cambridge
  • Wellington-Harrington
  • East Cambridge
  • Agassiz
  • Cambridge Highlands
  • Strawberry Hill
  • West Cambridge
  • North Cambridge

Cambridge neighborhoods

Single-Family assessed value

Using two data files, we can find out the average assessed value for a single-family house per neighborhood. Here are the results sorted from most expensive to least expensive:

  • West Cambridge – $1507544
  • Neighborhood Nine – $1451155
  • Agassiz – $1075459
  • Mid-Cambridge – $786673
  • Riverside – $712027
  • Strawberry Hill – $707192
  • Area 2/MIT – $599264
  • Cambridgeport – $592257
  • North Cambridge – $534844
  • Cambridge Highlands – $507051
  • Area Four – $505535
  • Wellington-Harrington – $456673
  • East Cambridge – $407907

Cambridge gingle-Family assessed value

Single-Family assessed value per sq. ft.

We can also perform a similar calculation per sq. ft. Here are the results sorted from most expensive to least expensive:

  • West Cambridge – $318
  • Neighborhood Nine – $318
  • Agassiz – $287
  • Mid-Cambridge – $261
  • Riverside – $253
  • Area 2/MIT – $225
  • Cambridgeport – $223
  • Cambridge Highlands – $222
  • Wellington-Harrington – $208
  • Strawberry Hill – $207
  • North Cambridge – $204
  • East Cambridge – $203
  • Area Four – $200

Condo assessed value

  • Agassiz – $603000
  • West Cambridge – $591502
  • Neighborhood Nine – $581251
  • Mid-Cambridge – $483449
  • Riverside – $470751
  • Cambridge Highlands – $449725
  • Area 2/MIT – $431406
  • Cambridgeport – $422049
  • North Cambridge – $410023
  • Strawberry Hill – $387065
  • Area Four – $384520
  • Wellington-Harrington – $366460
  • East Cambridge – $357547

Condo assessed value per sq. ft.

  • West Cambridge – $395
  • Neighborhood Nine – $387
  • Agassiz – $382
  • Mid-Cambridge – $376
  • Riverside – $371
  • Cambridge Highlands – $338
  • Area 2/MIT – $334
  • Cambridgeport – $334
  • Wellington-Harrington – $335
  • Area Four – $335
  • East Cambridge – $335
  • North Cambridge – $330
  • Strawberry Hill – $313

Vegetation

There is a data file listing areas with concentration of vegetation. Here is their definition: “Vegetation is a wooded area with a group or high density of trees.”

Each of those is a polygon. We can find which neighborhoods contain which vegetation polygons and calculate the total area of vegetation per neighborhood. Which neighborhood has the most vegetation? It is “Cambridge Highlands” thanks to the Fresh Pond vegetation. “Cambridge Highlands” contains 32% of the total vegetation in Cambridge followed by “North Cambridge” with 26%. Here is the full ranking:

  • Cambridge Highlands – 31.7%
  • North Cambridge – 25.7%
  • West Cambridge – 22.1%
  • Strawberry Hill – 7.5%
  • Neighborhood Nine – 4.1%
  • Agassiz – 3.1%
  • Area 2/MIT – 2.3%
  • East Cambridge – 1.6%
  • Cambridgeport – 1.5%
  • Riverside – 0.2%
  • Mid-Cambridge – 0.1%
  • Wellington-Harrington – 0%
  • Area Four – 0%

Fire Stations

There are eight fire stations in Cambridge:

  • Fire Headquarters
  • Fire Company 2
  • Fire Company 3
  • Fire Company 4
  • Fire Company 5
  • Fire Company 6
  • Fire Company 8
  • Fire Company 9

Going back to each of our neighborhood. We calculate the geometric center of each neighborhood and find out the closest fire station. Here are the results ranked by distance to the closest fire station.

  • Neighborhood Nine center is 164 meters from Fire Company 8.
  • East Cambridge center is 328 meters from Fire Company 3.
  • Area Four center is 407 meters from Fire Company 2.
  • Cambridgeport center is 492 meters from Fire Company 6.
  • West Cambridge center is 534 meters from Fire Company 9.
  • Riverside center is 559 meters from Fire Company 6.
  • Agassiz center is 602 meters from Fire Headquarters.
  • Mid-Cambridge center is 613 meters from Fire Headquarters.
  • Wellington-Harrington center is 694 meters from Fire Company 5.
  • Area 2/MIT center is 910 meters from Fire Company 6.
  • Strawberry Hill center is 987 meters from Fire Company 9.
  • North Cambridge center is 1271 meters from Fire Company 4.
  • Cambridge Highlands center is 1340 meters from Fire Company 9.

Note: 1 meter ~= 1.1 yard.

Emergency Shelters

We can do the same with emergency shelters. There are 15 in Cambridge. Here are the results ranked by distance to the closest shelter.

  • Wellington-Harrington center is 68 meters from Frisoli Youth Center.
  • Strawberry Hill center is 111 meters from Russell / West Cambridge Youth Ctr.
  • Area Four center is 138 meters from Fletcher / Maynard Academy.
  • Agassiz center is 212 meters from Baldwin School.
  • Mid-Cambridge center is 223 meters from Cambridge Rindge and Latin.
  • Neighborhood Nine center is 469 meters from Graham & Parks School.
  • Cambridgeport center is 567 meters from Morse School.
  • East Cambridge center is 634 meters from Kennedy / Longfellow School.
  • North Cambridge center is 715 meters from Peabody School.
  • Area 2/MIT center is 782 meters from CASPAR Shelter.
  • Riverside center is 838 meters from Cambridge Rindge and Latin
  • West Cambridge center is 917 meters from Tobin School.
  • Cambridge Highlands center is 1038 meters from Tobin School.

Intersections

One of the data file contains the streets/roads intersections. An interesting attribute is the intersecting street count. Here is the breakdown based on this attribute:

  • 2 intersecting streets: 1392
  • 3 intersecting streets: 109
  • 4 intersecting streets: 2

The two “4 intersecting streets” intersections are the “Coolidge Hill Rd & Fresh Pond Pkwy & Gerrys Landing Rd & Mt Auburn St” intersection and the “Magazine St & Massachusetts Ave & Prospect St & Western Ave” intersection. After all, Cambridge is notorious for those type of intersections.

Note: There are 357 intersections without a number of intersecting streets so we dismissed those.

Metered parking spaces

Based on one of the data file, Cambridge has 3094 metered parking spaces. A different data file lists streets segments. Using both data, we can find out the amount of street coverage when it comes to metered parking spaces. We found out that 13% of streets segments have metered parking spaces in Cambridge.

Implementation

Python and the SnowFloat API are used here to compute those statistics. Here is how the subway stations 1/8 mile proximity statistics were calculated:

import snowfloat.client

address_points_uuid = 'ecf03fb0e1f34773b53959a2c603e214'
subway_stations_uuid = '32d771c0bc0641eb9fc5c2f04305ecb9'

points = []
subway_stations = client.get_features(subway_stations_uuid)
for subway_station in subway_stations:
    address_points = client.get_features(address_points_uuid,
        query='distance_lte', geometry=subway_station.geometry,
        distance=200)
    for address_point in address_points:
        for point in points:
            if address_point.uuid == point.uuid:
                break
        else:
            points.append(address_point)
print len(points) / 20844.

I can publish more of the code used if needed.

More data to analyze

Here is the full list of data files the city of Cambridge made public in the geojson format. Also listed are the fields for each data file. Let me know other statistics you would like to see added to this post.

ADDRESS_AddressPoints

  • address_id
  • ml
  • stname
  • stnm
  • bldgid
  • full_addr
  • entry
  • type
  • editdate

ADDRESS_MasterAddressBlocks

  • unq_id2
  • unq_id
  • nhood
  • globalid

ASSESSING_CommercialDistrictsFY2013

  • dist_name
  • globalid
  • district

ASSESSING_CommercialDistrictsFY2014

  • dist_name
  • district

ASSESSING_EasementsFY2013

  • globalid

ASSESSING_EasementsFY2014

ASSESSING_ParcelMapIndexFY2013

  • bsize_inch_to_foot
  • dsize_map_scale
  • globalid
  • bsize_map_scale
  • pcix_no
  • rotation
  • dsize_inch_to_foot

ASSESSING_ParcelMapIndexFY2014

  • bsize_inch_to_foot
  • dsize_map_scale
  • bsize_map_scale
  • pcix_no
  • rotation
  • dsize_inch_to_foot

ASSESSING_ParcelsFY2013

  • map
  • globalid
  • ml
  • fcode
  • lot
  • uyear
  • editdate
  • editor
  • approved

ASSESSING_ParcelsFY2014

  • map
  • uyear
  • loc_id
  • lot
  • ml
  • editdate
  • editor
  • poly_type
  • source
  • plan_id

ASSESSING_ResidentialDistrictFY2013

  • globalid
  • district

ASSESSING_SubParcelLinesFY2013

  • globalid

ASSESSING_SubParcelLinesFY2014

BASEMAP_Bridges

  • type

BASEMAP_Buildings

  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl
  • bldgid
  • editdate

BASEMAP_Cemeteries

  • type
  • name

BASEMAP_Curbs

  • type

BASEMAP_Decks

  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl

BASEMAP_Docks

  • type

BASEMAP_Driveways

  • type

BASEMAP_Fences

  • type

BASEMAP_Firewalls

  • type

BASEMAP_Headstones

  • type

BASEMAP_MiscStructures

  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl

BASEMAP_ParkingLots

  • type

BASEMAP_Plazas

  • type

BASEMAP_Porches

  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl

BASEMAP_PrivateWalkways

  • type

BASEMAP_PublicFootpaths

  • type

BASEMAP_Roads

  • type

BASEMAP_RooftopMechanicals

  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl

BASEMAP_RooftopSolarPanels

  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl

BASEMAP_Sidewalks

  • type
  • editdate

BASEMAP_Stairs

  • top_gl
  • stair_type
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl

BASEMAP_SwimmingPools

  • top_gl
  • elev_sl
  • top_sl
  • base_elev
  • type
  • elev_gl

BASEMAP_Vegetation

  • type

BASEMAP_Walls

  • type

BOUNDARY_CDDNeighborhoods

  • webpage
  • n_hood
  • name
  • globalid

BOUNDARY_CityBoundary

  • globalid

BOUNDARY_Zipcodes

  • globalid
  • zip_code

CDD_ZoningDistricts

  • zone_type
  • zcode
  • globalid
  • pud_type_new
  • pud_type

CDD_ZoningOverlayDistricts

  • name
  • overlap
  • globalid
  • type
  • label
  • ordnum
  • editdate

DEMOGRAPHICS_BlockGroups1990

  • blkgrpid
  • globalid

DEMOGRAPHICS_BlockGroups2000

  • blkgrps_id
  • globalid
  • tract
  • group_
  • fipsstco
  • stfid

DEMOGRAPHICS_BlockGroups2010

  • awater10
  • tractce10
  • aland10
  • blkgrpce10
  • funcstat10
  • namelsad10
  • countyfp10
  • statefp10
  • mtfcc10
  • geoid10
  • intptlat10
  • intptlon10

DEMOGRAPHICS_Blocks1990

  • pop100
  • tractblk
  • globalid

DEMOGRAPHICS_Blocks2000

  • pop_lab
  • fipsstco
  • block_id
  • globalid
  • stfid

DEMOGRAPHICS_Blocks2010

  • name10
  • blockce10
  • tractce10
  • complete
  • awater10
  • aland10
  • statefp10
  • funcstat10
  • countyfp10
  • intptlat10
  • mtfcc10
  • geoid10
  • intptlon10
  • followup
  • comments

DEMOGRAPHICS_Tracts1990

  • globalid
  • tract

DEMOGRAPHICS_Tracts2000

  • fipsstco
  • stfid
  • tractid
  • globalid
  • trt2000

DEMOGRAPHICS_Tracts2010

  • name10
  • awater10
  • tractce10
  • aland10
  • statefp10
  • funcstat10
  • namelsad10
  • countyfp10
  • intptlat10
  • mtfcc10
  • geoid10
  • intptlon10

DPW_LitterBarrels

  • globalid
  • park_distr
  • location
  • owner
  • id
  • man
  • type

DPW_PedestrianRamps

  • ramp_line_
  • f_puddle
  • f_total
  • f_slope
  • globalid
  • f_cond
  • f_total06
  • sw_width
  • f_align
  • lip
  • location
  • slope_comp
  • total_ramp
  • puddling
  • material_t
  • opening_wi
  • condition
  • f_open
  • date
  • contract

DPW_StreetTrees

  • diameter
  • trunks
  • creator
  • growspac
  • location
  • streetnumber
  • species
  • inspectr
  • otherriskfactors
  • riskrating
  • globalid
  • sitetype
  • sidewalkobstruction
  • treeid
  • probabilityoffailure
  • probabilityoftargetimpact
  • overheadwires
  • sizeodefectiveparts
  • memtree
  • created
  • modified
  • def_cult
  • streetname
  • notes
  • ownership
  • formersitetype
  • removaldate
  • treecondition
  • cultivar
  • treegrate
  • plantdate
  • sidewalkwidth
  • treegrateactionreq
  • adacompliant
  • posd
  • dw
  • rt
  • speciesshort
  • d
  • lean
  • structuralissues
  • cr

ELECTIONS_CongressionalDistricts

  • editdate
  • congressman
  • name
  • district
  • party

ELECTIONS_PollingLocations

  • w_p
  • location_note
  • location
  • globalid
  • address
  • editdate

ELECTIONS_StateRepDistricts

  • editdate
  • party
  • rep
  • district

ELECTIONS_StateSenateDistricts

  • senate
  • editdate
  • senator
  • party

ELECTIONS_WardsPrecincts

  • ward
  • totalpop
  • wardprecinct
  • precinct

HEALTH_HealthClinics

  • phone
  • site_name
  • globalid
  • address
  • editdate

HEALTH_Hospitals

  • phone
  • site_name
  • globalid
  • address

HISTORICAL_HistoricDistricts

  • name
  • globalid

HISTORICAL_HistoricalCPA

  • dollaramnt
  • globalid
  • project
  • printedmapid
  • granttype
  • buildingid
  • fiscalyear
  • grantid

HISTORICAL_HistoricalMarkers

  • name
  • neighborhood
  • marker_type
  • globalid
  • address
  • description

HISTORICAL_LandmarksEasements

  • globalid
  • ml
  • property_t
  • historic_n
  • lot
  • address
  • block
  • editdate

HISTORICAL_NationalRegisterHistoricPlaces

  • globalid
  • nrdis
  • location
  • labelname
  • shape_leng
  • histname
  • street
  • aka
  • st_num
  • nrind
  • pres_restr
  • nhl
  • doe

HISTORICAL_NhoodConservationDist

  • name
  • globalid

HYDRO_Floodplains

  • sfha_tf
  • fld_ar_id
  • percentage
  • source_cit
  • fld_zone
  • floodway

HYDRO_WaterBodies

  • type
  • name

HYDRO_Wetlands

  • complants1
  • globalid
  • wtdescript
  • datasource
  • wtsystem
  • namelocati
  • wtsubsyste
  • wtlnd_id
  • complants2

INFRA_Billboards

INFRA_Drainage

  • type
  • name

INFRA_Hydrants

  • globalid
  • subtype
  • contractauthority
  • lasteditdate
  • enabled
  • ancillaryrole
  • hydrant_id
  • lasteditor
  • gate
  • rotation
  • lifecyclestatus
  • hydrant_gpm
  • installdate
  • lastsource

INFRA_OverheadSigns

INFRA_ParkBenches

INFRA_StreetLights2010Flyover

  • elev

INFRA_StreetLightsNSTAR2005

  • point_subt
  • npole_tag
  • symbol
  • lmp_type
  • oldnode_ta
  • ver_end
  • oldpoint_t
  • globalid
  • str_des
  • equip_locn
  • phase__id
  • object__id
  • no_lamp
  • ver_id
  • lock__id
  • str_name
  • group_id
  • pole_type_
  • intersecti
  • oid_
  • lumens
  • feature_st
  • ntag
  • point_tag
  • point_type
  • bracket_cd
  • lmp_class_
  • cntl_ty
  • err_desc
  • err_flag
  • lamp_id

INFRA_UtilityFacilities

  • type
  • site_name
  • globalid
  • address

INFRA_UtilityPoles

  • elev

LANDMARK_DayCareFacilities

  • site_name
  • globalid
  • shel_cap
  • phone
  • publicpriv
  • address
  • owner
  • shelter

LANDMARK_ElderlyFacilities

  • owner
  • publicpriv
  • site_name
  • globalid
  • address

LANDMARK_MemorialPoles

  • source
  • type
  • name
  • globalid
  • pole_id
  • editdate

LANDMARK_MunicipalBuildings

  • type
  • site_name
  • globalid
  • address
  • editdate

LANDMARK_PlacesOfWorship

  • site_name
  • globalid
  • address

LANDMARK_PrivateSchools

  • owner
  • phone
  • site_name
  • globalid
  • address

LANDMARK_PublicLibraries

  • phone
  • site_name
  • globalid
  • address

LANDMARK_PublicSchools

  • site_name
  • globalid
  • shel_cap
  • shelter
  • phone
  • address
  • shel_food

LANDMARK_YouthCenters

  • phone
  • site_name
  • globalid
  • address

PUBLICSAFETY_EmergencyShelters

  • capacity
  • name
  • globalid
  • shel_cap
  • shelter
  • concateid
  • zipcode
  • phone
  • med_fac
  • address
  • shel_food
  • type
  • editdate

PUBLICSAFETY_FireStations

  • phone
  • site_name
  • globalid
  • address

PUBLICSAFETY_PoliceStation

  • phone
  • site_name
  • globalid
  • address

RECREATION_BikeFacilities

  • bike_fac
  • globalid
  • st_type
  • pbike_fac
  • length
  • street
  • comments
  • editdate

RECREATION_CommunityGardens

  • mapid
  • park
  • location
  • globalid
  • editdate

RECREATION_OpenSpace

  • globalid
  • fcode
  • owner
  • oscode
  • rcode
  • join_id
  • name
  • editdate

RECREATION_Playgrounds

  • location
  • globalid

RECREATION_PublicPools

  • owner
  • name

RECREATION_SportsAreas

  • type

RECREATION_Waterplay

  • status
  • join_id
  • park
  • location
  • size

TRAFFIC_CommercialParking

  • globalid
  • ml
  • descriptio
  • accessvia
  • totalsp
  • commspac
  • address
  • id
  • structure
  • owntype

TRAFFIC_MeteredParkingSpaces

  • space_id
  • src_date
  • globalid
  • editdate
  • editor

TRAFFIC_MunicipalParkingLots

  • sitename
  • ml
  • type
  • globalid
  • address

TRAFFIC_PavementMarkings

  • type

b>TRANS_BusShelters

  • elev_sl
  • direction
  • oid_
  • globalid
  • top_sl
  • top_gl
  • style
  • xls_no
  • x_street
  • street
  • elev_gl
  • address
  • base_elev
  • stop_name
  • stop_id
  • quantity
  • editdate

TRANS_Centerlines

  • potential_r_to
  • street_name
  • tonode
  • label
  • street_id
  • zip_right
  • direction
  • globalid
  • fromnode
  • roadways
  • potential_l_from
  • zip_left
  • l_to
  • r_from
  • street_type
  • l_from
  • id
  • majorroad
  • restriction
  • r_to
  • street
  • potential_r_from
  • potential_l_to
  • alias
  • editdate

TRANS_Intersections

  • nodenumber
  • globalid
  • p_x
  • p_y
  • intersection
  • intersectingstreetcount
  • editdate

TRANS_MajorRoads

  • majorroad
  • street
  • st_name
  • globalid
  • st_type

TRANS_Rail

  • type

TRANS_SidewalkCenterlines

  • type

TRANS_SubwayHeadhouses

  • elev_sl
  • base_elev
  • top_gl
  • elev_gl
  • top_sl

TRANS_SubwayLines

  • grade
  • source
  • line
  • globalid

TRANS_SubwayStations

  • line
  • station
  • globalid

API to access the Cambridge city geospatial data

December 28, 2013

The city of Cambridge, MA made a large set of geospatial data publicly available.

I uploaded the layers to the geospatial web service I have running on EC2. You can now access the Cambridge geospatial data easily using the Python client or the raw API. Here is a tutorial.

REST service + Python client to access geographic data

November 12, 2013

This post shows how we can query geographic data using a simple Python client pointing to a GIS web service: SnowFloat.

The data file here is the Vermont towns geographic data file. Its name is “Boundary_BNDHASH_region_towns”. We first upload the data file to the web service. The web service here is a layer on top of GeoDjango on top of PostGIS.

Layer

Let’s retrieve the layer object:

>>> layer = client.get_layers(name='Boundary_BNDHASH_region_towns')[0]

Number of features:

>>> layer.num_features
255

Layer’s fields:

>>> layer.fields
[{u'name': u'gdb_vcgi_v', u'type': u'real'},
 {u'name': u'perimeter', u'type': u'real'},
 {u'name': u'towns_', u'type': u'integer'},
 {u'name': u'towns_id', u'type': u'integer'},
 {u'name': u'fips6', u'type': u'integer'},
 {u'name': u'townname', u'size': 18, u'type': u'string'},
 {u'name': u'townnamemc', u'size': 18, u'type': u'string'},
 {u'name': u'cnty', u'type': u'integer'},
 {u'name': u'shape_area', u'type': u'real'},
 {u'name': u'shape_len', u'type': u'real'}]

Layer’s extent:

>>> layer.extent
[424788.84, 25211.8370313, 581562.0, 279799.0]

Features

Towns starting with the letter ‘B’:

>>> features = layer.get_features(field_townnamemc_startswith='B')
>>> len(features)
30
>>> features[0].fields['townnamemc']
Berkshire

Towns less than 10 kms from the town of Burlington:

>>> burlington = layer.get_features(field_townnamemc_exact='Burlington')
>>> features = layer.get_features(query='distance_lte',
...                               geometry=burlington,
...                               distance=10000)
>>> len(features)
12
>>> [feature.fields['townnamemc'] for feature in features]
[u'Milton',
 u'South Hero',
 u'St. George',
 u'Charlotte',
 u'Colchester',
 u'Winooski',
 u'Shelburne',
 u'Hinesburg',
 u'Burlington',
 u'Essex',
 u'South Burlington',
 u'Williston']

Ten furthest towns + their distance to Burlington:

>>> features = layer.get_features(spatial_operation='distance',
...                               spatial_geometry=burlington.geometry,
...                               order_by=('spatial',),
...                               query_slice=(0, 10))
>>> ['%s - %d kms' % (feature.fields['townnamemc'], feature.spatial / 1000) for feature in features]
[u'Vernon - 187 kms',
 u'Guilford - 184 kms',
 u'Halifax - 182 kms',
 u'Whitingham - 180 kms',
 u'Stamford - 179 kms',
 u'Pownal - 178 kms',
 u'Brattleboro - 177 kms',
 u'Readsboro - 177 kms',
 u'Marlboro - 172 kms',
 u'Dummerston - 170 kms']

Towns inside the Robert T. Stafford White Rocks area:

>>> features = layer.get_features(query='intersects', geometry=forest_feature.geometry)
>>> [feature.fields['townnamemc'] for feature in features]
[u'Wallingford', u'Mount Holly', u'Mount Tabor', u'Weston', u'Dorset', u'Peru']

As you can see, the goal here is to build a light tool for easy access to geographic data. As I am continuing to work on the web service back-end, I would really appreciate getting feedback from people working in the GIS world.

Massachusetts Census 2010 Towns maps and statistics using Python

September 10, 2013

Using the Massachusetts Census 2010 Towns data source, we generated the following maps and statistics. The software used relies on technologies such as Python and PostGIS.

Note: The data source has been generated and published by MassGIS.

Population change

Population change 2000-2010

We can see some towns in Cape Cod and Western Massachusetts not growing as fast (few even loosing population) as the rest of the state. What is driving the increase just west of 495? Is it the expensive real state closer to Boston?

Population per town

Population in 2010

No surprise that Western Massachusetts has less populated towns than the Greater Boston area.

Most populated towns:

  1. Boston – 617594
  2. Worcester – 181045
  3. Springfield – 153060
  4. Lowell – 106519
  5. Cambridge – 105162

Least populated towns:

  1. Gosnold – 75
  2. Monroe – 121
  3. Mount Washington – 167
  4. New Ashford – 228
  5. Aquinnah – 311

Residents per house unit in 2010

Residents per house unit 2010

Could the low number in Cape Code be explained by the higher number of secondary residences?

Population density in 2010

Residents per house unit 2010

We are looking at the number of habitants per acre.

Towns area

Largest towns in acres:

  1. Plymouth – 65726
  2. Middleborough – 46179
  3. Petersham – 43689
  4. Barnstable – 40001
  5. Dartmouth – 39513

Smallest towns in acres:

  1. Nahant – 671
  2. Winthrop – 1295
  3. Chelsea – 1416
  4. Hull – 1804
  5. Swampscott – 1945

Furthest towns from Boston

  1. Mount Washington – 193 kms
  2. Alford – 190 kms
  3. Egremont – 190 kms
  4. West Stockbridge – 185 kms
  5. Sheffield – 182 kms

There are many more distance and spatial computations we could make. Which one do you think would be interesting?

Software

We have been using the SnowFloat Python client and its web service running on Amazon EC2.

First step is to import the data source:

>>> import snowfloat.client
>>> client = snowfloat.client.Client()
>>> data_source = 'CENSUS2010TOWNS_POLY.zip'
>>> res = client.import_geospatial_data(data_source)
# res is a dictionary with the ID of the layer created and the number of features (districts) added.
>>> res
{u'layer_uuids': [u'b2b2660e43ef4196b999840984480338'], u'num_features': 351}
>>> layer_uuid = res['layer_uuids'][0]

Population per town

We define a map task and send it to the server.

>>> task = snowfloat.task.Task(
...     operation='map',
...     task_filter={'layer_uuid_exact': layer_uuid},
...     extras={'size': [800, 600],
...             'title': 'Population 2010',
...             'colors': {'type': 'gradient'},
...             'operation': {'type': 'nop',
...                           'value_1': 'field_pop2010',
...                           'normalize': 'log'}})

>>> res = client.execute_tasks([task,])
# res contains the URL of the map stored online.
>>> res
[[{u'url': u'http://snowfloat.results.s3.amazonaws.com/95391554fb6b4ba6bb1cffaf196f83ae?Signature=BQeDD2Ued1N2JolDwedfzjLAuoA%3D&Expires=1378834201&AWSAccessKeyId=AKIAI4JN4Z5ER2EY3A2Q'}]]

Furthest towns from Boston

First step is to calculate the centroid for the Boston town geometry:

>>> feature = client.get_features(layer_uuid=layer_uuid, field_town2_exact='Boston', spatial_operation='centroid')
>>> point = feature.spatial
>>> point
Point(coordinates=[233705.31415324158, 896147.8312876453])

Now, we can retrieve the five furthest towns from this point.

>>> features = client.get_features(layer_uuid=layer_uuid, spatial_operation='distance', spatial_geometry=point, order_by=('-spatial',), query_slice=(0,5))
>>> ['%s - %d kms' % (feature.fields['town2'], feature.spatial / 1000) for feature in features]
[u'Mount Washington - 193 kms',
 u'Alford - 190 kms',
 u'Egremont - 190 kms',
 u'West Stockbridge - 185 kms',
 u'Sheffield - 182 kms']

You can learn more about the SnowFloat service by reading its tutorial.

Python, Twitter statistics and the 2012 French presidential election

August 29, 2012

This post describes how Pytolab was designed to process Tweets related to the 2012 French presidential election, in real-time. This post also goes over some of the statistics computed over a period of 9 months.

Note: I presented this project at EuroSciPy 2012: abstract.

Architecture
Statistics

Architecture

The posts are received from the Twitter streaming API and sent to a messaging exchange. The posts are read from the messaging queue and processed by the computing unit. Most frequently accessed data is stored in an in-memory DB (Redis) and long term data is stored in MySQL. See diagram below.

Twitter statistics

Tweets receiver

The Twitter streaming API filter feature is used here to receive the tweets we are interested in: Tweets referring to at least one of the candidates. The helper library Tweepy facilitates that task.

First thing we do is setting up a stream listener. We get a listener instance, set the callback to be called when a new post arrives and finally get a stream instance by passing our listener instance to it. We will see next how those different objects are defined.

def setup_stream_listener(self):
    """
    Setup Twitter API streaming listener
    """
    listener = Listener()
    listener.set_callback(self.mq.producer.publish)
    self.stream = tweepy.Stream(
        self.config.get('twitter', 'userid'),
        self.config.get('twitter', 'password'),
        listener,
        timeout=3600
    )

Note: We use ConfigParser for the configuration file management.

The Listener class is derived from the tweepy.StreamListener class. We overwrite some of the methods to indicate what to do when a new post arrives or when an error is detected.

class Listener(tweepy.StreamListener):
    def on_status(self, status):
        # Do things with the post received. Post is the status object.
        ...
    
    def on_error(self, status_code):
        # If error thrown during streaming.
        ...

    def on_timeout(self):
        # If no post received for too long
        ...
        
    def on_limit(self, track):
        # If too many posts match our filter criteria and only a subset is
        # sent to us
        ...

    def on_delete(self, status_id, user_id):
         # When a delete notice arrives for a post.
         ...

    def set_callback(self, callback):
        # Pass callback to call when a new post arrives
        self.callback = callback

We need to add few lines of code to the on_status method. We parse what we are interested in and publish the data to our messaging queue. We filter out the posts written by an author whose language is not French. The callback is our messaging queue producer publish method.

def on_status(self, status):
    if status.author.lang == 'fr':
        message = {'author_name': status.author.screen_name,
                   'author_id': status.author.id,
                   'id': status.id,
                   'text': status.text,
                   'retweeted': status.retweeted,
                   'coordinates': status.coordinates,
                   'time': int(time.time())}
        self.callback(json.dumps(message), 'posts')

We will see later how the messaging queue producer and consumer are built.

There is one more thing we need to do: setting up our streaming filter so we start receiving posts from Twitter, we are interested in. We have a list of presidential candidates in the list self.person. We build a list of names and start listening for them. The call to stream.filter is blocking and the method on_status of the listener class is called each time a new post arrived.

Keep in mind that the streaming filter returns at most 1% of all posts processed by Twitter. This means that if the posts referring to our candidates represent more than 1% of all posts on Twitter at instant t, then the number of posts will be capped at 1%. We encountered this case only twice: during the first round results and during the second round results. We lost less than 10% of the posts when that situation happened. How do you make sure this does not happen? You will have to subscribe to the complete stream which is provided by some Twitter partners like DataSift and Gnip. Those solutions are not cheap.

Note that we are catching all exceptions. There is no guarantee that you will get continuous streaming with no errors so catching all exceptions is important here.

def stream_filter(self):
    track_list = [data.normalize(p['name']) for p in self.persons]
    while True:
        try:
            self.stream.filter(track=track_list)
        except Exception:
            logging.exception('stream filter')
            time.sleep(10)

Some examples of errors I saw in the past:

File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
...
File "/usr/lib/python2.6/httplib.py", line 518, in read
    return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 561, in _read_chunked
    raise IncompleteRead(''.join(value))
IncompleteRead: IncompleteRead(0 bytes read)
File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
...
File "/usr/lib/python2.6/ssl.py", line 96, in <lambda>
    self.recv = lambda buflen=1024, flags=0: SSLSocket.recv(self, buflen, flags)
File "/usr/lib/python2.6/ssl.py", line 222, in recv
    raise x
SSLError: The read operation timed out
File "/usr/local/lib/python2.6/dist-packages/tweepy-1.7.1-py2.6.egg/tweepy/streaming.py", line 148, in _read_loop
    c = resp.read(1)
...
File "/usr/lib/python2.6/ssl.py", line 136, in read
    return self._sslobj.read(len)
error: [Errno 104] Connection reset by peer

Let’s take a look at what happens when the method stream.filter is called. An HTTPS POST request is made using the following URL: https://stream.twitter.com/1/statuses/filter.json?delimited=length and the following body: track=candidate1, candidate2,… The stream of data is then read in a loop until there is an error.

The data arrives in the following format: “\n\n…” There is the length before the post data because we ask for it with the URL parameter ‘delimited=length’.

Here is an example of a post content:

{
    "in_reply_to_user_id": null,
    "in_reply_to_status_id": null,
    "text": "bla bla bla",
    "favorited": false,
    ...
}

A more complete example: https://gist.github.com/900964.

The Tweepy library formats that data as a status object and passes it to the on_status method of the listener object.

See the full Tweets receiver module.

Messaging queue

We are using RabbitMQ for our messaging system plus the Python helper library py-amqlib. An exchange is created to receive the posts and a consumer reads the messages from a queue. Those messages are processed by the computing unit. The advantage of using a messaging queue is we can handle surge of posts.

First is the producer. We create a connection to the messaging server and get a channel on that connection. This channel is used to publish messages to the exchange.

class Producer(object):
    def __init__(self, exchange_name, host, userid, password):
        self.exchange_name = exchange_name
        self.connection = amqp.Connection(
            host=host, userid=userid, password=password, virtual_host="/",
            insist=False)
        self.channel = self.connection.channel()

Our publisher class has a publish method to send a message to the exchange. Messages marked as ‘persistent’ that are delivered to ‘durable’ queues will be logged to disk. We use the routing key ‘posts’ which will also be used when we create the queue to route the messages properly.

def publish(self, message, routing_key):
    msg = amqp.Message(message)
    msg.properties["content_type"] = "text/plain"
    msg.properties["delivery_mode"] = 2
    self.channel.basic_publish(exchange=self.exchange_name,
                     routing_key=routing_key,
                     msg=msg)

Next is the consumer. We also get a connection to the messaging server and get a channel on that connection.

class Consumer(object):
    def __init__(self, host, userid, password):
        self.connection = amqp.Connection(host=host, userid=userid, password=password, virtual_host="/", insist=False)
        self.channel = self.connection.channel()

We also have a method creating the queue and one passing the method to be called each time there is a message to be consumed in the queue.

See the full Messaging queue module.

DB interface

Before we go over the computing unit, let’s look at the DB interface we created to interface with the in-memory DB Redis and MySQL.

Regarding Redis, our interface is built on top of the helper library redis-py. It adds retries around DB commands.

We use the following Redis commands (complexity of the command is shown next to it):

  • GET key – O(1)
  • SET key – O(1)
  • DELETE key – O(1)
  • EXISTS key – O(1)
  • INCR key – O(1)
  • RPUSH key value – O(1)
  • LSET key index value – O(N)
  • LINDEX key index – O(N)
  • LRANGE key start stop – O(S+N)

The key used to store posts is ‘post: ‘. We dump the json post data as the key’s value. For ease of access, we also have a Redis list per person and per hour with the following key: ‘person: :posts:. This list contains the post ids referring to this person during that hour.

Regarding MySQL, our interface is built on top of the helper library MySQLdb.

Here is the method to execute a MySQL command. If the command throws an operational error or an internal error, we try to reconnect to the MySQL server. If it throws a SQL error, we retry multiple times before raising a DBError.

def mysql_command(self, cmd, sql, writer, *args):
    retry = 0
    while retry < self.cmd_retries:
        try:
            r = getattr(self.db_cursor, cmd)(sql, args)
            if writer:
                self.db_disk_posts.commit()
                return r
            else:
                return self.db_cursor.fetchall() 
        except (MySQLdb.OperationalError, MySQLdb.InternalError):
            self.log.error('MySQL cmd %s DB error', cmd)
            # reconnect
            self.setup_mysql_loop()
            retry = 0
        except MySQLdb.Error:
            self.log.error('MySQL cmd %s sql %s failed', cmd, sql)
            retry += 1
            if retry <= self.cmd_retries:
                time.sleep(self.cmd_retry_wait)
        except AttributeError:
            self.log.error('MySQL cmd %s does not exist', cmd)
            raise exceptions.DbError()
    raise exceptions.DbError()

We keep smaller and more recent data in Redis. MySQL is used for larger and long-term data.

We added a thin layer on top of the Redis and MySQL commands to make the dual DB setup transparent. When we request some data, it is read from Redis and/or MySQL based on its age or type.

Twitter statistics

See the full DB module.

Computing unit

We defined a method called when there is a message to read from the queue. When a post is received, we process it the following way:

  • Filter out posts marked as fr language and containing common english words. In most cases, this is a post fully written in English and we need to bypass those.
  • For each person, check if this post is really about that person and not something unrelated.
  • Add post ID to the person’s hourly posts list.
  • Store post data in DB.
def process_post(self, post):
    """
    Process post received from the message queue.
    """
    # is this a post matching one or more persons?
    post_add = False
    # remove accents and lowercase everything
    text = data.normalize(post['text']).lower()
    ...
    # check post language
    if data.get_text_language(text) == 'fr':
        for person in self.persons:
            # get person's first name, last name and nickname
            names = data.get_names(person)
            # check if the post is really about that person
            if data.check_names(names, text, person['words']) == 1:
                # one more post for this person
                if not post_add:
                    post_add = True
                    # get next post id
                    post_id = self.db.incr('nextPostId')
                # add post to person's posts list
                key = 'person:%d:posts:%d' % (person['id'],
                        self.stats_last_update)
                self.db.rpush(key, post_id)
                ...
        if post_add:
            # add post to db
            self.db.set_post(int(post_id),
                json.dumps(post))
            # add post id to current hour
            key = 'posts:%d' % (self.stats_last_update)
            self.db.rpush(key, post_id)
    else:
        logging.debug('found english word in %s', text)

Filtering out unrelated messages is key here. For example, “Je vais en Hollande demain” (I am going to Holland tomorrow) is not really about the presidential candidate “Hollande” but more about the country “Holland”. Both are spelled the same way in French. We defined a list of words and rules per person to help filtering out the unrelated posts.

See the full compute module.

High availability

Each element above can be highly-available with the use of an extra server. We can add one more server receiving the tweets in case the active one fails over. We can detect this type of failure using an heartbeat between the active and the stand-by instance. RabbitMQ supports mirror queues. Redis and MySQL supports a master/slave architecture.

Twitter statistics

Performance

During peak traffic (first round results for example), the bottleneck in our system was the Twitter streaming listener. The code reads the length of the post data byte per byte from the stream and then reads the post data using the length value. This is quite CPU intensive and we had to switch from a small instance (1 computing units) on Amazon EC2 to a large one (4 computing units) to read the posts in real-time during traffic peaks.

The messaging system we used (RabbitMQ) can handle way more than what we used it for so no issue on that side.

Here is some comparison between Redis and MySQL when it comes to storing data on a small EC2 instance.

Method:

- MySQL: insert into table for each value, final commit.
- Redis: SET command for each value. Redis persists changes to disk in the background.

Adding 10k posts:

- MySQL: 4.0 seconds.
- Redis: 2.6 seconds – 1.53x faster.

Adding 100k posts:

- MySQL: 42.0 seconds.
- Redis: 23.7 seconds – 1.77x faster.

Statistics

Over 8 millions tweets (8442728) related to the different candidates were analyzed by Pytolab from Sep 1st 2011 to June 1st 2012. Posts referring to at least one candidate were analyzed. This is different than the posts posted by the candidates themselves.

Here are some key dates from the presidential campaign:

  • 1st round of the Socialist primaries: October 9th 2011
  • 2nd round of the Socialist primaries: October 16th 2011
  • 1st round of the presidential election: April 22nd 2012
  • 2nd round of the presidential election: May 6th 2012

The following chart represents the number of posts per day for each candidate. The key dates described above are shown in red.

Twitter statistics

Here is the list of candidates we tracked:

List of candidates:

  • Nathalie Arthaud
  • Martine Aubry
  • Jean-Michel Baylet
  • François Bayrou
  • Christine Boutin
  • Nicolas Dupont Aignan
  • François Hollande
  • Nicolas Hulot
  • Eva Joly
  • Marine Le Pen
  • Jean-Luc Mélenchon
  • Arnaud Montebourg
  • Philippe Poutou
  • Ségolène Royal
  • Nicolas Sarkozy
  • Manuel Valls
  • Dominique Villepin

Here are the number of posts where each candidate’s name appears.

Twitter statistics

We noticed that Nicolas Sarkozy is referred in 41% of all posts we analyzed. François Hollande in 35% of all posts. There is no strong correlation between the number of posts per candidate and their polling result. The candidate with the most posts was the president at that time so it is expected to see those numbers.

Posts count Polls
Nicolas Sarkozy François Hollande
François Hollande Nicolas Sarkozy
François Bayrou Marine Le Pen
Marine Le Pen Jean-Luc Mélenchon
Jean-Luc Mélenchon François Bayrou

We noticed something interesting where the number of posts were matching the polls during the 11 hours preceding the first round results and during the 6 hours preceding the second round results.

Twitter statistics
Twitter statistics

Let’s look at the authors of the posts now. We counted 388628 different authors. 98.3% of authors posted less than 200 posts during those 9 months. That is less than 1 post per day. 0.7% of authors (2720) posted more than 500 posts and posted 45% of all posts.

The top 10 authors in number of posts are:

  • sarkoactu: 26356
  • bayrouactu: 26345
  • Elysee_2012: 21076
  • sarkozy_info: 18868
  • FlashPresse: 16349
  • Scrutin2012: 16229
  • Eric_vds: 14667
  • democrates: 14528
  • akemoi: 14403
  • blabalade: 14119

Here is the distribution of posts per hour for all posts over our period of 9 months:

Twitter statistics

If we look at the number of posts from “sarkoactu”. it is about 96 posts per day. Looking at the distribution of the posts per hour for that author, we notice that it is probably an automatic feed.

Twitter statistics

Looking at the full list of authors and their posts distribution per hour, we found out that 26 authors are probably automatic feeds and that they represent 215783 posts which is 2.5% of all posts.

Location metadata is attached to only 0.5% of all posts. In our case, this represents 40799 posts. There is not much difference between each candidate in regards to the post location. We do notice that the posts are issued mainly from French speaking countries: France, Belgium, Canada, Algeria, Tunisia… It makes sense as we analyzed posts written in French.

Twitter statistics

This Europe map shows that this event is mainly followed in France and little in the rest of Europe. The fact that we tracked the posts written in French contributes to this result.

Twitter statistics

Next, we looked at what other candidates an author talks about when his most talked candidate is person A. Below, you can see that if an author most talked candidate is Nathalie Arthaud then that author also talks about François Hollande, Nicolas Sarkozy and Marine Le Pen.

In 11 on 17 cases, the most talked candidate is Nicolas Sarkozy. Reciprocity is not a rule. When an author talks about Nicolas Hulot, he also talks about Eva Joly (2nd most). The opposite is not true.

  • Nathalie Arthaud
    • François Hollande – 19.2%
    • Nicolas Sarkozy – 18.8%
    • Marine Le Pen – 11.1%
    • Philippe Poutou – 11.1%
    • Eva Joly – 9.2%
  • Martine Aubry
    • François Hollande – 31.4%
    • Nicolas Sarkozy – 19.5%
    • Ségolène Royal – 9.2%
    • Arnaud Montebourg – 8.6%
    • Marine Le Pen – 7.8%
  • Jean-Michel Baylet
    • François Hollande – 21.9%
    • Nicolas Sarkozy – 19.7%
    • Marine Le Pen – 10.1%
    • Arnaud Montebourg – 7.9%
    • Eva Joly – 7.5%
  • François Bayrou
    • François Hollande – 24.2%
    • Nicolas Sarkozy – 23.9%
    • Marine Le Pen – 10.2%
    • Jean-Luc Mélenchon – 8.2%
    • Eva Joly – 6.6%
  • Christine Boutin
    • Nicolas Sarkozy – 31.5%
    • François Hollande – 24.8%
    • Marine Le Pen – 11.2%
    • Ségolène Royal – 6.7%
    • Eva Joly – 4.9%
  • Nicolas Dupont Aignan
    • Nicolas Sarkozy – 24.1%
    • François Hollande – 23.1%
    • Marine Le Pen – 14.9%
    • Jean-Luc Mélenchon – 8.6%
    • Eva Joly – 8.3%
  • François Hollande
    • Nicolas Sarkozy – 32.8%
    • Marine Le Pen – 13.8%
    • Jean-Luc Mélenchon – 7.6%
    • François Bayrou – 7.5%
    • Eva Joly – 6.9%
  • Nicolas Hulot
    • Nicolas Sarkozy – 31.6%
    • Eva Joly – 18.3%
    • François Hollande – 10.7%
    • Marine Le Pen – 10.2%
    • Ségolène Royal – 8.9%
  • Eva Joly
    • Nicolas Sarkozy – 27.4%
    • Marine Le Pen – 15.2%
    • François Hollande – 14.5%
    • Jean-Luc Mélenchon – 7.6%
    • François Bayrou – 5.8%
  • Marine Le Pen
    • Nicolas Sarkozy – 33.6%
    • François Hollande – 19.8%
    • Jean-Luc Mélenchon – 14.0%
    • Eva Joly – 6.1%
    • Ségolène Royal – 5.5%
  • Jean-Luc Mélenchon
    • Nicolas Sarkozy – 23.3%
    • François Hollande – 15.9%
    • Marine Le Pen – 14.7%
    • François Bayrou – 8.8%
    • Eva Joly – 6.4%
  • Arnaud Montebourg
    • Nicolas Sarkozy – 25.2%
    • François Hollande – 15.8%
    • Ségolène Royal – 8.1%
    • Martine Aubry – 7.9%
    • Manuel Valls – 6.5%
  • Philippe Poutou
    • Nicolas Sarkozy – 32.0%
    • François Hollande – 20.6%
    • Marine Le Pen – 12.1%
    • Eva Joly – 6.8%
    • Ségolène Royal – 6.4%
  • Ségolène Royal
    • Nicolas Sarkozy – 32.4%
    • François Hollande – 19.2%
    • Marine Le Pen – 9.4%
    • Martine Aubry – 6.0%
    • Eva Joly – 5.2%
  • Nicolas Sarkozy
    • François Hollande – 21.4%
    • Marine Le Pen – 13.8%
    • François Bayrou – 9.2%
    • Jean-Luc Mélenchon – 8.8%
    • Eva Joly – 7.9%
  • Manuel Valls
    • François Hollande – 24.4%
    • Nicolas Sarkozy – 19.1%
    • Martine Aubry – 8.8%
    • Arnaud Montebourg – 7.9%
    • Marine Le Pen – 7.6%
  • Dominique Villepin
    • Nicolas Sarkozy – 18.9%
    • François Hollande – 16.4%
    • François Bayrou – 11.3%
    • Marine Le Pen – 9.7%
    • Eva Joly – 6.6%

The following graph shows connections between candidates based on number of identical words used when posts are referring to them. Wider the vertex is, more words are in common. An obvious link is Hollande – Sarkozy. Bayrou being in the center politically has two strong links: with Sarkozy and with Hollande. This is also expected. We notice another strong link between Le Pen and Sarkozy. This link makes sense based on some subjects discussed by both candidates.

Twitter statistics

Next is a similar chart but based on posts referring multiple candidates. What is interesting here are the links going across political boundaries. Mélenchon has two major links: one with a candidate on the left and one with a candidate on the right. Joly has two links with candidates on the left and one with a candidate on the right. It makes sense when you know that Joly is more on the left on the political spectrum.

Twitter statistics

Let’s look at the major events for each candidate. As we are tracking the number of posts for each candidate, we can find out the events and what those were about by looking at the most used words in the posts.

The next chart shows that about 22500 posts talked about François Bayrou on March 9th. Looking at the most used words, we can see that the candidate was participating to a TV show called “Des paroles et des actes” also abbreviated “dpda”. “soir” means evening and the TV show takes place in the evening.

Twitter statistics

See the section events in the annexes for the complete list of events for each candidate.

Next is a bar chart showing the number of authors mainly talking about a candidate (80% or more of the posts only related that candidate). We notice a strong presence online of authors mainly talking about François Hollande. We notice 2 others strong presence online: Marine Le Pen and Ségolène Royal.

Twitter statistics

Annexes

Events

Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics
Twitter statistics

Twitter sentiment analysis using Python and NLTK

January 2, 2012

This post describes the implementation of sentiment analysis of tweets using Python and the natural language toolkit NLTK. The post also describes the internals of NLTK related to this implementation.

Background

The purpose of the implementation is to be able to automatically classify a tweet as a positive or negative tweet sentiment wise.

The classifier needs to be trained and to do that, we need a list of manually classified tweets. Let’s start with 5 positive tweets and 5 negative tweets.

Positive tweets:

  • I love this car.
  • This view is amazing.
  • I feel great this morning.
  • I am so excited about the concert.
  • He is my best friend.

Negative tweets:

  • I do not like this car.
  • This view is horrible.
  • I feel tired this morning.
  • I am not looking forward to the concert.
  • He is my enemy.

In the full implementation, I use about 600 positive tweets and 600 negative tweets to train the classifier. I store those tweets in a Redis DB. Even with those numbers, it is quite a small sample and you should use a much larger set if you want good results.

Next is a test set so we can assess the exactitude of the trained classifier.

Test tweets:

  • I feel happy this morning. positive.
  • Larry is my friend. positive.
  • I do not like that man. negative.
  • My house is not great. negative.
  • Your song is annoying. negative.

Implementation

The following list contains the positive tweets:

pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

The following list contains the negative tweets:

neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3] 
    tweets.append((words_filtered, sentiment))

The list of tweets now looks like this:

tweets = [
    (['love', 'this', 'car'], 'positive'),
    (['this', 'view', 'amazing'], 'positive'),
    (['feel', 'great', 'this', 'morning'], 'positive'),
    (['excited', 'about', 'the', 'concert'], 'positive'),
    (['best', 'friend'], 'positive'),
    (['not', 'like', 'this', 'car'], 'negative'),
    (['this', 'view', 'horrible'], 'negative'),
    (['feel', 'tired', 'this', 'morning'], 'negative'),
    (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
    (['enemy'], 'negative')]

Finally, the list with the test tweets:

test_tweets = [
    (['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]

Classifier

The list of word features need to be extracted from the tweets. It is a list with every distinct words ordered by frequency of appearance. We use the following function to get the list plus the two helper functions.

word_features = get_word_features(get_words_in_tweets(tweets))
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

If we take a pick inside the function get_word_features, the variable ‘wordlist’ contains:

<FreqDist:
    'this': 6,
    'car': 2,
    'concert': 2,
    'feel': 2,
    'morning': 2,
    'not': 2,
    'the': 2,
    'view': 2,
    'about': 1,
    'amazing': 1,
    ...
>

We end up with the following list of word features:

word_features = [
    'this',
    'car',
    'concert',
    'feel',
    'morning',
    'not',
    'the',
    'view',
    'about',
    'amazing',
    ...
]

As you can see, ‘this’ is the most used word in our tweets, followed by ‘car’, followed by ‘concert’…

To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

As an example, let’s call the feature extractor with the document ['love', 'this', 'car'] which is the first positive tweet. We obtain the following dictionary which indicates that the document contains the words: ‘love’, ‘this’ and ‘car’.

{'contains(not)': False,
 'contains(view)': False,
 'contains(best)': False,
 'contains(excited)': False,
 'contains(morning)': False,
 'contains(about)': False,
 'contains(horrible)': False,
 'contains(like)': False,
 'contains(this)': True,
 'contains(friend)': False,
 'contains(concert)': False,
 'contains(feel)': False,
 'contains(love)': True,
 'contains(looking)': False,
 'contains(tired)': False,
 'contains(forward)': False,
 'contains(car)': True,
 'contains(the)': False,
 'contains(amazing)': False,
 'contains(enemy)': False,
 'contains(great)': False}

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

training_set = nltk.classify.apply_features(extract_features, tweets)

The variable ‘training_set’ contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

[({'contains(not)': False,
   ...
   'contains(this)': True,
   ...
   'contains(love)': True,
   ...
   'contains(car)': True,
   ...
   'contains(great)': False},
  'positive'),
 ({'contains(not)': False,
   'contains(view)': True,
   ...
   'contains(this)': True,
   ...
   'contains(amazing)': True,
   ...
   'contains(enemy)': False,
   'contains(great)': False},
  'positive'),
  ...]

Now that we have our training set, we can train our classifier.

classifier = nltk.NaiveBayesClassifier.train(training_set)

Here is a summary of what we just saw:

Twitter sentiment analysis with Python and NLTK

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input.

Let’s take a look inside the classifier train method in the source code of the NLTK library. ‘label_probdist’ is the prior probability of each label and ‘feature_probdist’ is the feature/value probability dictionary. Those two probability objects are used to create the classifier.

def train(labeled_featuresets, estimator=ELEProbDist):
    ...
    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)
    ...
    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    ...
    return NaiveBayesClassifier(label_probdist, feature_probdist)

In our case, the probability of each label is 0.5 as we can see below. label_probdist is of type ELEProbDist.

print label_probdist.prob('positive')
0.5
print label_probdist.prob('negative')
0.5

The feature/value probability dictionary associates expected likelihood estimate to a feature and label. We can see that the probability for the input to be negative is about 0.077 when the input contains the word ‘best’.

print feature_probdist
{('negative', 'contains(view)'): <ELEProbDist based on 5 samples>,
 ('positive', 'contains(excited)'): <ELEProbDist based on 5 samples>,
 ('negative', 'contains(best)'): <ELEProbDist based on 5 samples>, ...}
print feature_probdist[('negative', 'contains(best)')].prob(True)
0.076923076923076927

We can display the most informative features for our classifier using the method show_most_informative_features. Here, we see that if the input does not contain the word ‘not’ then the positive ration is 1.6.

print classifier.show_most_informative_features(32)
Most Informative Features
           contains(not) = False          positi : negati =      1.6 : 1.0
         contains(tired) = False          positi : negati =      1.2 : 1.0
       contains(excited) = False          negati : positi =      1.2 : 1.0
         contains(great) = False          negati : positi =      1.2 : 1.0
       contains(looking) = False          positi : negati =      1.2 : 1.0
          contains(like) = False          positi : negati =      1.2 : 1.0
          contains(love) = False          negati : positi =      1.2 : 1.0
       contains(amazing) = False          negati : positi =      1.2 : 1.0
         contains(enemy) = False          positi : negati =      1.2 : 1.0
         contains(about) = False          negati : positi =      1.2 : 1.0
          contains(best) = False          negati : positi =      1.2 : 1.0
       contains(forward) = False          positi : negati =      1.2 : 1.0
        contains(friend) = False          negati : positi =      1.2 : 1.0
      contains(horrible) = False          positi : negati =      1.2 : 1.0
...

Classify

Now that we have our classifier initialized, we can try to classify a tweet and see what the sentiment type output is. Our classifier is able to detect that this tweet has a positive sentiment because of the word ‘friend’ which is associated to the positive tweet ‘He is my best friend’.

tweet = 'Larry is my friend'
print classifier.classify(extract_features(tweet.split()))
positive

Let’s take a look at how the classify method works internally in the NLTK library. What we pass to the classify method is the feature set of the tweet we want to analyze. The feature set dictionary indicates that the tweet contains the word ‘friend’.

print extract_features(tweet.split())
{'contains(not)': False,
 'contains(view)': False,
 'contains(best)': False,
 'contains(excited)': False,
 'contains(morning)': False,
 'contains(about)': False,
 'contains(horrible)': False,
 'contains(like)': False,
 'contains(this)': False,
 'contains(friend)': True,
 'contains(concert)': False,
 'contains(feel)': False,
 'contains(love)': False,
 'contains(looking)': False,
 'contains(tired)': False,
 'contains(forward)': False,
 'contains(car)': False,
 'contains(the)': False,
 'contains(amazing)': False,
 'contains(enemy)': False,
 'contains(great)': False}
def classify(self, featureset):
    # Discard any feature names that we've never seen before.
    # Find the log probability of each label, given the features.
    # Then add in the log probability of features given labels.
    # Generate a probability distribution dictionary using the dict logprod
    # Return the sample with the greatest probability from the probability
    # distribution dictionary

Let’s go through that method using our example. The parameter passed to the method classify is the feature set dictionary we saw above. The first step is to discard any feature names that are not know by the classifier. This step does nothing in our case so the feature set stays the same.

Next step is to find the log probability for each label. The probability of each label (‘positive’ and ‘negative’) is 0.5. The log probability is the log base 2 of that which is -1. We end up with logprod containing the following:

{'positive': -1.0, 'negative': -1.0}

The log probability of features given labels is then added to logprod. This means that for each label, we go through the items in the feature set and we add the log probability of each item to logprod[label]. For example, we have the feature name ‘friend’ and the feature value True. Its log probability for the label ‘positive’ in our classifier is -2.12. This value is added to logprod['positive']. We end up with the following logprod dictionary.

{'positive': -5.4785441837188511, 'negative': -14.784261334886439}

The probability distribution dictionary of type DictionaryProbDist is generated:

DictionaryProbDist(logprob, normalize=True, log=True)

The label with the greatest probability is returned which is ‘positive’. Our classifier finds out that this tweets has a positive sentiment based on the training we did.

Another example is the tweet ‘My house is not great’. The word ‘great’ weights more on the positive side but the word ‘not’ is part of two negative tweets in our training set so the output from the classifier is ‘negative’. Of course, the following tweet: ‘The movie is not bad’ would return ‘negative’ even if it is ‘positive’. Again, a large and well chosen sample will help with the accuracy of the classifier.

Taking the following test tweet ‘Your song is annoying’. The classifier thinks it is positive. The reason is that we don’t have any information on the feature name ‘annoying’. Larger the training sample tweets is, better the classifier will be.

tweet = 'Your song is annoying'
print classifier.classify(extract_features(tweet.split()))
positive

There is an accuracy method we can use to check the quality of our classifier by using our test tweets. We get 0.8 in our case which is high because we picked our test tweets for this article. The key is to have a very large number of manually classified positive and negative tweets.

Voilà. Don’t hesitate to post a comment if you have any feedback.

Python dictionary implementation

August 29, 2011

This post describes how dictionaries are implemented in the Python language.

Dictionaries are indexed by keys and they can be seen as associative arrays. Let’s add 3 key/value pairs to a dictionary:

>>> d = {'a': 1, 'b': 2}
>>> d['c'] = 3
>>> d
{'a': 1, 'b': 2, 'c': 3}

The values can be accessed this way:

>>> d['a']
1
>>> d['b']
2
>>> d['c']
3
>>> d['d']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'd'

The key ‘d’ does not exist so a KeyError exception is raised.

Hash tables

Python dictionaries are implemented using hash tables. It is an array whose indexes are obtained using a hash function on the keys. The goal of a hash function is to distribute the keys evenly in the array. A good hash function minimizes the number of collisions e.g. different keys having the same hash. Python does not have this kind of hash function. Its most important hash functions (for strings and ints) are very regular in common cases:

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

We are going to assume that we are using strings as keys for the rest of this post. The hash function for strings in Python is defined as:

arguments: string object
returns: hash
function string_hash:
    if hash cached:
        return it
    set len to string's length
    initialize var p pointing to 1st char of string object
    set x to value pointed by p left shifted by 7 bits
    while len >= 0:
        set var x to (1000003 * x) xor value pointed by p
        increment pointer p
    set x to x xor length of string object
    cache x as the hash so we don't need to calculate it again
    return x as the hash

If you run hash(‘a’) in Python, it will execute string_hash() and return 12416037344. Here we assume we are using a 64-bit machine.

If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. This makes the computation of the slot index fast. The probability to find an empty slot is high due to the resizing mechanism described below. This means that having a simple computation makes sense in most of the cases. If the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.

hash table

We can see that the Python hash function does a good job when the keys are consecutive which is good because it is quite common to have this type of data to work with. However, once we add the key ‘z’, there is a collision because it is not consecutive enough.

We could use a linked list to store the pairs having the same hash but it would increase the lookup time e.g. not O(1) anymore. The next section describes the collision resolution method used in the case of Python dictionaries.

Open addressing

Open addressing is a method of collision resolution where probing is used. In case of ‘z’, the slot index 3 is already used in the array so we need to probe for a different index to find one which is not already used. Adding a key/value pair might take more time because of the probing but the lookup will be O(1) and this is the desired behavior.

A quadratic probing sequence is used to find a free slot. The code is the following:

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;

Recurring on 5*j+1 quickly magnifies small differences in the bits that didn’t affect the initial index. The variable “perturb” gets the other bits of the hash code into play.

Just out of curiosity, let’s look at the probing sequence when the table size is 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

You can read more about this probing sequence by looking at the source code of dictobject.c. A detailed explanation of the probing mechanism can be found at the top of the file.

open addressing

Now, let’s look at the Python internal code along with an example.

Dictionary C structures

The following C structure is used to store a dictionary entry: key/value pair. The hash, key and value are stored. PyObject is the base class of the Python objects.

typedef struct {
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

The following structure represents a dictionary. ma_fill is the number of used slots + dummy slots. A slot is marked dummy when a key pair is removed. ma_used is the number of used slots (active). ma_mask is equal to the array’s size minus 1 and is used to calculate the slot index. ma_table is the array and ma_smalltable is the initial array of size 8.

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

Dictionary initialization

When you first create a dictionary, the function PyDict_New() is called. I removed some of the lines and converted the C code to pseudocode to concentrate on the key concepts.

returns new dictionary object
function PyDict_New:
    allocate new dictionary object
    clear dictionary's table
    set dictionary's number of used slots + dummy slots (ma_fill) to 0
    set dictionary's number of active slots (ma_used) to 0
    set dictionary's mask (ma_value) to dictionary size - 1 = 7
    set dictionary's lookup function to lookdict_string
    return allocated dictionary object

Adding items

When a new key/value pair is added, PyDict_SetItem() is called. This function takes a pointer to the dictionary object and the key/value pair. It checks if the key is a string and calculates the hash or reuses the one cached if it exists. insertdict() is called to add the new key/value pair and the dictionary is resized if the number of used slots + dummy slots is greater than 2/3 of the array’s size.
Why 2/3? It is to make sure the probing sequence can find a free slot fast enough. We will look at the resizing function later.

arguments: dictionary, key, value
returns: 0 if OK or -1
function PyDict_SetItem:
    if key's hash cached:
        use hash
    else:
        calculate hash
    call insertdict with dictionary object, key, hash and value
    if key/value pair added successfully and capacity over 2/3:
        call dictresize to resize dictionary's table

inserdict() uses the lookup function lookdict_string() to find a free slot. This is the same function used to find a key. lookdict_string() calculates the slot index using the hash and the mask values. If it cannot find the key in the slot index = hash & mask, it starts probing using the loop described above, until it finds a free slot. At the first probing try, if the key is null, it returns the dummy slot if found during the first lookup. This gives priority to re-use the previously deleted slots.

We want to add the following key/value pairs: {‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}. This is what happens:

A dictionary structure is allocated with internal table size of 8.

  • PyDict_SetItem: key = ‘a’, value = 1
    • hash = hash(‘a’) = 12416037344
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12416037344 & 7 = 0
        • slot 0 is not used so return it
      • init entry at index 0 with key, value and hash
      • ma_used = 1, ma_fill = 1
  • PyDict_SetItem: key = ‘b’, value = 2
    • hash = hash(‘b’) = 12544037731
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12544037731 & 7 = 3
        • slot 3 is not used so return it
      • init entry at index 3 with key, value and hash
      • ma_used = 2, ma_fill = 2
  • PyDict_SetItem: key = ‘z’, value = 26
    • hash = hash(‘z’) = 15616046971
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15616046971 & 7 = 3
        • slot 3 is used so probe for a different slot: 5 is free
      • init entry at index 5 with key, value and hash
      • ma_used = 3, ma_fill = 3
  • PyDict_SetItem: key = ‘y’, value = 25
    • hash = hash(‘y’) = 15488046584
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15488046584 & 7 = 0
        • slot 0 is used so probe for a different slot: 1 is free
      • init entry at index 1 with key, value and hash
      • ma_used = 4, ma_fill = 4
  • PyDict_SetItem: key = ‘c’, value = 3
    • hash = hash(‘c’) = 12672038114
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 12672038114 & 7 = 2
        • slot 2 is free so return it
      • init entry at index 2 with key, value and hash
      • ma_used = 5, ma_fill = 5
  • PyDict_SetItem: key = ‘x’, value = 24
    • hash = hash(‘x’) = 15360046201
    • insertdict
      • lookdict_string
        • slot index = hash & mask = 15360046201 & 7 = 1
        • slot 1 is used so probe for a different slot: 7 is free
      • init entry at index 7 with key, value and hash
      • ma_used = 6, ma_fill = 6

This is what we have so far:

python dictionary insert

6 slots on 8 are used now so we are over 2/3 of the array’s capacity. dictresize() is called to allocate a larger array. This function also takes care of copying the old table entries to the new table.

dictresize() is called with minused = 24 in our case which is 4 * ma_used. 2 * ma_used is used when the number of used slots is very large (greater than 50000). Why 4 times the number of used slots? It reduces the number of resize steps and it increases sparseness.

The new table size needs to be greater than 24 and it is calculated by shifting the current size 1 bit left until it is greater than 24. It ends up being 32 e.g. 8 -> 16 -> 32.

This is what happens with our table during resizing: a new table of size 32 is allocated. Old table entries are inserted into the new table using the new mask value which is 31. We end up with the following:

python dictionary table resizing

Removing items

PyDict_DelItem() is called to remove an entry. The hash for this key is calculated and the lookup function is called to return the entry. The slot is now a dummy slot.

We want to remove the key ‘c’ from our dictionary. We end up with the following array:

Python dictionary delete key

Note that the delete item operation doesn’t trigger an array resize if the number of used slots is much less that the total number of slots. However, when a key/value pair is added, the need for resize is based on the number of used slots + dummy slots so it can shrink the array too.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

Python string objects implementation

June 19, 2011

This article describes how string objects are managed by Python internally and how string search is done.

PyStringObject structure
New string object
Sharing string objects
String search

PyStringObject structure

A string object in Python is represented internally by the structure PyStringObject. “ob_shash” is the hash of the string if calculated. “ob_sval” contains the string of size “ob_size”. The string is null terminated. The initial size of “ob_sval” is 1 byte and ob_sval[0] = 0. If you are wondering where “ob_size is defined”, take a look at PyObject_VAR_HEAD in object.h. “ob_sstate” indicates if the string object is in the interned dictionary which we are going to see later.

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];
} PyStringObject;

New string object

What happens when you assign a new string to a variable like this one?

>>> s1 = 'abc'

The internal C function “PyString_FromString” is called and the pseudo code looks like this:

arguments: string object: 'abc'
returns: Python string object with ob_sval = 'abc'
PyString_FromString(string):
    size = length of string
    allocate string object + size for 'abc'. ob_sval will be of size: size + 1
    copy string to ob_sval
    return object

Each time a new string is used, a new string object is allocated.

Sharing string objects

There is a neat feature where small strings are shared between variables. This reduces the amount of memory used. Small strings are strings of size 0 or 1 byte. The global variable “interned” is a dictionary referencing those small strings. The array “characters” is also used to reference the strings of length 1 byte: i.e. single characters. We will see later how the array “characters” is used.

static PyStringObject *characters[UCHAR_MAX + 1];
static PyObject *interned;

Let’s see what happens when a new small string is assigned to a variable in your Python script.

>>> s2 = 'a'

The string object containing ‘a’ is added to the dictionary “interned”. The key is a pointer to the string object and the value is the same pointer. This new string object is also referenced in the array characters at the offset 97 because value of ‘a’ is 97 in ASCII. The variable “s2″ is pointing to this string object.

Python string object internals

What happens when a different variable is assigned to the same string ‘a’?

>>> s3 = 'a'

The same string object previously created is returned so both variables are pointing to the same string object. The “characters” array is used during that process to check if the string already exists and returns the pointer to the string object.

if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL)
{
    ...
    return (PyObject *)op;
}

Python string object internals

Let’s create a new small string containing the character ‘c’.

>>> s4 = 'c'

We end up with the following:

Python string object internals

We also find the “characters” array at use when a string’s item is requested like in the following Python script:

>>> s5 = 'abc'
>>> s5[0]
'a'

Instead of creating a new string containing ‘a’, the pointer at the offset 97 of the “characters” array is returned. Here is the code of the function “string_item” which is called when we request a character from a string. The argument “a” is the string object containing ‘abc’ and the argument “i” is the index requested: 0 in our case. A pointer to a string object is returned.

static PyObject *
string_item(PyStringObject *a, register Py_ssize_t i)
{
    char pchar;
    PyObject *v;
    ...
    pchar = a->ob_sval[i];
    v = (PyObject *)characters[pchar & UCHAR_MAX];
    if (v == NULL)
        // allocate string
    else {
        ...
        Py_INCREF(v);
    }
    return v;
}

The “characters” array is also used for function names of length 1:

>>> def a(): pass

String search

Let’s take a look at what happens when you perform a string search like in the following Python code:

>>> s = 'adcabcdbdabcabd'
>>> s.find('abcab')
>>> 11 

The “find” function returns the index where the string ‘abcd’ is found in the string “s”. It returns -1 if the string is not found.

So, what happens internally? The function “fastsearch” is called. It is a mix between Boyer-Moore and Horspool algorithms plus couple of neat tricks.

Let’s call “s” the string to search in and “p” the string to search for. s = ‘adcabcdbdabcabd’ and p = ‘abcab’. “n” is the length of “s” and “m” is the length of “p”. n = 18 and m = 5.

The first check in the code is obvious, if m > n then we know that we won’t be able to find the index so the function returns -1 right away as we can see in the following code:

w = n - m;
if (w < 0)
    return -1;

When m = 1, the code goes through “s” one character at a time and returns the index when there is a match. mode = FAST_SEARCH in our case as we are looking for the index where the string is found first and not the number of times the string if found.

if (m <= 1) {
    ...
    if (mode == FAST_COUNT) {
        ...
    } else {
        for (i = 0; i < n; i++)
            if (s[i] == p[0])
                return i;
    }
    return -1;
}

For other cases i.e. m > 1. The first step is to create a compressed boyer-moore delta 1 table. Two variables will be assigned during that step: “mask” and “skip”.

“mask” is a 32-bit bitmask, using the 5 least significant bits of the character as the key. It is generated using the string to search “p”. It is a bloom filter which is used to test if a character is present in this string. It is really fast but there are false positives. You can read more about bloom filters here. This is how the bitmask is generated in our case:

mlast = m - 1
/* process pattern[:-1] */
for (mask = i = 0; i < mlast; i++) {
    mask |= (1 << (p[i] & 0x1F));
}
/* process pattern[-1] outside the loop */
mask |= (1 << (p[mlast] & 0x1F));

First character of “p” is ‘a’. Value of ‘a’ is 97 = 1100001 in binary format. Using the 5 least significants bits, we get 00001 so “mask” is first set to: 1 << 1 = 10. Once the entire string "p" is processed, mask = 1110.

How do we use this bitmask? By using the following test where "c" is the character to look for in the string "p".

if ((mask & (1 << (c & 0x1F))))

Is 'a' in "p" where p = 'abcab'? Is 1110 & (1 << ('a' & 0X1F)) true? 1110 & (1 << ('a' & 0X1F)) = 1110 & 10 = 10. So, yes 'a' is in 'abcab'. If we test with 'd', we get false and also with the characters from 'e' to 'z' so this filter works pretty well in our case.

"skip" is set to the index of the character with the same value as the last character in the string to search for. "skip" is set to the length of "p" - 1 if the last character is not found. The last character in the string to search for is 'b' which means "skip" will be set to 2 because this character can also be found by skipping over 2 characters down. This variable is used in a skip method called the bad-character skip method.

In the following example: p = 'abcab' and s = 'adcabcaba'. The search starts at index 4 of "s" and checks backward if there is a string match. This first test fails at index = 1 where 'b' is different than 'd'. We know that the character 'b' in "p" is also found 3 characters down starting from the end. Because 'c' is part of "p", we skip to the following 'b'. This is the bad-character skip.

Python string object internals

Next is the search loop itself (real code is in C instead of Python):

for i = 0 to n - m = 13:
    if s[i+m-1] == p[m-1]:
        if s[i:i+mlast] == p[0:mlast]:
            return i
        if s[i+m] not in p:
            i += m
        else:
            i += skip
    else:
        if s[i+m] not in p:
            i += m
return -1

The test "s[i+m] not in p" is done using the bitmask. "i += skip" is the bad-character skip. "i += m" is done when the next character is not found in "p".

Let's see how this search algorithm works with our strings "p" and "s". The first 3 steps are familiar. After that, the character 'd' is not in the string "p" so we skip the length of "p" and quickly find a match after that.

Python string object internals

You can take a look at the code for the string objects here.

That's it for now. I hope you enjoyed the article. Please write a comment if you have any feedback. If you need help with a project written in Python or with building a new web service, I am available as a freelancer: LinkedIn profile. Follow me on Twitter @laurentluce.

Python integer objects implementation

May 15, 2011

This article describes how integer objects are managed by Python internally.

An integer object in Python is represented internally by the structure PyIntObject. Its value is an attribute of type long.

typedef struct {
    PyObject_HEAD
    long ob_ival;
} PyIntObject;

To avoid allocating a new integer object each time a new integer object is needed, Python allocates a block of free unused integer objects in advance.

The following structure is used by Python to allocate integer objects, also called PyIntObjects. Once this structure is initialized, the integer objects are ready to be used when new integer values are assigned to objects in a Python script. This structure is called “PyIntBlock” and is defined as:

struct _intblock {
    struct _intblock *next;
    PyIntObject objects[N_INTOBJECTS];
};
typedef struct _intblock PyIntBlock;

When a block of integer objects is allocated by Python, the objects have no value assigned to them yet. We call them free integer objects ready to be used. A value will be assigned to the next free object when a new integer value is used in your program. No memory allocation will be required when a free integer object’s value is set so it will be fast.

The integer objects inside the block are linked together back to front using their internal pointer called ob_type. As noted in the source code, this is an abuse of this internal pointer so do not pay too much attention to the name.

Each block of integers contains the number of integer objects which can fit in a block of 1K bytes, about 40 PyIntObject objects on my 64-bit machine. When all the integer objects inside a block are used, a new block is allocated with a new list of integer objects available.

A singly-linked list is used to keep track of the integers blocks allocated. It is called “block_list” internally.

Python integer object internals

A specific structure is used to refer small integers and share them so access is fast. It is an array of 262 pointers to integer objects. Those integer objects are allocated during initialization in a block of integer objects we saw above. The small integers range is from -5 to 256. Many Python programs spend a lot of time using integers in that range so this is a smart decision.

#define NSMALLPOSINTS           257
#define NSMALLNEGINTS           5
static PyIntObject *small_ints[NSMALLNEGINTS + NSMALLPOSINTS];

Python integer object internals

The integer object representing the integer -5 is at the offset 0 inside the small integers array. The integers object representing -4 is at offset 1 …

What happens when an integer is defined in a Python script like this one?

>>> a=1
>>> a
1

When you execute the first line, the function PyInt_FromLong is called and its logic is the following:

if integer value in range -5,256:
    return the integer object pointed by the small integers array at the 
    offset (value + 5).
else:
    if no free integer object available:
        allocate new block of integer objects 
    set value of the next free integer object in the current block 
    of integers.
    return integer object

With our example: integer 1 object is pointed by the small integers array at offset: 1+5 = 6. A pointer to this integer object will be returned and the variable “a” will be pointing to that integer object.

Python integer object internals

Let’s a look at a different example:

>>> a=300
>>> a
300

300 is not in the range of the small integers array so the next free integer object’s value is set to 300.

Python integer object internals

If you take a look at the file intobject.c in the Python 2.6 source code, you will see a long list of functions taking care of operations like addition, multiplication, conversion… The comparison function looks like this:

static int
int_compare(PyIntObject *v, PyIntObject *w)
{
    register long i = v->ob_ival;
    register long j = w->ob_ival;
    return (i < j) ? -1 : (i > j) ? 1 : 0;
}

The value of an integer object is stored in its ob_ival attribute which is of type long. Each value is placed in a register to optimize access and the comparison is done between those 2 registers. -1 is returned if the integer object pointed by v is less than the one pointed by w. 1 is returned for the opposite and 0 is returned if they are equal.

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback.

Python and cryptography with pycrypto

April 22, 2011

We are going to talk about the toolkit pycrypto and how it can help us speed up development when cryptography is involved.

Hash functions
Encryption algorithms
Public-key algorithms

Hash functions

A hash function takes a string and produces a fixed-length string based on the input. The output string is called the hash value. Ideal hash functions obey the following:

  • It should be very difficult to guess the input string based on the output string.
  • It should be very difficult to find 2 different input strings having the same hash output.
  • It should be very difficult to modify the input string without modifying the output hash value.

Cryptography and Python

Hash functions can be used to calculate the checksum of some data. It can be used in digital signatures and authentication. We will see some applications in details later on.

Let’s look at one example of a hash function: MD5.

MD5

This hash function dates back from 1991. The hash output value is 128-bit.

The algorithm’s steps are as follow:

  • Pad the string to a length congruent to 448 bits, modulo 512.
  • Append a 64-bit representation of the length of the input string.
  • Process the message in 16-word blocks. There are 4 rounds instead of 3 compared to MD4.

You can get more details regarding the algorithm here.

Hashing a value using MD5 is done this way:

>>> from Crypto.Hash import MD5
>>> MD5.new('abc').hexdigest()
'900150983cd24fb0d6963f7d28e17f72'

It is important to know that the MD5 is vulnerable to collision attacks. A collision attack is when 2 different inputs result in the same hash output. It is also vulnerable to some preimage attacks found in 2004 and 2008. A preimage attack is: given a hash h, you can find a message m where hash(m) = h.

Applications

Hash functions can be used in password management and storage. Web sites usually store the hash of a password and not the password itself so only the user knows the real password. When the user logs in, the hash of the password input is generated and compared to the hash value stored in the database. If it matches, the user is granted access. The code looks like this:

from Crypto.Hash import MD5
def check_password(clear_password, password_hash):
    return MD5.new(clear_password).hexdigest() == password_hash

It is recommended to use a module like py-bcrypt to hash passwords as it is more secure than using MD5.

Another application is file integrity checking. Many downloadable files include a MD5 checksum to verify the integrity of the file once downloaded. Here is the code to calculate the MD5 checksum of a file. We work on chunks to avoid using too much memory when the file is large.

import os
from Crypto.Hash import MD5
def get_file_checksum(filename):
    h = MD5.new()
    chunk_size = 8192 
    with open(filename, 'rb') as f:
        while True:
            chunk = f.read(chunk_size)
            if len(chunk) == 0:
                break
            h.update(chunk)
    return h.hexdigest()

Hash functions comparison

Hash function Hash output size (bits) Secure?
MD2 128 No
MD4 128 No
MD5 128 No
SHA-1 160 No
SHA-256 256 Yes

Encryption algorithms

Encryption algorithms take some text as input and produce ciphertext using a variable key. You have 2 types of ciphers: block and stream. Block ciphers work on blocks of a fixed size (8 or 16 bytes). Stream ciphers work byte-by-byte. Knowing the key, you can decrypt the ciphertext.

Block ciphers

Let’s look at one of the block cipher: DES. The key size used by this cipher is 8 bytes and the block of data it works with is 8 bytes long. The simplest mode for this block cipher is the electronic code book mode where each block is encrypted independently to form the encrypted text.

Cryptography and Python

It is easy to encrypt text using DES/ECB with pycrypto. The key ’10234567′ is 8 bytes and the text’s length needs to be a multiple of 8 bytes. We picked ‘abcdefgh’ in this example.

>>> from Crypto.Cipher import DES
>>> des = DES.new('01234567', DES.MODE_ECB)
>>> text = 'abcdefgh'
>>> cipher_text = des.encrypt(text)
>>> cipher_text
'\xec\xc2\x9e\xd9] a\xd0'
>>> des.decrypt(cipher_text)
'abcdefgh'

A stronger mode is CFB (Cipher feedback) which combines the plain block with the previous cipher block before encrypting it.

Cryptography and Python

Here is how to use DES CFB mode. The plain text is 16 bytes long (multiple of 8 bytes). We need to specify an initial feedback value: we use a random string 8 bytes long, same size as the block size. It is better to use a random string for each new encryption to avoid chosen-ciphertext attacks. Note how we use 2 DES objects, 1 to encrypt and 1 to decrypt. This is required because of the feedback value getting modified each time a block is encrypted.

>>> from Crypto.Cipher import DES
>>> from Crypto import Random
>>> iv = Random.get_random_bytes(8)
>>> des1 = DES.new('01234567', DES.MODE_CFB, iv)
>>> des2 = DES.new('01234567', DES.MODE_CFB, iv)
>>> text = 'abcdefghijklmnop'
>>> cipher_text = des1.encrypt(text)
>>> cipher_text
"?\\\x8e\x86\xeb\xab\x8b\x97'\xa1W\xde\x89!\xc3d"
>>> des2.decrypt(cipher_text)
'abcdefghijklmnop'

Stream ciphers

Those algorithms work on a byte-by-byte basis. The block size is always 1 byte. 2 algorithms are supported by pycrypto: ARC4 and XOR. Only one mode is available: ECB.

Let’s look at an example with the algorithm ARC4 using the key ’01234567′.

>>> from Crypto.Cipher import ARC4
>>> obj1 = ARC4.new('01234567')
>>> obj2 = ARC4.new('01234567')
>>> text = 'abcdefghijklmnop'
>>> cipher_text = obj1.encrypt(text)
>>> cipher_text
'\xf0\xb7\x90{#ABXY9\xd06\x9f\xc0\x8c '
>>> obj2.decrypt(cipher_text)
'abcdefghijklmnop'

Applications

It is easy to write code to encrypt and decrypt a file using pycrypto ciphers. Let’s do it using DES3 (Triple DES). We encrypt and decrypt data by chunks to avoid using too much memory when the file is large. In case the chunk is less than 16 bytes long, we pad it before encrypting it.

import os
from Crypto.Cipher import DES3

def encrypt_file(in_filename, out_filename, chunk_size, key, iv):
    des3 = DES3.new(key, DES3.MODE_CFB, iv)

    with open(in_filename, 'r') as in_file:
        with open(out_filename, 'w') as out_file:
            while True:
                chunk = in_file.read(chunk_size)
                if len(chunk) == 0:
                    break
                elif len(chunk) % 16 != 0:
                    chunk += ' ' * (16 - len(chunk) % 16)
                out_file.write(des3.encrypt(chunk))

def decrypt_file(in_filename, out_filename, chunk_size, key, iv):
    des3 = DES3.new(key, DES3.MODE_CFB, iv)

    with open(in_filename, 'r') as in_file:
        with open(out_filename, 'w') as out_file:
            while True:
                chunk = in_file.read(chunk_size)
                if len(chunk) == 0:
                    break
                out_file.write(des3.decrypt(chunk))

Next is a usage example of the 2 functions defined above:

from Crypto import Random
iv = Random.get_random_bytes(8)
with open('to_enc.txt', 'r') as f:
    print 'to_enc.txt: %s' % f.read()
encrypt_file('to_enc.txt', 'to_enc.enc', 8192, key, iv)
with open('to_enc.enc', 'r') as f:
    print 'to_enc.enc: %s' % f.read()
decrypt_file('to_enc.enc', 'to_enc.dec', 8192, key, iv)
with open('to_enc.dec', 'r') as f:
    print 'to_enc.dec: %s' % f.read()

The output of this script:

to_enc.txt: this content needs to be encrypted.

to_enc.enc: ??~?E??.??]!=)??"t?
                                JpDw???R?UN0?=??R?UN0?}0r?FV9
to_enc.dec: this content needs to be encrypted.

Public-key algorithms

One disadvantage with the encryption algorithms seen above is that both sides need to know the key. With public-key algorithms, there are 2 different keys: 1 to encrypt and 1 to decrypt. You only need to share the encryption key and only you, can decrypt the message with your private decryption key.

Cryptography and Python

Public/private key pair

It is easy to generate a private/public key pair with pycrypto. We need to specify the size of the key in bits: we picked 1024 bits. Larger is more secure. We also need to specify a random number generator function, we use the Random module of pycrypto for that.

>>> from Crypto.PublicKey import RSA
>>> from Crypto import Random
>>> random_generator = Random.new().read
>>> key = RSA.generate(1024, random_generator)
>>> key
<_RSAobj @0x7f60cf1b57e8 n(1024),e,d,p,q,u,private>

Let’s take a look at some methods supported by this key object. can_encrypt() checks the capability of encrypting data using this algorithm. can_sign() checks the capability of signing messages. has_private() returns True if the private key is present in the object.

>>> key.can_encrypt()
True
>>> key.can_sign()
True
>>> key.has_private()
True

Encrypt

Now that we have our key pair, we can encrypt some data. First, we extract the public key from the key pair and use it to encrypt some data. 32 is a random parameter used by the RSA algorithm to encrypt the data. This step simulates us publishing the encryption key and someone using it to encrypt some data before sending it to us.

>>> public_key = key.publickey()
>>> enc_data = public_key.encrypt('abcdefgh', 32)
>>> enc_data
('\x11\x86\x8b\xfa\x82\xdf\xe3sN ~@\xdbP\x85
\x93\xe6\xb9\xe9\x95I\xa7\xadQ\x08\xe5\xc8$9\x81K\xa0\xb5\xee\x1e\xb5r
\x9bH)\xd8\xeb\x03\xf3\x86\xb5\x03\xfd\x97\xe6%\x9e\xf7\x11=\xa1Y<\xdc
\x94\xf0\x7f7@\x9c\x02suc\xcc\xc2j\x0c\xce\x92\x8d\xdc\x00uL\xd6.
\x84~/\xed\xd7\xc5\xbe\xd2\x98\xec\xe4\xda\xd1L\rM`\x88\x13V\xe1M\n X
\xce\x13 \xaf\x10|\x80\x0e\x14\xbc\x14\x1ec\xf6Rs\xbb\x93\x06\xbe',)

Decrypt

We have the private decryption key so it is easy to decrypt the data.

>>> key.decrypt(enc_data)
'abcdefgh'

Sign

Signing a message can be useful to check the author of a message and make sure we can trust its origin. Next is an example on how to sign a message. The hash for this message is calculated first and then passed to the sign() method of the RSA key. You can use other algorithms like DSA or ElGamal.

>>> from Crypto.Hash import MD5
>>> from Crypto.PublicKey import RSA
>>> from Crypto import Random
>>> key = RSA.generate(1024, random_generator)
>>> text = 'abcdefgh'
>>> hash = MD5.new(text).digest()
>>> hash
'\xe8\xdc@\x81\xb144\xb4Q\x89\xa7 \xb7{h\x18'
>>> signature = key.sign(hash, '')
>>> signature
(1549358700992033008647390368952919655009213441715588267926189797
14352832388210003027089995136141364041133696073722879839526120115
25996986614087200336035744524518268136542404418603981729787438986
50177007820700181992412437228386361134849096112920177007759309019
6400328917297225219942913552938646767912958849053L,)

Verify

Knowing the public key, it is easy to verify a message. The plain text is sent to the user along with the signature. The receiving side calculates the hash value and then uses the public key verify() method to validate its origin.

>>> text = 'abcdefgh'
>>> hash = MD5.new(text).digest()
>>> public_key.verify(hash, signature)
True

That’s it for now. I hope you enjoyed the article. Please write a comment if you have any feedback.

 
Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org