About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

25 November 2014 - 19:11We’re running more Data Science Training in 2015 Q1 in London

A couple of weeks ago Bart and I ran two very successful training courses in London through my ModelInsight, one introduced data science using pandas and numpy to build a recommender engine, the second taught a two-day course on High Performance Python (and yes, that was somewhat based on my book with a lot of hands-on exercises). Based on feedback from those courses we’re looking to introduce up to 5 courses at the start of next year.

If you’d like to hear about our London data science training then sign-up to our (very low volume) announce list. I posted an anonymous survey onto the mailing list, if you’d like to give your vote to the courses we should run then jump over here (no sign-up, there’s only 1 question, there’s no commitment).

If you’d like to talk about these in person then you can find me (probably on-stage) co-running the PyDataLondon meetups.

Here’s the synopses for each of the proposed courses:

“Playing with data – pandas and matplotlib” (1 day)

Aimed at beginner Pythonista data scientists who want to load, manipulate and visualise data
We’ll use pandas with many practical exercises on different sorts of data (including messy data that needs fixing) to manipulate, visualise and join data. You’ll be able to work with your own data sets after this course, we’ll also look at other visualise tools like Seaborn and Bokeh. This will suit people who haven’t used pandas who want a practical introduction such as data journalists, engineers and semi-technical managers.

“Building a recommender system with Python” (1 day)

Aimed at intermediate Pythonistas who want to use pandas and numpy to build a working recommender engine, this covers both using data through to delivering a working data science product. You already know a little linear algebra and you’ve used numpy lightly, you want to see how to deploy a working data science product as a microservice (Flask) that could reliably be put into production.

“Statistics and Big Data using scikit-learn” (2 days)

Aimed at beginner/intermediate Pythonistas with some mathematical background and a desire to learn everyday statistics and to start with machine learning
Day 1 – Probability, distributions, Frequentist and Bayesian approaches, Inference and Regression, Experiment Design – part discussion and part practical
Day 2 – Applying these approaches with scikit-learn to everyday problems, examples may include (note *examples may change* this just gives a flavour) Bayesian spam detection, predicting political campaigns, quality testing, clustering, weather forecasting, tools will include Statsmodels and matplotlib.

“Hands on with Scikit-Learn” (5 days)

Aimed at intermediate Pythonistas who need a practical and comprehensive introduction to machine learning in Python, you’ve already got a basic statistical and linear algebra background
This course will cover all the terminology and stages that make up the machine learning pipeline and the fundamental skills needed to perform machine learning successfully. Aided by many hands on labs with Python scikit-learn the course will enable you to understand the basic concepts, become confident in applying the tools and techniques, and provide a firm foundation from which to dig deeper and explore more advanced methods.

“High Performance Python” (2 days)

Aimed at intermediate Pythonistas whose code is too slow
Day 1 – Profiling (CPU and RAM), compiling with Cython, using Numba, PyPy and Pythran (all the way through to using OpenMP)
Day 2 – Going multicore (multiprocessing) and multi-machine (IPython parallel), fitting more into RAM, probabilitistic counting, storage engines, Test Driven Development and several debugging exercises
A mix of theory and practical exercises, you’ll be able to use the main Python tools to confidently and reliably make your code run faster


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Python

26 August 2014 - 21:35Why are technical companies not using data science?

Here’s a quick question. How come more technical companies aren’t making use of data science? By “technical” I mean any company with data and the smarts to spot that it has value, by “data science” I mean any technical means to exploit this data for financial gain (e.g. visualisation to guide decisions, machine learning, prediction).

I’m guessing that it comes down to an economic question – either it isn’t as valuable as some other activity (making mobile apps? improving UX on the website? paid marketing? expanding sales to new territories?) or it is perceived as being valuable but cannot be exploited (maybe due to lack of skills and training or data problems).

I’m thinking about this for my upcoming keynote at PyConIreland, would you please give me some feedback in the survey below (no sign-up required)?

To be clear – this is an anonymous survey, I’ll have no idea who gives the answers.

Create your free online surveys with SurveyMonkey , the world’s leading questionnaire tool.

 

If the above is interesting then note that we’ve got a data science training list where we make occasional announcements about our upcoming training and we have two upcoming training courses. We also discuss these topics at our PyDataLondon meetups. I also have a slightly longer survey (it’ll take you 2 minutes, no sign-up required), I’ll be discussing these results at the next PyDataLondon so please share your thoughts.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: ArtificialIntelligence, Data science, pydata, Python

20 August 2014 - 21:24Data Science Training Survey

I’ve put together a short survey to figure out what’s needed for Python-based Data Science training in the UK. If you want to be trained in strong data science, analysis and engineering skills please complete the survey, it doesn’t need any sign-up and will take just a couple of minutes. I’ll share the results at the next PyDataLondon meetup.

If you want training you probably want to be on our training announce list, this is a low volume list (run by MailChimp) where we announce upcoming dates and suggest topics that you might want training around. You can unsubscribe at any time.

I’ve written about the current two courses that run in October through ModelInsight, one focuses on improving skills around data science using Python (including numpy, scipy and TDD), the second on high performance Python (I’ve now finished writing O’Reilly’s High Performance Python book). Both courses focus on practical skills, you’ll walk away with working systems and a stronger understanding of key Python skills. Your developer skills will be stronger as will your debugging skills, in the longer run you’ll develop stronger software with fewer defects.

If you want to talk about this, come have a chat at the next PyData London meetup or in the pub after.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, pydata, Python

1 August 2014 - 13:13Python Training courses: Data Science and High Performance Python coming in October

I’m pleased to say that via our ModelInsight we’ll be running two Python-focused training courses in October. The goal is to give you new strong research & development skills, they’re aimed at folks in companies but would suit folks in academia too. UPDATE training courses ready to buy (1 Day Data Science, 2 Day High Performance).

UPDATE we have a <5min anonymous survey which helps us learn your needs for Data Science training in London, please click through and answer the few questions so we know what training you need.

“Highly recommended – I attended in Aalborg in May “:… upcoming Python DataSci/HighPerf training courses”” @ThomasArildsen

These and future courses will be announced on our London Python Data Science Training mailing list, sign-up for occasional announces about our upcoming courses (no spam, just occasional updates, you can unsubscribe at any time).

Intro to Data science with Python (1 day) on Friday 24th October

Students: Basic to Intermediate Pythonistas (you can already write scripts and you have some basic matrix experience)

Goal: Solve a complete data science problem (building a working and deployable recommendation engine) by working through the entire process – using numpy and pandas, applying test driven development, visualising the problem, deploying a tiny web application that serves the results (great for when you’re back with your team!)

  • Learn basic numpy, pandas and data cleaning
  • Be confident with Test Driven Development and debugging strategies
  • Create a recommender system and understand its strengths and limitations
  • Use a Flask API to serve results
  • Learn Anaconda and conda environments
  • Take home a working recommender system that you can confidently customise to your data
  • £300 including lunch, central London (24th October)
  • Additional announces will come via our London Python Data Science Training mailing list
  • Buy your ticket here

High Performance Python (2 day) on Thursday+Friday 30th+31st October

Students: Intermediate Pythonistas (you need higher performance for your Python code)

Goal: learn high performance techniques for performant computing, a mix of background theory and lots of hands-on pragmatic exercises

  • Profiling (CPU, RAM) to understand bottlenecks
  • Compilers and JITs (Cython, Numba, Pythran, PyPy) to pragmatically run code faster
  • Learn r&d and engineering approaches to efficient development
  • Multicore and clusters (multiprocessing, IPython parallel) for scaling
  • Debugging strategies, numpy techniques, lowering memory usage, storage engines
  • Learn Anaconda and conda environments
  • Take home years of hard-won experience so you can develop performant Python code
  • Cost: £600 including lunch, central London (30th & 31st October)
  • Additional announces will come via our London Python Data Science Training mailing list
  • Buy your ticket here

The High Performance course is built off of many years teaching and talking at conferences (including PyDataLondon 2013, PyCon 2013, EuroSciPy 2012) and in companies along with my High Performance Python book (O’Reilly). The data science course is built off of techniques we’ve used over the last few years to help clients solve data science problems. Both courses are very pragmatic, hands-on and will leave you with new skills that have been battle-tested by us (we use these approaches to quickly deliver correct and valuable data science solutions for our clients via ModelInsight). At PyCon 2012 my students rated me 4.64/5.0 for overall happiness with my High Performance teaching.

@ianozsvald [..] Best tutorial of the 4 I attended was yours. Thanks for your time and preparation!” @cgoering

We’d also like to know which other courses you’d like to learn, we can partner with trainers as needed to deliver new courses in London. We’re focused around Python, data science, high performance and pragmatic engineering. Drop me an email (via ModelInsight) and let me know if we can help.

Do please join our London Python Data Science Training mailing list to be kept informed about upcoming training courses.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

1 Comment | Tags: Data science, High Performance Python Book, Python

26 June 2014 - 14:08PyDataLondon second meetup (July 1st)

Our second PyDataLondon meetup will be running on Tuesday July 1st at Pivotal in Shoreditch. The announce went out to the meetup group and the event was at capacity within 7 hours – if you’d like to attend future meetups please join the group (and the wait-list is open for our next event). Our speakers:

  1. Kyran Dale on “Getting your Python data onto a Browser” – Python+javascript from ex-academic turned Brighton-based freelance Javascript Pythonic whiz
  2. Laurie Clark-Michalek – “Defence of the Ancients Analysis: Using Python to provide insight into professional DOTA2 matches” – game analysis using the full range of Python tools from data munging, high performance with Cython and visualisation

We’ll also have several lightning talks, these are described on the meetup page.

We’re open to submissions for future talks and lightning talks, please send us an email via the meetup group (and we might have room for 1 more lightning talk for the upcoming pydata – get in contact if you’ve something interesting to present in 5 minutes).

Some other events might interest you – Brighton has a Data Visualisation event and recently Yves Hilpisch ran a QuantFinance training session and the slides are available. Also remember PyDataBerlin in July and EuroSciPy in Cambridge in August.

 


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Life, pydata, Python

23 June 2014 - 22:47High Performance Python manuscript submitted to O’Reilly

I’m super-happy to say that Micha and I have submitted the manuscript to O’Reilly for our High Performance Python book. Here’s the final chapter list:

  • Understanding Performant Python
  • Profiling to find bottlenecks (%timeit, cProfile, line_profiler, memory_profiler, heapy and more)
  • Lists and Tuples (how they work under the hood)
  • Dictionaries and Sets (under the hood again)
  • Iterators and Generators (introducing intermediate-level Python techniques)
  • Matrix and Vector Computation (numpy and scipy and Linux’s perf)
  • Compiling to C (Cython, Shed Skin, Pythran, Numba, PyPy) and building C extensions
  • Concurrency (getting past IO bottlenecks using Gevent, Tornado, AsyncIO)
  • The multiprocessing module (pools, IPC and locking)
  • Clusters and Job Queues (IPython, ParallelPython, NSQ)
  • Using less RAM (ways to store text with far less RAM, probabilistic counting)
  • Lessons from the field (stories from experienced developers on all these topics)

August is still the expected publication date, a soon-to-follow Early Release will have all the chapters included. Next up I’ll be teaching on some of this in August at EuroSciPy in Cambridge.

Some related (but not covered in the book) bit of High Performance Python news:

  • PyPy.js is now faster than CPython (but not as fast as PyPy) – crazy and rather cutting effort to get Python code running on a javascript engine through the RPython PyPy toolchain
  • Micropython runs in tiny memory environments, it aims to runs on embedded devices (e.g. ARM boards) with low RAM where CPython couldn’t possibly run, it is pretty advanced and lets us use Python code in a new class of environment
  • cytools offers Cython compiled versions of the pytoolz extended iterator objects, running faster than pytoolz and via iterators probably using significantly less RAM than when using standard Python containers

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, High Performance Python Book, Python

19 June 2014 - 16:34Flask + mod_uwsgi + Apache + Continuum’s Anaconda

I’ve spent the morning figuring out how to use Flask through Anaconda with Apache and uWSGI on an Amazon EC2 machine, side-stepping the system’s default Python. I’ll log the main steps in, I found lots of hints on the web but nothing that tied it all together for someone like me who lacks Apache config experience. The reason for deploying using Anaconda is to keep a consistent environment against our dev machines.

First it is worth noting that mod_wsgi and mod_uwsgi (this is what I’m using) are different things, Flask’s Apache instructions talk about mod_wsgi and describes mod_uwsgi for nginx. Continuum’s Anaconda forum had a hint but not a worked solution.

I’ve used mod_wsgi before with a native (non-Anaconda) Python installation (plus a virtualenv of numpy, scipy etc), I wanted to do something similar using an Anaconda install of an internal recommender system for a client.  The following summarises my working notes, please add a comment if you can improve any steps.

  • Setup an Ubuntu 12.04 AMI on EC2
  • source activate production  # activate the Anaconda environment
  •   (I'm assuming you've setup an environment and
  •   put your src onto this machine)
  • conda install -c https://conda.binstar.org/travis uwsgi
  •   # install uwsgi 2.0.2 into your Anaconda environment
  •   using binstar (other, newer versions might be available)
  • uwsgi --http :9090 --uwsgi-socket localhost:56708
  •   --wsgi-file <path>/server.wsgi
  •   # run uwsgi locally on a specified TCP/IP port
  • curl localhost:9090  # calls localhost:9090/ to test
  •   your Flask app is responding via uwsgi

If you get uwsgi running locally and you can talk to it via curl then you’ve got an installed uwsgi gateway running with Anaconda – that’s the less-discussed-on-the-web part done.

Now setup Apache:

  • sudo apt-get install lamp-server^
  •   # Install the LAMP stack
  • sudo a2dissite 000-default
  •   # disable the default Apache app
  • # I believe the following is sensible but if there's
  •   an easier or better way to talk to uwsgi, please
  •   leave me a comment (should I prefer unix sockets maybe?)
  • sudo apt-get install libapache2-mod-uwsgi  # install mod_uwsgi
  • sudo a2enmod uwsgi  # activate mod_uwsgi in Apache
  • # create myserver.conf (see below) to configure Apache
  • sudo a2ensite myserver.conf
  •   # enable your server configuration in Apache
  • service apache2 reload  # somewhere around now you'll have
  •   to reload Apache so it sees the new configurations, you
  •   might have had to do it earlier

My server.wsgi lives in with my source (outside of the Apache folders), as noted in the Flask wsgi page it contains:

import sys
sys.path.insert(0, "<path>/mysource")
from server import app as application

Note that it doesn’t need the virtualenv hack as we’re not using virtualenv, you’ve already got uwsgi running with Anaconda’s Python (rather than the system’s default Python).

The Apache configuration lives in /etc/apache2/sites-available/myserver.conf and it has only the following lines (credit: Django uwsgi doc), note the specified port is the same as we used when running uwsgi:

<VirtualHost *:80>
  <Location />
    SetHandler uwsgi-handler
    uWSGISocket 127.0.0.1:56708
  </Location>
</VirtualHost>

Once Apache is running, if you stop your uwsgi process then you’ll get 502 Bad Gateway errors, if you restart your uwsgi process then your server will respond again. There’s no need to restart Apache when you restart your uwsgi process.

For debugging note that /etc/apache2/mods-available/ will contain uwsgi.load once mod_uwsgi is installed. The uwsgi binary lives in your Anaconda environment (for me it is ~/anaconda/envs/production/bin/uwsgi), it’ll only be active once you’ve activated this environment. Useful(ish) error messages should appear in /var/log/apache2/error.log. uWSGI has best practices and a FAQ.

Having made this run at the command line it now needs to be automated. I’m using Circus. I’ve installed this via the system Python (not via Anaconda) as I wanted to treat it as being outside of the Anaconda environment (just as Upstart, cron etc would be outside of this environment), this means I needed a bit of tweaking. Specifically PATH must be configured to point at Anaconda and a fully qualified path to uwsgi must be provided:

#circus.ini
[circus]
check_delay = 5
endpoint = tcp://127.0.0.1:5555
pubsub_endpoint = tcp://127.0.0.1:5556

[env:myserver]
PATH=/home/ubuntu/anaconda/bin:$PATH

[watcher:myserver]
cmd = <path_anaconda>/envs/production/bin/uwsgi
args = --http :9090 --uwsgi-socket localhost:56708  
  --wsgi-file <config_dir>/server.wsgi 
  --chdir <working_dir>
warmup_delay = 0
numprocesses = 1

 

This can be run with “circusd <config>/circus.ini –log-level debug” which prints out a lot of debug info to the console, remember to run this with a login shell and not in the Anaconda environment if you’ve installed it without using Anaconda.

Once this works it can be configured for control by the system, I’m using systemd on Ubuntu via the Circus Deployment instructions with a /etc/init/circus.conf script, configured to its own directory.

If you know that mod_wsgi would have been a better choice then please let me know (though dev for the project looks very slow [it says “it is resting”]), I’m experimenting with mod_uwsgi (it seems to be more actively developed) but this is a foreign area for me, I’d be happy to learn of better ways to crack this nut. A quick glance suggests that both support Python 3.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Data science, Python

1 November 2013 - 12:10“Introducing Python for Data Science” talk at SkillsMatter

On Wednesday Bart and I spoke at SkillsMatter to 75 Pythonistas with an Introduction to Data Science using Python. A video of the 4 talks is now online. We covered:

Since the group is more of a general programming community we wanted to talk at a high level on the various ways that Python can be used for data science, it was lovely to have such a large turn-out and the following pub conversation was much fun.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

16 Comments | Tags: Data science, Life, Python

7 October 2013 - 17:10Future Cities Hackathon (@ds_ldn) Oct 2013 on Parking Usage Inefficiencies

On Saturday six of us attended the Future Cities Hackathon organised by Carlos and DataScienceLondon (@ds_ldn). I counted about 100 people in the audience (see lots of photos, original meetup thread), from asking around there seemed to be a very diverse skill set (Python and R as expected, lots of Java/C, Excel and other tools). There were several newly-released data sets to choose from. We spoke with Len Anderson of SocITM who works with Local Government, he suggested that the parking datasets for Westminster Ward might be interesting as results with an economic outcome might actually do something useful for Government Policy. This seemed like a sensible reason to tackle the data. Other data sets included flow-of-people and ASBO/dog-mess/graffiti recordings.

Overall we won ‘honourable mention’ for proposing the idea that the data supported a method of changing parking behaviour whilst introducing the idea of a dynamic pricing model so that parking spaces might be better utilised and used to generate increased revenue for the council. I suspect that there are more opportunities for improving the efficiency of static systems as the government opens more data here in the UK.

Sidenote – previously I’ve thought about the replacement of delivery drivers with self-driving cars and other outcomes of self-driving vehicles, the efficiencies discussed here connect with those ideas.

With the parking datasets we have over 4 million lines of cashless parking-meter payments for 2012-13 in Westminster to analyse, tagged with duration (you buy a ticket at a certain time for fixed periods of time like 30 minutes, 2 hours etc) and a latitude/longitude for location. We also had a smaller dataset with parking offence tickets (with date/time and location – but only street name, not latitude/longitude) and a third set with readings from the small number of parking sensors in Westminster.

Ultimately we produced a geographic plot of over 1000 parking bays, coloured by average percentage occupancy in Westminster. The motivation was to show that some bays are well used (i.e. often have a car parked in them) whilst other areas are under-utilised and could take a higher load (darker means better utilised):

Westminster Parking Bays by Percentage Occupancy

At first we thought we’d identified a striking result. After a few more minutes hacking (around 9.30pm on the Saturday) we pulled out the variance in pricing per bay and noted that this was actually quite varied and confusing, so a visitor to the area would have a hard time figuring out which bays were likely to be both under-utilised and cheap (darker means more expensive):

Westminster parking bays by cost

If we’d have had more time we’d have checked to see which bays were likely to be under-utilised and cheap and ranked the best bays in various areas. One can imagine turning this into a smartphone app to help visitors and locals find available parking.

The video below shows the cost and availability of parking over the course of the day. Opacity (how see-through it is) represents the expense – darker means more expensive (so you want to find very-see-through areas). Size represents the number of free spaces, bigger means more free space, smaller (i.e. during the working day) shows that there are few free spaces:

Behind this model we captured the minute-by-minute stream of ticket purchases by lat/lng to model the occupancy of bays, the data also records the number of bays that can be maximally used (but the payment machines don’t know how many are in use – we had to model this). Using Pandas we modelled usage over time (+1 for each ticket purchase and -1 for each expiry), the red line shows the maximum number of bays that are available, the sections over the line suggest that people aren’t parking for their full allocation (e.g. you might buy an hour’s ticket but only stay for 20 minutes, then someone else buys a ticket and uses the same bay):

parking_starts_and_ends

We extended the above model for one Tuesday over all the 1000+ plus parking bays in Westminster.

Additionally this analysis by shows the times and days when parking tickets are most likely to be issued. The 1am and 3am results were odd, Sunday (day 6) is clearly the quietest, weekdays at 9am are obviously the worst:

parking_fines_bucketed_over_many_weeks_cropped

Conclusion:

We believe that the carrot and stick approach to parking management (showing where to park – and noting that you’ll likely get fined if you don’t do it properly) should increase the correct utilisation of parking bays in Westminster which would help to reduce congestion and decrease driver-frustration, whilst increasing income for the local council.

Update – at least one parking area in New Zealand are experimenting with truly dynamic demand-based pricing.

We also believe the data could be used by Traffic Wardens to better patrol the high-risk areas to deter poor parking (e.g. double-parking) which can be a traffic hazard (e.g. by obstructing a road for larger vehicles like Fire Engines). The static dataset we used could certainly be processed for use in a smartphone app for easy use, and updated as new data sets are released.

Our code is available in this github repo: ParkingWestminster.

Here’s our presentation:

Team:

Tools used:

  • Python and IPython
  • Pandas
  • QGIS (visualisation of shapefiles backed by OpenLayers maps from Google and OSM)
  • pyshp to handle shapefiles
  • Excel (quick analysis of dates and times, quick visualisation of lat/lng co-ords)
  • HackPad (useful for lightweight note/URL sharing and code snippet collaboration)

 Some reflections for future hackathons:

  • Pre-cleaning of data would speed team productivity (we all hacked various approaches to fixing the odd Date and separate Time fields in the CSV data and I suspect many in the room all solved this same problem over the first hour or two…we should have flagged this issue early on and a couple of us solved it and written out a new 1.4GB fixed CSV file for all to share)
  • Decide early on on a goal – for us it was “work to show that a dynamic pricing model is feasible” – that lets you frame and answer early questions (quite possibly an hour in we’d have discovered that the data didn’t support our hypothesis – thankfully it did!)
  • Always visualise quickly – whilst I wrote a new shapefile to represent the lat/lng data Bart just loaded it into Excel and did a scatter plot – super quick and easy (and shortly after I added the Map layer via QGIS so we could line up street names and validate we had sane data)
  • Check for outliers and odd data – we discovered lots of NaN lines (easily caught and either deleted or fixed using Pandas), these if output and visualised were interpreted by QGIS as an extreme but legal value and so early on we had some odd visuals, until we eyeballed the generated CSV files. Always watch for NaNs!
  • It makes sense to print a list of extreme and normal values for a column, again as a sanity check – histograms are useful, also sets of unique values if you have categories
  • Question whether the result we see actually would match reality – having spent hours on a problem it is nice to think you’ve visualised something new and novel but probably the data you’re drawing is already integrated (e.g. in our case at least some drivers in Westminster would know where the cheap/under-utilised parking spaces would be – so there shouldn’t be too many)
  • Setup a github repo early and make sure all the team can contribute (some of our team weren’t experienced with github so we deferred this step and ended up emailing code…that was a poor use of time!)
  • Go visit the other teams – we hacked so intently we forgot to talk to anyone else…I’m sure we’d have learned and skill-shared had we actually stepped away from our keyboards!

Update – Stephan Hügel has a nice article on various Python tools for making maps of London wards, his notes are far more in-depth than the approach we took here.

Update – nice picture of London house prices by postcode, this isn’t strictly related to the above but it is close enough. Visualising the workings of the city feels rather powerful. I wonder how the house prices track availability of public transport and local amenities?


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Data science, Life, Python

17 June 2013 - 20:13Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)

Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github.

The goal is to disambiguate the word-sense of a token (e.g. “Apple”) in a tweet as being either the-brand-I-care-about (in this case – Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I’m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being “about Apple Inc or not” and whilst this is possible, for this project I’m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I’ve set myself.

Below are the slides from the longer of the two talks at BrightonPython:

As noted in the slides for week 1 of the project I built a trivial LogisticRegression classifier using the default CountVectorizer, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to consulting work.

Currently I use a JSON file of tweets filtered on the term ‘apple’, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term “apple”). I used the Chromium Language Detector to filter non-English tweets and also discard English tweets that I can’t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I’ll probably thin out later, possibly they over-represent the real frequency of important tokens.

Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).

To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).

Using the simplest possible approach that could work – I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (>92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.

The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like “Tim”, “Cook”, “CEO” as significant features (along with other features that you’d expect to see like “iphone” and “sauce” and “juice”) – this is due to their prevalence in this small dataset (in this set examples like this are very frequent). Once a larger dataset is used this advantage will disappear.

I’ve added some TODO items to the README, maybe someone wants to tinker with the code? Building an interface to the open source DBPediaSpotlight (based on WikiPedia data using e.g. this python wrapper) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).

Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.

Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of <5 capital letters per tweet suggests an out-of-class example:

nbr_capitals_scikit_testtrain_apple
This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):

histogram_tweet_lengths_scikit_testtrain_apple

Next I need to:

  • Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated
  • Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I’ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they’re much easier to debug)
  • Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)
  • Build a bigger data set that doesn’t exhibit the easily-fitted unigrams that appear in the current set

Longer term I’ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I’d like to play with – I figure making some progress here opens the door to analysing media commentary in tweets.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life, Python, SocialMediaBrandDisambiguator