About

Ian Ozsvald picture

This is Ian Ozsvald's blog, I'm an entrepreneurial geek, an AI consultant, founder of the Annotate.io social media mining API, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Brightonian. Here's a little more about me.

View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

5 May 2013 - 14:32June project: Disambiguating “brands” in Social Media

Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I’m planning on a change for June.

I’m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following tweet mentions Apple-the-brand:
“I love my apple, though leopard can be a pain”
and that this tweet does not:
“Really enjoying this apple, very tasty”

I’ve used AlchemyAPI, OpenCalais, DBPedia Spotlight and others for client projects and it turns out that these APIs expect long-form text (e.g. Reuters articles) written with good English.

Tweets are short-form, messy, use colloquialisms, can be compressed (e.g. using contractions) and rely on local context (both local in time and social group). Linguistically a lot is expressed in 140 characters and it doesn’t look like”good English”.

A second problem with existing APIs is that they cannot be trained and often don’t know about European brands, products, people and places. I plan to build a classifier that learns whatever you need to classify.

Examples for disambiguation will include Apple vs apple (brand vs e.g. fruit/drink/pie), Seat vs seat (brand vs furniture), cold vs cold (illness vs temperature), ba (when used as an abbreviation for British Airways).

The goal of the June project will be to out-perform existing Named Entity Recognition APIs for well-specified brands on Tweets, developed openly with a liberal licence. The aim will be to solve new client problems that can’t be solved with existing APIs.

I’ll be using Python, NLTK, scikit-learn and Tweet data. I’m speaking on progress at BrightonPy and DataScienceLondon in June.

Probably for now I should focus on having no computer on my honeymoon…

1 Comment | Tags: ArtificialIntelligence, Life, Python

17 April 2013 - 11:38Visualising London, Brighton and the UK using Geo-Tweets

Recently I’ve been grabbing Tweets some some natural language processing analysis (in Python using NetworkX and NLTK) – see this PyCon and PyData conversation analysis. Using the London dataset (visualised in the PyData post) I wondered if the geo-tagged tweets would give a good-looking map of London. It turns out that it does:

london_all_r1_nomap

You can see the bright centre of London, the Thames is visible wiggling left-to-right through the centre. The black region to the left of the centre is Hyde Park. If you look around the edges you can even see the M25 motorway circling the city. This is about a week’s worth of geo-filtered Tweets from the Twitter 10% firehose. It is easier to locate using the following Stamen tiles:

london_all_r5

Can you see Canary Wharf and the O2 arena to its east? How about Heathrow to the west edge of the map? And the string of reservoirs heading north north east from Tottenham?

Here’s a zoom around Victoria and London Bridge, we see a lot of Tweets around the railway stations, Oxford Street and Soho. I’m curious about all the dots in the Thames – presumably people Tweeting about their pleasure trips?

centrallondon_r3_map

Here’s a zoom around the Shoreditch/Tech City area. I was surprised by the cluster of Tweets in the roundabout (Old Street tube station), there’s a cluster in Bonhill Street (where Google’s Campus is located – I work above there in Central Working). The cluster off of Old Street onto Rivington Street seems to be at the location of the new and fashionable outdoor eatery spot (with Burger Bear). Further to the east is a more pubby/restauranty area.

london_shoreditch_all

I’ve yet to analyse the content of these tweets (doing something like phrase extraction from the PyCon/PyData tweets onto this map would be great). As such I’m not sure what’s being discussed, probably a bunch of the banal along with chitchat between people (“I”m on my way”…). Hopefully some of it discusses the nearby environment.

I’m using Seth’s Python heatmap (inspired by his lovely visuals). In addition I’m using Stamen map tiles (via OpenStreetMap). I’m using curl to consume the Twitter firehose via a geo-defined area for London, saving the results to a JSON file which I consume later (shout if you’d like the code and I’ll put it in github) – here’s a tutorial.

During London Fashion Week I grabbed the tagged tweets (for “#lfw’ and those mentioning “london fashion week” in the London area), if you zoom on the official event map you’ll see that the primary Tweet locations correspond to the official venue sites.

lfw

What about Brighton? Down on the south coast (about 1 hour on the train south of London), it is where I’ve spent the last 10 years (before my recent move to London). You can see the coastline, also Sussex University’s campus (north east corner). Western Road (the thick line running west a little way back from the sea) is the main shopping street with plenty of bars.

brighton_gps_to0103_nomap

It’ll make more sense with the Stamen tiles, Brighton Marina (south east corner) is clear along with the small streets in the centre of Brighton:

brighton_gps_to0403_map

Zooming to the centre is very nice, the North Laines are obvious (to the north) and the pedestriansed area below (the “south laines”) is clear too. Further south we see the Brighton Pier reaching into the sea. To the north west on the edge of the map is another cluster inside Brighton Station:

brighton_gps_to0403_map_zoomed

Finally – what about all the geo-tagged Tweets for the UK (annoyingly I didn’t go far enough west to get Ireland)? I’m pleased to see that the entirety of the mainland is well defined, I’m guessing many of the tweets around the coastline are more from pretty visiting points.

uk_gps_to0404_map_r5_zoomed

How might this compare with a satellite photograph of the UK at night? Population centres are clearly visible but tourist spots are far less visible, the edge of the country is much less defined (via dailymail):

Europe satellite

I’m guessing we can use these Tweets for:

  • Understanding what people talk about in certain areas (e.g. Oxford Street at rush-hour?)
  • Learning why foursquare checkings (below) aren’t in the same place as tweet locations (can we filter locations away by using foursquare data?)
  • Seeing how people discuss the weather – is it correlated with local weather reports?
  • Learning if people talk about their environment (e.g. too many cars, poor London tube climate control, bad air, too noisy, shops and signs, events)
  • Seeing how shops, gigs and events are discussed – could we recommend places and events in real time based on their discussion?
  • Figuring out how people discuss landmarks and tourist spots – maybe this helps with recommending good spots to visit?
  • Looking at the trail people leave as they Tweet over time – can we figure out their commute and what they talk about before and after? Maybe this is a sort of survey process that happens using public data?

Here are some other geo-based visualisations I’ve recently seen:

If you want help with this sort of work then note that I run my own AI consultancy, analysing and visualising social media like Twitter is an active topic for me at present (and will be more so via my planned API at annotate.io).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

12 Comments | Tags: ArtificialIntelligence, Data science, Entrepreneur, Life, Python

15 April 2013 - 14:03More Python 3.3 downloads than Python 2.7 for past 3 months

Since PyCon 2013 I’ve been in a set of conversations that start with “should I be using Python 3.3 for science work?”. Here’s a recent reddit thread on the subject. Last year I solidly recommended using Python 2.7 for scientific work (as many key libraries weren’t yet supported). I’m on the cusp of changing my recommendation.

Update there’s a nice thread on Reddit/r/python discussing what’s required and where the numbers are coming from.

I last looked at the rate of Python downloads via ShowMeDo during 2008 when Python 2.5 was the top dog. The Windows 2.5.1 installer was getting 500,000 downloads a month. In the last 3 months I’m pleasantly surprised to see that Python 3.3 for Windows is downloaded more each month than Python 2.7. We can see:

  • March 2013 Python 3.3 for Windows has 647k downloads vs Python 2.7 with 630k
  • February 2013 Python 3.3 for Windows has 553k downloads vs Python 2.7 with 498k
  • January 2013 Python 3.3 for Windows has 533k downloads vs Python 2.7 with 495k (Python 2.7 less popular since January 2013)
  • December 2012 Python 3.3 for Windows has 412k downloads vs Python 2.7 with 525k

These figures only tell a part of the story of course. For Windows you have to download Python. On Linux and Mac it comes pre-installed (so we can’t measure those numbers).

Python 2.7 has been the default on Ubuntu for a while, that’s changing with Ubuntu 13.04. There are two lists of Python-3 compatible packages, it seems that Django is on this list and at PyCon there was a how-to-port-to-py3 video (not Flask yet update Armin is tweeting for sprint help for Py3 support), SQLAlchemy is (but not MySQL-python), Fabric isn’t ready yet. For web-dev it seems to be a mixed bag but I’m guessing Python 3 support will be across the board this year.

For scientific use we already have Python-3 compatible numpy, scipy and matplotlib. scikit-learn is ‘nearly‘ ported, Pillow (the recent fork of PIL) is ready for Python 3. NLTK is also being ported.

For scientific use around natural language processing the switch to unicode-by-default looks most attractive (the mix of strings and unicode datatypes has burnt hours for me over the years in Python 2.x). Here’s a PyCon video on the use of Python 3 for text processing and this reviews why Python 3.3 is superior to Python 2.7.

It is slightly too early for me yet to want to switch but I’m starting to experiment. I’ve added some __future__ imports to new code so I know I’m writing Python 2.7 in a 3-like style. I’m also increasingly using Ned Batchelder’s coverage.py via nosetests to make sure I have good coverage. I currently run 2to3 to check that things convert cleanly to Python 3 but rarely run the result with Python 3 (I haven’t needed to do this yet). There’s a set of useful advice on python3porting including various __future__ imports (including division, print_function, unicode_literals, absolute_import).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

28 Comments | Tags: Life, Python

2 April 2013 - 8:32Applied Parallel Computing (PyCon 2013 Tutorial) slides and code

Minesh B. Amin (MBASciences) and I (Mor Consulting Ltd) taught Applied Parallel Computing over 3 hours at PyCon 2013. PyCon this year was a heck of a lot of fun, I did the fun run (mentioned below), received one of the free 2500 RaspberryPis that were given away, met an awful lot of interesting people and ran two birds-of-a-feather sessions (parallel computing for our tutorial, another on natural language processing).

I held posting this entry until the video was ready (it came out yesterday). All the code and slides are in the github repo. Currently (but not indefinitely) there’s a VirtualBox image with everything (Redis, Disco etc) pre-installed.

After the conference, partly as a result of the BoF NLP session I created a Twitter graph “Concept Map” based on #pycon tweets, then another for #pydata. They neatly summarise many of the topics of conversation.

Here’s our room of 60+ students, slides and video are below:

Applied Parallel Computing PyCon 2013 (left side of room)

Applied Parallel Computing PyCon 2013 (left side)

The video runs for 2 hours 40:

Here’s a list of our slides:

  1. Intro to Parallelism (Minesh)
  2. Lessons Learned (Ian)
  3. List of Tasks with Mandelbrot set (Ian)
  4. Map/Reduce with Disco (Ian)
  5. Hyperparameter optimisation with grid and random search (Minesh)

These are each of the slide decks:

 

I also had fun in the 5k fun run (coming around 77th of 150 runners), we raised $7k or so for cancer research and the John Hunter Memorial Fund.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: ArtificialIntelligence, Life, Python

22 March 2013 - 1:16Analysing #pydata, London and Brighton tweets for concept mapping

Below I’ve visualised tweets for #PyData conference and the cities of London and Brighton – this builds on my ‘concept cloud‘ from a few days ago at the #PyCon conference. Props to Maksim for his Social Media Analysis tutorial for inspiration.

Update – Maksim’s Analying Social Networks tutorial video is online.

For the earlier #PyCon 2013 analysis I visualised #hashtags and @usernames from #pycon tagged tweets during the conference. I’ve built upon this to add some natural language processing for ‘noun phrase extraction’ which I detail below – this helps me to pull out phrases that are descriptive but haven’t been tagged. It also helps us to see which people are connected with which subjects. For the PyCon analysis I collected 22k tweets, after removing retweets I was left with 7,853 for analysis.

#PyData (PyData Santa Clara 2013)

pydata_weds_afternoon

PyData 2013 is a much smaller conference than PyCon (PyCon had 2,500 people and 20% female attendance, PyData had around 400 with 10% female attendance). Being smaller it had far fewer tweets – after removing retweets I had just 225 tweets to analyse. Cripes! This is clearly not big data. The other problem was that people weren’t using many #hashtags, they were referring to topics using natural language. For example:

“Peter Norvig was giving a talk at PyData in Santa Clara, CA on the topic of innovation in education.” (source)

Clearly some natural language processing was required. I took two approaches:

  • Extract capitalised sub-phrases (e.g. “Peter Norvig”, “Santa Clara”) of one or more words
  • Use NLTK’s bigram collocation analyser (to find lowercased phrases such as “ipython notebook”, “machine learning”)

Starting at the bottom of the plot we see three types of colour:

  • white is for #hashtags
  • light blue is for @usernames
  • dark green is for phrases (extracted using natural language processing)

We see a cluster of references around @fperez_org (Fernando Perez of IPython), one cluster is around @swcarpentry (the scientist-friendly software carpentry movement), the other is around IPython and the IPython Notebook (@minrk of IPython/parallel is linked too). I like the connection to Julia – Fernando discussed during his keynote that Julia now interoperates with Python.

The day before we had Peter Norvig (Director of research at Google) giving a keynote on the use of Python in education at Udacity including a discussion of how machine learning could be used to identify the mistakes that new coders make so we could make friendlier error messages to help students correct their code. See the clustering around this at the top of the graph.

Later the same day Henrik (@brinkar) spoke on Wise.io‘s Random Forest classifier. Their approach was efficient enough to demo live on a RaspberryPi. The connection from Peter to Henrik goes via #venturebeat who covered wise.io’s new software release during the conference.

Connecting IPython and Wise.io is @ogrisel (Olivier Grisel) of scikit-learn. He gave an impressive (and given the variability of conference wifi – slightly ballsy) live demo of scaling a machine learning system via IPython Parallel on EC2.

In the middle we see @teoliphant (Travis Oliphant) joined to Continuum (his company). Off to the right I get to blow my own trumpet – the phrases “awesome python” and “network analysis” connect to “russel brand” which is how one wag described my lightning talk. I got a chance to demo the earlier version of this at the end of @katychuang‘s talk on networkx.

London (geo-tagged tweets)

londonout

For the last month I’ve been grabbing tweets in the London geo area for another project. I had to raise my filtering levels to bring the network down to a sane (and easily visualised) number of nodes. After removing ReTweets I have 497,771 tweets from just a subset of my data. Some obvious clusters can be seen:

  • #weather and #rain and (presumably a rather wet) “St Albans” (a very British discussion)
  • The “O2 Arena” near the centre with “Justin Beiber” and #believetour, linked with #amazing, #excited, #nowplaying
  • @onedirection must have been playing (connected with band members @louis_tomlinson and @real_liam_payne amongst others)
  • To the top-right we have a football cluster with “Manchester United”, “Champions League”, #cpfc, #realmadrid and “Old Trafford”
  • The usual tourist spots like “Tower Bridge”, “Covent Garden”, “Hyde Park”, “Big Ben”, “Trafalgar Square” are  discussed with #happy #sun #loveit, linked just off of here is “London Heathrow Airport” and “New York”

Brighton (geo-tagged tweets)

brighton

This is my favourite, analysed using 40,379 tweets after removing ReTweets. The nature of the two cities (Brighton is 50 miles south of London on the coast, it is a university town with a young & party-friendly population) is quite apparent:

  • Top left there is discussion around “One Direction”, #justinbeiber and #seo (a particular Brighton tech thing)
  • Just south of @justinbieber is a single chain of not-safe-for-work ranting (another particular Brighton thing)
  • If you jump to the bottom right you’ll see #underwear, #lingerie, #teenagers – not as dodgy as you might expect, Sweetling were doing a social media bra campaign
  • #hove is joined with #sunny #morning and nearby places #lewes #shoreham
  • #brightonbeach and “Brighton Pier” connect with #birds (Seagulls – a bane!) and #sun
  • #friends, #memories#, #happy, #goodtimes, #marina, #fun, #girls cluster around the centre (Brighton does like a party)
  • Off down to the bottom left is a some sort of political discussion (what were they doing in Brighton?)

Reproducing this

All the code is in github at twitter_networkx_concept_map including the one line cURL command to capture the data. An example .gephi file is included for visualisation in Gephi. The built-in networkx viewer (optionally using GraphViz) works reasonably well but isn’t interactive. Maksim’s tutorial and utils class were jolly useful (utils is in my repo), I’m also using twitter-text-python for parsing @usernames, #hashtags and URLs from the tweets.

If you want some custom work around this, give me a shout via Mor Consulting.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

19 Comments | Tags: ArtificialIntelligence, Life, Python

18 March 2013 - 7:36Semantic map of PyCon2013 Twitter Topics

Maksim taught a lovely Social Graph Analytics course at PyCon the day before I taught Applied Parallel Computing. I took his demo for a “poor mans LDA/LSI analysis” of a Twitter topic (rather than using full LDA it just uses co-incident hashtags) and added usernames to produce the plot below.

UpdateAnalysing #pydata conference (and the cities London and Brighton) tweets using NLTK and NetworkX added as a second post.

pycon2013_hashtags_usernames

White nodes are hashtags (e.g. #raspberrypi centres the left white cluster), purple is for usernames (e.g. organiser @jessenoller is in the centre, Python’s creator @gvanrossum is between #raspberrypi and Jesse, @dabeaz and @raymondh are near the centre). We see a strongly connected cluster of people and hashtags along with several disconnected sets.

Over the course of PyCon I’ve collected all the #pycon tagged Tweets using the 1% Twitter Firehose (via a 1 line curl command). I have some Tweet parsing code which transforms this data into useful subsets (originally I was working on 2D geo-tagged plots of London and Brighton – to be posted later), in this case I extract the hashtags and usernames from each tweet using twitter-text-python and and then build edges in a graph for each pair of mentions that occur in a tweet. E.g.:

“really cool stenography talk by @stenoknight at #PyCon – she still uses #vim with #plover”

will cause a link to form between #pycon and #vim, #pycon and #plover, #vim and #plover. The width of edges in the diagram corresponds to the number of times the same hashtags (and users) are linked in each tweet. To understand which people are related to each concept I added usernames so in the above example edges are also formed between @stenoknight and the three hashtags.

If you open up a larger version of the image (click the main image) you can follow some of the detail. The #raspberrypi tag is interesting – lots of prominent projects are mentioned alongside (e.g. #pandas, #django). Just below the main cluster is a subcluster on #robots #vision #hackers – these are joined to the main #raspberrypi cluster by the adjective #awesome (rather lovely!). All 2,500 attendees of PyCon were given a full Raspberry Pi Model B during the Friday morning keynote by Eben Upton and during the weekend a RaspberryPi hacklab taught many people how to add hardware and use Python on the device.

In the centre we see a lot of people – many people mention each other or are linked by others (e.g. prominent speakers) in their tweets. I filter out ReTweets so we’re only looking at mentions of people inside one tweet if someone has written that tweet afresh. The legendary Testing in Python Birds of a Feather session (#tipbof) on the right is linked to a few prominent folk.

#openscience and #openaccess are well linked to the south of the main cluster, connected to the main group via clusters of people.

I’m quite intrigued by the @styleseat link out to #nailjerks #pixiedust #nailart to the north, they ran a manicure/pedicure session in connection with #pyladies.

pycon_tags_people2_annotated_sundayguido

Guido gave a keynote this morning and discussed async programming – a new cluster formed (see zoom from earlier analysis shown above) from yesterday’s data with #tulip #sunday #pep3156 whilst talking about PEP 3156. It is interesting to note the time-based nature of the clusters (which we can’t see in this single 2D image, maybe I ought to animate it?).

Update I’ve added the plot below using the Community Detection feature of Gephi, it shows Guido’s async tag set as a separate cluster. #raspberrypi has a nicely large cluster, web servers have their own too.

pycon_tags_people_communities

Due to yesterday’s PyCon 5k Fun Run there’s a disconnected cluster for #10minutemile #shootforthestars #ugh to the north – 150 of us (of 2500 attendees) ran at 7am, we raised $7k towards cancer research.

It is worth mentioning that I removed some of the more prominent nodes as many of the other topics connect to these so they add little information:

  • #pycon
  • #python
  • #pycon2013
  • @pycon
  • @top_webtech @inowgb (spammy)
  • any username node with less than 50 occurrences
  • any hashtag node with only 1 occurrence

I’ll add the code to github tomorrow. Tools used include twitter-text-python, networkx and Gephi. Update the code is in github as twitter_networx_concept_map.

If I get time whilst here I’ll do some more analysis on the data. I’d love to use a named entity tool or some parsing to extract obvious nouns (e.g. packages and topics) that aren’t #hashtagged.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

19 Comments | Tags: Life, Python

15 March 2013 - 23:22Use of VirtualBox to prepare students (PyCon tutorials)

Minesh and I ran a tutorial (Applied Parallel Computing) at PyCon 2013 yesterday, we’ve been working on building and distributing a VirtualBox (7GB) for students to simplify the teaching with a unified, preconfigured environment. This process took a while, below are my notes. Others (e.g. Kat teaching SimpleCV) also had a VirtualBox.

The upside of a VirtualBox is that everyone has a unified environment, so students see on their screen exactly what you have on your screen. The downside is that this doesn’t help them install the tools onto their laptop for normal use. If you’re teaching a medley of tools (as we were) and especially if some require non-trivial installation (e.g. Disco map/reduce for us) then VirtualBoxes are a clear win.

  • We zipped the directory containing the VDI file, Kat used a single OVF file (both for VirtualBox), I think the single OVF file might be easier to distribute and might work in other (non-VirtualBox) environments. Our zip took 7GB down to 2.2GB
  • Your VirtualBox will be configured for you…but students might have foreign keyboards (e.g. Minesh made our VBox image with a US keyboard, I have a UK keyboard, some students have German etc keyboards) – provide notes on how to reconfigure the Guest OS so the student can setup their keyboard
  • git clone a read-only repo into the VBox, students can then just git pull to get updates
  • We added a run_this_to_confirm_you_have_the_correct_libraries.py script, it checks that everything is installed, students can run this to double check that their install is good
  • Use a standard user and password – we used “pycon:pycon”
  • I made a YouTube screencast using RecordMyDesktop (with desktop compositing disabled to reduce flicker)
  • Bundle everything into a blog post that you can easily update – here are our install notes and video
  • A large zip is harder to distribute – I linked to the zip on my blog (I have lots of bandwidth) and created a torrent using the super-easy burnbit site (here’s my download page) – you can see the torrent link on the install notes page linked above
  • You probably want to use a 32 bit OS for the Guest OS (we used Linux Mint 14 32 bit), a 64 bit Guest OS won’t run on a 32 bit system (but a 32 bit Guest OS will run on a 64 bit host)
  • Despite linking our tutorial notes to the tutorial page on the PyCon website (and mailing students), many didn’t have a preinstalled environment – we had a set of USB Thumb Drives which simplified the setup. Our first 30 minutes was talking so students had time to install the VBox
  • Github is a great place to store code, data (if not huge) and slides

Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

6 Comments | Tags: Life, Python

7 March 2013 - 17:10PowerPoint: Brief Introduction to NLProc. for Social Media

For my client (AdaptiveLab) I recently gave an internal talk on the state of the art of Natural Language Processing around Social Media (specifically Twitter and Facebook), having spent a few days digesting recent research papers. The area is fascinating (I want to do some work here via my Annotate.io) as the text is so much dirtier than in long form entries such as we might find with Reuters and BBC News.

The Powerpoint below is just the outline, I also gave some brief demos using NLTK (great Python NLP library).

 


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

2 Comments | Tags: ArtificialIntelligence, Data science, Life

7 March 2013 - 16:54ANN: twitter-text-python 1.0.0.2 release (Python Tweet parsing library)

A few weeks back I took over as maintainer of the twitter-text-python library (source on github). This library lets you take a tweet like:

"@ianozsvald, you now support #IvoWertzel's tweet ...
parser! https://github.com/ianozsvald/"

and extract the Twitter entities as defined in the Twitter conformance tests. The entities in the above tweet would be:

  • reply: 'ianozsvald'
  • users: ['ianozsvald']
  • tags: ['IvoWertzel']
  • urls: ['https://github.com/ianozsvald/']
  • lists: []  # no lists in this tweet
  • output html: u'<a href="http://twitter.com/ianozsvald">@ianozsvald</a>, ...
  •   you now support <a href="http://search.twitter.com/search?q=%23IvoWertzel">#IvoWertzel</a>\'s
  •   tweet parser! <a href="https://github.com/ianozsvald/">https://github.com/ianozsvald/</a>'

If you’re parsing Tweets or status-update-like-entities (from e.g. App.net)  in Python then this library makes it easy to extract @people, URLs and #hashtags. You can also request the spans (character locations) for each entity, very useful if you have repeated phrases and you’re doing a search/replace.

The library is easily installed using “$ pip install twitter-text-python” (MIT license) via the Python Package Index, currently at version 1.0.0.2.

Credit – the library was developed by Ivo Wertzel (BonsiaDan on github), I merged a few Pull requests after forking to fix some bugs and have now taken over official maintenance.


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

7 Comments | Tags: Life, Python

18 February 2013 - 23:43PyCon Tutorial Notes for Applied Parallel Computing

This post is for students of the Applied Parallel Computing tutorial that Minesh B. Amin and I will run during March 2013 at PyCon.This is a wiki-post, I’ll update it over the next month. If you are attending the tutorial you must check this post in the run-up to the tutorial. Important notes are below for you to read now. This is linked to from our PyCon Tutorial Support page.

If you come to this after the tutorial you’ll probably find this useful for setup. The following is for my students:

  • Check this post before you come to PyCon, you will be expected to have followed instructions and installed the software and updates before the tutorial
  • You won’t have time to install/setup during the tutorial, you must arrive prepared, we have a lot to work through and we’ll start immediately
  • Accepting that the PyCon wifi has been great in past years you must assume that wifi will be broken – come prepared with a fully working environment
  • We recommend strongly that you use our VirtualBox (it has all the libs and the github repo pre-installed, it is open source, it’ll run on Win/Mac/Linux), if you install your own package set then we can’t help you if it doesn’t work as expected (it is also quite fiddly to setup yourself) – you can of course buddy-up with someone else during the tutorial if required

You will be able to get the VirtualBox (about 7GB GB) from this post in the next week, you’ll be better off using the torrent that we’ll provide (please seed if you can, if possible all the way until the tutorial runs to help fellow students).

Download link for VirtualBox (required!) for the tutorial:

(v1.1 torrent deleted as it didn’t run cleanly on Macs)

PyCON-2013_AppliedParallelComputing1.2.zip torrent (very robust – resume if download breaks, 2.2GB zip decompresses to 6.9GB) or via direct download (more brittle – no resume if the download breaks).


md5sum: ce43b52a18ca913e62842ae72cc8df74

NOTE – I had the v1.1 version linked in the torrent above for a few days – if you got that and you can’t start the VirtualBox, just right-click in VirtualBox and discard the saved state, then restart the image. If you have the v1.2 version (linked as of March 4th) then you’re fine.

Video – this YouTube Video Demo (7 minutes) shows you how to install the image.

Instructions:

  1. Unzip to a directory with 7GB of disk space (MAC USERS – the built-in unzip doesn’t seem to handle 64 bit files, use 7zip for success [maybe Windows users too?])
  2. Open VirtualBox (optional but useful – add the extension pack for host integration)
  3. Machine | Add and open the directory that contains the .vdi and .vbox files
  4. Start the machine, it’ll boot to the Linux desktop
  5. Open the web link on the Desktop if you want to see the latest version of this blog post
  6. Double click the “Download GITHUB Repo” script on the desktop and it’ll refresh the repository (in case we’ve added new code)
  7. Familiarise yourself with the environment (Linux Mint 14), GTK Vim and emacs are installed
  8. Open a terminal and run ./pycon2013_applied_parallel_computing/run_this_to_confirm_you_have_the_correct_libraries.py (from the home directory) which confirms to you that the necessary Python libraries are installed (I’ve done this, you can do it for confirmation)

The VirtualBox is a fully configured Linux Mint 14 32 bit (based on Ubuntu 12.10) distribution, with gui, also with gvim installed. Feel free to add anything else. You don’t need to bother installing further system updates, the OS was up to date when we released it. It is configured to provide 2 CPUs and 3GB RAM – you might need to reduce these figures to get it running on your machine.

It runs on my 64 bit laptop (Linux Mint 13 64 bit) and on 32 bit machines, it should work equally well on Windows and Mac (we’ve tested it on both). You should install the Guest Additions (when the Ubuntu installation has booted use the Devices menu at the top of the VirutalBox window and “install guest additions” – this installs integration features like copy/paste with your host OS) as they provide things like shared clipboard to the host machine.

Instructions if you can’t/won’t use our VirtualBox (but you’re on your own in this case):

You can get the github repo here – if you set this up yourself then we can’t offer help if it doesn’t work (go to the relevant forums and ask there). There is a test script in the root of the repo (run_this_to_confirm_you_have_the_correct_libraries.py) which will confirm if you have the right libraries installed (it only checks for the presence of Disco, it doesn’t confirm that it is configured correctly). The README will give you some guidance but we really recommend that you get our VirtualBox (to be released in the next week via this post).


Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

4 Comments | Tags: Life, Python