Entrepreneurial Geekiness

Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products. More about Ian here.
Entrepreneurial Geekiness
Ian is a London-based independent Chief Data Scientist who coaches teams, teaches and creates data products.
Coaching
Training
Jobs
Products
Consulting

Open Sourcing “The Screencasting Handbook”

Back in 2010 I released the finished version of my first commercial eBook The Screencasting Handbook. It was 129 pages of distilled knowledge for the budding screencaster, written in part to introduce my (then) screencasting company ProCasts to the world (which I sold years back) and based on experience teaching through ShowMeDo. Today I release the Handbook under a Creative Commons License. After 3 years the content is showing its age (the procedures are good, the software-specific information is well out of date), I moved out of screencasting a while back and have no plans to update this book.

The download link for the open sourced version is at thescreencastinghandbook.com.

I’m using the Creative Commons Unported license – it allows anyone to derive a new version and/or make commercial usage without requiring any additional permissions from me, it does require attribution. This is the most open license I can give that still gives me a little bit of value (by way of attribution). The license must not be modified.

If someone would like to derive an updated version (with or without a price tag) you are very welcome to – just remember to attribute back to the original site and to this site with my name please (as noted at the download point). You can not change the license (but if you wanted to make a derived and non-open-source version of the book for commercial use, I’m sure we can come to an arrangement).

Previously I’ve discussed how I wrote the Handbook in an open, collaborative fashion (with monthly chapter releases to the preview audience), this was a good procedure that I’d use again. Other posts discussing the Handbook are under the “screencasting-handbook” tag.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Social Media Brand Disambiguator first steps

As noted a few days back I’m spending June working on a social-media focused brand disambiguator using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean text and tweets are anything but long or cleanly written!

The problem is this: in a short tweet (e.g. “Loving my apple, like how it werks with the iphon”) we have little context to differentiate the sense of the word “apple”. As a human we see the typos and deliberate spelling errors and know that this use of “apple” is for the brand, not for the fruit. Existing APIs don’t make this distinction, typically they want a lot more text with fewer text errors. I’m hypothesising that with a supervised learning system (using scikit-learn and NLTK) and hand tagged data I can outperform the existing APIs.

I started on Saturday (freshly back from honeymoon), a very small github repo is online. Currently I can ingest tweets from a JSON file (captured using curl), marking the ones with a brand and those with the same word but not-a-brand (in-class and out-of-class) in a SQLite db. I’ll benchmark my results against my hand-tagged Gold Standard to see how I do.

Currently I’m using my Python template to allow environment-variable controlled configurations, simple logging, argparse and unittests. I’ll also be using the twitter text python module that I’m now supporting to parse some structure out of the tweets.

I’ll be presenting my progress next week at Brighton Python, my goal is to have a useful MIT-licensed tool that is pre-trained with some obvious brands (e.g. Apple, Orange, Valve, Seat) and software names (e.g. Python, vine, Elite) by the end of this month, with instructions so anyone can train their own models. Assuming all goes well I can then plumb it into my planned annotate.io online service later.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

Thoughts from a month’s backpacking honeymoon

I’m publishing this on the hoof, right now we’re in Istanbul near the end of our honeymoon back home. Here are some app-travelling notes (for our Nexus 4 Androids).

Google Translate offers Offline dictionaries for all the European languages, each is 150mb. We downloaded new ones before each country hop. Generally they were very useful, some phrases were wrong or not colloquial (often for things like “the bill please”). Some languages had pronunciation guides, they were ok but a phrase book would be better. It worked well as a glorified language dictionary.

Google Maps Offline were great except Hungary where offline wasn’t allowed (it didn’t explain why).

The lack of phrase or dictionary apps was a pain, there’s a real dearth on Android. Someone should fill this gap!

WiFi was fairly common throughout our travels so we rarely used our paper Guides. WiFi was free in all hotels, sometimes in train stations, often in cafes and bars even in Romania.

WikiSherpa caches recent search results which are pulled out of Wikipedia and Wikivoyage, this works like a poor man’s RoughGuide. It doesn’t link to any maps or cache images but if you search on a city, you can read up on it (e.g. landmarks, how to get a taxi etc) whilst you travel.

The official WikiPedia app has page saving, this is useful for background info on a city when reading offline.

AnyMemo is useful for learning phrases in new languages. It is chaotic as the learning files aren’t curated. You can edit the files to remove the phrases you don’t need and to add useful new ones in.

Emily notes that TripAdvisor on Android doesn’t work well (the iPhone version was better but still not great). Emily also notes that hotels.com, lastminute and booking.com were all useful for booking most of our travels and hotels.

We used foursquare when we had WiFi, sadly there is no offline mode so I just starred locations using Google Maps. Foursquare needs a language independent reading system, trying to figure out if a series of Turkish reviews were positive or not based on the prevalence of smileys wasn’t easy (Google Translate integration would have helped). An offline FourSquare would have been useful (e.g. for cafes near to our spot).

We really should have bought a WiFi 3G dongle. The lack of data was a pain. We used Emily’s £5 travel data day plans on occasion (via Three). It works for most of Europe but not Switzerland or Turkey.

Given that we have WikiPedia and Wiktionary, how come we don’t have a “WikiPhrases” (“wikilingo”?) with multi-language forms of common phrases? Just like the phrase books for travel that we can buy but with good local phrases and idioms across any language that gets written up. This feels like it’d have a lot of value.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More

June project: Disambiguating “brands” in Social Media

Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I’m planning on a change for June.

I’m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following tweet mentions Apple-the-brand:
“I love my apple, though leopard can be a pain”
and that this tweet does not:
“Really enjoying this apple, very tasty”

I’ve used AlchemyAPI, OpenCalais, DBPedia Spotlight and others for client projects and it turns out that these APIs expect long-form text (e.g. Reuters articles) written with good English.

Tweets are short-form, messy, use colloquialisms, can be compressed (e.g. using contractions) and rely on local context (both local in time and social group). Linguistically a lot is expressed in 140 characters and it doesn’t look like”good English”.

A second problem with existing APIs is that they cannot be trained and often don’t know about European brands, products, people and places. I plan to build a classifier that learns whatever you need to classify.

Examples for disambiguation will include Apple vs apple (brand vs e.g. fruit/drink/pie), Seat vs seat (brand vs furniture), cold vs cold (illness vs temperature), ba (when used as an abbreviation for British Airways).

The goal of the June project will be to out-perform existing Named Entity Recognition APIs for well-specified brands on Tweets, developed openly with a liberal licence. The aim will be to solve new client problems that can’t be solved with existing APIs.

I’ll be using Python, NLTK, scikit-learn and Tweet data. I’m speaking on progress at BrightonPy and DataScienceLondon in June.

Probably for now I should focus on having no computer on my honeymoon…


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.
Read More