About

Ian Ozsvald picture

This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, founder of the Annotate.io social media mining API, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, co-founder of the SocialTies App, author of the A.I.Cookbook, author of The Screencasting Handbook, a Pythonista, co-founder of ShowMeDo and FivePoundApps and also a Londoner. Here's a little more about me.

High Performance Python book with O'Reilly View Ian Ozsvald's profile on LinkedIn Visit Ian Ozsvald's data science consulting business Protecting your bits. Open Rights Group

10 May 2016 - 22:43PyDataLondon 2016 Conference Write-up

We’ve just run our 3rd PyDataLondon Conference (2016) – 3 days, 4 tracks, 330 people.This builds on PyDataLondon 2015. It was ace! If you’d like to be notified about PyDataLondon 2017 then join this announce list (it’ll be super low volume like it has been for the last 2 years).

Big thanks to the organizers, sponsors and speakers, such a great conference it was. Being super tired going home on the train, but it was totally worth it. – Brigitta

We held it at Bloomberg UK again – many thanks to our hosts! I’d also like to thank my colleagues, review committee and all our volunteers for their hard work, the weekend went incredibly smoothly and that’s because our team is so on-top-of-everything – thanks!

Our keynote speakers were:

Our videos are being uploaded to YouTube. Slides will be linked against each author’s entry. There are an awful lot of happy comments on Twitter too. Our speakers covered Python, Julia, R, MCMC, clustering, geodata, financial modeling, visualisation, deployment, pipelines and a whole lot more. I spoke on Statistically Solving Sneezes and Sniffles (a citizen science project using ML to try to diagnose the causes of Rhinitis). Our Beginner Bootcamp (led by Conrad) had over 50 attendees!

…Let me second that. My first PyData also. It was incredible. Well organised – kudos to everyone who helped make it happen; you guys are pros. I found Friday useful as well, are the meetups like that? I’d love to be more involved in this community. –  lewis

We had two signing sessions for five authors with a ton of free books to give away:

  • Kyran Dale – Data Visualisation with Python and Javascript (these were the first copies in the UK!)
  • Amit Nandi – Spark for Python Developers
  • Malcolm Sherrington – Mastering Julia
  • Rui Miguel Forte – Mastering Predictive Analytics with R
  • Ian Ozsvald (me!) – High Performance Python (now in Italian, Polish and Japanese)

 

Some achievements

  • We used slack for all members at the conference – attendees started side-channels to share tutorial files, discuss the meets and recommend lunch venues (!)
  • We added an Unconference track (7 blank slots that anyone could sign-up for on the day), this brought us a nice random mix of new topics and round-table discussions
  • A new bioinformatics slack channel is likely to be formed due to collaborations at the conference
  • We signed up a ton of new volunteers to help us next year (thanks!)
  • An impromptu jobs board appeared on a notice board and was rapidly filled (if useful – also see my jobs list)

Thank you to all the organisers and speakers! It’s been my first PyData and it’s been great! – raffo

We had 15-20% female attendance this year, a slight drop on last year’s numbers (we’ll keep working to do better).

On a personal note it was great to see colleagues who I’ve coached in the past – especially as some were speaking or were a part of our organising committee.

With thanks to our sponsors and via ticket sales we raised more money this year for the NumFOCUS non-profit that backs the scientific Python stack (they give grants and stipends for contributors). We’d love to have more sponsors next year (this is especially useful if you’re hiring!). Thanks to:

Let me know if you do a write-up so I can link it here please:

If you’d like to hear about next year’s event then join this announce list (it’ll be super low volume). You probably also want to join our PyDataLondon meetup.

There are other upcoming PyData conferences including Berlin, Paris and Cologne. Take a look and get involved!

As an aside – if your data science team needs coaching, do drop me a line (and take a look at my coaching testimonials on LinkedIn). If you want a job in data science, take a look at my London Python data science jobs list.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

31 Comments | Tags: Data science, Life, pydata, Python

11 January 2016 - 23:57Allergic Rhinitis (“Why do I always sneeze?!”) research project using Machine Learning

Since April my wife (@fluffyemily) and I have been running a research project around her allergies. She sneezes all year and we’re trying to figure out the cause. Allergic Rhinitis affects 10-30% of Westerners, in Emily’s case it is all-year so it isn’t just pollen related. We figure that a good data-collection process coupled with robust analysis might reveal some of the causes of sneezing such that Emily’s in better control of her Rhinitis.

Emily’s a senior iOS developer with Mozilla, she wrote an open source App for her iPhone to log her sneezes, antihistamine use and interactions with “things” like animals. The App gives us a time-stamp and geolocation. Since she’s mostly in London we’ve got a rich source of events to join to other datasets.

This post is just to put down a marker. I’ve made some progress using Machine Learning to predict when an antihistamine might be used. Currently I can out-predict a Dummy (majority-class) classifier using many cross-validation runs, this is hardly brilliant but we didn’t expect diagnosing a long-term allergy to be a simple affair! Exploratory data analysis on the data shows lots of interesting behaviours, I hope to talk about some of these in the future.

We’ve tried (and so far rejected) air-born particulates as a reason for her allergies via Kings College LondonAir data (thanks!). Weather data is more promising using a local wunderground station (Emily seems to be a little sensitive to humidity and windspeed). I’ve recently started work on MyFitnessPal logged data (the Python 3.4 port was thankfully easy) to start to look at alcohol (a known histamine modifier) and possibly other food.

Behind the scenes I’ve got a collaborative group (thanks Frank and Giles!) in Slack and a private github repo, I plan to talk a little on how this works. I think talking about ways we can collaborate on research projects has value, anything that helps us move on from just working in an office seems like a good idea.

If you’re interested in hearing updates about this project and maybe getting involved to log your own allergy data, join this email announce list. Your email will be kept private, I’ll just send you an email every now and again when we’ve made some progress (which will probably appear here) and when we need volunteers.

Ultimately we’d like to help predict the causes of allergies for other folk. We’ve been talking about this for around 2 years, it is encouraging to see research like this pointing to the use of ML to predict and model the body’s behaviours.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

15 Comments | Tags: Data science, Life, Python

14 October 2015 - 13:21Opening Plenary at BudapestBI Forum 2015

I’ve just given my final talk for the year – I’m “at my other home” in Budapest (I’m half-Hungarian) and have had the honour of opening Bence and team’s BudapestBI Forum 2015. This conference has both an open-source-day and (tomorrow) an enterprise-day, all around analytics and with lots of Python and R.

This talk is an iteration of my previous Shipping talks, in part backed by results from our latest PyDataLondon survey to 2,000 members where we’ve asked about member frustrations and I’ve integrated some of the results into this talk:

Shipping Data Science Products
(source)

Here are my slides:

In the room we had roughly 2/3 ‘engineers/builders’ and 1/3 ‘researchers/analysts’, it seems that Python and R are used by a large number of folk here today.

I also ‘released’ a set of my notes that I’ve tentatively entitled “Data Science Delivered” – this is a github doc with a series of the notes that I wish I’d learned years ago. Right now these notes are super-rough, I figure “release early, release often” will help me refine these.

It is based in part through my talking, teaching and coaching over the last couple of years. I intend to add more in the next couple of weeks (so hopefully by November 2015 it’ll be far less rough!), I’d like to add some Notebooks as examples. You’re welcome to post bugs/requests and I’ll try to add notes, if I know about those areas. Please feel free to share some of your experiences (via @ianozsvald, via email, via Bugs etc).


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

20 September 2015 - 17:23“Ship Data Science Products!” at PyConUK2015

PyConUK2015 is over, it was another year of happy Pythonistic hobbitness in Coventry. I spoke on shipping data science products on the new Science track (organised by Sarah):

It was nice to hear some polite-abuse being thrown at folk stuck on Python 2.x reminding them that it is high time to upgrade to Python 3. Propaganda was given away to support this move.

Obviously I plugged PyDataLondon and our upcoming meetups – if you like data science then come along to our meetups.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

8 Comments | Tags: Data science, Life, pydata, Python

13 May 2015 - 16:42Data Science Deployed – Opening Keynote for PyConSE 2015

I’ve just had a fab couple of days at PyConSE in Stockholm, I really enjoyed giving the opening keynote (thanks!) and attending two days of interesting talks. The Saturday was packed with data science talks (see below), it felt like a mini PyData or EuroSciPy, most cool!

The goal of my talk was to show use-cases for why you should do data science, why it is valuable, how to do it successfully with Python and how get the data products deployed. The whole shebang in 40 minutes. Tools mentioned include scikit-learn, statsmodels, textract, pandas, matplotlib, seaborn, bokeh, IPython and Notebooks, Spyder, PyCharm, Flask and Spyre.

Sidenote – this is the follow-on to my “The Real Unsolved Problems in Data Science” opening keynote at PyConIreland 2014.

My main points seemed to make it through, phew!

What I take from @ianozsvald talk:
“How can i turn our data into business value?”
“Log everything!”
Think + hypothesize + test @pythse

Exploiting your data is key to staying relevant in your business! Listening to @ianozsvald at #pyconse @scalior

Note – I’ll be updating this write-up a little over the next couple of days (it is the end of the conf and I’m rather shattered right now!).

The slides and video for my Data Science Deployed talk are below:

I’d like to acknowledge Ollie Glass along with Ferenc Huszár (Balderton) and Thomas Stone (Prediction.io) for feedback on early ideas for my talk – cheers gents!

I also plugged PyDataBerlin, our upcoming PyDataLondon (June 19-21, CfP open for just 1 more week) and EuroSciPy on stage, hopefully we’ll see a few more international visitors. I should also have plugged PyConUK too as there’s now a Science Track too!

The following talks from yesterday will interest you, I hope the videos come online soon:

  • Analyzing data with Pandas
  • Data processing and machine learning with Python (slides)
  • Deep Learning and Deep Data Science
  • Hacking Human Language
  • IPython: How a notebook is changing science
  • The Hitchhikers Guide to Python

Here’s a couple of extra links that might be interesting:

Here’s Ilian Iliev’s review of the conference too.

I have a vague idea to write-up these topics more in the future, I’m calling this Building Data Science Products with Python. There’s a mailing list, I’ll email to ask questions a little over the coming months to figure out if/how I should write this.

Thanks everyone for a lovely conference!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

11 Comments | Tags: Life, pydata, Python

3 April 2015 - 11:05PyDataParis 2015 and “Cleaning Confused Collections of Characters”

I’m at PyDataParis, this is the first PyData in France and we have a 300-strong turn-out. In my talk I asked about the split of academic and industrial folk, we have 70% industrialists here (at least – in my talk of 70 folk). The bulk of the attendees are in the Intro track and maybe the split is different in there. All slides are up, videos are following, see them here.

Here’s a photo of Gael giving a really nice opening keynote on Scikit-Learn:

I spoke on data cleaning with text data, I packed quite a bit into my 40 minutes and got a nice set of questions. The slides are below, it covers:

  • Data extraction from text files, PDF, HTML/XML and images
  • Merging on columns of data
  • Correctly processing datetimes from files and the dangers of relying on the pandas defaults
  • Normalising text columns so we could join on otherwise messy data
  • Automated data transformation using my annotate.io (Python demo)
  • Ideas on automated feature extraction
  • Ideas on automating visualisation for new, messy datasets to get a “bird’s eye view”
  • Tips on getting started – make a Gold Standard!

One question concerned the parsing of datetime strings from unusual sources. I’d mentioned dateutil‘s parser in the talk and a second parser is delorean. In addition I’ve also seen arrow (an extension of the standard datetime) which has a set of parsers including one for ISO8601. The parsedatetime module has an NLP module to convert statements like “tomorrow” into a datetime.

I don’t know of other, better parsers – do you? In particular I want one that’ll take a list of datetimes and return one consistent converter that isn’t confused by individual instances (e.g. “1/1” is MM/DD or DD/MM ambiguous).

I’m also asking for feedback on the subject of automated feature extraction and automated column-join tools for messy data. If you’ve got ideas on these subjects I’d love to hear from you.

In addition I was reminded of DiffBot, it uses computer vision and NLP to extract meaning from web pages. I’ve never tried it, can any of you comment on its effectiveness? Olivier Grisel mentioned pyquery to me, it is an lxml parser which lets you make jquery-like queries on HTML.

update I should have mentioned chardet, it detects encodings (UTF8, CP1252 etc) from raw text, very useful if you’re trying to figure out the encoding for a collection of bytes off of a random data source! libextract (write-up) looks like a young but nice tool for extracting text blocks from HTML/XML sources, also goose. boltons is a nice collection of bolton-tools to the standard library (e.g. timeutils, strutils, tableutils). Possibly mETL is a useful tool to think about the extract, transform and load process.

update It might also be worth noting some useful data sources from which you can extract semi-structured data, e.g. ‘tech tags’ from stackexchange‘s forums (and I also see a new hackernews dump). Here’s a big list of “awesome public datasets“.

update Peadar Coyle (@springcoil) gave a nice talk at PyConItaly 2015 on “Data Products – how to get models into production” which is related.

Camilla Montonen has just spoken on Rush Hour Dynamics, visualising London Underground behaviour. She noted graph-tool, a nice graphing/viz library I’d not seen before. Fabian has just shown me his new project, it collects NLP IPython Notebooks and lists them, it tries to extract titles or summaries (which is a gnarly sub-problem!). The AXA Data Innovation Lab have a nice talk on explaining machine learned models.

Gilles Loupe’s slides for his ML/sklearn talk on trees and boosting are online, as are Alexandre Gramfort‘s on sklearn linear models.


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

14 Comments | Tags: Data science, Life, pydata, Python

21 February 2015 - 21:05Data-Science stuff I’m doing this year

2014 was an interesting year, 2015 looks to be even richer. Last year I got to publish my High Performance Python book, help co-organise the rather successful PyDataLondon2014 conference, teach High Performance in public (slides online) and in private, keynote on The Real Unsolved Problems in Data Science and start my ModelInsight AI agency. That was a busy year (!) but deeply rewarding.

My High Performance Python published with O’Reilly in 2014

 

This year our consulting is branching out – we’ve already helped a new medical start-up define their data offering, I’m mentoring another data scientist (to avoid 10 years of my mistakes!) and we’re deploying new text mining IP for existing clients. We’ve got new private training this April for Machine Learning (scikit-learn) and High Performance Python (announce list) and Spark is on my radar.

Apache Spark maxing out 8 cores on my laptop

Python’s role in Data Science has grown massively (I think we have 5 euro-area Python-Data-Science conferences this year) and I’m keen to continue building the London and European scenes.

I’m particularly interested in dirty data and ways we can efficiently clean it up (hence my Annotate.io lightning talk a week back). If you have problems with dirty data I’d love to chat and maybe I can share some solutions.

For PyDataLondon-the-conference we’re getting closer to fixing our date (late May/early June), join this announce list to hear when we have our key dates. In a few weeks we have our 10th monthly PyDataLondon meetup, you should join the group as I write up each event for those who can’t attend so you’ll always know what’s going on. To keep the meetup from degenerating into a shiny-suit-fest I’ve setup a separate data science jobs list, I curate it and only send relevant contract/permie job announces.

This year I hope to be at PyDataParis, PyConSweden, PyDataLondon, EuroSciPy and PyConUK – do come say hello if you’re around!


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

5 Comments | Tags: ArtificialIntelligence, Data science, High Performance Python Book, Life, pydata, Python

19 February 2015 - 11:35Starting Spark 1.2 and PySpark (and ElasticSearch and PyPy)

The latest PySpark (1.2) is feeling genuinely useful, late last year I had a crack at running Apache Spark 1.0 and PySpark and it felt a bit underwhelming (too much fanfare, too many bugs). The media around Spark continues to grow and e.g. today’s hackernews thread on the new DataFrame API has a lot of positive discussion and the lazily evaluated pandas-like dataframes built from a wide variety of data sources feels very powerful. Continuum have also just announced PySpark+GlusterFS.

One surprising fact is that Spark is Python 2.7 only at present, feature request 4897 is for Python 3 support (go vote!) which requires some cloud pickling to be fixed. Using the end-of-line Python release feels a bit daft. I’m using Linux Mint 17.1 which is based on Ubuntu 14.04 64bit. I’m using the pre-built spark-1.2.0-bin-hadoop2.4.tgz via their downloads page and ‘it just works’. Using my global Python 2.7.6 and additional IPython install (via apt-get):

spark-1.2.0-bin-hadoop2.4 $ IPYTHON=1 bin/pyspark
...
IPython 1.2.1 -- An enhanced Interactive Python.
...
 Welcome to
 ____              __
 / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
 /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
 /_/
Using Python version 2.7.6 (default, Mar 22 2014 22:59:56)
 SparkContext available as sc.
 >>>

Note the IPYTHON=1, without that you get a vanilla shell, with it it’ll use IPython if it is in the search path. IPython lets you interactively explore the “sc” Spark context using tab completion which really helps at the start. To run one of the included demos (e.g. wordcount) you can use the spark-submit script:

spark-1.2.0-bin-hadoop2.4/examples/src/main/python 
$ ../../../../bin/spark-submit wordcount.py kmeans.py  # count words in kmeans.py

For my use case we were initially after sparse matrix support, sadly they’re only available for Scala/Java at present. By stepping back from my sklean/scipy sparse solution for a minute and thinking a little more map/reduce I could just as easily split the problem into number of counts and that parallelises very well in Spark (though I’d love to see sparse matrices in PySpark!).

I’m doing this with my contract-recruitment client via my ModelInsight as we automate recruitment, there’s a press release out today outlining a bit of what we do. One of the goals is to move to a more unified research+deployment approach, rather than lots of tooling in R&D which we then streamline for production, instead we hope to share similar tooling between R&D and production so deployment and different scales of data are ‘easier’.

I tried the latest PyPy 2.5 (running Python 2.7) and it ran PySpark just fine. Using PyPy 2.5 a  prime-search example takes 6s vs 39s with vanilla Python 2.7, so in-memory processing using RDDs rather than numpy objects might be quick and convenient (has anyone trialled this?). To run using PyPy set PYSPARK_PYTHON:

$ PYSPARK_PYTHON=~/pypy-2.5.0-linux64/bin/pypy ./pyspark

I’m used to working with Anaconda environments and for Spark I’ve setup a Python 2.7.8 environment (“conda create -n spark27 anaconda python=2.7”) & IPython 2.2.0. Whichever Python is in the search path or is specified at the command line is used by the pyspark script.

The next challenge to solve was integration with ElasticSearch for storing outputs. The official docs are a little tough to read as a non-Java/non-Hadoop programmer and they don’t mention PySpark integration, thankfully there’s a lovely 4-part blog sequence which “just works”:

  1. ElasticSearch and Python (no Spark but it sets the groundwork)
  2. Reading & Writing ElasticSearch using PySpark
  3. Sparse Matrix Multiplication using PySpark
  4. Dense Matrix Multiplication using PySpark

To summarise the above with a trivial example, to output to ElasticSearch using a trivial local dictionary and no other data dependencies:

$ wget http://central.maven.org/maven2/org/elasticsearch/
 elasticsearch-hadoop/2.1.0.Beta2/elasticsearch-hadoop-2.1.0.Beta2.jar
$ ~/spark-1.2.0-bin-hadoop2.4/bin/pyspark --jars 
 elasticsearch-hadoop-2.1.0.Beta2.jar
>>> res=sc.parallelize([1,2,3,4])
 >>> res2=res.map(lambda x: ('key', {'name': str(x), 'sim':0.22}))
 >>> res2.collect()
 [('key', {'name': '1', 'sim': 0.22}),
 ('key', {'name': '2', 'sim': 0.22}),
 ('key', {'name': '3', 'sim': 0.22}),
 ('key', {'name': '4', 'sim': 0.22})]

>>>res2.saveAsNewAPIHadoopFile(path='-', 
 outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
 keyClass="org.apache.hadoop.io.NullWritable", 
 valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
 conf={"es.resource": "myindex/mytype"})

The above creates a list of 4 dictionaries and then sends them to a local ES store using “myindex” and “mytype” for each new document.  Before I found the above I used this older solution which also worked just fine.

Running the local interactive session using a mock cluster was pretty easy. The docs for spark-standalone are a good start:

sbin $ ./start-master.sh
 #  the log (full path is reported by the script so you could `tail -f `) shows
 # 15/02/17 14:11:46 INFO Master: 
 # Starting Spark master at spark://ian-Latitude-E6420:7077
 # which gives the link to the browser view of the master machine which is 
 # probably on :8080 (as shown here http://www.mccarroll.net/blog/pyspark/).
#Next start a single worker:
sbin $ ./start-slave.sh 0 spark://ian-Latitude-E6420:7077
 # and the logs will show a link to another web page for each worker 
 # (probably starting at :4040).
#Next you can start a pySpark IPython shell for local experimentation:
$ IPYTHON=1 ~/data/libraries/spark-1.2.0-bin-hadoop2.4/bin/pyspark 
  --master spark://ian-Latitude-E6420:7077
 # (and similarity you could run a spark-shell to do the same with Scala)
#Or we can run their demo code using the master node you've configured setup:
$ ~/spark-1.2.0-bin-hadoop2.4/bin/spark-submit 
  --master spark://ian-Latitude-E6420:7077 
  ~/spark-1.2.0-bin-hadoop2.4/examples/src/main/python/wordcount.py README.txt

Note if you tried to run the above spark-submit (which specifies the –master to connect to) and you didn’t have a master node, you’d see log messages like:

15/02/17 14:14:25 INFO AppClient$ClientActor: 
 Connecting to master spark://ian-Latitude-E6420:7077...
15/02/17 14:14:25 WARN AppClient$ClientActor: 
 Could not connect to akka.tcp://sparkMaster@ian-Latitude-E6420:7077: 
 akka.remote.InvalidAssociation: 
 Invalid address: akka.tcp://sparkMaster@ian-Latitude-E6420:7077
15/02/17 14:14:25 WARN Remoting: Tried to associate with 
 unreachable remote address 
 [akka.tcp://sparkMaster@ian-Latitude-E6420:7077]. 
 Address is now gated for 5000 ms, all messages to this address will 
 be delivered to dead letters. 
 Reason: Connection refused: ian-Latitude-E6420/127.0.1.1:7077

If you had a master node running but you hadn’t setup a worker node then after doing the spark-submit it’ll hang for 5+ seconds and then start to report:

15/02/17 14:16:16 WARN TaskSchedulerImpl: 
 Initial job has not accepted any resources; 
 check your cluster UI to ensure that workers are registered and 
 have sufficient memory

and if you google that without thinking about the worker node then you’d come to this diagnostic page  which leads down a small rabbit hole…

Stuff I’d like to know:

  • How do I read easily from MongoDB using an RDD (in Hadoop format) in PySpark (do you have a link to an example?)
  • Who else in London is using (Py)Spark? Maybe catch-up over a coffee?

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

10 Comments | Tags: ArtificialIntelligence, Data science, Life, pydata, Python

10 December 2014 - 13:31New Relic, uWSGI and “Cannot perform a data harvest for ‘‘ as there is no active session.”

This is more a note-to-self and maybe to another confused soul – if you’re using New Relic (it seems to be really rather nice for web app monitoring) with uWSGI, by default uWSGI runs without the GIL. This means no threads and this means New Relic won’t report anything which leads to a confusing first try.

Specifically read the Best Practices notes for uWSGI around “–enable-threads”. You have to add “–enable-threads” if you’re using New Relic’s Python agent, this is documented on their Python Agent Integration docs for uWSGI but for me the clue was in their log (by default in /tmp/newrelic-python-agent.log if you enable it in newrelic.ini) which showed:

(3717/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Commencing harvest of all application data.
(3717/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Cannot perform a data harvest for '<appname>' as there is no active session.
(3717/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Completed harvest of all application data in 0.00 seconds.

Once I’d added “–enable-threads” to uWSGI the logs looked a lot healthier, particularly:

(3292/NR-Harvest-Thread) newrelic.core.agent DEBUG 
 - Commencing harvest of all application data.
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Commencing data harvest of '<appname>'.
 ...
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Send profiling data for harvest of '<appname>'.
(3292/NR-Harvest-Thread) newrelic.core.application DEBUG 
 - Done sending data for harvest of '<appname>'.

Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life, Python

30 August 2014 - 12:06Slides for High Performance Python tutorial at EuroSciPy2014 + Book signing!

Yesterday I taught an excerpt of my 2 day High Performance Python tutorial as a 1.5 hour hands-on lesson at EuroSciPy 2014 in Cambridge with 70 students:

IMG_20140828_155857

We covered profiling (down to line-by-line CPU & memory usage), Cython (pure-py and OpenMP with numpy), Pythran, PyPy and Numba. This is an abridged set of slides from my 2 day tutorial, take a look at those details for the upcoming courses (including an intro to data science) we’re running in October.

I’ll add the video in here once it is released, the slides are below.

I also got to do a book-signing for our High Performance Python book (co-authored with Micha Gorelick), O’Reilly sent us 20 galley copies to give away. The finished printed book will be available via O’Reilly and Amazon in the next few weeks.

Book signing at EuroSciPy 2014

If you want to hear about our future courses then join our low-volume training announce list. I have a short (no-signup) survey about training needs for Pythonistas in data science, please fill that in to help me figure out what we should be teaching.

I also have a further survey on how companies are using (or not using!) data science, I’ll be using the results of this when I keynote at PyConIreland in October, your input will be very useful.

Here are the slides (License: CC By NonCommercial), there’s also source on github:


Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.

No Comments | Tags: Life, pydata, Python