Archives of #Classifier

Overfitting with a Decision Tree

Below is a plot of Training versus Testing errors using a Precision metric (actually 1.0-precision, so lower is better) that shows how easy it is to over-fit a decision tree to the detriment of generalisation. It is important to check that a classifier isn’t overfitting to the training data such that it is just learning […]

Visualising True Positives and False Positives against Features with scikit-learn

Here I’m starting to look into the errors caused in the social media brand disambiguator project. Below I look at true and false positives (correct and mistaken is-a-brand classifications) and plot them against the number of features that two different classifiers can use to calculate their class membership probabilities. First I’m using the default LogisticRegression […]

Visualising the internals of Logistic Regression on a Text Matrix

Below I have some plots that visualise the term matrix (as a binary matrix and as a TF-IDF matrix) for the brand disambiguation project followed by a visualisation of the coefficients used in scikit-learn’s LogisticRegression classifier using l1 and l2 penalties. Using a CountVectorizer with binary=True we can mark the absence or presence of a […]

June project: Disambiguating “brands” in Social Media

Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I’m planning on a change for June. I’m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following […]