
The battle for Greece’s next top model may have been over for this year…😂
However, the battle’s still on in almost every machine learning task a data scientist comes across in their daily -and nightly- lives.
The long standing questions in these cases are:
- Which model is the best?
- Can I know that in advance?
- If not, how am I supposed to try every possible different model and decide afterwards, without creating the messiest 🍝 spaghetti code in the whole of Italy?
- And most importantly, should I eat more cake?…err sorry, that’s always a yes!
Fortunately, the machine learning community can offer some real engineering gems for deploying your machine learning models easily and with the least possible pain – my favourite ones being keras/tensorflow
and scikit-learn
(sklearn
). Today, I’ll focus on scikit-learn
only, which offers great implementations for a large set of popular “traditional” machine learning classifiers (aka non deep-learning based).
Yet, if I had to try every possible classifier integrated within scikit-learn
, would I be able to do that in a coherent/non-redundant/straightforward way?…🤔 i.e. write my code once and seamlessly run it for every possible classifier? (unlike Java…)
Today’s our lucky day as the answer to that is: (you guessed it) yes!
The very first step for doing that is defining a class -let’s call it SklearnWrapper– that can invoke standard built-in methods, such as fit
, predict
, etc., seamlessly across multiple sklearn
classifiers.
This class would look like this:
class SklearnWrapper(object): def __init__(self, clf, params={}): self.clf = clf(**params) def fit(self, x, y): return self.clf.fit(x, y) def predict(self, x): return self.clf.predict(x) def predict_proba(self, x): return self.clf.predict_proba(x) def evaluate(self, x, y, verbose=0): self.clf.evaluate(x, y, verbose) def feature_importances(self, x, y): return (self.clf.fit(x, y).feature_importances_) def get_coef_(self): return self.clf.coef_
The critical part in this case is passing a clf and params argument that are then instantiated into a specific sklearn
object by calling the clf(**params)
constructor.
For example, you can create a new Random Forest classifier like that:
from sklearn.ensemble import RandomForestClassifier # Random Forest Classifier parameters rf_params = { 'n_jobs': -1, 'n_estimators': 100, 'max_features' : 'auto', 'max_depth': 15, 'min_samples_leaf': 2, 'min_samples_split': 4, 'warm_start': False, 'verbose': 0 } rf_model = SklearnWrapper(clf=eval("RandomForestClassifier"), params=rf_params)
Note that the clf argument’s input value is a string representing a classifier that is eval‘ed and thus interpreted into the respective sklearn
classifier object. This means that the string passed onto clf has to be a valid sklearn
module name implementing a machine learning classifier! The clf variable/argument in that case is merely a placeholder name for a real sklearn
module (that needs to be explicitly imported), while params are a dictionary of classifier-specific input parameters.
Similarly, we can create a Support Vector Classifier model as such:
from sklearn.svm import SVC # Support Vector Classifier parameters svc_params = { 'C': 0.01, 'kernel': 'linear', 'gamma': 'auto', 'probability': True, 'shrinking': True } svc_model = SklearnWrapper(clf=eval("SVC"), params=svc_params)
So far so good… We’re now just missing hooking up our model definitions with data objects (train/test) to actually train our models and make predictions. In a next post, I’ll talk about creating a higher level class that implements further functionality on top of SklearnWrapper
to provide a standard interface for building the model, training, evaluating and extracting predictions. 🙂