HashingVectorizer – DAN FOLKES

pickle and python

I took this piece of code out of a project I am working on. I wanted to guess the tag based on keywords in the body of text. So, I take the text, apply a hash vectorizer and then pass the hashed values into a AdaBoostClassifier that uses DecisionTreeClassifier. I wanted to build it once and use it over and over again, so I used Pickle to save it on the file system to reuse.

This code assumes you have a dataframe populated already.

Includes:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import pickle
import os.path

Setting up filesystem and parameters stuff:

resetPickle = False
foundPickle = False
"""This is where you would load the dataset"""
df_tags = pd.DataFrame()
pick_model_path1='pickles/modelAdaDecTreeClassifier.pickle'
pick_model_tags_root_pre = 'pickles/model_tag_'
pick_model_tags_root_post = '_DecTreeClassifier.pickle'
tag_pickle_path = pick_model_tags_root_pre + 'PIC' + pick_model_tags_root_post

Create HashingVectorizer. The ngrams 1,2 means that it will use words like “Richmond” and “Richmond VA” as tokens:

vctrizr_tag = HashingVectorizer(ngram_range=(1, 2))

This will check to see if the pickle exists. It will load it into the model if it exists:

if resetPickle == False and os.path.isfile(tag_pickle_path):
    pickle_in = open(tag_pickle_path,'rb')
    model_tag = pickle.load(pickle_in)
    foundPickle = True

If the pickle does not exist, it will go and train the AdaBoostClassifier and save it into the pickle:

if foundPickle == False:
    y_tag = df_tags
    vctr_tag= vctrizr_tag.transform(df_tags['Text'])
    X_tag = vctrizr_tag.transform(df_tags['Text'])
    X_train_tag, X_test_tag, y_train_tag, y_test_tag = train_test_split(X_tag, y_tag, test_size=0.2, random_state=1)
    model_tag = AdaBoostClassifier(DecisionTreeClassifier(max_depth=44),n_estimators=25)
    model_tag = model_tag.fit(X_train_tag, y_train_tag)
    score = model_tag.score(X_test_tag, y_test_tag)
    print('score',score)
    with open(tag_pickle_path, 'wb') as f:
        pickle.dump(model_tag, f)

All together now:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import pickle
import os.path
resetPickle = False
foundPickle = False
"""This is where you would load the dataset"""
df_tags = pd.DataFrame()
 
pick_model_path1='pickles/modelAdaDecTreeClassifier.pickle'
pick_model_tags_root_pre = 'pickles/model_tag_'
pick_model_tags_root_post = '_DecTreeClassifier.pickle'
tag_pickle_path = pick_model_tags_root_pre + 'PIC' + pick_model_tags_root_post
vctrizr_tag = HashingVectorizer(ngram_range=(1, 2))
if resetPickle == False and os.path.isfile(tag_pickle_path):
    pickle_in = open(tag_pickle_path,'rb')
    model_tag = pickle.load(pickle_in)
    foundPickle = True
if foundPickle == False:
    y_tag = df_tags
    vctr_tag= vctrizr_tag.transform(df_tags['Text'])
    X_tag = vctrizr_tag.transform(df_tags['Text'])
    X_train_tag, X_test_tag, y_train_tag, y_test_tag = train_test_split(X_tag, y_tag, test_size=0.2, random_state=1)
    model_tag = AdaBoostClassifier(DecisionTreeClassifier(max_depth=44),n_estimators=25)
    model_tag = model_tag.fit(X_train_tag, y_train_tag)
    score = model_tag.score(X_test_tag, y_test_tag)
    print('score',score)
    with open(tag_pickle_path, 'wb') as f:
        pickle.dump(model_tag, f)

Tag: HashingVectorizer

Python: Pickle and Unpickle Tree Classifier with Hashing Vectorizer