# Tradeshift Text Classification

The Tradeshift Text Classification challenge was sponsored by a company that processes and converts paper documents into electronic form. The challenge was to “to create and open source an algorithm that correctly predicts the probability that a piece of text belongs to a given class” where the class may be, e.g., a date, address, name, etc.

I wasn’t particularly interested in this competition, but I thought I’d enter it anyway to learn a bit more about multi-classification problems with a huge feature space.

My strategy was to use the simple and elegant online-learning algorithm provided by tinrtgu.

The advantage of this algorithm is (a) it has a small footprint and (b) it’s fast. Although takes hours to run in native Python, but less than 15 minutes using PyPy.

Since PyPy is only 32-bit for Windows, I was limited to 2 Gb of RAM. I decided to move to Numba, which proved to be slightly faster, while also giving me access to my entire system RAM.

The original code is very well documented. I stripped out the comments for my use, and will highlight the changes I made as part of the competition.

I’ll discuss the code block by block.

## Files and Features

from datetime import datetime from math import log, exp, sqrt import numpy as np from numba import jit, i4, f4, f8, void D = 5770166 train = 'train_enc.csv' label = 'trainLabels.csv' test = 'test_enc.csv' train05 = 'train05.csv' label05 = 'labels05.csv' train95 = 'train95.csv' label95 = 'labels95.csv' feature_columns = 340

D is the total number of features. The original code did hashing in-line. Since I did hashing in a pre-processing step, I knew exactly how many features I had and set the value accordingly.

I also took off 5% of the training data to use for validation. I did not cross-validate; in other words, I always used the same set for validation.

The original data had 145 columns. The additional columns I added were feature interactions.

## Reading the Data Files

def data(path, label_path=None): for t, line in enumerate(open(path)): if t == 0: x = np.zeros(total_features+1) if label_path: label = open(label_path) label.readline() continue row = line.rstrip().split(',') for m, feat in enumerate(row): if m == 0: ID = int(feat) else: x[m] = feat # The hashing was completed in the pre-processing step if label_path: y = np.asarray([float(y) for y in label.readline().split(',')[1:]]) yield (ID, x, y) if label_path else (ID, x)

The key change in this function was the commented line above, where I changed the hashing function to just reading the input file. I’ll talk more about my pre-processing in a bit.

## Functions

@jit(f8(f8,i4)) def logloss(p, y): p = max(min(p, 1. - 10e-15), 10e-15) return -log(p) if y == 1. else -log(1. - p) @jit(f8(f8[:],f8[:])) def predict(x, w): wTx = 0. for i in x: wTx += w[i] * 1. return 1. / (1. + exp(-max(min(wTx, 20.), -20.))) @jit(void(f4,f8[:],f8[:],f8[:],f8,i4)) def update(alpha, w, n, x, p, y): for i in x: n[i] += abs(p - y) w[i] -= (p - y) * 1. * alpha / sqrt(n[i])

Numba makes is *so* easy to speed up functions. I just added the @jit decorators, and everything else was taken care of automatically.

## Training and Validation

I borrowed an idea posted by John 3:16 to cycle through the data until convergence (or, in my case, until the validation score got worse).

start = datetime.now() alpha = .10 K = [k for k in range(33) if k != 13] w = np.zeros((33,D)) n = np.zeros((33,D)) lastCV = 5 thisCV = 1 passes = 0 while (lastCV - thisCV) > 0.000001: lastCV = copy(thisCV) passes += 1 loss = 0. loss_y14 = log(1. - 10**-15) # Training on 95% seen = 0 for ID, x, y in data(train95, label95): seen += 1 for k in K: p = predict(x, w[k]) update(alpha, w[k], n[k], x, p, y[k]) loss += logloss(p, y[k]) loss += loss_y14 if seen % 100000 == 0: print('%s\tcount: %d\tlogloss: %f' % ( datetime.now(), seen, (loss/33.)/seen)) # Validating on the 5% loss = 0. seen = 0 for ID, x, y in data(train05, label05): seen += 1 for k in K: p = predict(x, w[k]) loss += logloss(p, y[k]) loss += loss_y14 thisCV = (loss/33.)/seen print(passes, 'CV:\tlogloss: %f' % ((loss/33.)/seen))

In the original code, `w`

and `n`

were Python lists. I modified them to Numpy arrays in order to get the benefit of Numba jit.

Once I determined the optimum number of passes, I trained again on the full data set, and write the predictions to file (code not shown).

## Pre-processing

There were more lines of pre-processing code than algorithm code, so I’m just going to hit the highlights.

import numpy as np import pandas as pd Xt = pd.read_csv('train.csv', index_col='id') Xs = pd.read_csv('test.csv', index_col='id') D = 2**26 text = [3, 4, 34, 35, 61, 64, 65, 91, 94, 95] text = ['x'+str(i) for i in text] for i in range(10): for j in range(i+1,10): ci = text[i] cj = text[j] Xt[ci+cj] = (Xt[ci] + Xt[cj]).apply(lambda x: abs(hash(x)) % D)

The text-text interactions were some of the most important. I used a large number for `D`

since I just wanted to avoid collisions. There were other ways to do this, but this was fairly straight forward. Once the interactions were encoded, I encoded the text columns, and did some “across the board” interactions:

Xt['text1'] = Xt.x3 + Xt.x34 + Xt.x64 + Xt.x94 + Xt.x61

Similarly, I created interaction features for the binary and categorical inputs. (code not shown)

I threw away any attributes not included in both the test and training data.

atts_to_keep = {} for col in Xt.columns: Xt_atts = set(Xt[col].unique()) Xs_atts = set(Xs[col].unique()) atts_to_keep[col] = Xt_atts & Xs_atts print('{} - {}, {}, {}'.format(col, len(Xt_atts), len(Xs_atts), len(atts_to_keep[col]))) Xt.loc[Xt[~Xt[col].isin(atts_to_keep[col])].index, col] = 0 Xs.loc[Xs[~Xs[col].isin(atts_to_keep[col])].index, col] = 0

And then I ensured each column had unique attributes:

le = LabelEncoder() # Rename attributes in each column from 0, ..., N-1 # where N is the number of attributes in the column for x in Xt.columns: print('--- Encoding: {} ---'.format(x)) le.fit(pd.concat([Xt[x],Xs[x]]).astype(str)) Xt[x] = le.transform(Xt[x].astype(str)) Xs[x] = le.transform(Xs[x].astype(str)) # Starting with the second column, add the maximum from the # previous column for i in range(1,len(cols)): max_feat = max(Xt[cols[i-1]].max(), Xs[cols[i-1]].max()) + 1 Xt[cols[i]] = Xt[cols[i]] + max_feat Xs[cols[i]] = Xs[cols[i]] + max_feat Xt.to_csv('train_enc.csv') Xs.to_csv('test_enc.csv')

(Of course, I did the same processing on the train95 and train05 data files.)

## Results

I finished a fairly-decent 69th out of 375 competitors, which gave me another Top 25% finish and boosted my Kaggle rank from 604th to 466th out of 224,564, putting me in the top 0.2% of competitors.

My final score was 0.0060263, while the winners scored 0.0043324. All of the top 10 scores were below 0.0050000, so I had a long way to go. On the other hand, the median score was 0.0085000, so I did considerably better than average.

## Final Thoughts

This contest was very rewarding in terms of knowledge.

It was my first time running PyPy, as well as my first time speeding up code with Numba.

My validation very closely matched my leader board scores, so I was able to very quickly know whether code changes were making progress. The submissions I chose for my final score provided the best two results on the private leader board, indicating that my validation was completely successful.

Had been able to spend a bit more time, I would have liked to try a completely different algorithm, such a neural net or random forest, and try ensemble averaging.

A no-brainer idea that I had intended to implement, but did not, was to train on different data ordering. Others reported in the forum (after the contest ended) that this improved their scored by 0.0002 or so, which would have moved me up about 10 places or so. As I’m sure Kaggle with provide another opportunity to use this online-learning approach, that will be the first thing I do next time.