CS 243 Project - Hardware Counters and Data Augmentation for Predicting Vectorization

Data modeling

Build supervised machine learning models to predict vectorization (with and without data augmentation).

Data dictionary:
BR_INST_EXEC.ALL_BRANCHES: Speculative and retired branches
Cycles (CPU_CLK_UNHALTED.THREAD_P): Thread cycles when thread is not in halt state
ICACHE.MISSES: # instruction cache, victim cache, and streaming buffer misses. Uncacheable accesses included
Instructions (INST_RETIRED.ANY_P): Number of instructions retired
IPC: Instructions/Cycles
ITLB_MISSES.MISS_CAUSES_A_WALK: Misses at all ITLB levels that causes a page walk
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles while L1 cache miss demand load is outstanding
L1D.REPLACEMENT: L1D data line replacements
L2_cache_misses (L2_RQSTS.MISS): All requests that miss L2 cache
L2_cache_accesses (L2_RQSTS.REFERENCES): All L2 requests
MACHINE_CLEARS.COUNT: Number of machine clears (nukes) of any type
MACHINE_CLEARS.CYCLES: Cycles where there was a nuke (thread-specific and all thread)
MEM_LOAD_UOPS_RETIRED.L1_MISS: Retired load uops misses in L1 cache as data sources
MISALIGN_MEM_REF.LOADS: Speculative cache line split load uops dispached to L1 cache
RESOURCE_STALLS.ANY: Resource-related stall cycles
UOPS_EXECUTED.CORE: Number of uops executed on the core
DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK: Load misses in all DTLB levels that cause page walks
UOPS_EXECUTED.THREAD: Counts the number of uops to be executed per thread each cycle
UOPS_ISSUED.ANY: Uops that resource allocation table (RAT) issues to reservation station (RS)
UOPS_ISSUED.STALL_CYCLES: Cycles when RAT does not issue uops to RS for the thread
UOPS_RETIRED.ALL: Actually retired uops

In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import minmax_scale
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.base import clone
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.model_selection import LeaveOneOut, train_test_split, cross_validate, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import mutual_info_classif
plt.style.use('ggplot')
np.random.seed(1) # To reproduce results

Import data from all runs

In [3]:
data = pd.read_pickle('Intermediate/Data_final')
print data.shape
data.head()
(151, 23)
Out[3]:
Symbol Name BR_INST_EXEC.ALL_BRANCHES Cycles (CPU_CLK_UNHALTED.THREAD_P) ICACHE.MISSES Instructions (INST_RETIRED.ANY_P) IPC ITLB_MISSES.MISS_CAUSES_A_WALK CYCLE_ACTIVITY.CYCLES_L1D_PENDING L1D.REPLACEMENT L2_cache_misses (L2_RQSTS.MISS) ... MEM_LOAD_UOPS_RETIRED.L1_MISS MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_ISSUED.ANY UOPS_ISSUED.STALL_CYCLES UOPS_RETIRED.ALL Vectorizable
0 vdotr 0.780054 1.000000 0.846777 0.953049 0.333166 0.091284 0.009918 0.352163 0.038586 ... 0.006236 0.001107 0.576490 0.825337 1.000000 0.844710 0.989464 0.814133 0.846817 1
1 vsumr 0.363920 0.994448 1.000000 0.371609 0.120150 0.091871 0.008573 0.177088 0.004370 ... 0.007573 0.000633 0.855463 0.582767 0.769073 0.443357 0.410981 0.703915 0.446982 1
2 s312 0.361293 0.989835 0.758251 0.372705 0.121151 0.254541 0.018059 0.175171 0.011072 ... 0.009501 0.000273 0.923626 0.645358 0.742128 0.431895 0.398940 0.982609 0.427202 1
3 s311 0.348797 0.974573 0.585715 0.366408 0.120901 0.406182 0.028118 0.173145 0.018200 ... 0.009169 0.000294 0.922912 0.631597 0.773651 0.432357 0.405950 0.936244 0.427589 1
4 s233 0.137980 0.992884 0.970062 0.164879 0.043805 1.000000 1.000000 1.000000 1.000000 ... 1.000000 0.000075 1.000000 0.118286 0.098588 0.208184 0.227743 1.000000 0.202813 0

5 rows × 23 columns

In [4]:
data['Vectorizable'].value_counts()
Out[4]:
0    85
1    66
Name: Vectorizable, dtype: int64
In [5]:
# Target class proportion
66./(66+85)
Out[5]:
0.4370860927152318

Model experimentation

Using all features

In [7]:
# Create training and test data splits
n_feats_orig = data.shape[1]-1
seed = 0
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:n_feats_orig], data.iloc[:, n_feats_orig], \
                                                    random_state=seed, test_size=50)
X_all, y_all = data.iloc[:, 1:n_feats_orig], data.iloc[:, n_feats_orig]
print X_train.shape, X_test.shape, y_train.shape, y_test.shape, X_all.shape, y_all.shape
feats = X_train.columns.values.tolist()
print y_train.value_counts()
(101, 21) (50, 21) (101,) (50,) (151, 21) (151,)
0    59
1    42
Name: Vectorizable, dtype: int64
In [8]:
# Get the names of all features 
all_feats = X_train.columns.values.tolist()

Mutual information based feature selection

In [9]:
# Calculate mutual information to filter features
# num_feats = 10
mut_info = mutual_info_classif(X_all, y_all, n_neighbors=10, random_state=seed)
# feats = [all_feats[idx] for idx in np.argsort(-mut_info)]
# feats = feats[:num_feats]

# Getting top p features based on a threshold
feats = [all_feats[idx] for idx in np.where(mut_info >= mut_info.mean())[0]] # Threshold = Mean
X_train = X_train[feats]
X_test = X_test[feats]
X_all = X_all[feats]
print X_train.shape, X_test.shape, y_train.shape, y_test.shape, X_all.shape, y_all.shape
print feats
n_feats = len(feats)
(101, 10) (50, 10) (101,) (50,) (151, 10) (151,)
[u'ICACHE.MISSES', u'IPC', u'L1D.REPLACEMENT', u'L2_cache_accesses (L2_RQSTS.REFERENCES)', u'MISALIGN_MEM_REF.LOADS', u'RESOURCE_STALLS.ANY', u'UOPS_EXECUTED.CORE', u'DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK', u'UOPS_EXECUTED.THREAD', u'UOPS_RETIRED.ALL']

Variables (from 10 neighbor M.I. calculation)
[u'ICACHE.MISSES', u'IPC', u'L1D.REPLACEMENT', u'L2_cache_accesses (L2_RQSTS.REFERENCES)', u'MISALIGN_MEM_REF.LOADS', u'RESOURCE_STALLS.ANY', u'UOPS_EXECUTED.CORE', u'DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK', u'UOPS_EXECUTED.THREAD', u'UOPS_RETIRED.ALL']

Manual feature selection

In [514]:
# # Manual feature selection (based on visualization) - Unused
# feats = ['ICACHE.MISSES', 'Instructions (INST_RETIRED.ANY_P)', 'ITLB_MISSES.MISS_CAUSES_A_WALK', 'L1D.REPLACEMENT', \
#         'L2_cache_accesses (L2_RQSTS.REFERENCES)', 'RESOURCE_STALLS.ANY', 'UOPS_EXECUTED.CORE', \
#          'UOPS_ISSUED.STALL_CYCLES', 'MACHINE_CLEARS.CYCLES']
# X_train = X_train[feats]
# X_test = X_test[feats]
# X_all = X_all[feats]
# print X_train.shape, X_test.shape, y_train.shape, y_test.shape, X_all.shape, y_all.shape
# n_feats = len(feats)
(101, 9) (50, 9) (101,) (50,) (151, 9) (151,)

Build models

In [340]:
np.random.seed(1)

rseed = 0

# Uncomment each model to test it individually
# clf = LogisticRegression(class_weight='balanced', C=0.1)             
clf = KNeighborsClassifier(n_neighbors=5) #, weights = 'distance')
# clf = LogisticRegression(random_state = rseed)
# clf = GaussianNB()
# clf = GradientBoostingClassifier(random_state = rseed)
# clf = RandomForestClassifier(random_state = rseed)
# clf = SVC(kernel='linear', C=100, random_state=rseed)
In [341]:
print clf
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
In [342]:
cross_val_scores = cross_val_score(clf, X_all, y_all, cv=20)#, scoring='accuracy')
print "Individual cross val scores =", cross_val_scores
print "Mean cross val score =", np.mean(cross_val_scores)
Individual cross val scores = [ 0.55555556  0.33333333  0.44444444  0.77777778  0.44444444  0.75
  0.42857143  0.57142857  0.71428571  0.71428571  0.71428571  0.85714286
  0.85714286  0.57142857  0.85714286  0.85714286  0.57142857  0.71428571
  0.71428571  0.57142857]
Mean cross val score = 0.650992063492
In [343]:
np.random.seed(1)
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
print "Pred Accuracy on train = ", accuracy_score(y_train, y_pred_train)
print "Pred Accuracy on test = ", accuracy_score(y_test, y_pred_test)
print classification_report(y_test, y_pred_test)
Pred Accuracy on train =  0.80198019802
Pred Accuracy on test =  0.68
             precision    recall  f1-score   support

          0       0.69      0.69      0.69        26
          1       0.67      0.67      0.67        24

avg / total       0.68      0.68      0.68        50

In [14]:
# Distribution of the target class in the test data
y_test.value_counts()
Out[14]:
0    26
1    24
Name: Vectorizable, dtype: int64

Feature importance

In [15]:
# View feature importances
# top_feat_idx = np.argsort(-np.abs(clf.coef_))      # For logistic regression/SVM
top_feat_idx = np.argsort(-clf.feature_importances_) # For tree based models (random forests, gradient boosting)
top_feats = [feats[idx] for idx in top_feat_idx]
print top_feats[:3]
[u'UOPS_EXECUTED.CORE', u'IPC', u'MISALIGN_MEM_REF.LOADS']
In [16]:
# Plot the feature importance
n_feats_plot = len(feats)+1
plt.figure(figsize=(9, 6))
# plt.bar(np.arange(1, n_feats_plot), np.abs(clf.coef_[top_feat_idx].squeeze()))      # SVM/logistic regression
plt.bar(np.arange(1, n_feats_plot), clf.feature_importances_[top_feat_idx])           # Tree based models
plt.xticks(np.arange(1, n_feats_plot), X_all.columns, rotation=90)
plt.title('Feature Importance plot')
plt.show()

Data augmentation - Experiment 1: SMOTE

In [91]:
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.ensemble import EasyEnsemble
In [92]:
# seed = 0
# sm = SMOTE()
# X_train_new, y_train_new = sm.fit_sample(X_train, y_train)
# clf.fit(X_train_new, y_train_new)
# y_pred_train_new = clf.predict(X_train_new)
# y_pred_test = clf.predict(X_test)
# print "Pred Accuracy on train = ", accuracy_score(y_train_new, y_pred_train_new)
# print "Pred Accuracy on test = ", accuracy_score(y_test, y_pred_test)
# print classification_report(y_test, y_pred_test)
In [231]:
from random import randrange, choice
from sklearn.neighbors import NearestNeighbors

# Function to generate data samples using SMOTE
def SMOTE(T, N=100, k=3):
    """
    Modified from 
    https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data

    Returns (N/100) * n_minority_samples synthetic minority samples.

    Parameters
    ----------
    T : Pandas DataFrame, shape = [n_minority_samples, n_features]
        Holds the minority samples
    N : percentage of new synthetic samples: 
        n_synthetic_samples = N/100 * n_minority_samples. Can be < 100.
    k : int. Number of nearest neighbours. 

    Returns
    -------
    S : array, shape = [(N/100) * n_minority_samples, n_features]
    """    
    
    n_minority_samples, n_features = T.shape
    colnames = T.columns.values.tolist()
    np.random.seed(0)

    if N < 100:
        #create synthetic samples only for a subset of T.
        N = 100
        pass

    if (N % 100) != 0:
        raise ValueError("N must be < 100 or multiple of 100")

    N = N/100
    n_synthetic_samples = N * n_minority_samples
    S = np.zeros(shape=(n_synthetic_samples, n_features))

    #Learn nearest neighbours
    neigh = NearestNeighbors(n_neighbors = k)
    neigh.fit(T)

    #Calculate synthetic samples
    for i in xrange(n_minority_samples):
        nn = neigh.kneighbors(T.iloc[i].values.reshape([1, n_features]),\
                              return_distance=False)
        np.random.seed(1)
        for n in xrange(N):
            nn_index = choice(nn[0])
            #NOTE: nn includes T.iloc[i], we don't want to select it 
            np.random.seed(0)
        
            while nn_index == i:
                nn_index = choice(nn[0])
            np.random.seed(0)
        
            dif = T.iloc[nn_index] - T.iloc[i]
            gap = np.random.random()
            S[n + i * N, :] = T.iloc[i,:] + gap * dif[:]
    S  = pd.DataFrame(S)
    S.columns = T.columns.values.tolist()
    return S
In [215]:
# # Combine X and y, remove the symbol name column
# data_all = data.iloc[:, 1:n_feats+1]
# print data_all.shape
In [244]:
# Generate samples using only X_train
data_train = X_train.copy()
data_train['Vectorizable'] = y_train
data_train.shape
# new_samples = SMOTE(data_train)

# Load already generated samples
new_samples = pd.read_csv('Intermediate/smote_samples.csv')
new_samples = new_samples.drop('Unnamed: 0', axis=1)
print new_samples.shape
(101, 11)
In [245]:
# Compare the distributions of the training set and the new data samples
X_train.describe()
Out[245]:
ICACHE.MISSES IPC L1D.REPLACEMENT L2_cache_accesses (L2_RQSTS.REFERENCES) MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_RETIRED.ALL
count 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000
mean 0.198731 0.542834 0.129467 0.101296 0.003973 0.108505 0.197155 0.098562 0.250420 0.252092
std 0.222979 0.278163 0.153427 0.144353 0.033616 0.201508 0.200970 0.163822 0.229663 0.229784
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.055332 0.299124 0.043326 0.036665 0.000007 0.012568 0.076946 0.009116 0.099784 0.099498
50% 0.109453 0.624030 0.069363 0.058536 0.000023 0.035300 0.125305 0.043828 0.165757 0.180506
75% 0.264093 0.778473 0.177088 0.107248 0.000082 0.076365 0.224134 0.114548 0.336506 0.333328
max 1.000000 1.000000 1.000000 1.000000 0.332701 1.000000 1.000000 1.000000 1.000000 1.000000
In [246]:
new_samples.describe()
Out[246]:
ICACHE.MISSES IPC L1D.REPLACEMENT L2_cache_accesses (L2_RQSTS.REFERENCES) MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_RETIRED.ALL Vectorizable
count 101.000000 101.000000 101.000000 101.000000 1.010000e+02 101.000000 101.000000 101.000000 101.000000 101.000000 101.000000
mean 0.187686 0.555431 0.124546 0.097460 1.821658e-03 0.096608 0.193801 0.090426 0.253980 0.255782 0.415842
std 0.196227 0.257556 0.129093 0.120117 1.516466e-02 0.176488 0.194891 0.155684 0.226519 0.226445 0.495325
min 0.002239 0.068120 0.008829 0.005452 5.579318e-07 0.002294 0.006642 0.001075 0.012073 0.011678 0.000000
25% 0.060786 0.321453 0.039447 0.030807 8.983143e-06 0.013597 0.079609 0.010999 0.102415 0.101937 0.000000
50% 0.110368 0.635657 0.069623 0.059004 2.388686e-05 0.036248 0.128412 0.052468 0.182134 0.180632 0.000000
75% 0.230213 0.781151 0.192073 0.139565 5.841605e-05 0.079404 0.203942 0.083392 0.306887 0.313223 1.000000
max 0.915909 0.964420 0.688335 0.711596 1.501185e-01 0.885895 0.946640 0.895809 0.996812 0.996098 1.000000
In [247]:
# Distribution of the target variable
new_samples.loc[new_samples['Vectorizable'] > 0.5, 'Vectorizable'] = 1
new_samples.loc[new_samples['Vectorizable'] <= 0.5, 'Vectorizable'] = 0
new_samples['Vectorizable'].value_counts()
Out[247]:
0.0    59
1.0    42
Name: Vectorizable, dtype: int64
In [248]:
# Combine the data
data_combined = pd.concat((data_train, new_samples))
print data_combined.shape
data_combined.head()
(202, 11)
Out[248]:
ICACHE.MISSES IPC L1D.REPLACEMENT L2_cache_accesses (L2_RQSTS.REFERENCES) MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_RETIRED.ALL Vectorizable
69 0.143356 0.713392 0.069017 0.041058 0.000032 0.010332 0.224134 0.091379 0.235755 0.244177 0.0
135 0.027050 0.884355 0.034352 0.021884 0.000031 0.008078 0.094499 0.041149 0.104578 0.103942 0.0
56 0.142475 0.000000 0.350906 0.155269 0.000041 0.151149 0.125305 0.135148 0.088666 0.088776 0.0
80 0.216463 0.342428 0.238609 0.151527 0.000116 0.065604 0.119742 0.142080 0.120991 0.119966 1.0
123 0.046474 0.595244 0.044028 0.037999 0.000016 0.024703 0.110224 0.055848 0.112159 0.119667 1.0
In [276]:
# Test the predictive models 
# clf = RandomForestClassifier(random_state = rseed)
# clf = LogisticRegression(class_weight='balanced', C=0.1)
clf = KNeighborsClassifier(n_neighbors=5) #, weights = 'distance')
In [277]:
X_all_new, y_all_new = data_combined.iloc[:, :n_feats], data_combined.iloc[:, n_feats]
print X_all_new.shape, y_all_new.shape
cross_val_scores_new = cross_val_score(clf, X_all_new, y_all_new, cv=20)
print "Individual cross val scores =", cross_val_scores_new
print "Mean cross val score =", np.mean(cross_val_scores_new)
(202, 10) (202,)
Individual cross val scores = [ 0.54545455  0.81818182  0.81818182  0.81818182  0.5         0.6         0.7
  0.8         0.8         0.7         0.9         0.9         0.7         0.9
  0.8         0.8         0.8         1.          1.          1.        ]
Mean cross val score = 0.795
In [275]:
np.random.seed(1)
clf.fit(X_all_new, y_all_new)
y_pred = clf.predict(X_all)
y_pred_test = clf.predict(X_test)
print "Pred Accuracy on original data = ", accuracy_score(y_all, y_pred)
print "Pred Accuracy on test data = ", accuracy_score(y_test, y_pred_test)
print classification_report(y_all, y_pred)
Pred Accuracy on original data =  0.920529801325
Pred Accuracy on test data =  0.76
             precision    recall  f1-score   support

          0       0.95      0.91      0.93        85
          1       0.89      0.94      0.91        66

avg / total       0.92      0.92      0.92       151

In [243]:
# Store the generated data samples
# new_samples.to_csv('Intermediate/smote_samples.csv')
In [ ]:
 
In [852]:
# Store the training data (use it in the R code to generate data samples using synthpop)
data_train.to_csv('Intermediate/data_train.csv') 
In [321]:
# Load the synthetically generated data 
syn_data = pd.read_csv('Intermediate/syn_data2.csv')
syn_data = syn_data.drop('Unnamed: 0', axis=1)
print syn_data.shape
syn_data.head()
(101, 11)
Out[321]:
ICACHE.MISSES IPC L1D.REPLACEMENT L2_cache_accesses..L2_RQSTS.REFERENCES. MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_RETIRED.ALL Vectorizable
0 0.632083 0.043805 0.352163 0.281932 0.000144 0.576490 0.825337 0.773651 0.992934 0.744836 1
1 0.026483 0.638048 0.001949 0.009073 0.000009 0.023876 0.108513 0.029647 0.102108 0.099498 1
2 1.000000 0.333166 0.282088 0.130811 0.000633 0.855463 0.631597 0.769073 0.443357 0.446982 1
3 0.527502 0.044556 0.002188 0.009935 0.000116 0.391888 0.410215 0.253072 0.336506 0.375899 1
4 0.062224 0.741927 0.051210 0.044799 0.000041 0.014442 0.092373 0.041149 0.115743 0.142437 1
In [322]:
syn_data.columns = data_train.columns # Column names might have become a bit messed up in R
In [323]:
data_combined = pd.concat((data_train, syn_data))
print data_combined.shape
data_combined.head()
(202, 11)
Out[323]:
ICACHE.MISSES IPC L1D.REPLACEMENT L2_cache_accesses (L2_RQSTS.REFERENCES) MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_RETIRED.ALL Vectorizable
69 0.143356 0.713392 0.069017 0.041058 0.000032 0.010332 0.224134 0.091379 0.235755 0.244177 0
135 0.027050 0.884355 0.034352 0.021884 0.000031 0.008078 0.094499 0.041149 0.104578 0.103942 0
56 0.142475 0.000000 0.350906 0.155269 0.000041 0.151149 0.125305 0.135148 0.088666 0.088776 0
80 0.216463 0.342428 0.238609 0.151527 0.000116 0.065604 0.119742 0.142080 0.120991 0.119966 1
123 0.046474 0.595244 0.044028 0.037999 0.000016 0.024703 0.110224 0.055848 0.112159 0.119667 1
In [337]:
# Test predictive models
# clf = RandomForestClassifier(random_state = rseed)
# clf = LogisticRegression(class_weight='balanced', C=0.1)
clf = KNeighborsClassifier(n_neighbors=5) #, weights = 'distance')
In [338]:
X_all_new, y_all_new = data_combined.iloc[:, :n_feats], data_combined.iloc[:, n_feats]
print X_all_new.shape, y_all_new.shape
cross_val_scores_new = cross_val_score(clf, X_all_new, y_all_new, cv=20)
print "Individual cross val scores =", cross_val_scores_new
print "Mean cross val score =", np.mean(cross_val_scores_new)
(202, 10) (202,)
Individual cross val scores = [ 0.54545455  0.63636364  0.54545455  0.72727273  0.90909091  0.63636364
  0.54545455  0.5         0.7         0.9         0.9         0.6         0.8
  0.5         0.7         0.55555556  0.77777778  0.88888889  0.77777778
  0.55555556]
Mean cross val score = 0.685050505051
In [339]:
np.random.seed(1)
clf.fit(X_all_new, y_all_new)
y_pred = clf.predict(X_all)
y_pred_test = clf.predict(X_test)
print "Pred Accuracy on original data = ", accuracy_score(y_all, y_pred)
print "Pred Accuracy on test data = ", accuracy_score(y_test, y_pred_test)
print classification_report(y_all, y_pred)
Pred Accuracy on original data =  0.761589403974
Pred Accuracy on test data =  0.72
             precision    recall  f1-score   support

          0       0.78      0.81      0.79        85
          1       0.74      0.70      0.72        66

avg / total       0.76      0.76      0.76       151