CS 243 Project - Hardware Counters and Data Augmentation for Predicting Vectorization

Data import and preprocess

Import hardware performance data from all runs of the target 'runnovec' using Apple's Instruments, and create a single dataset from them.

Data dictionary:
BR_INST_EXEC.ALL_BRANCHES: Speculative and retired branches
Cycles (CPU_CLK_UNHALTED.THREAD_P): Thread cycles when thread is not in halt state
ICACHE.MISSES: # instruction cache, victim cache, and streaming buffer misses. Uncacheable accesses included
Instructions (INST_RETIRED.ANY_P): Number of instructions retired
IPC: Instructions/Cycles
ITLB_MISSES.MISS_CAUSES_A_WALK: Misses at all ITLB levels that causes a page walk
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles while L1 cache miss demand load is outstanding
L1D.REPLACEMENT: L1D data line replacements
L2_cache_misses (L2_RQSTS.MISS): All requests that miss L2 cache
L2_cache_accesses (L2_RQSTS.REFERENCES): All L2 requests
MACHINE_CLEARS.COUNT: Number of machine clears (nukes) of any type
MACHINE_CLEARS.CYCLES: Cycles where there was a nuke (thread-specific and all thread)
MEM_LOAD_UOPS_RETIRED.L1_MISS: Retired load uops misses in L1 cache as data sources
MISALIGN_MEM_REF.LOADS: Speculative cache line split load uops dispached to L1 cache
RESOURCE_STALLS.ANY: Resource-related stall cycles
UOPS_EXECUTED.CORE: Number of uops executed on the core
DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK: Load misses in all DTLB levels that cause page walks
UOPS_EXECUTED.THREAD: Counts the number of uops to be executed per thread each cycle
UOPS_ISSUED.ANY: Uops that resource allocation table (RAT) issues to reservation station (RS)
UOPS_ISSUED.STALL_CYCLES: Cycles when RAT does not issue uops to RS for the thread
UOPS_RETIRED.ALL: Actually retired uops

In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import minmax_scale

Import data from all runs

In [3]:
#data_path = '../Data/Instruments_counters/'
data_path = '../Data/Instruments_counters/Iteration 2/'
num_runs = 5
X = {}

for i in xrange(num_runs):
    fname = data_path + 'Run' + str(i+1) + '_four_ctrs.xlsx' 
    X[i+1] = pd.read_excel(fname)
In [4]:
for i in range(num_runs):
    print X[i+1].shape
(162, 9)
(162, 8)
(161, 8)
(163, 8)
(162, 8)

Read function names

In [5]:
with open('func_names.txt', 'r') as f:
    func_names = []
    for line in f:
        func_names.append(line.strip('\n'))
In [6]:
print func_names[:3]
['s000', 's111', 's1111']

Remove data rows that do not match with any entry in func_names

In [7]:
for i in range(num_runs):
    bool_idx = []
    for row in xrange(len(X[i+1])):
        bool_idx.append(X[i+1].loc[row, 'Symbol Name'].strip() in func_names)
    print "Pre removal shape:", X[i+1].shape,
    X[i+1] = X[i+1][bool_idx]
    print "; Post removal shape:", X[i+1].shape
Pre removal shape: (162, 9) ; Post removal shape: (151, 9)
Pre removal shape: (162, 8) ; Post removal shape: (151, 8)
Pre removal shape: (161, 8) ; Post removal shape: (151, 8)
Pre removal shape: (163, 8) ; Post removal shape: (151, 8)
Pre removal shape: (162, 8) ; Post removal shape: (151, 8)
In [8]:
for i in range(num_runs):
    print X[i+1].columns
Index([u'Total Samples', u'Running Time', u'Self (ms)',
       u'BR_INST_EXEC.ALL_BRANCHES', u'Cycles (CPU_CLK_UNHALTED.THREAD_P)',
       u'ICACHE.MISSES', u'Instructions (INST_RETIRED.ANY_P)', u'IPC',
       u'Symbol Name'],
      dtype='object')
Index([u'Total Samples', u'Running Time', u'Self (ms)',
       u'ITLB_MISSES.MISS_CAUSES_A_WALK', u'CYCLE_ACTIVITY.CYCLES_L1D_PENDING',
       u'L1D.REPLACEMENT', u'L2_cache_misses (L2_RQSTS.MISS)', u'Symbol Name'],
      dtype='object')
Index([u'Total Samples', u'Running Time', u'Self (ms)',
       u'L2_cache_accesses (L2_RQSTS.REFERENCES)', u'MACHINE_CLEARS.COUNT',
       u'MACHINE_CLEARS.CYCLES', u'MEM_LOAD_UOPS_RETIRED.L1_MISS',
       u'Symbol Name'],
      dtype='object')
Index([u'Total Samples', u'Running Time', u'Self (ms)',
       u'MISALIGN_MEM_REF.LOADS', u'RESOURCE_STALLS.ANY',
       u'UOPS_EXECUTED.CORE', u'DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK',
       u'Symbol Name'],
      dtype='object')
Index([u'Total Samples', u'Running Time', u'Self (ms)',
       u'UOPS_EXECUTED.THREAD', u'UOPS_ISSUED.ANY',
       u'UOPS_ISSUED.STALL_CYCLES', u'UOPS_RETIRED.ALL', u'Symbol Name'],
      dtype='object')
In [9]:
# Drop unnecessary columns
cols_drop = ['Total Samples', 'Running Time', 'Self (ms)']
for i in range(num_runs):
    X[i+1] = X[i+1].drop(cols_drop, axis=1)
    print X[i+1].shape
(151, 6)
(151, 5)
(151, 5)
(151, 5)
(151, 5)

Join the datasets

In [11]:
# Strip white space in the symbol column and join
X[1]['Symbol Name'] = X[1]['Symbol Name'].str.strip()
X_final = X[1].copy()

for i in range(2, num_runs+1):
    X[i]['Symbol Name'] = X[i]['Symbol Name'].str.strip()
    X_final = X_final.merge(X[i], on="Symbol Name")
In [12]:
print X_final.shape
X_final.head()
(151, 22)
Out[12]:
BR_INST_EXEC.ALL_BRANCHES Cycles (CPU_CLK_UNHALTED.THREAD_P) ICACHE.MISSES Instructions (INST_RETIRED.ANY_P) IPC Symbol Name ITLB_MISSES.MISS_CAUSES_A_WALK CYCLE_ACTIVITY.CYCLES_L1D_PENDING L1D.REPLACEMENT L2_cache_misses (L2_RQSTS.MISS) ... MACHINE_CLEARS.CYCLES MEM_LOAD_UOPS_RETIRED.L1_MISS MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_ISSUED.ANY UOPS_ISSUED.STALL_CYCLES UOPS_RETIRED.ALL
0 32898053378 202634487665 165343323 284478949114 1.404 vdotr 1566271 1414688448 7987661844 1443520314 ... 37122480 46984068 4244258 103131657181 746796411743 51553927 313639529747 255207954422 283002409419 315093572163
1 15402237231 201510737415 195245596 111415719268 0.553 vsumr 1576298 1223385933 4016790085 163928654 ... 52913116 57043846 2429370 153035564259 527714486691 39649824 164961112108 106401790932 244734928689 166693554555
2 15291819486 200576980534 148066843 111741988682 0.557 s312 4355770 2572975087 3973317312 414545498 ... 118442102 71543658 1045728 165228902493 584245186976 38260814 160714999271 103304372760 341497114037 159352080019
3 14766436661 197487923610 114395584 109867874366 0.556 s311 6946805 4004280019 3927364137 681120881 ... 108278763 69047605 1129560 165101217008 571816069369 39885773 160886222961 105107649484 325399189741 159495608576
4 5902922158 201194189637 189402970 49883641665 0.248 s233 17093130 142283054238 22681220374 37398546300 ... 218227606 7521975179 288029 178891118054 108208635727 5086768 77842695865 59266540314 347535199663 76069585447

5 rows × 22 columns

In [14]:
# Rearrange and drop some columns
cols = X_final.columns.values.tolist()
cols.remove("Symbol Name")
cols.insert(0, "Symbol Name")
X_final = X_final[cols]
#X_final = X_final.drop('TX_MEM.ABORT_CAPACITY_WRITE', axis=1)
print X_final.shape
X_final.head()
(151, 22)
Out[14]:
Symbol Name BR_INST_EXEC.ALL_BRANCHES Cycles (CPU_CLK_UNHALTED.THREAD_P) ICACHE.MISSES Instructions (INST_RETIRED.ANY_P) IPC ITLB_MISSES.MISS_CAUSES_A_WALK CYCLE_ACTIVITY.CYCLES_L1D_PENDING L1D.REPLACEMENT L2_cache_misses (L2_RQSTS.MISS) ... MACHINE_CLEARS.CYCLES MEM_LOAD_UOPS_RETIRED.L1_MISS MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_ISSUED.ANY UOPS_ISSUED.STALL_CYCLES UOPS_RETIRED.ALL
0 vdotr 32898053378 202634487665 165343323 284478949114 1.404 1566271 1414688448 7987661844 1443520314 ... 37122480 46984068 4244258 103131657181 746796411743 51553927 313639529747 255207954422 283002409419 315093572163
1 vsumr 15402237231 201510737415 195245596 111415719268 0.553 1576298 1223385933 4016790085 163928654 ... 52913116 57043846 2429370 153035564259 527714486691 39649824 164961112108 106401790932 244734928689 166693554555
2 s312 15291819486 200576980534 148066843 111741988682 0.557 4355770 2572975087 3973317312 414545498 ... 118442102 71543658 1045728 165228902493 584245186976 38260814 160714999271 103304372760 341497114037 159352080019
3 s311 14766436661 197487923610 114395584 109867874366 0.556 6946805 4004280019 3927364137 681120881 ... 108278763 69047605 1129560 165101217008 571816069369 39885773 160886222961 105107649484 325399189741 159495608576
4 s233 5902922158 201194189637 189402970 49883641665 0.248 17093130 142283054238 22681220374 37398546300 ... 218227606 7521975179 288029 178891118054 108208635727 5086768 77842695865 59266540314 347535199663 76069585447

5 rows × 22 columns

In [ ]:
 

Data summary

In [15]:
X_final.describe()
Out[15]:
BR_INST_EXEC.ALL_BRANCHES Cycles (CPU_CLK_UNHALTED.THREAD_P) ICACHE.MISSES Instructions (INST_RETIRED.ANY_P) IPC ITLB_MISSES.MISS_CAUSES_A_WALK CYCLE_ACTIVITY.CYCLES_L1D_PENDING L1D.REPLACEMENT L2_cache_misses (L2_RQSTS.MISS) L2_cache_accesses (L2_RQSTS.REFERENCES) ... MACHINE_CLEARS.CYCLES MEM_LOAD_UOPS_RETIRED.L1_MISS MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_ISSUED.ANY UOPS_ISSUED.STALL_CYCLES UOPS_RETIRED.ALL
count 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 151.000000 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 ... 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02 1.510000e+02
mean 1.029730e+10 4.314274e+10 3.820871e+07 8.005649e+10 2.276384 1.758772e+06 6.275280e+09 2.852117e+09 2.476650e+09 5.748567e+09 ... 4.832081e+07 4.203577e+08 5.059883e+07 1.837174e+10 1.812483e+11 5.114688e+06 9.505438e+10 7.201072e+10 5.513212e+10 9.593726e+10
std 9.375488e+09 4.406093e+10 4.047781e+07 6.687566e+10 1.034400 2.605665e+06 1.765274e+10 3.358082e+09 5.141308e+09 7.941159e+09 ... 8.079379e+07 1.173619e+09 3.750493e+08 3.384069e+10 1.749958e+11 8.067838e+06 8.237843e+10 5.822866e+10 6.848248e+10 8.245904e+10
min 1.017222e+08 2.276098e+08 8.996800e+04 8.079634e+08 0.073000 6.534000e+03 3.621955e+06 2.608560e+05 4.893980e+05 1.383100e+06 ... 3.053900e+05 7.973300e+04 2.110000e+02 6.340420e+06 1.376529e+09 4.622000e+03 7.222727e+08 6.830927e+08 3.361293e+08 7.947695e+08
25% 3.245769e+09 1.527282e+10 1.125142e+07 3.395053e+10 1.475000 4.030940e+05 5.598525e+08 9.753718e+08 2.675647e+08 2.156453e+09 ... 1.206805e+07 1.829620e+07 2.838050e+04 2.321397e+09 7.048659e+10 4.666855e+05 3.927703e+10 3.101884e+10 1.620568e+10 3.991881e+10
50% 6.454546e+09 2.483256e+10 2.267135e+07 5.818220e+10 2.547000 8.862050e+05 1.344801e+09 1.593733e+09 1.142119e+09 3.484510e+09 ... 2.553813e+07 5.148628e+07 9.121700e+04 6.743583e+09 1.186366e+11 2.378649e+06 6.785619e+10 5.343507e+10 2.917066e+10 6.942281e+10
75% 1.306314e+10 5.556697e+10 5.177838e+07 1.166920e+11 3.135500 1.929257e+06 3.772427e+09 3.950341e+09 2.185620e+09 6.318272e+09 ... 5.408279e+07 1.475001e+08 3.057290e+05 1.447650e+10 2.150991e+11 5.885383e+06 1.377849e+11 1.042060e+11 5.968286e+10 1.387093e+11
max 4.214540e+10 2.026345e+11 1.952456e+08 2.984538e+11 4.068000 1.709313e+07 1.422831e+11 2.268122e+10 3.739855e+10 5.904742e+10 ... 7.717047e+08 7.521975e+09 3.835279e+09 1.788911e+11 9.045471e+11 5.155393e+07 3.711657e+11 2.579181e+11 3.475352e+11 3.719477e+11

8 rows × 21 columns

Normalize the data

In [16]:
for col in X_final:
    if X_final[col].dtype != 'object':
        X_final[col] = minmax_scale(X_final[col])
/Users/rahulsridhar/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:444: DataConversionWarning: Data with input dtype int64 was converted to float64.
  warnings.warn(msg, DataConversionWarning)
In [17]:
X_final.describe()
Out[17]:
BR_INST_EXEC.ALL_BRANCHES Cycles (CPU_CLK_UNHALTED.THREAD_P) ICACHE.MISSES Instructions (INST_RETIRED.ANY_P) IPC ITLB_MISSES.MISS_CAUSES_A_WALK CYCLE_ACTIVITY.CYCLES_L1D_PENDING L1D.REPLACEMENT L2_cache_misses (L2_RQSTS.MISS) L2_cache_accesses (L2_RQSTS.REFERENCES) ... MACHINE_CLEARS.CYCLES MEM_LOAD_UOPS_RETIRED.L1_MISS MISALIGN_MEM_REF.LOADS RESOURCE_STALLS.ANY UOPS_EXECUTED.CORE DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK UOPS_EXECUTED.THREAD UOPS_ISSUED.ANY UOPS_ISSUED.STALL_CYCLES UOPS_RETIRED.ALL
count 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 ... 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000
mean 0.242500 0.212024 0.195325 0.266251 0.551535 0.102550 0.044080 0.125738 0.066211 0.097334 ... 0.062245 0.055874 0.013193 0.102666 0.199156 0.099130 0.254646 0.277286 0.157823 0.256343
std 0.222994 0.217685 0.207413 0.224682 0.258924 0.152498 0.124071 0.148057 0.137475 0.134491 ... 0.104737 0.156027 0.097789 0.189176 0.193757 0.156507 0.222378 0.226364 0.197243 0.222170
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.074780 0.074332 0.057193 0.111349 0.350939 0.023209 0.003909 0.042992 0.007141 0.036498 ... 0.015248 0.002422 0.000007 0.012942 0.076519 0.008964 0.104077 0.117930 0.045707 0.105412
50% 0.151101 0.121562 0.115710 0.192760 0.619274 0.051483 0.009426 0.070256 0.030526 0.058990 ... 0.032710 0.006834 0.000024 0.037662 0.129832 0.046054 0.181226 0.205073 0.083049 0.184905
75% 0.308285 0.273407 0.264857 0.389335 0.766583 0.112528 0.026489 0.174158 0.058429 0.106982 ... 0.069714 0.019599 0.000080 0.080891 0.236636 0.114080 0.369996 0.402445 0.170930 0.371584
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 21 columns

In [18]:
# Save the data
pd.to_pickle(X_final, "Intermediate/X_all_runs")
In [19]:
# Look at the correlation matrix
plt.figure(figsize=(10, 8))
plt.matshow(X_final.corr(), fignum=1)
plt.colorbar()
plt.show()
In [ ]:
 

Read labels (can the loop be vectorized or not?)

In [20]:
# Read the C program
with open("../Data/TSVC_force_vec/tsc.c", 'r') as f:
    c_prog = f.readlines()
print len(c_prog)
5589
In [21]:
# Read the vectorization report
# with open("../Data/TSVC_force_vec/reportgcc.lst.txt", 'r') as f:
with open("../Data/TSVC_force_vec_2/reportgcc.lst.txt", 'r') as f:
    vec_report = f.readlines()
print len(vec_report)
2011
In [22]:
# Get the first and last lines of the function definitions for each function
func_first_line = {}
func_last_line = {}

for func in func_names:
    # First line
    idx = [i for i, st in enumerate(c_prog) if " "+func + "(" in c_prog[i]][0]
    func_first_line[func] = idx
    
    # Last line
    while True:
        idx += 1
        if "return" in c_prog[idx]:
            func_last_line[func] = idx
            break
    print func, func_first_line[func], func_last_line[func]
s000 773 797
s111 801 824
s1111 827 848
s112 853 876
s1112 880 903
s113 908 930
s1113 933 955
s114 960 986
s115 991 1015
s1115 1018 1042
s116 1047 1072
s118 1077 1101
s119 1106 1132
s1119 1135 1161
s121 1166 1190
s122 1195 1222
s123 1227 1257
s124 1262 1292
s125 1296 1323
s126 1327 1355
s127 1360 1387
s128 1392 1420
s131 1425 1447
s132 1452 1475
s141 1480 1508
s151 1521 1539
s152 1550 1572
s161 1577 1608
s1161 1611 1642
s162 1647 1670
s171 1675 1696
s172 1701 1722
s173 1727 1749
s174 1754 1776
s175 1781 1803
s176 1808 1833
s211 1844 1867
s212 1872 1894
s1213 1897 1919
s221 1924 1947
s1221 1950 1971
s222 1976 2000
s231 2005 2028
s232 2033 2057
s1232 2060 2084
s233 2089 2116
s2233 2119 2146
s235 2150 2175
s241 2180 2203
s242 2209 2230
s243 2235 2259
s244 2264 2288
s1244 2291 2314
s2244 2317 2340
s251 2345 2369
s1251 2372 2397
s2251 2400 2425
s3251 2428 2451
s252 2457 2482
s253 2488 2515
s254 2520 2545
s255 2550 2577
s256 2582 2607
s257 2612 2637
s258 2640 2668
s261 2673 2699
s271 2702 2726
s272 2731 2756
s273 2761 2786
s274 2791 2818
s275 2823 2849
s2275 2852 2876
s276 2881 2908
s277 2912 2944
s278 2949 2979
s279 2984 3018
s1279 3021 3047
s2710 3052 3088
s2711 3093 3116
s2712 3121 3145
s281 3150 3176
s1281 3179 3205
s291 3210 3235
s292 3240 3268
s293 3273 3294
s2101 3299 3322
s2102 3327 3352
s2111 3357 3384
s311 3396 3420
s31111 3431 3460
s312 3465 3490
s313 3494 3519
s314 3524 3551
s315 3556 3589
s316 3594 3621
s317 3625 3652
s318 3657 3694
s319 3699 3727
s3110 3732 3768
s13110 3771 3804
s3111 3809 3836
s3112 3841 3867
s3113 3872 3899
s321 3904 3926
s322 3931 3953
s323 3958 3981
s331 3986 4015
s332 4027 4061
s341 4067 4095
s342 4100 4128
s343 4133 4163
s351 4168 4195
s1351 4198 4225
s352 4230 4256
s353 4261 4289
s421 4301 4328
s1421 4331 4357
s422 4362 4391
s423 4396 4423
s424 4428 4456
s431 4461 4486
s441 4491 4519
s442 4524 4564
s443 4569 4602
s451 4607 4629
s452 4634 4656
s453 4661 4685
s471 4690 4718
s481 4723 4748
s482 4754 4777
s491 4787 4810
s4112 4815 4837
s4113 4842 4865
s4114 4870 4896
s4115 4901 4927
s4116 4932 4960
s4117 4965 4987
s4121 4992 5014
va 5019 5040
vag 5045 5068
vas 5073 5096
vif 5101 5125
vpv 5130 5152
vtv 5157 5180
vpvtv 5185 5207
vpvts 5212 5234
vpvpv 5239 5261
vtvtv 5266 5288
vsumr 5293 5317
vdotr 5322 5347
vbor 5352 5393
In [23]:
# Compare the line numbers against vectorization report
loop_vectorized = {} # 1 for yes 0 for no

for func in func_names:
#for func in temp:
    print func,
    for num in range(func_first_line[func], func_last_line[func]+1):
        try:
            to_find = ":"+str(num)+":"
            
            # get first index of occurrence
            idx = [i for i, st in enumerate(vec_report) if to_find in vec_report[i]][0]
            #print num, "/", idx, ";" 
            if "vectorized loop" in vec_report[idx]:
                loop_vectorized[func] = 1
                print "Yes"
                break
            elif "loop not vectorized" or "not beneficial" in vec_report[idx]:
                loop_vectorized[func] = 0
                print "No"
                break
        except IndexError:
            pass
s000 Yes
s111 No
s1111 Yes
s112 No
s1112 Yes
s113 No
s1113 No
s114 No
s115 Yes
s1115 Yes
s116 No
s118 No
s119 Yes
s1119 Yes
s121 Yes
s122 No
s123 No
s124 No
s125 Yes
s126 No
s127 No
s128 No
s131 Yes
s132 Yes
s141 No
s151 s152 Yes
s161 No
s1161 No
s162 Yes
s171 Yes
s172 No
s173 Yes
s174 Yes
s175 No
s176 Yes
s211 No
s212 No
s1213 No
s221 No
s1221 No
s222 No
s231 No
s232 No
s1232 Yes
s233 No
s2233 No
s235 No
s241 No
s242 No
s243 Yes
s244 No
s1244 No
s2244 Yes
s251 Yes
s1251 Yes
s2251 No
s3251 No
s252 No
s253 No
s254 No
s255 No
s256 No
s257 No
s258 No
s261 No
s271 No
s272 No
s273 No
s274 No
s275 No
s2275 Yes
s276 No
s277 No
s278 No
s279 No
s1279 No
s2710 No
s2711 No
s2712 No
s281 No
s1281 Yes
s291 No
s292 No
s293 No
s2101 No
s2102 Yes
s2111 No
s311 Yes
s31111 No
s312 Yes
s313 Yes
s314 Yes
s315 Yes
s316 Yes
s317 Yes
s318 No
s319 Yes
s3110 No
s13110 Yes
s3111 Yes
s3112 No
s3113 Yes
s321 No
s322 No
s323 No
s331 No
s332 No
s341 No
s342 No
s343 No
s351 No
s1351 Yes
s352 Yes
s353 No
s421 Yes
s1421 Yes
s422 Yes
s423 Yes
s424 Yes
s431 Yes
s441 No
s442 No
s443 No
s451 Yes
s452 Yes
s453 No
s471 Yes
s481 No
s482 No
s491 Yes
s4112 Yes
s4113 Yes
s4114 Yes
s4115 Yes
s4116 Yes
s4117 Yes
s4121 Yes
va No
vag Yes
vas Yes
vif No
vpv Yes
vtv Yes
vpvtv Yes
vpvts Yes
vpvpv Yes
vtvtv Yes
vsumr Yes
vdotr Yes
vbor Yes
In [24]:
print len(np.nonzero(loop_vectorized.values())[0])
print np.nonzero(loop_vectorized.values())[0]
66
[  0   1   3   4   7   9  13  14  15  16  17  18  20  21  22  23  30  32
  33  34  41  42  44  46  54  59  63  69  70  73  74  76  80  81  82  85
  88  90  91 102 103 104 107 108 110 112 116 119 120 121 123 124 125 126
 127 129 130 131 132 133 134 136 140 142 148 149]
In [25]:
vec_labels = pd.DataFrame(loop_vectorized, index=[0]).T
vec_labels['Symbol Name'] = vec_labels.index
vec_labels.columns = ['Vectorizable', 'Symbol Name']
vec_labels.columns
Out[25]:
Index([u'Vectorizable', u'Symbol Name'], dtype='object')
In [26]:
vec_labels.head()
Out[26]:
Vectorizable Symbol Name
s000 1 s000
s111 0 s111
s1111 1 s1111
s1112 1 s1112
s1113 0 s1113
In [27]:
print vec_labels.shape #s151 doesn't get called for some reason; add it manually
vec_labels = vec_labels.append({'Vectorizable': 0, 'Symbol Name': 's151'}, ignore_index=True)
print vec_labels.shape
(150, 2)
(151, 2)
In [28]:
Data_final = X_final.merge(vec_labels, on='Symbol Name')
print Data_final.shape
(151, 23)
In [39]:
Data_final.to_pickle('Intermediate/Data_final')