Data dictionary:
BR_INST_EXEC.ALL_BRANCHES: Speculative and retired branches
Cycles (CPU_CLK_UNHALTED.THREAD_P): Thread cycles when thread is not in halt state
ICACHE.MISSES: # instruction cache, victim cache, and streaming buffer misses. Uncacheable accesses included
Instructions (INST_RETIRED.ANY_P): Number of instructions retired
IPC: Instructions/Cycles
ITLB_MISSES.MISS_CAUSES_A_WALK: Misses at all ITLB levels that causes a page walk
CYCLE_ACTIVITY.CYCLES_L1D_PENDING: Cycles while L1 cache miss demand load is outstanding
L1D.REPLACEMENT: L1D data line replacements
L2_cache_misses (L2_RQSTS.MISS): All requests that miss L2 cache
L2_cache_accesses (L2_RQSTS.REFERENCES): All L2 requests
MACHINE_CLEARS.COUNT: Number of machine clears (nukes) of any type
MACHINE_CLEARS.CYCLES: Cycles where there was a nuke (thread-specific and all thread)
MEM_LOAD_UOPS_RETIRED.L1_MISS: Retired load uops misses in L1 cache as data sources
MISALIGN_MEM_REF.LOADS: Speculative cache line split load uops dispached to L1 cache
RESOURCE_STALLS.ANY: Resource-related stall cycles
UOPS_EXECUTED.CORE: Number of uops executed on the core
DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK: Load misses in all DTLB levels that cause page walks
UOPS_EXECUTED.THREAD: Counts the number of uops to be executed per thread each cycle
UOPS_ISSUED.ANY: Uops that resource allocation table (RAT) issues to reservation station (RS)
UOPS_ISSUED.STALL_CYCLES: Cycles when RAT does not issue uops to RS for the thread
UOPS_RETIRED.ALL: Actually retired uops
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from Utilities import Utilities as utils
plt.style.use('ggplot')
data = pd.read_pickle('Intermediate/Data_final')
print data.shape
data.head()
data['Vectorizable'].value_counts()
print data.dtypes
# Variable names
columns = data.columns.values.tolist()
data.describe()
# Get names of all continuous variables
continuous_vars = [columns[i] for i in np.where(data.dtypes != 'O')[0]]
continuous_vars.remove('Vectorizable') # Remove target variable
print len(continuous_vars)
# Plot the distribution of each variable in vectorized and non-vectorized loops
for var in continuous_vars:
plt.figure(figsize = (16, 5))
plt.subplot(121)
plt.hist(data.loc[data['Vectorizable']==0, var])
#plt.title('Vectorizable = 0 vs. '+var)
plt.ylabel('Number of non-vectorized loops')
plt.xlabel(var)
plt.subplot(122)
plt.hist(data.loc[data['Vectorizable']==1, var])
plt.ylabel('Number of vectorized loops')
plt.xlabel(var)
#plt.title('Vectorizable = 1; '+var)
plt.show()
Interesting variables:
icache misses, inst retired any, itlb misses causes a walk, l1d replacement, l2 cache accesses, resource stalls any,
uops executed core and thread, uops issued any, uops retired all
Reduce data dimensionality and look at the distribution of the target variable in the new data space
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state = 2)
embs = tsne.fit(data.iloc[:, 1:22])
plt.scatter(embs.embedding_[:, 0], embs.embedding_[:, 1], c=data.Vectorizable)