Friday 19 May 2017

Converting Excel data to learn Machine learning using Scikit-learn python

If you have suprevised data in EXCEL and you want to do machine learning on that data as given in the below link
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

u need to convert the excel data into a format required in the link X, y arrays.
here in the link X is a 2 dimensional numpy float array and y is one dimensional numpy array


For ex: if u have data as shown below

fishlength       fishwidth   classlabel
1.2                   1.4                 1
1.5                   4.3                 2

We require X=array([[ 1.2 ,   1.4],[ 1.5,4.3]]) and y=array([1,2])
inorder to convert that data into the above format as required in the link

In the excel sheets remove the headers (fishlength ,fishwidth, classlabel) and cut the classlabel entire column and save it as .csv(comma delimited ) in the SAVE AS dialog box (for ex: here the filename is finaldata.csv)


and paste the classlabel column in seperate excel workbook using PasteSpecial and choosing transpose option in PasteSpecial.
Save this workbook with .csv(comma delimited ) in the SAVE AS dialog box(for ex: here the filename is label.csv)

type the below code:

import csv
import numpy as np
f=open("finaldata.csv")
X=np.genfromtxt(f,delimiter=",")
f.close()
f=open('label.csv')
csv=np.genfromtxt(f,delimiter=",")
y=csv.astype(int)
f.close()

#the below code is copied and pasted by removing nosiy features in above link

y = label_binarize(y, classes=[1, 2, 3])#array([[1, 0, 0],[1, 0, 0],...
n_classes = y.shape[1]#shape[1] gives columns (3 here) and y.shape[0] gives rows
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5)#50% are taken as training data(X_train and those labels as y_train)
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True))#creating classifier
y_score = classifier.fit(X_train, y_train).decision_function(X_test)#ysocre=array([[ -2.49503189e+00,   4.13933465e-01,   9.99997811e-01],...X_test value belongs to class 3
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])#y_test has actual label information and y_score is predicted fpr=false positive rate
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

No comments:

Post a Comment