Supervised Learning Classification Checkpoint, KNN K-Nearest Neighbors Algorithm, Titanic data set, preprocess the data, Logistic Regression, confusion matrix, decision tree, random forest, manual prediction, new accuracy, optimal number of neighbors
In this checkpoint, we are going to work on the Titanic data set to predict if a passenger will survive or not using several classification algorithms of supervised learning. We will start by logistic regression, knn, then decision tree and we finalize by random forest.
[...] Supervised Learning Classification Checkpoint Checkpoint Objective : In this checkpoint, we are going to work on the Titanic data set to predict if a passenger will survive or not using several classification algorithms of supervised learning. We will start by logistic regression, knn, then decision tree and we finalize by random forest. preprocess the data using pandas In import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, roc_auc_score from sklearn.preprocessing import LabelEncoder dataset=pd.read_csv("titanic-passengers.csv", dataset.head() Out[4]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked No 2 Collander, Mr. [...]
[...] Nils (Alma Cornelia Berglund) female NaN S No 1 Davidson, Mr. Thornton male F.C B71 S def preprocess_data(new_data): new_data['Age'].fillna(new_data['Age'].mean(),inplace=True) new_data.replace({'Sex':{'male': 1,'female':0}},inplace=True) new_data['Cabin']=new_data.Cabin.fillna('G6') new_data.replace({'Survived':{'Yes': 1,'No':0}},inplace=True) In label_encoder = LabelEncoder() new_data['Embarked'] = label_encoder.fit_transform(new_data['Embarked']) return new_data data=preprocess_data(dataset) data Out[5]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Collander, Mr. Erik Gustaf G Moen, Mr. Sigurd Hansen F G Jensen, Mr. Hans Peder G Palsson, Mrs. Nils (Alma Cornelia Berglund) G Davidson, Mr. [...]
[...] Thornton F.C B Nasser, Mrs. Nicholas (Adele Achem) G Sirayanian, Mr. Orsen G Cacic, Miss. Marija G Petroff, Mr. Pastcho ("Pentcho") G Phillips, Miss. Kate Florence ("Mrs Kate Louis G rows x 12 columns X = data.drop(['Survived','Name','Ticket','PassengerId','Cabin'], axis=1) y = data['Survived'] X_train, X_test, y_train, y_test = train_test_split(X, test_size=0.2, random_state=42) X In Out[8]: Pclass Sex Age SibSp Parch Fare Embarked rows x 7 columns Part 1 Logistic Regression Apply logistic regression. [...]
[...] Use confusion matrix to validate your model. Another validation matrix for classification is ROC / AUC , do your research on them explain them and apply them in our case In logreg = LogisticRegression() logreg.fit(X_train, y_train) y_pred = logreg.predict(X_test) y_pred site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lb fgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( Out[9]: array([ dtype=int64) In from sklearn.metrics import classification_report print(classification_report(y_test,y_pred)) precision recall f1-score support accuracy macro avg weighted avg In import seaborn as sns confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted']) sns.heatmap(confusion_matrix, annot=True) Out[11]:
APA Style reference
For your bibliographyOnline reading
with our online readerContent validated
by our reading committee