Search

Exploratory Data Analysis Titanic Dataset

Updated: Jul 4, 2021

Exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

In this we will se how to do EDA on titanic data set and then we will apply Logistic regression to train your model.


Load Libraries

we have to install some libraries like pandas to read the dataset , numpy , matplotlib and seaborn to visualize the data.


Load Dataset

Pandas help us to read the dataset



Visualization

Now we find null values in your dataset using seaborn ,we will create heatmap.


Here we have visualization of heatmap in which we got null values in feature 'Age' and 'Cabin'.


Now we will select a specific feature and ploy it by using seaborn



Here we have taken 'Survived' feature and drawn countplot which shows no of people survived and not survived


If we want 'Survived' feature to draw countplot which shows no of people survived and not survived with respect to any other feature we have to use hue ,which will give us visualization with respect to other feature



Now if we want to know that data is normally distributed or not we use distplot to visualize it using seaborn .

This is normally distributed curve

If we plot for feature 'Fare' we will get to know that it is not normally distributed


This feature is right skewed

Now we will look on outliers for that we will use boxplot which will show the outliers as well as 25%(percentile) , 75% and mean value of a feature .



By box plot we got mean value of age with respect to pclass soo we will write a code where we will replace all null values in feature 'Age' with the mean value of age with respect to pclass.



Here we again go with the Heat map to visualize that feature 'Age' has null values or not and also we do dummy trap ( one hot encoding)for feature 'sex' and 'Embarked' .



We can clearly compare the previous Heat map and this one we have removed null values from feature 'Age' .

No we will drop the column which are of no use from the data and concat the one hot encoded feature ('Age', 'Embarked') with the main data , look down this is your final data on which we will train your model.



Model Building

Now here we have to choose your dependent variable (which is our output ) and independent variable

In[17] is your independent variables and In[18] is dependent variable .


Now we will do Train-test-split then we will import logistic regression then we have initialized it with a object name logmodel then we have fit the model now its time to predict your accuracy .



Here we have imported confusion matrix which will give us type 1 error and type 2 error and accuracy score will give us how much your model has predicted you can look into the diagram


Now I know that your accuracy is quite low soo now its your work to have hand 's on it and make accuracy more by using different ML algorithm and you can mail me your accuracy score .


References and credit


Krish Naik - He is an amazing teacher for Data Science you can just visit to his Youtube channel and explore this all concept.



Your feedback is appreciated!


Did you find this Blog helpful? Any suggestions for improvement? Please let me know by filling the contact us form or ping me on LinkedIn .

Thanks!









199 views0 comments

Recent Posts

See All