top of page

Data Analysis in Python With Pandas

Updated: Jul 4, 2021

If you guys are planning to movie your carrier towards Machine Learning and Data Science you need to learn how to play with data using some tool which will be most important in python like Pandas , Numpy, Scikit Learn any many more but in this blog you will get the full information how to use panda library for Data analysis.

Pandas is an open source library for data manipulation and analysis in python.

Loading Data

The very first step is to load the dataset to perform Feature engineering, Feature slection and hyper parameter tuning. The dataset is in the form of excel file we need to convert into data frame to read the dataset and then we can apply operations.

Pandas allows us to load a spreadsheet and manipulate it programmatically in python. The central concept in pandas is the type of object called a DataFrame – basically a table of values which has a label for each row and column.

Now lets load the dataset using pandas

df = pandas.read_csv('music.csv')

Here df represent pandas Dataframe

Selection -

If we want to select any column from the df then we have to just write df['Artists']

If we want to select any Row or multiple rows from the df the the we have to use loc function which will extract rows for us in this you have to give starting index no and the ending index no.

we want to select any multiple rows and multiple column from the df the the we have to use iloc function which will extract rows and columns for us in this you have to give starting index no and the ending index no.

Filtering -

from Dataset if we want a specific thing from row or columns we have to just use df with specifing the word from the row or column

Now it gets more interesting. We can easily filter rows using the values of a specific row. For example, here are our jazz musicians:

Here are the artists who have more than 1,800,000 listeners

Dealing with Missing Values -

To deal with missing values we have soo many techinques to handle it like Mean /median /mode , Random sampling imputation , capturing NAN (missing value) valued with the new feature, End of distribution imputation but here i am not going to deal with all this i will only just show how to drop the NAN value but while making a model just simply droping the NAN value leads to overfitng or something else.

Many datasets you’ll deal with in your data science journey will have missing values. Let’s say our data frame has a missing value

Pandas provides multiple ways to deal with this. The easiest is to just drop rows with missing values

Another simple way to fill the NAN values you can fill it by 0 using df.fillna(0)

Grouping -

Things start to get really interesting when you start grouping rows with certain criteria and aggregating their data. For example, let’s group our dataset by genre and see how many listeners and plays each genre has:

Pandas grouped the the two “Jazz” rows into one, and since we used sum() for aggregation, it added together the listeners and plays for the two Jazz artists and shows the sums in the combined Jazz column.

This is not only nifty, but is an extremely powerful data analysis method. Now that you know groupby(), you wield immense power to fold datasets and uncover insights from them. Aggregation is the first pillar of statistical wisdom, and so is one of the foundational tools of statistics.

In addition to sum(), pandas provides multiple aggregation functions including mean() to compute the average value, min(), max(), and multiple other functions. More on groupyby() in the Group By User Guide.

If you use groupby() to its full potential, and use nothing else in pandas, then you’d be putting pandas to great use. But the library can still offer you much, much more.

Creating New Columns from Existing Columns -

Often in the data analysis process, we find ourselves needing to create new columns from existing ones. Pandas makes this a too easy.

By telling Pandas to divide a column by another column, it realizes that we want to do is divide the individual values respectively (i.e. each row’s “Plays” value by that row’s “Listeners” value).

Get Hands On!

you can install Anaconda Navigator on your laptop and run Jupyter Notebook over there you can just practice the code .

References -

1. Krish Naik - he is an amazing teacher for Data Science you can just visit to his Youtube channel and explore this all concept.

2.Jay Alammar - he also a youtuber you can find him on twetter @JayAlammar

Your feedback is appreciated!

Did you find this Blog helpful? Any suggestions for improvement? Please let me know by filling the contact us form or ping me on LinkedIn .


146 views0 comments

Recent Posts

See All
bottom of page