Topics to be covered in series of articles:
- What is subsetting and types of subsetting in data preprocessing
- How to access top or bottom rows of the DataFrame
- How to set Column/Row Filters
- What Modifying Data in DataFrame means
- How to deal with Missing values
- Manipulate values either at Row level or Column level based on certain conditions
Introduction to data preprocessing
This is the continuation of the previous article where we covered – What is an index?; How to subset first N rows based on their position index?; Can we change the index?
We will be using the same Big Mart Sales Dataset to cover the below points –
Points to be covered:
- Will the index be always numeric?
- How to subset the data based on a label of the index?
- Can we reset the index?
- How to subset the data based on a value of a column?
So let’s start…
We will start with reading the data file by importing the pandas library and using the read_csv() function.
So now that we know how to change the pandas dataframe using a column. The first question that arises is – Do we always need to use a Numerical column to set the index? Well, the obvious answer is No. You can also have categorical columns like Item_Identifier, Item_fat_content, and so on as indexes.
Let us see how to do that
In this case, I will set the index of the dataframe as Item_Identifier. Now, you might be aware of why we have used the parameters drop=True and inplace= True. Let us see how the data frame looks like.
Now, the index of the dataframe is a categorical column. So, each of these rows has a label which is the Item_Identifier and from the original dataframe, we have removed this column.
Let us check what is data.index
This is very predictable
Now, how can you subset data when you have a label index i.e., categorical index. This is where .loc method comes into the picture. loc is the property of the dataframe using which you can subset the data using categorical values of the labels that we had given to our dataframe earlier.
So now, if we take FDA15 which is the label index of the first row, and for this label get me all the rows of the dataframe that have FDA15 as a label. Now, if I talk about it in technical or business terms, this is essentially all the rows in the data frame are sales records for the FDA15 item.
Since we have already done a good amount of experimentation on the index of the dataframe by changing it again and again.
Resetting the index
Coming to our next point, can we reset the index back to its original positional index? Well, you can do that in pandas and this is where the reset_index() function comes into the picture. Using this function, you can make sure that, whatever experimentation we have performed on the index for various reasons, at the end of it, the dataframe reverts to its original form which is having the positional index.
So, once you have executed the above line of code; you will notice that the index is the same i.e, the index has gone back to its initial state (you can check the data again using data.head method)
This was all about the label-based indexing and how you can subset such kinds of rows in pandas.
Our next point is, what if you want to subset your data based on the value of a column or a row.
Let us take an example; check the Item_type column, there are multiple item types like dairy, soft drinks, meat, fruits and vegetables, households, and so on.
Suppose I want to extract only those records from the dataset that has dairy as an item_type.
We will use below line of code to get all the records only of the specified value of the Item_type column:
data[data[‘Item_type’] == ‘Dairy’]
Now, those who are from an excel background will find this very nostalgic because this looks very similar to the filter feature in excel.
Let’s Conclude for today
We covered several topics that will help us subsetting the data using label and value index, and modifying data for data preprocessing. In the next article, we’ll start with subsetting data based on position in detail with code examples.