Topics to be covered in series of articles:
- What is subsetting and types of subsetting
- How to access top or bottom rows of the DataFrame
- How to set Column/Row Filters
- What Modifying Data in DataFrame means
- How to deal with Missing values
- Manipulate values either at Row level or Column level based on certain conditions
Introduction to Data preprocessing
In Data Science for Machine Learning using Python, there are four major tasks namely Reading Data ﬁles, Preprocessing (subsetting, Modifying, aggregating data), Visualizing trends and patterns, and Building ML models.
In this series of articles, we’ll cover the above all points, but as of now, we’ll mainly focus on the preprocessing part – How to do Subsetting using python.
So let’s start…
Subsetting and its types:
While importing data into DataFrame using python we usually work with small sections or portions of data. DataFrame is nothing but a two-dimensional data structure where data is arranged in tabular format in rows and columns. Subsetting data or DataFrame is the process of extracting or selecting a set of desired rows and columns from the DataFrame.
There are two types of Indexing scheme in pandas that are used for subsetting:
As the name suggests – It consists of an Integer denoting row or column position.
Here we’ve labels like ‘car’,’DAF01’,… for rows and columns.
Now let us have a couple of examples to understand both Indexing techniques.
In Fig: a, each row is assigned numeric index starting from 0; and in Fig: b each row is assigned label index. We’ll dive more deeply above both indexing techniques ahead in this and upcoming articles.
Overview of subsetting in Pandas
In this section, we’ll learn diﬀerent subsetting techniques for data prepossessing using pandas and by utilizing code examples we will cover below all topics.
Points to be covered:
- What is an index?
- How to subset ﬁrst N rows based on position index?
- Can we change the index?
- Will the index be always numeric ?
- Can we reset the Index?
By the end of this article and complete series of this topic related articles, we’ll have a complete understanding and hands-on how to work on subsetting with pandas.
So Starting with the ﬁrst step, reading the data ﬁle by importing the pandas library and using read_csv() function.
The dataset we are going to use for this article is the Big Mart Sales Dataset. Now we already know that there are two major Indexing schemes in pandas. The ﬁrst approach gives an integer index to each row and column based on its position. In python Indexing starts from 0 for the ﬁrst row, 1 for the second row, and so on; similarly, the column index also starts with 0 for the ﬁrst column, 1 for the second column, and so on.
Now, the second approach is where we deal with label-based indexes for both rows and columns instead of Numeric Index values.
Getting information for data preprocessing
As we already discussed both approaches at the start of this article. Let us look at what exactly an Index means. So let us see the below Dataframe and how the Index is stored under the hood.
If you will have a close look then the Index for the taken dataframe starts with 0 and ends at 8523 with the step size of 1.
Let us see the shape of the Dataset which gives you the total count of the records present in the Dataset along with the count of Number of Columns. So the shape of the Dataset (is obtained using shape method), is 8523 rows and 12 columns.
We can also retrieve the Column Names used or given in the Dataset by using the .columns method with the DataFrame variable. Here we can use Label based Indexing because each column has a name by default.
Pandas Dataframe by default takes Numeric Index for rows, also known as positional index; and for columns, by default, it takes a label-based index as each column has their own names.
Now, that we know which indexing is used when let’s see the above theory in practical by taking a small example of how to utilize the positional index to get the ﬁrst N rows of a DataFrame. This is where the head() function comes into the picture. Using it you can access the ﬁrst N number of rows.
Note: If you call head() without any parameter it will just display the ﬁrst 5 records starting position index with 0 to index 4.
If you pass parameter 10 to head() like head(10) it will display the ﬁrst 10 records.
Now coming to our next question, Can we change the index? The short answer is yes. So, let’s see how.
Here I’ve created a list of 8523 random numbers. The length of the list is similar to the length of the Dataset we have. We’ll be setting these random numbers as the new index for our DataFrame.
Let’s see what happens when we change this. So now, once again I’ll use head() to see the ﬁrst 5 rows of the Dataframe. Let’s see if something changes.
Have you noticed something even though the index got changed but the ordering of the data is still the same as above. So even though it manually changes the index of the dataFrame the order in which the rows appearing is still the same. Now, why is this happening? This is happening because whenever we manually update the index of the dataframe in pandas, this new index is treated as a label for each of the rows of the dataframe. So each of the numbers 4333, 7000, and so on are the labels for each of the rows of the dataframe. So this is a small example of label-based indexing.
And since our head() function only works with the position-based indexing, it doesn’t matter what label you give to each row, whenever you call the head() function it will always show you the ﬁrst 5 rows or ﬁrst N rows of the Dataframe.
There is also another way to change the index in pandas using the set_index() function. Let’s see how to do that. One of the advantages of the set_index() function is you can pass in the name of the column of the DataFrame and the set_index() function will create a new label-based index using this column value itself.
For Example, from the DataFrame I want to use the Outlet_Establishment_Year column as index to my entire DataFrame. So let’s do it through code.
As predicted the new index of this DataFrame is actually Outlet_Establishmenat_year column and let me also show you what is displayed when we execute data.index
So, this is the second way by which you can change the index of a DataFrame.
Before we proceed further let me explain a couple of few things. If you remember we were passing two parameters here in this data.set_index(‘Outlet_Establishment_Year’,drop=True,inplace=True) function, drop=True and inplace=True. Let’s understand what these two parameters mean. In order to explain these two parameters, I’ll have to read the dataset again so I can show you the diﬀerence and eﬀect of the parameter used.
Firstly, I will replace drop=True with drop=False. Basically, we are deactivating this parameter’s
If you can see the Outlet_Establishment_Year is set to the index but the same column is also present in our dataset too.
So, if we set the parameter drop=True it will drop the column from the dataset after being set as an index for our dataframe; inplace=True means if you set the value as True it will incorporate the changes in the original dataframe, whereas if you set it as False the column will not be dropped from the original dataframe.
Let’s Conclude for today
We covered many topics that will help us for subsetting using label index and modifying data. In the next article, we’ll start with subsetting data based on values, labels, and position in detail with code examples.