Brushing Up on Pandas Groupby
(Didn’t get me to where I was hoping to be though)
For the third week now, I’ve been playing around with a training data set from Machine Hacker. The first week the focus was on creating a new column “fReviews” and working through some hiccups I encountered on that endeavor. Last week the focus was on creating another new column “iRatings”. As a refresh, the initial training set had 6,237 rows with 8 features and the target variable of “Price”. At this point, the data set is still 6,237 rows, but has two additional features:
This week’s plan was to explore the values in ‘BookCategory’ and ‘Genre’, and see how the two columns differ in the information they provide. Let’s peak at ‘Genre’ first; there are 345 different values with a high percentage of the genres having 10 or fewer books in that genre:
Now for a glimpse at ‘BookCategory’; there are 11 different categories of books:
The questions that emerged in my mind were: how many genres are represented in each book category and which genre in each category is represented the most? How shall I go about figuring it out? I recall from my Bootcamp days the Pandas groupby function and decided to brush up on its use to see if it could provide answers to my two questions.
Let’s start with some background information about pandas.DataFrame.groupby from the documentation:
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Make note that creating a groupby object alone will not display a result — although you have split the dataset into groups — the power to display meaningful information comes from chaining an Aggregation Function onto the groupby object. There are 13 such functions; their role is fairly self-explanatory, but for more information, I suggest you investigate here:
Forging ahead to figure out the number of genres represented in each book category brought me to refining the groupby query further:
This line of code gave me what I was looking for — the number of books in each genre in each book category:
However, it would be more useful to get just the maximum value, not the whole list. Just for chuckles, I tried appending .max() to the end and got an answer for the book category with the maximum number of books in one genre as a result:
Okay, it’s a nice result, but not exactly what I was looking for. I’d was hoping for all 11 categories, not just the one with the highest value in the genre.
There is probably a way to figure out the greatest number of books in a genre for each book category, but I ended up going back to basics by creating a subset Data Frame for each book category, getting the value counts based on genres, and slicing the first value:
The next step would be to create a function to repeat the process for the other 10 categories of books. However, this exercise gave me a refresher on the use of the Pandas grouby() operation, and ultimately the answer to my question of finding the genre with the greatest number of books in each category.