Difficulty With Applying DataFrame.replace()
--
A lesson in “read the fine print”!
Last week I chronicled my frustrating journey to add a column to a DataFrame and fill the new column by extracting a float value from a string of text from another column. This week, using the same DataFrame about books from Machine Hack that is 6237 entries with a target variable of ‘Price’ and 8 features, I decided to turn my attention to the ‘Ratings’ column.
Here’s a df.sample(3) of the DataFrame with the added column, ‘fReviews’, from last week:
My plan this week with the ‘Ratings’ column was to extract the number and put it into a new column, ‘iRatings’. As is the case with most of my coding ideas — easier said than done. I encounter two small problems right off: changing data types and using complex regex statements. The ‘Ratings’ column is of type object. Even when I try to change it — and set it equal to itself (an important lesson learned from last week!)-it still doesn't change.
Before:
And after:
Moving on to my other problem…My attempt to extract the number out of the ‘Ratings’ column fails because my regex skills are shaky when there is a comma in the number! Therefore, I’ll remove the commas in the ‘Ratings’ column and make the regex extraction easier.
In order to confirm that commas are successfully replaced, I first create a list of the ‘Ratings’ to see how many entries in the column have a comma in them:
Check to see if there are commas with my comma checker function:
Yes, indeed there are 20 ‘Ratings’ with a comma.
Indulge me a slight sidetrack at this point — use of str.replace() with a text string. The syntax is:
str.replace(old, new[, count])
Onward…the syntax for df.replace() is slightly different with more arguments:
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)
My first, second, and third attempts to use df.replace() aren’t successful. (This is Deja Vu from last week again and Einstein’s Insanity description). Here are my three attempts:
After a bit of searching in Stackoverflow, in small print towards the bottom in pink highlight, is a very helpful clue:
Adding the “.str” is the key! After making the small addition, replacing the commas with no space, creating a new list, and checking for commas finally works:
Now, I can simply extract the number from the ‘Ratings’ column and create a new column, ‘iRatings’.
And create a rudimentary histogram of values:
But using the df.describe() may be more helpful to see the data:
The point of actually seeing the distribution of the number of ratings for each book is not the goal. The goal is to continually improve my skills — mission accomplished!