Difficulty With Applying DataFrame.replace()

4 min readDec 16, 2020

--

A lesson in “read the fine print”!

Last week I chronicled my frustrating journey to add a column to a DataFrame and fill the new column by extracting a float value from a string of text from another column. This week, using the same DataFrame about books from Machine Hack that is 6237 entries with a target variable of ‘Price’ and 8 features, I decided to turn my attention to the ‘Ratings’ column.

Here’s a df.sample(3) of the DataFrame with the added column, ‘fReviews’, from last week:

Sample of 3 Entries from the Books DataFrame

My plan this week with the ‘Ratings’ column was to extract the number and put it into a new column, ‘iRatings’. As is the case with most of my coding ideas — easier said than done. I encounter two small problems right off: changing data types and using complex regex statements. The ‘Ratings’ column is of type object. Even when I try to change it — and set it equal to itself (an important lesson learned from last week!)-it still doesn't change.

Before:

Data Type of “Ratings”

And after:

Moving on to my other problem…My attempt to extract the number out of the ‘Ratings’ column fails because my regex skills are shaky when there is a comma in the number! Therefore, I’ll remove the commas in the ‘Ratings’ column and make the regex extraction easier.

In order to confirm that commas are successfully replaced, I first create a list of the ‘Ratings’ to see how many entries in the column have a comma in them:

Create a List of ‘Ratings’

Check to see if there are commas with my comma checker function:

Yes, indeed there are 20 ‘Ratings’ with a comma.

Indulge me a slight sidetrack at this point — use of str.replace() with a text string. The syntax is:

str.replace(old, new[, count])

Onward…the syntax for df.replace() is slightly different with more arguments:

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method=’pad’)

My first, second, and third attempts to use df.replace() aren’t successful. (This is Deja Vu from last week again and Einstein’s Insanity description). Here are my three attempts:

Second Attempt, Slight Change to Syntax, is Not Successful Either

Third Attempt, with an EXACT string, is Not Successful Either

After a bit of searching in Stackoverflow, in small print towards the bottom in pink highlight, is a very helpful clue:

Adding the “.str” is the key! After making the small addition, replacing the commas with no space, creating a new list, and checking for commas finally works:

Now, I can simply extract the number from the ‘Ratings’ column and create a new column, ‘iRatings’.

And create a rudimentary histogram of values:

But using the df.describe() may be more helpful to see the data:

The point of actually seeing the distribution of the number of ratings for each book is not the goal. The goal is to continually improve my skills — mission accomplished!

Difficulty With Applying DataFrame.replace()

Written by Annika Noren