Difficulty With Applying DataFrame.replace()

A lesson in “read the fine print”!

Last week I chronicled my frustrating journey to add a column to a DataFrame and fill the new column by extracting a float value from a string of text from another column. This week, using the same DataFrame about books from Machine Hack that is 6237 entries with a target variable of ‘Price’ and 8 features, I decided to turn my attention to the ‘Ratings’ column.

Here’s a df.sample(3) of the DataFrame with the added column, ‘fReviews’, from last week:

Sample of 3 Entries from the Books DataFrame

My plan this week with the ‘Ratings’ column was to extract the number and put it into a new column, ‘iRatings’. As is the case with most of my coding ideas — easier said than done. I encounter two small problems right off: changing data types and using complex regex statements. The ‘Ratings’ column is of type object. Even when I try to change it — and set it equal to itself (an important lesson learned from last week!)-it still doesn't change.

Before:

Data Type of “Ratings”

And after:

Data Type of “Ratings” Unchanged

Moving on to my other problem…My attempt to extract the number out of the ‘Ratings’ column fails because my regex skills are shaky when there is a comma in the number! Therefore, I’ll remove the commas in the ‘Ratings’ column and make the regex extraction easier.

In order to confirm that commas are successfully replaced, I first create a list of the ‘Ratings’ to see how many entries in the column have a comma in them:

Create a List of ‘Ratings’

Check to see if there are commas with my comma checker function:

Check for Commas in ‘Ratings’

Yes, indeed there are 20 ‘Ratings’ with a comma.

Indulge me a slight sidetrack at this point — use of str.replace() with a text string. The syntax is:

Examples of str.replace()

Onward…the syntax for df.replace() is slightly different with more arguments:

My first, second, and third attempts to use df.replace() aren’t successful. (This is Deja Vu from last week again and Einstein’s Insanity description). Here are my three attempts:

First Attempt is Unsuccessful
Second Attempt, Slight Change to Syntax, is Not Successful Either
Third Attempt, with an EXACT string, is Not Successful Either

After a bit of searching in Stackoverflow, in small print towards the bottom in pink highlight, is a very helpful clue:

Adding the “.str” is the key! After making the small addition, replacing the commas with no space, creating a new list, and checking for commas finally works:

Success! No Commas in ‘Ratings’ Column

Now, I can simply extract the number from the ‘Ratings’ column and create a new column, ‘iRatings’.

And create a rudimentary histogram of values:

But using the df.describe() may be more helpful to see the data:

Viewing the Values in ‘iRatings’

The point of actually seeing the distribution of the number of ratings for each book is not the goal. The goal is to continually improve my skills — mission accomplished!