The devil is in the details
My latest project has been to work on a hackathon challenge from Machine Hack. The first problem in the beginner’s list is to predict the price of a book. There is a training data set with 6, 237 rows of 8 book features; the challenge is to build a Machine Learning model so as to predict the price of a book. There is also a testing data set to test the model against. The 8 columns are: Title, Author, Edition, Reviews, Ratings, Synopsis, Genre, and BookCategory. Price is the target variable and the only column that is a float data type, all others are objects.
I started by doing some data exploration using df.describe(), df.info() and plotting the distribution of Price.
Not quite what I was expecting, but it’s just a starting point. Less than 1% of the books have a price greater than 4000.
I decided that my first task would be to create a new column, ‘Reviews_float’, and populate it with the number from the ‘Reviews’ column using Regex. Then I would be able to look at what values the reviews were and their distribution.
A quick check confirms that the new column is there with actual values!
In order to look at the values in the ‘Review_float’ column and to do any graphing, the data type needs to be changed from object to float. I’m a little red-faced to admit that I spent a lot of time on this tiny exercise. The many attempts to change the datatype were futile. You know that Einstein quote, “Insanity: doing the same thing over and over and expecting different results”? That was totally what I was doing — the same thing over and over, hoping for a different result:
So, I tried a different method:
No luck, same problem — the data type of Reviews_float was still an object. I googled a lot of different phrases, restarted the notebook, had a snack, watched Youtube, and repeatedly tried the same code over and over. It was not making sense. Until…finally, one more scour of StackOverflow:
I had been forgetting to set the change of the data type in the column from object to float back to itself. Such a small error that reeked so much havoc.
Back on track — check out what kind of values the 6,236 books had as ratings. Depending on what you’re looking for, there are different options on how to look at the values.
Using df.count delivers a series of the values:
While df.count() is a function that produces the total number of reviews, but no values from the column:
To find out what the values are in the column, use df.unique() for an array of the different values:
While df.unique gives the whole long list as a series:
But, if you only want to know how many unique values there are, use df.nunique():
Now, if you would like a list of all the values, make use of df.value_counts():
There are arguments for the df.value_counts() function to refine the output — bins and ascending:
Finally, now that I knew what the different values were, I simply wanted to look at the distribution of the reviews to see if it was a normal one or not:
Once I was able to figure out my rudimentary error, the remainder of my task became an interesting exercise in comparing the many different ways to find the values of a column.