Revisiting My First Data Science Project: King County House Prices

One year later, would I be able to clean the data better?

Just about 1 year ago to the day I started on a formal Data Science path at the Flatiron School in the online, part-time Data Science program. The first module of the curriculum covered the basics: learning to use Github and terminal commands, coding in Python and using its many libraries, cleaning data, designing beautiful visualizations and graphs, and creating linear and multi-linear regression models. It was a lot to digest; one instructor described it as drinking from a fire hose! At the end of those first four weeks, our first data science project was assigned. The data set was the King County House Sales, which has 21 columns and over 20,000 rows. It was quite an overwhelming experience for a newbie.

My python skills have come quite a long way and I thought it would be a fun, or at least an interesting, endeavor to revisit the data. Since so much of data science hinges upon competent and thorough cleaning, that is where I started my redo. My personal challenge was to use different coding methods to measure how far I have progressed. This blog will focus on the cleaning of two columns that caused me angst the first time around: square footage of the basement (sqft_basement) and the year a house was renovated (yr_renovation).

The data set has 3 columns that are used to quantify the size of each house: sqft_living, sqft_above, and sqft_basement. In other words, the amount of total living space of a house is equal to the above-ground living space plus the basement living space. I verified this by creating a new column, calc_basement, and populated it with the value calculated by subtracting sqft_above from sqft_living.

The reason I created the new column, calc_basement, was that there were 454 values in sqft_basement assigned to “?” (see line 6). There were no null values. By calculating the size of the basement using the data, the 454 rows with a question mark did not have to be dropped.

Next, I wanted to try out two methods described in an article by DataQuest using NumPy.where() and NumPy.select() to create new columns to showcase the information.

Using the where method, a new column has_basement was created, which indicates if a house has a basement or not. If the value of the calc_basement is zero, False is assigned to has_basement, otherwise, it is True.

The second method was to create a list of conditions and assign basements to a level, depending on the size.

If a basement has a calculated size of less than 300 square feet, it was assigned to level_1 in a new column called basement_size. A basement between 300 and 600 square feet was assigned to level_2 and so on for the other 3 levels.

In my data cleaning last year, I couldn’t figure out an elegant method to deal with unknown values nor did I know about the NumPy built-in functions of where() and select(). I didn’t know what I didn’t know.

In the ‘yr_renovated’ column there 3842 null values out of the total 21,597 values, which is 17% of the values and a great deal of data to lose. Additionally, over 17,000 homes had not had a renovation while 744 had renovations. In my cleaning session last year, I dropped the entire column; this year would be different!

My plan was to create a new column called ‘renovation’, replace the null values with “999” in the same ratio as renovation/no-renovation in the data set and then convert the column to True and False. I made use of the Numpy built-in function again and random choice to assign the null values in the correct percentage.

For a result:

These new strategies allowed me to create a cleaner, more accurate data set than last year.