Personal highs and lows of the 366 days
Like many people, I love a doing a yearly review in late December — taking the time to look back to see what went really well, what was abysmal, and make plans for the future. Since most of my time was spent finishing a Bootcamp program or looking for a job, the focus of this post will (mainly) be about that journey.
I spent the first half of 2020 completing the curriculum of an online Data Science bootcamp which included completing 2 projects and a Capstone. (Three other projects had been completed in the latter half of 2019.)
Project #4 — a market analysis to identify the 5 best zip codes in the United States and evaluate their potential future growth for investment purposes — was completed in late January. The interpretation of “best” was up to each student but had to be backed by data and research. The data set was the biggest one I had worked with so far. It contained 272 columns: 6 columns of identifying features of a zip code, a column for the zip code itself, and the rest were the average monthly house sales over 22 years. Here’s a very small portion of the data:
My methodology for identifying the top 5 zip codes was conducted in 3 steps:
- First, using the provided data set, I calculated the return on investment (ROI) of each zip code in the US from the time period following the 2008 market crash until mid-2018. The zip codes were then ranked according to the ROI in descending order.
- The second step was to conduct market research to identify the real estate markets forecasted to have high growth potential. Using the ROI data and market research, I choose to concentrate on the top 11 zip codes in the San Jose, Phoenix, and Las Vegas markets for modeling in step 3.
- The third and final step used statistical analysis on the data to identify trends and then model the data with S/ARIMA to forecast for potential growth in 2019 and 2020.
What were the takeaways from this project? I felt really good about the visualizations I created. Prior to modeling, I wanted to see the ROI for the 11 zip codes I had identified and created this graph:
The low point of this project was getting the Python melt() function to work. It took many, many iterations to figure out and became a real source of pain. There were also several tense moments while trying to get the S/ARIMA model working correctly, but in the end, it all worked well. The result was that based on my statistical analysis, the best zip codes in the United States with the highest potential ROI in 2019–2020 based on data forecasts and market research would be:
- Mountain View, CA (94043) with 60%
- Sunnyvale, CA (98089) with 54%
- Palo Alto, CA (94301) with 33%
- Palo Alto, CA (94302) with 33%
- Las Vegas, NV (89104) with 31%
- Las Vegas, NV (89107) with 30%
Project #5 was a “find your own data set project.” I choose to work with a medical data set to predict cardiovascular heart disease in patients based on 11 features. Four features were objective, four were based on examinations and the last three were subjective. The data was sourced from Kaggle and had 70,000 patient records:
I ended up engineering a 12th feature, Body Mass Index, from ‘height’ and ‘weight’. Cleaning was straight-forward and I was pleased with the visualizations:
The challenge was to use and compare different classification models to analyze the data in order to predict the presence or absence of cardiovascular heart disease using the twelve features. The models I used were:
- Logistic Regression
- Random Forest
- K-Nearest Neighbor (KNN)
- Support Vector Machine (SVM)
In the second stage of my project, Principal Component Analysis (PCA) was employed. It was determined that the optimal n_components to use was 10 and for additional insight, a value of n_components equal to 2 was used. The models in the second stage of classification were:
2. SVM with n_components =2 and =10
3. Logistic Regression
4. Logistic Regression with n_components =2 and =10
5. KNN with n_components =2 and =10
Based on the accuracy and precision metrics, it was determined that the SVM model was the best at predicting whether a patient suffers from cardiovascular heart disease or not. It is interesting to note though, the difference between SVM and the second place model — logistic regression — was by a fraction of a percentage. The SVM models (straight SVM and 2 SVM models using PCA) took 30 minutes, 18 minutes (PCA=10), and 10 minutes (PCA=2) to complete computing in comparison to the logistic regression models that took less than 1 second to complete.
The high on this project was the joy of working with a data set I had chosen and really, really liked. The low? Not much on this project besides not being about to figure out how to get XGBoost to model the data (more on this later). Every time I ran the notebook it crashed. A few months later I figured out the error, which was a belated feeling of accomplishment.
Right about the time the world started shutting down because of the pandemic, I started work on my Capstone project. The timing was perfect; I wasn’t able to create excuses for not working on it because there was nowhere to go and no one to visit.
The goal of the capstone was to develop a data analysis on a topic of my choosing. The data though had to be independently sourced — no curated data sets off of Kaggle. Sometimes too many options can be a burden and this was definitely the case with me. Finally, though, I settled on studying coffee and figuring out the features that make a perfect cup.
There were several highlights for me on the Capstone. First, I faced and conquered my dread of performing web scraping. On my very first attempt at scraping a coffee website for information using Beautiful Soup, I ran into a forbidden error code (error 403). Rough start! Luckily I quickly found a workaround and was on my way to collecting over 5,500 coffee URLs.
Then I used the list of URLs to scrape the website for the data of each coffee review. The pre-cleaned data wasn’t pretty, but I had a unique data set!
The second highlight of the Capstone was my success with regex to clean the data since I didn’t have much experience using it. The easiest column to clean with regex was the ‘Agtron’ column. What is an agtron and what does the ‘Agtron’ number represent? An agtron is a machine that reflects light on a sample of coffee to objectively assign a number to the bean’s roast color. It is a precise measure of the degree of roast. The smaller the number, the darker the roast. Each coffee had two readings: the whole beans before grinding (the number preceding the slash) and the same beans after grinding (the number after the slash). For example, a reading of 58/76 (first row above, 4th column) would describe a coffee with a whole-bean reading of 58 and a ground reading of 76.
My regex code to extract the two integers from the text string:
Next I assigned the integers to two new columns (the far right):
And finally, my third thrill from this project was using GeoPandas to create a beautiful world map indicating the number of coffee growers in each country.
I started with the basic, distorted world map:
Then I created a dictionary from my cleaned data set. The key was the country name and the value was the number of growers in the country.
Next, I added this dictionary to the world data:
The data to be plotted on the world map:
I played with the plot to fix the world distortion (yes, I removed Arctica and Antartica to make the map more visually pleasing and there aren’t any coffee growers in those regions to represent so it felt like a safe decision), changed the colors, added a color key and two notes to the map (lower right hand corner). My code:
My final map!
The Capstone goal was to use classification modeling to determine what features make a great coffee. I used the following different models:
- Logistic Regression
- Random Forest
- Random Forest with SMOTE
- Random Forest with GridSearch
The model that performed the most accurately was XGBoost. A classification report (including precision, recall, and F1 score), a confusion matrix, and plot of feature importances were calculated for each model. The XGBoost modeling for a GREAT cup of coffee had the highest training and testing accuracy, 99.76%, and 98.18%. The important features to consider when purchasing a GREAT cup of coffee from my data are (interestingly price and country were not at the top of the list):
6. the price per ounce
I learned during this modeling that XGBoost does not require scaled testing and training data (which is why everything was crashing in project #5).
My (very) low point on this capstone? The dismal review of the project I received from a professional Data Scientist who administers mock technical interviews.
The second half of 2020 was spent looking for a job where I can use my shiny, new data skills. I reached out to over 275 (and counting) people. As an introvert, this is a win for me. I also made 168 Github commits and wrote 26 blogs including this one. I haven’t found a job yet — the low of the year (thanks Coronavirus and world-wide shut down!) — but I’m very hopeful for next year.
And on a personal level, my family and I stayed free of the virus — a BIG highlight. Two other highlights were physical in nature. The first: a self-designed challenge to do 100 burpees each day in November. Three thousand burpees for fun. Why? Why not. Challenge accepted and completed.
The second: another self-designed challenge in honor of the Kelly Brush Foundation virtual ride. At 6 am on a beautiful Saturday in September I completed 18 consecutive classes on my Peloton bike for 220 minutes with 18 different instructors.
It turns out that data from the Peloton app can be downloaded, which is just what I did a few days ago. The data is in a CSV file which I loaded into my Jupyter notebook and started cleaning and playing with:
And after some cleaning, I identified the 18 classes I rode on September 19 using Python and Pandas:
To keep my learning going, I decided to use different methods to clean the data. To eliminate unwanted columns, there are two choices — the del keyword or the drop() function. Here’s how I used del to eliminate a column:
The del keyword operates on columns only, one at a time, and is an in-place operation. On the other hand, the drop() function operates on rows and columns, more than one at a time, and can operate in-place or return a copy. Here I used the drop() function with iloc to eliminate a column:
Finally — what are my future plans for 2021? Pretty short and sweet for now:
- Continue to be the gatekeeper to my happiness
- Find a meaningful job in Data
- Get vaccinated!