Decision tree models for investigating local economy in the COVID-19 pandemic (or, the development of Brookline, MA local business in the COVID-19 pandemic, part II)

Norfolk Project
4 min readMar 6, 2022

In our previous article, we explored data collected from our survey of Brookline, MA local businesses, as well as a large existing body of literature, by investigating associations between the responses of businesses. Though the data collected in our survey was roughly linear, it is categorical and inoperable. So we cannot use methods of statistical analysis that rely on applying linear regression and other techniques using functions onto the underlying data. However, we can segregate the data in a clever way.

This is a continuation of The development of Brookline, MA local business in the COVID-19 pandemic. The article is available here.

Authored by: Diego Luca Gonzalez Gauss, Benjamin Vyshedskiy, Robert Rogers, and Matvey Borodin.

All decision tree models were made and evaluated by Diego Luca Gonzalez Gauss.

One easy method of segregating data is via decision tree classifiers. Before any statistical tests, we can interpolate NaN data with mean values when the value is roughly linear and drop data otherwise. Our group is very interested in the relationship between local businesses and the federal government, and how government intervention during the COVID-19 pandemic is perceived by local businesses. So, for our initial group of trees, we chose to split data based on whether or not the model believed a business received government loans, as well as the performance of said businesses.

An example of a decision tree predicting if Brookline businesses received government loans.

With the preprocessed and cleaned dataset, we simulated 120 decision trees for each split. Our results when predicting whether a business received government loans had an average accuracy of 84.3%, but the model’s capability to predict businesses that did not receive government loans was limited. Comparably, when predicting whether or not employees thought their business had performed well in the pandemic, our trees had an average accuracy of 71.3%.

The cumulative confusion matrices for our simulated decision trees when predicting whether a business received government loans (left) and their performance (right). 0 signifies if the predicted value was actually no, and 1 if it was actually yes.

To avoid overfitting, a significant issue considering the scale of data and the results of our decision trees, we can make a random forest model using existing trees. Random forest yielded a much higher accuracy of 93.4% using previously simulated decision trees predicting whether or not a business received government loans, outperforming classical methods. The forest of trees predicting the performance of businesses had similar improvements, with an average accuracy of 81.0%.

Our features’ importance in predicting whether a business received government loans (left) and their performance (right), measured using mean decrease in impurity (MDI) from our forest.

An important question posed in our earlier article asks how important our features were in investigating the nature of Brookline business. With our forest, we can extrapolate how important certain features are by quantifying how they contributed to the progression of the decision tree, and how frequently they appeared in tree nodes. Measuring the mean impurity of business features we collected from simulated decision trees, we learned that the demand of businesses (“feature 4”) and the attitude while in the workplace (“feature 15”) are most informative (0.13, 0.11 MDI) to predicting whether or not a business received government loans. The average amount spent by customers (“feature 3”) was the most informative (0.16 MDI) feature in predicting the performance of businesses, by a large margin.

The vast majority (94%) of root nodes segregated data depending on the demand of the business when predicting whether a business received government loans, while every root node of trees predicting the performance of businesses segregated data depending on the average amount spent by customers or employee experience. This aligns with our findings in the previous article, which highlights the positive relationship(s) between average customer spending and business performance and between business demand and workplace attitude as some of the strongest pairwise associations from our data.

An example of a gradient-boosted tree predicting the performance of Brookline businesses, with a maximum depth of 2, after being given a few stumps.

To affirm the feature importance estimated by impurity decrease, we also use sufficient decision stumps as weak learners in a gradient-boosted model. Considering the size of data, we can assume that the model will always reach a perfect accuracy in predicting whether a business has received government loans and its overall performance. However, we can use the gradient-boosted model to determine feature importance in predicting factors important to Brookline economy such as their performance, in a more developed model and without relying on impurity decrease or other statistics. Gradient-boosted trees loosely follow the statistics of trees simulated previously, with features such as demand, customer spending, and workplace attitude still relevant to segregating data. Root nodes were less decisive than with simple decision trees, with 77% of root nodes segregating data depending on the demand of the business when predicting whether a business received government loans, and 82% of root nodes when predicting business performance segregating data depending on the average amount spent by customers (54%) or employee experience (28%).

As with our previous article, qualitative inferences into business operations are important to understanding how local business employees feel, think, and relate to the Brookline economic landscape. Incorporating simple methods using decision trees, we show that our qualitative inferences can also give insight into quantitative measurements of Brookline’s economy more acutely than government statistics.

--

--

Norfolk Project

Norfolk Project is a student research group using data science & abridged fields to try and solve real-world problems. Contact us at: contact@norfolkproject.com