Projects and Papers
Machine Learning Natural Language Processing Finance

Restaurant Feature and Location Optimization

Restaurant Feature and Location Optimization

This was part of a group project for Columbia's Business Analytics class, IEOR 4650.
The code can be accessed on GitHub.


Opening a restaurant is an expensive business that requires a large investment and usually carries high operational costs which will only be countered by even higher sale rates. From thousands of locations available for opening a restaurant, it is often difficult to predict the success of a restaurant before it is open. This project identifies the ideal location and features required for a new restaurant to be successful.


The company Yelp publishes user reviews and recommendations for business, including restaurants and bars, in the US. They publicly provide this data for over 209,393 businesses in the US, including 8,021,122 million reviews. The state with the most data on registered businesses, with 60,803, was Arizona, of which 14,335 were reviews for restaurants, cafes and bars. 1,442,385 corresponding reviews were identified, and were analyzed for their sentiment through natural language processing. Weighting each review by its number of upvotes, they were averaged to provide an average review sentiment per business. Background demographical data on cities was found for Arizona, and this data was added to each business based on its location. The attributes of each business from the main dataset are extracted, and three models are created: one to determine whether the restaurant will be open, one to identify the ideal city, and one to predict the star rating the restaurant will have and suggestions for improvement.

Data Retrieval and Cleaning

Yelp provided 5 datasets: businesses, reviews, customer tips, checkin times, users. The business dataset was the first to be processed, and then sentiment analysis was run on the reviews and customer tips datasets.

The dataset provided each business’s name, location and various categories and attributes, including stars/rating. Using the state location, the most number of businesses were identified in Arizona (60,803), followed by Nevada (39,084), and Ohio (16,392) — the average number of businesses was 4,730 per state. Only Arizona data was used, to avoid user preferential bias between states.

The category and attribute data was presented as a list for each business. The categories and attributes were extracted, with each unique category or attributed representing a dummy variable for the businesses. There were a total of 1,242 unique categories (such as “Food” or “Doctors”), and 94 attributes (such as “BestNight”, “NoiseLevel”). To filter the businesses, only ones with the categories “Restaurant,” “Food,” “Bars,” or “Cafe” were kept, reducing the total from 60,803 to 14,335. Further, only categories or attributes present in at least 1% of the sample were kept. This meant only 57 categories and 56 attributes were kept. Each of the 14,335 businesses was now represented by a business_id and 118 other numerical and binary variables.

The business addresses had multiple spelling variations for similar cities. Using the postal code from each business’s address, the city name was then pulled using a Google API. Using the city name, demographic data for each location was added from a dataset compiled by students at Mexico State University, who also worked with the Yelp dataset. This included city-level data on education levels and population.

The checkin times dataset was not used, as it was very limited. For the 14,335 businesses, there were only checkin time for 13,733. Using this data would mean reducing the dataset by 4.200%, removing 602 businesses.

Sentiment Analysis

The reviews and tips datasets presented a customer’s comment or review for each business. Only reviews made for restaurants in Arizona were kept, using the business_id across both datasets as a primary key. For each comment, only special characters were removed during preprocessing, as the sentiment analyzer used was VADER. VADER is optimized for sentiments expressed in social media and commenting, primarily using lexical features. It gives a value from -1 to 1 , ranking how positive or negative the sentiment of a review is. Each review’s sentiment was then weighted by the number of upvotes it received, and then all the reviews were averaged for each business. The review star ratings were also aggregated in the same way for each business. This meant that reviews that received more positive feedback would be considered more than reviews with less feedback. Each business then had an average review sentiment, and average number of stars received in reviews. The same process was conducted for the tip data, which contained tips/advice from users for each business.

Model 1 - Open or Closed Classification

The first consideration is to determine if a restaurant will be successful enough to remain open. This meant running a regression on all the category and attribute variables, against a binary target variable of open or closed. The data was split 70-30 train-validation, and the model chosen was a logistic classification. A preliminary regression was run to identify variables that had a significance greater than 95%, and then best subset selection was employed using only the variables identified. The final model for this classification had an accuracy of 80.526% on the training data and 79.144% on the validation split; the confusion matrix of predicted open vs actual open is presented below.
Predicted vs Actual results
Confusion matrix of predicted open vs actual open

The model found that the features which increased the likelihood of remaining open were: fast food, classy ambience, has bike parking, good for brunch, good for dinner, offers delivery and has a liquor license. Restaurants marked as having a casual attire were more likely to close. The classification also indicated that restaurants serving Mediterranean or Chinese food were more likely to close (coefficient of -0.86548 and -0.33290 , p-score of 1.58e-07 and 8.077e-03). Restaurants with more positive reviews were more likely to be open. Using this model, one can see that, for example, a restaurant is likely to remain open if it has a classy ambience, is good for dinner, offers delivery and has bike parking. It would fare better if it did not serve Mediterranean food.

Model 2 - Location Optimization Regression

Factors relating to lifestyle were highly correlated with the number of stars for a restaurant in each city. We decided to include the lifestyle indicators of all cities, using an approach gathered from a study conducted at Mexico State University. These indicators include the education level (consisting of percentage of population that has completed high-school, Bachelors and Graduate degrees), the employment level, household affordability index (HAI), and cost of living index (CLI). A lifestyle score was used, representing the average quality of life per person in each city, based on life expectancy, cost of living index, household affordability index and several other metrics. 

Using dummy variables to represent each city, linear regression models can be created to determine how a city can impact the expected number of stars for different restaurant types. This involves the characteristics of cities, such as their populations, or quality of life score, as well as the features of different restaurants that impact a restaurant in each city. By varying the restaurant type and features, we can identify the most successful place in launching a restaurant (the place with the greatest predicted number of stars). For example, if one desired to open a Chicken Wing restaurant, the model would display, with an adjusted R-squared score of 0.735, that the best city would be Glendale, AZ.

Additional features and search parameters can be included; increasing the number of pre-set restaurant variables, such as including the restaurant price range, increases the predictive power of the model. For example, a Breakfast/Brunch restaurant with a Price Range Level of 1 (displayed as $ on Yelp) would be most successful in Gilbert, AZ, and would perform poorly in Surprise, AZ. The ranking of cities are determined by their p-value, limited by the threshold set according to a desired tolerance (< 0.05), and then by their coefficient. In this example, the model had an adjusted R-squared score of 0.771 .

As another example, for American cuisine, the top places for the lowest price range restaurant (Yelp pricing of $) are Paradise Valley and Fountain Hills, with their respective p-values of 0.050 and 0.139, and coefficients of 0.450 and 0.689. Opening this type of restaurant in Paradise Valley would increase the expected number of stars by 0.450. The worst locations would be Goodyear (p-value of 0.073, coefficient of -0.346), Chandler (0.047, -0.192) and Gilbert (0.009, -0.273). For the second price range ($$), the top three are Scottsdale (0.010, 0.114), Litchfield Park (0.176, 0.211) and Goodyear (0.053, 0.158). For the highest price range ($$$), the cities which showed a statistical significance in increasing the expected number of stars were Paradise Valley (0.057, 0.359) and Fountain Hills (0.298, 0.318). It is advised not to open in Glendale, which had a p-value of 0.2 and coefficient of -0.493 . This means that if a restaurant with American cuisine and a price level of 3 ($$$), it would expect to lose 0.493 stars if opened in Glendale.

Model 3 - Feature Optimization to maximize expected ranking

To increase a prospective restaurant’s star ranking, the restaurant features as well as the location’s characteristics must be considered. Thus, the next step is to further investigate the attributes of restaurants that result in a more successful one, and identify which ones make restaurants successful in a particular city relative to others. A logistic classification model was used to identify the features which are most important for increasing a restaurant’s predicted number of stars. This was done over the entire dataset, as well as for each specific city.

Over all restaurants, cafes and bars in Arizona, the model found the significance of each variable with an adjusted R-squared model score of 0.873. The restaurant features (categories and attributes) that increase the predicted number of stars include Thai (coefficient of 0.8458), lounges (0.7779) and desserts (0.4151). Select features that could decrease a restaurant’s star rating are fast food (-0.4896), or sports bars (-0.3503). The full list of features is available in Appendix A.

Models were then generated for each restaurant feature and location. For example, if opening a Chicken Wing restaurant in Mesa, the most important features indicating a restaurant’s success are whether it delivers (0.4188), is good for dinner (0.2619) and if it serves beer or wine. For this example, a restaurant would expect fewer stars if it was good for kids (-0.3174), was open for many hours per day (-0.0353 per daily hour open) and served spirits as well as beer and wine. In this particular example, the model had a score of 0.847 .


This project determines the optimal location and features a restaurant should have in order to maximize its success. Using the models developed would serve as guideposts to prospective restaurant owners on the top locations for their business, and the restaurant’s features. Success was scored as whether a restaurant was open or closed, multiplied by the number of stars (out of five) in the restaurant’s Yelp rating. For a prospective restaurant, it is first verified whether it is expected to remain open, using Model 1. Then, using Model 2, and any assumed features, the best location can be found. Once the location and features have been pre-selected, they can be used in Model 3 to determine which other features should be added to improve the overall expected star rating.

The results present very interesting insights and allow the restaurant owner to derive useful insights not only pertaining to the best location for a type of restaurant, but they can also predict the most significant factors affecting the success of a type of restaurant and its price range, and from there incorporate them into their business model.