Midterm peer review #7

changy12 · 2016-11-04T20:59:55Z

I read both your proposal and your mid-term report about predicting a certain customer’s rating on a certain restaurant based on some features about both the customer and the restaurant. (Feel free to correct me if I am wrong)

At first, I appreciate the value of this research problem, as you said “The mode brings mutual benefits to both customers and business owners.”, and “In this way the customers could find the local businesses they are looking for more efficiently, and the business owners could improve their strategies by learning more about the target customers.”

For the mid-term report, I appreciate that you directly and clearly state your purpose in the first sentence, and your report is clear in content, language and structure.

What I admire more is your preprocessing. You are able to combine 5 datasets together and match the corresponding users and business, which looks a hard work for me. Then, among such a large combined dataset, you have done a significant cleaning work and focus on dining, which I think involves a significant amount of effort.

For the analysis, I appreciate that you have done abundant preliminary analyses with 3 models and cross-validation to find out the proper parameter, which is standard.

Some points that may need improvements:
(1) As you said, linear regression is not proper for this problem. In fact, this problem is better to be considered as an ordinal regression or a classification problem, rather than ordinary regression where the output is continuous value.

(2) How do you deal with categorical features in logistic regression? Some related questions are:
Since you used 45 features, why the size of the matrix W for logistic regression is 46*5?
You said “We found the coefficients for restaurant categories were pretty low”. I did not quite understand, did you mean the variable called “restaurant categories”? If so, it is categorical variable, yes? If yes, and if you encode it into 0-1, then the variable “restaurant categories” will involve more than 1 coefficients, and each of them corresponds to one kind of restaurant.

(3) Your accuracy is defined as the proportion of correctly classified samples, right? If so, the mean accuracy on both training and test datasets still needs improving. In addition, if the 5 classes are unbalanced, then you’d better adopt some criterion other than classification accuracy, which could be searched online. For example, if rating=3 only has 30 training samples and 20 test samples, the accuracy may still be high even if the prediction of the classifier will never be rating=3 for any x.

(4) Some classification algorithms I know so far: SVM, Adaboost, Random Forest, RBF neural network, BP neural network, decision tree, Fisher linear analysis, kNN, etc. Among these, SVM and Adaboost are good for many problems and popular topics in machine learning.

(5) Finally, after successful prediction, how will you help the customers recommend the local businesses, and help the business owners improve their strategies by learning more about the target customers?
Once you said, “This python library also produces the output directly instead of the probabilities of each output”. In fact, I think you could use these probabilities to help you compare and recommend restaurants.

Your project looks promising to me. Wish you a fruitful and pleasant process.
Ziyi Chen (zc286)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Midterm peer review #7

Midterm peer review #7

changy12 commented Nov 4, 2016

Midterm peer review #7

Midterm peer review #7

Comments

changy12 commented Nov 4, 2016