The Investigation of Multiple Product Rating Based on Data Mining Approaches

Ratings and product reviews could be considered as one of the main features determining the quality of a product in online store systems, especially in deciding whether to place a product as part of an online store's inventory. Online vendors are attracted by product reviews and ratings in order to study on potential products and related predictions. In this way, different machine learning algorithms such as Support Vector Machine, Bayesian Networks, Random Forests and Logistic Regression are investigated. The performance of each model is evaluated using accuracy, sensitivity and F1 score on the data from amazon online store website, 1996 to 2014. It is noteworthy to mention that the results of this paper can be used as an initial input to long-term product rating predictions.


1-Introduction
Ratings and product reviews are main indicators of product quality in online store systems. Noticeably online vendors pay attention to the ratings in product warehousing retainment process. They also believe long-term product rating predictions would help them introducing certain product on the store website.
By the expansion of electronic commerce such as Amazon and eBay, online purchasing recognized as the most important trading method in the last decade. Although the main advantage of e-purchasing is physical inexistency at the store, this does not allow customers to physically evaluate the product and have to complete purchase process based on their senses. Hence, after price consideration, online customers will be directly pay attention to product rating and reviews in order to make purchasing decision. For each product, the rating information includes two values a) Average product rate b) Number of the voters. As shown in Figure 1-1, if an online product is rated by many users, the customer will ensure that the product information is reliable. On the other hand, if a product is rated only by multiple users, the customer may not feel confident in his purchasing decision.
Figure 1-1, Product Rating Sample by Amazon Increased number of voters will influence average product rating become closer to average population rating in long run. Long-term product predictions are beneficial for both retailers and customers. Various studies showed that user ratings have undeniable effects on customer purchasing decisions, as well business profits. Along with the importance of product ratings for online vendors, product ratings are also known as an important internal part of the world's leading web services such as Amazon, Tryp Odyssey, Epinions, and Yalep where users can express their opinions about a product, company or a business by writing textual reviews. Usually this rating systems contain a text field and star rating evaluation. User's rating system for a product would usually be as follows: Restaurant: A Restaurants, Shiraz, B. Street

Food Quality
Service Quality Environment Textual review: I have eaten at restaurant A for many times. Food quality and variety is good. But it's a small place, so you can never go straight there and find a sit available. Another problem would be low speed serving so you must wait too much.
As can be seen from mentioned rating system, user performs two tasks for multiple rating objectives. One through assigning multiple rating score to different characteristics of a product, and the other by textual reviews needs to be written in the text field. These textual reviews could affect product quality in positive or negative manner. The challenge of low product related information in multiple rating systems makes us to judge the quality of a product on long-term prediction basis. Gano et al. (2009) tried to improve the rating system by examining user experiences. Online comments considered as an important issue for users in purchasing processes. Furthermore, most comments are written in a free text format so computer systems could not easily understand, analyze, and collect them. If the structure and feelings which are provided in the reviews taken in to account, the user experience will be greatly improved. Consequently, they focused on identifying information in free texts and using knowledge to improve the user experience.

2-Literature Review
Li Hong and colleagues (2010) presented a method to improve numerical ratings. Unigrams and n-grams were the most commonly used. Unigrams could not capture important phrases such as "could have been better", which is essential for prediction models. On the other hand, n-grams considered such expressions, but usually appear to have poor performance in the training set and thus not able to produce powerful predictions. According to the limitations of these two models, a new type of presentation was introduced: root word, set of words which could modify common sentences and negative words. They also provided a limited Ridge regression algorithm for learning outcomes related to the reviews. The experiments showed that the methodology of the Kiev opinion is much better than the earlier advanced techniques for review rating predictions.
Ming and Khademi (2014) presented the text of a review, along with the numerical score. The numerical score was predicted only by reviewing the user's text. Online surveys considered as a valuable source of information for users but due to the large number of these texts, it's almost impossible for users to access the information they seek through all the reviews. To provide a business review, one solution is to assign a rating of 1-5 to the business. This privilege can be personal and equitable to the user's mind. They also predicted a business rank based on the user-generated theory texts which not only provided an overview of the texts highlevel ideas, but also abolishes individualism.  introduced online communities as an attractive source of ideas which are relevant for new product development and innovation. However, making sense of the 'big data' in these communities is a complex analytical task. In the paper they described how to tune the model and which text mining steps to perform. The results conclude that machine learning and text mining could be useful for detecting ideas in online communities. Miller et al. (2018) mentioned that supervised methods are likely to provide better qualitative results, model selection procedures, and model performance measures. They illustrated that much of the expense of manual corpus labeling comes from common sampling practices such as random sampling that result in sparse coverage across classes, and duplicated effort of the expert who is labeling texts. Furthermore, they outlined several active learning methods for iterative text modeling and article sampling which leads researchers to train high performance text classification models. Usai et al. (2018) increased awareness of the potential text mining technique to discover knowledge and further promote research collaboration between knowledge management and the information technology communities. Since its emergence, text mining has involved multidisciplinary studies, various database technologies, Web-based collaborative writing, text analysis, machine learning and knowledge discovery.

3-Research Goal
Investigation of multiple product rating based on data mining approaches: For the objective of rating various attributes of a certain product, text mining approaches and classification methods were used.

3-1-Text Binary Classification
The goal is to achieve user attached classes from corresponded rating and textual reviews provided by Amazon. For this purpose, only two classes will be considered as target outputs which are: 1) Low Rate or Class 0: Rating scores which are less than 3. 2) High Rate or Class 1: Rating scores which are equal or more than 3. The aim of binary classification algorithm is to assign class0 or class1 in to a new unseen textual review. Generally, high rate class would have a positive impact on customer purchasing decisions, while low rate class tend to discourage customers from purchasing products.

3-2-Text Multi-class Classification
The strategy is to extend binary classification mode in order to assign classes in to textual ratings based on the accurate star system. Unlike the binary classification which samples are classified to limited classes, the multiclass classification proposes to classify samples to more than two classes with nominal values. As it can be seen from figure 3-2, each textual rating score is mapped to a specific class using one to one relationship. x_i {1 ^ *, 2 ^ *, 3 ^ *, 4 ^ *, 5 ^ *} Formula 3-1 Figure 3-2, Multi-class Classification In fact, some of the classification algorithms such as Bayesian networks and Logistic regression are naturally designed to obtain multi-class classification objectives. But other algorithms such as Support Vector machines are designed for binary classifications and require further processes to manage multi-class classification such as one versus one strategy.

3-3-Logistic Regression
At first glance, Logistic regression and multi-class classification may look similar, but they are theoretically different. In Logistic regression, class values are numbers between 1 and 5. Logistic regression also maintains the order. For example, class 4 is better than class 1. In the case of logistic regression, the positioning of a real 5-Star rated product in class 4 increases prediction accuracy. x_i {1,2, 3,4,5}. Formula 3-2

3-4-Text Classification Implementation
In order to implement a text classification algorithm, it is necessary to complete a few stages which are wholly mentioned in research methodology.

5-Research Methodology 5-1-Data Collection
For the proposed objectives of this paper different product datasets containing customer ratings were used. These extracted files which are derived from Amazon only represent a subset of data, that all products have 5 ratings or each user has rated at least 5 times. Duplicate ratings which are less than 1% of the total were eliminated. Each rating includes following labels: 1) Rating ID 2) Product ID 3) Voter Name 4) Textual Reviews 5) Rating Efficiency Score to other users 6) Rating Score ( Among above labels, we've only work with text reviews plus rating results and summaries. The rest of the existing elements considered unrelated in a context-sensitive framework. For extracting this information JSON data format, R programming language and Excel have been used. Furthermore, in this paper experienced and searched products are considered as two main categories of data collection. Quality evaluation of experienced products are difficult as huge number of users should buy the product in order to assess the quality. The specific sample utilized from this category is computer games. On the other hand, quality evaluation of searched products can take place through considering their key features on the internet. In this way, purchasing the product is not a necessary task. The specific sample utilized from this category is Mobile and Accessories.

5-2-Variable Selection
Dependent variables: Amazon rating system which is based on a 5-Star evaluation, used as a dependent variable. Dependent variables could take different values due to specific classification mode. For example: 1) Dependent variables for text binary classification algorithm would be low rate and high rate.
3) Dependent variables for Logistic regression would be 1, 2, 3, 4, 5. Independent variables: textual reviews, rating results plus summaries and the combination of them are used as independent variables. The goal is to find out which of these three variables provide better performance for the complex structure of the text.

5-3-Data Cleansing
Naturally, textual reviews contain duplicate and non-essential words therefore data preprocessing is used for the cleaning objectives. The purpose of data preprocessing mainly highlighted below: 1) Punctuation deletion 2) Number deletion 3) Extra space deletion 4) Stop word deletion 5) Lowercase conversion Data preprocessing simplify the data and gain more accuracy in classification task. Data cleaning process is done using R programming language and its related packages. This stage provides cleaned training dataset which includes rating score and cleaned textual reviews.

5-4-Data Resampling
Data resampling checks data distribution before classification process. The aim is to figure out whether the data needs to be re-sampled or not.
Mobile and accessories dataset include ratings in 13% low rate and 87% high rate. It also presents 7% ratings in 1-Star, 6% ratings in 2-Star, 11% ratings in 3-Star, 12% ratings in 4-Star and 55% ratings in 5-Star.
It is obvious that both datasets have unbalanced distribution therefore, the data must be resampled in order to prevent bias and false results. In this paper sample overfitting and underfitting methods are used to eliminate  Vol.10, No.5, 2019 20 data distribution problems.

5-5-Classification Algorithm Learning
In the learning stage, classification algorithms assign a class to each labeled sample. This stage enables classification algorithms to learn from the samples and then correctly classify the new ones. Different classification algorithms are developed in order to determine the best according to performance measurements. To implement this learning stage, Python programming language and its related libraries have been used. All implemented classification algorithms have been learned with both datasets utilizing sample overfitting and underfitting method.

5-6-Evaluation
Evaluation enables us to measure performance and effectiveness of trained classification algorithms. In other words, we want to see if the classification algorithm has been able to correctly classify new and unseen instances. Performance evaluation of a classification algorithm includes tasks below: 1) parallel environment creation to simplify the conversion, transformation, and classification of earlier stages. 2) 10-cross-validation implementation.

6-Numerical Results Obtained from Classification Algorithms
Performance related measurements due to different classification algorithms are presented in the 1) Table 6-1, Bayesian Networks Algorithm 2) Table 6-2, Support Vector Machines Algorithm 3) Table 6-3, Random Forests Algorithm 4) Table 6-4, Logistic Regression Algorithm The measurements have been done on both computer games and mobile datasets utilizing sample overfitting and sample underfitting method. 1) The result of support vector machine algorithm using sample underfitting method could not be displayed due to low speed execution. Therefore, these results are marked with "-". 2) Colored sections represent the best performance among all mentioned classification algorithms. Colored sections represent the best performance in predicting precise score using logistic regression.

7-Numerical Analysis of Classification Algorithms Performance 7-1-Analyze the Results of Unbalanced Data
As it can be seen from numerical results of computer games dataset: 1) Binary classification methodology with an unbalanced dataset provides fairly good results in different classification algorithms. F1 performance evaluation score take different values between 0.8 and 0.88. Specifically support vector machine algorithm provides the best performance. 2) Logistic regression algorithm with an unbalanced dataset would predict a precise rating score. F1 value is equal to 0.51 for its best performance. As it can be seen from numerical results of mobile and accessories dataset: 1) Binary classification methodology with an unbalanced dataset provides fairly good results in different classification algorithms. F1 performance evaluation score take different values between 0.82 and 0.92. Specifically support vector machine algorithm provides the best performance. The results are mostly similar to computer games dataset. 2) Logistic regression algorithm with an unbalanced dataset would predict a precise rating score. F1 value is equal to 0.61 for its best performance.

7-2-Analyze the Results of Resampled Data
Utilizing sample overfitting method provides better results than sample underfitting in both binary and multi class classification. Multi-class classification gives better results in predicting precise rating score using data resampling. Although the results of sample overfitting are optimistic, as F1 performance evaluation score is identical to 0.90, it has a notable weakness. Sample overfitting method engages bigger datasets and consequently execution speed would increase. According to mentioned issue, in this paper we used the results of sample underfitting for further approaches.
As it can be seen from numerical results of computer games and mobile and accessories datasets support vector machine and Bayesian network outcomes are close to each other in terms of accuracy. Although the support vector machine algorithm works a bit better than Bayesian networks, it has lower execution speed problem. We have to mention that classification algorithms provide better outcomes using rating results and summaries.

8-Examine the Performance of Support Vector Machine algorithm
In the following, we decided to investigate efficiency of support vector machine algorithm in terms of binary and multi class classifications. This means that the support vector machine algorithm used sample underfitting method plus rating results and summaries. The support vector machine algorithm obtained F1 performance evaluation score of 0.84 for binary classification and 0.52 for multi class classification.

8-1-Support Vector Machine Algorithm -Binary Classification
As it can be seen from table 8-1 the results of low rate classification have been greatly improved. It is also obvious that low rate classification is done more accurate than high rate classification.

8-2-Support Vector Machine Algorithm -Multi class Classification
As it can be seen from table 8-2 classifying test data to 1-Star and 5-Star classes is count as a simple task for the algorithm. It is clear that the classification algorithm has weakness in correctly classifying ratings to 3-Star class. Furthermore, support vector machine algorithm may mistakably consider 5-Star ratings as the worst-case scenario with 4-Star rating.

9-Conclusion and Future Work
In this paper, we consider different models to obtain textual score. Text classification algorithms automatically assign a text document into a fixed set of classes. The goal of binary classification methodology is to classify data in to high rate or low rate classes on the other hand multiclass classification and logistic regression objectives are to find precise category or rating score. These methodologies were tested on two different datasets which are experienced and searched products. It is noteworthy to mention that implementing a text classification algorithm in unbalanced dataset is not an easy task. Indeed, resampling techniques are needed in order to balance the datasets. According to the results, successful classification algorithms are Support vector machine and Bayesian networks due to performance evaluation.
For future work analyzing further datasets form Amazon, utilizing resampling methods other than sample overfitting or underfitting, considering K famous words in the context, improving bag of words method, onegram -two-gram -three-gram application and sentimental analysis are highly recommended.
The use of text-mining and machine learning algorithms in systematic reviews: reducing workload in preclinical biomedical sciences and reducing human screening error. bioRxiv, 255760. Ming Fan and Maryam Khademi. Predicting a business star in yelp from its reviews text alone arXiv preprint arXiv: