Do we rate as we write?

Introduction

Standing in front of us is a dataset of thousands of beer reviews taken from BeerAdvocate website. Each reviewer has given a 1-5 rating for each of the four beer aspects: Aroma, Palate, Appearance and Taste, along with an Overall grade. In addition, they also provided a textual review further explaining their opinion on a specific beer. In a nutshell, our aim is to explore what is it we write in these reviews that influences the ratings significantly. These conclusions can help breweries adjust their production according to people’s preferences and improve targeting: Do people often say that a beer is too sweet? It is an indicator to reduce sugar during the brewing process. On the other hand, are taste and appearance rated well? The breweries can then focus their commercials on slow-motion close-ups of a person drinking from a transparent glass, with their mouth in the forefront.

So, we will focus on answering two main questions:

~ Which beer aspects are the most important for users? Namely, what are the aspects that correlate positively with their overall rating?

~ What keywords do reviewers use about these aspects that are decisive factors when they give good grades?

One might ask, why do we even introduce textual analysis in our work? Okay, for our second question, the answer is quite straightforward – we need it to actually extract important words. But what about the first one? Aren't the grades users give enough to infer if an aspect stands out as influential? Well, there are two problems that arise.

First, people mostly give good grades. As can be seen from Figure 1, a vast majority of reviews (>90%) have numerical ratings all higher than 3. Therefore, if a certain aspect indeed stands out, its difference in rating compared to the other ones will not be so easy to notice.

Figure 1 : Users mostly give high ratings

Furthermore, the problem is hard due to the psychological phenomenon called the Halo Effect - the tendency for positive impressions of a person, company, brand, or product in one area to positively influence one’s opinion or feelings in other areas. For example, if a person is amazed by the beer bottle, they might transfer their pleasure to the taste. Or if the aroma reminds them of their mom’s cookies, they will probably rate the beer more than it deserves. In fact, a glimpse of our dataset shows that around 56% of reviews have numerical aspect ratings that almost coincide (as can be seen on Figure 2) – it is not very likely that each beer aspect follows the same pattern.

Figure 2 : Users similarly rate aspects

So, to overcome these phenomena, we use our good old BERT-based aspect sentiment analysis model (or BEERT as we like to call it) to find out what people really had to say about the beer, aspect-wise.

Aspect based sentiment analysis

Taste, Aroma, Palate or Appearance?

To answer our first question regarding the beer aspects most important to consumers, we conduct aspect-based sentiment analysis or ABSA. We do this by conducting an observational study between "absolute winner" and "absolute loser" beers, which we'll explain in a bit. Before that, we have to introduce a preprocessing step.
Considering that numerical ratings don't convey proper information about users' attitude towards a beer, we give our BEERT model the context (one of the four beer aspects) and it provides us with positive/neutral/negative sentiment scores for each aspect from the textual review. The scores are in the [0,1] interval and sum up to 1. Since we want to determine if a certain sentiment of an aspect exists, we convert each sentiment score to indicators by setting up a threshold and comparing them to it. What's more, as we look for aspects with a strong sentiment, our goal is to have only one of the three aspect-specific indicators present (which means that this sentiment is dominant) or to have them all being zero (we are not certain that the sentiment is polar enough). To be very confident in our results, we set up a threshold of 0.9.

Note that, due to BEERT's performance being computationaly expensive, we're working with a subset of our BeerAdvocate data - we keep only the reviews with fewer than 2000 words. Even with this approach, BEERT needed around 60h to extract the sentiments.

Our observational study compares two groups of beers: absolute winner, with all numerical aspect ratings greater than an absolute loser beer. We want to investigate how does each aspect sentiment derived from text influence the overall rating. Therefore, we randomly match pairs of better and worse beers, but we have to be careful in this process and try to eliminate as many confounders as possible that might endanger our analysis. Matched beer reviews:

are written by the same user

are related to beers with the same style and alcohol by volume (ABV)

do not differ largely in size, due to the BEERT inferring higher sentiment scores for longer reviews

are given in the same season, to cancel out any possible underlying seasonal effects

In total, we are working with around 62k of matched reviews.

Let’s take a look at the following plot (Figure 3), depicting the number of positive/neutral/negative and the total number of sentiment indicators for each aspect of winner and loser beers. We can immediately notice that people are more expressive about beers they rated higher simply by looking at bar heights. But, a more important result is that Aroma and Taste clearly stand out – people have more polarized opinions when writing about these two aspects.

Figure 3 : Number of indicators per aspect per sentiment

This could not be concluded merely by looking at the numerical ratings on eCDF plot (Figure 4) for each aspect of winner beers – it seems that people grade all aspects similarly (halo effect seen in real life :))

Figure 4 : eCDF of aspect ratings for absolute winner

It would also be interesting to see if some aspect sentiments are prevalent in winner beers. To investigate this, we estimate the average difference and 95% confidence intervals for the difference in sentiment indicators between winner and loser beers for each aspect. We then visualize the results in Figure 5. For positive sentiments, all C.I.s are above zero, and for negative, all are below zero. But, this result isn’t that interesting – it’s expected that the positive sentiment is prevalent in winners, while the negative is prevalent in losers. What’s interesting is the intensity of the influence of aspect sentiments on respective beer groups. Let’s look at positive indicators – clearly, Aroma and Taste stand out compared to the other two. But what does this mean? Imagine an edge case when all winners have a positive indicator for an aspect “1” and all losers have “0”. This would mean we can only look at that specific aspect and tell if a beer is a winner or a loser, regardless of the numerical ratings. Well, Aroma and Taste are the closest to that edge case, so we conclude that they have the strongest influence on better-rated beers. Similarly, negative Taste sentiment is the best indicator of a beer being rated lower.

Figure 5 : Confidence intervals for differences of aspects

The two analyses we conducted above both confirm that Aroma and Taste are prevalent in winner reviews compared to Appearance and Palate. We, therefore, deduct that these two are most influential on the ratings, and even though people often give similar grades to all aspects, they mostly care about Aroma and Taste.

Keywords

What do you like about the beer?

Keywords and beer styles

ABSA showed that Aroma and Taste are key when people write a review. These aspects influence the reviews overall. Now we want to go deeper to find out which aspect characteristics map into a high rating. We cannot conduct this analysis globally since different beers may have different traits. For example, we expect one to mention citrus notes for a Grape beer and not for an IPA, which is characterized as bitter. That’s why we conduct our analysis for beer styles individually, but only on those with at least 40k reviews, so that we have enough data points to declare the results valid. There are 15 such beer styles in the dataset.

Figure 6 : Number of reviews for the most rated styles

To answer our question, we start by investigating which style-specific keywords extracted from review text correlate with a high rating. We call these overall keywords. For each style, we take the top-100 most frequent words in all reviews of that style from which stopwords are excluded. Now, similarly to ABSA, we construct binary indicators for each keyword, denoting its presence in each style-specific review. To see how each keyword influences the grades, we conduct linear regression with Rating (an aggregate of numerical aspect grades) as the target and indicators as covariates. Finally, we obtain overall keywords as those whose coefficients we are certain in (p-value < 0.05) and for which the change in Rating is greater than 0.1 when having it in the formula.

Let’s visualize the results we got:

The size of a keyword corresponds with the number of styles it is associated with. It’s not shocking that generic words like great and nice appear in reviews the most often; we extract real information from overall keywords relating to a specific aspect :

Taste: grapefruit, citrus, cherry, chocolate, hops, hoppy, vanilla, tart, pine, funk, bourbon, tropical

Aroma: grapefruit, citrus, cherry, chocolate, hops, hoppy, vanilla, pine, funk, bourbon, tropical

Palate: smooth, refreshing, rich, tropical, thin

Appearance: red, tropical

where green-colored words have contribute positively to the Rating, and red-colored negatively.

In the interactive plot below (Figure 7), you can click on any of the 15 beer styles observed and visualize the contribution of each overall keyword to the style’s Rating. Note that the p-values for regression coefficients are all less than 10^-39.

Figure 7 : Keywords and coefficients per style

Based on this analysis, we derive the following conclusions:

Most overall keywords relate to Aroma and Taste, which co-aligns with ABSA results that people talk about these aspects the most. What’s more, all regressor coefficients related to these keywords increase the Rating proving that Aroma and Taste are crucial when giving a high grade.

There were only two keywords negatively influencing the grade that stood out – like and thin – so we conclude that people more often write about beer characteristics they liked.

Seasonal keywords

Overall keywords provided us with a global insight into how different beer style characteristics influence its rating. However, looking at Figure 8 depicting the monthly trend in Ratings for each of our 15 styles, we notice that the grades vary greatly depending on the period. For example, “American Amber/Red Ale” is rated 0.1 higher on average in February than in August. This inspired us to look for style-specific seasonal keywords to find out when is which beer characteristics important for consumers.

Figure 8 : Monthly trends of Rating per style

We decided to introduce seasonality by observing the best and worst rated month for each beer style. As in the previous analysis, we extract top-100 most frequent words and discard stopwords, but now only based on reviews given in these two months. Our candidates for seasonal keywords are ones that appear 10% more in one of these months – only they have the possibility to make a difference in ratings. Finally, as before, we convert them to binary indicators, fit a linear regression on all the reviews for a specific style, and decide which become seasonal keywords by looking at their p-values and contributions to the Rating. All kept regressor coefficients have p-values less than 10^-28.

Our method yielded 13 seasonal keywords (presented in the word cloud below). We already see the impact of the seasonal analysis – 7 new style-specific keywords, which were not present in the set of overall keywords, emerged. Out of these 13, only the word caramel has a negative influence on one style – “American Double / Imperial IPA”. This style is best rated on average in February and worst in July, which can be explained by the fact that the word caramel is more frequent in reviews given in July. Let’s now take a look at the words vanilla and fresh/refreshing, all influencing multiple beer styles positively. Our results show that vanilla-noted beers are more appreciated during changing seasons and the least during summer. On the other hand, people like a refreshing beer when it's warmer outside, as most of worst rated months of influenced styles are during winter.

Table of coefficients for seasonal keywords:

vanilla:

Style	Coefficient	Min Month	Max Month
American Double / Imperial Stout	0.116201	9	4
Russian Imperial Stout	0.135693	8	5
American Strong Ale	0.145952	9	11

fresh/refreshing:

Style	Coefficient	Min Month	Max Month
American IPA	0.135115	7	10
Witbier	0.168595	12	9
American Pale Ale (APA)	0.186084	1	10

caramel:

Style	Coefficient	Min Month	Max Month
American Double / Imperial IPA	-0.118599	7	2

To conclude, let’s ask ourselves why seasonal keywords shed a “different” light on beer aspects when compared to overall keywords. Well, overall keywords represent a general trend of mentioned beer characteristics, ones that are likely to increase or decrease the rating given at any point in time. Reviewers will probably be satisfied or dissatisfied with a particular style characteristic whenever they drink it. However, there exist some keywords that are not mentioned very frequently but their relative usage varies noticeably through time. That's exactly what we observed in previous analysis.