In the first part of my articles on linear regression, “Linear Regression in Tableau, Part 1. Time Series”, I looked at time trends. Most often, however, regression is used to study the relationship between two quantities, for example, budget and profit. In this case, both values ​​have the same units of measurement (euros or dollars, for example). It is clear that budget are primary measure, and profit is secondary one, that is, it depends on bugget. In this article, I’ll talk about finding group trends using movie data as an example. We all know about movie budgets and the box office that makes films – this is often reported in the media. In addition, there are various rating systems, for example, IMDB or TMDB, in which movie ratings are determined on a ten-point scale and are based on the audience’s vote. The TMDB rating is quite an authoritative indicator, and on its basis the viewer can choose films so as not to waste time on bad films. Now let’s ask ourselves questions:

  1. Will the film’s profit grow linearly as the film’s budget increases?
  2. Is there a difference in the correlation of profit with the budget of films of different genres (action, fantasy, etc.)?
  3. Will the trend be upward linear when comparing the TMDB rankings to the box office?

We asked ourselves some interesting questions or, in terms of statistics, built hypotheses. Now we must either confirm or deny them. I found The Movies Dataset at kaggle.com. This data contains information on 45,000 films by the end of 2017 in different languages, 26 million movie ratings from 270,000 viewers on the MovieLens resource. In addition, the data includes average TMDB movie ratings, box-office gross, movie budgets, and other metrics.

In this article, I will show you how to create a tool for finding relationships between metrics and breaking data into groups. The final visualization that we will be doing looks like this:

Linear regression in Tableau

The screenshot shows a scatter plot that characterizes two movie rating systems: TMDB (ten-point scale) and MovieLens (five-point scale). The coefficiend of correlation between the two rating metrics is high in most groups, that is, viewers vote roughly the same for most films. But there are films that deviate a lot from trends.

1. Creating a Scatter Plot with Groups

The correlation between two values is almost always visualized using a scatter plot. To build it in Tableau, you need to connect it to data files. The data contains several files, we only need the files movies_metadata.csv, ratings.csv (this file stores reviews and ratings from the MovieLens resource) and the links.csv file to link the first two files:

Note: the ratings.csv file contains more than 20 million rows with each review – that’s a lot, so I reduced it to 5000 lines by aggregating the ratings and the number of voters. This was done in Tableau Prep.

The data contains the fields Budget and Revenue. Let’s make the Profit field:

[Revenue][Budget].

Now let’s make the Profit vs Budget scatter plot in Tableau:

Linear regression in Tableau

By adding a trend line to the chart, we will see a strong positive trend, that is, as the budget increases, the profit increases. This is how we confirm our first hypothesis. But everything is not so simple, and we do not see details in the general trend. Let’s see the trends of films in different languages by adding the Original Language field to the granulation. We see different trends:

Linear regression

So we split all the films into groups. Let’s see the trend of a group of films in English:

Linear regression

This trend is positive, and in the general picture of English-language films, profits increase with the growth of the budget, that is, most films are profitable. Now let’s see the trend of films in Russian:

Linear regression

Here the trend is downward, movies become unprofitable when the budget grows. In this example, we made sure that the overall trend does not always correspond to the group trends.

Note: the initial data source contains complete information not for all movies, so I removed from the data movies with empty fields in the budget, genre, etc. As a result, 5230 movies remained.

2. Complex Groups

In the example above, one movie was in only one group (original language). In this section, consider an example where a movie is included in several groups. Let’s say we want to see trends by genre of films. One film can belong to several genres at once. For example, the movie “The Hobbit: An Unexpected Journey”. It’s fantasy, adventure and action at the same time. In the Genres field, it looks like this:

[{‘id’: 12, ‘name’: ‘Adventure’}, {‘id’: 14, ‘name’: ‘Fantasy’}, {‘id’: 28, ‘name’: ‘Action’}]

There can be up to six such combinations of genres in the data. Accordingly, when dividing into genres, you need to include one film several times in genre groups. But you cannot just add the Genres field to granulation – this way splitting into genres will not work correctly. Therefore, we modify the data. In the dataset, you need to put one movie and genre in each row, that is, one movie row with six genre values will turn into 6 rows. This can be done in Tableau Prep using Pivot operation. Let’s split the Genre field into columns according to the combination of symbols }, {.

Let’s save the flow to output.csv and replace the original file in Tableau.
 

Now make a string Aggregation parameter in Tableau with the values: By Genre, By Original Language, By Production Company. This parameter will change the granulation levels in the scatter plot and switch the types of groups.

Let’s create an Aggregation Field:

CASE [Aggregation]
WHEN ‘By Genre’ THEN [Genre New]
WHEN ‘By Original Language’ THEN [Original Language]
WHEN ‘By Production Company’ THEN [Prod]
END
Linear regression
Filter Filter Number of Movies is: 
 
{ FIXED [Aggregation Field]: COUNTD([Movie Id])}
 
This filter discards movies by quantity in each aggregation. On the screenshot, the groups with the number of movies less than 10 are discarded. With this filter, you can leave, for instance, only production comppanies with a large number of films (Universal Pictures, Walt Disney Pictures, etc.).
 

We made an interactive scatter plot with the ability to change aggregation groups and compare trends between groups.

3. Measure Comparison

To be able to visualize trends between any two measures, we will make it possible to switch measures both on the X axis and on the Y axis. To do this, create string parameters Measure X Parameter, Measure Y Parameter with a list of the measures we need and calculate X and Y:

X

CASE [Measure X Parameter]
WHEN ‘Budget’ THEN [Budget]
WHEN ‘Revenue’ THEN [Revenue]
WHEN ‘Profit’ THEN [Profit]
WHEN ‘MovieLens Rating’ THEN [Rating]
WHEN ‘Vote Count’ THEN [Vote Count]
WHEN ‘Runtime’ THEN [Runtime]
END

Y

CASE [Measure Y Parameter]
WHEN ‘Budget’ THEN [Budget]
WHEN ‘Revenue’ THEN [Revenue]
WHEN ‘Profit’ THEN [Profit]
WHEN ‘MovieLens Rating’ THEN [Rating]
WHEN ‘Vote Count’ THEN [Vote Count]
WHEN ‘Runtime’ THEN [Runtime]
END

Now you can choose a couple of any measures and watch linear trends:

Linear regression
 

4. Calculation Variances

In the article “Linear Regression in Tableau, Part 1. Time Series” I showed how to find the equation for the regression line and calculate the deviations of points from trend lines. Here we will use the same approach – first we will find the equation of the trend.

In the changed dataset, each genre of one movie corresponds to one row, that is, the movie can be, for example, six rows, and in order to correctly transfer the results of the Profit, Revenue, etc. metrics. in Tableau, you need to calculate the average of these 6 rows. Let’s do it with FIXED:

X fixed

{ FIXED [Movie Id]: AVG([X])}

Y fixed

{ FIXED [Movie Id]: AVG([Y])}

 
Delta
 
SUM([X fixed]^2)*COUNT([X fixed]) – SUM([X fixed])*SUM([X fixed])
 
Delta A
 
SUM([X fixed]*[Y fixed])* COUNT([X fixed]) – SUM([X fixed])*SUM([Y fixed])
 
Delta B
 
SUM([X fixed]^2)*SUM([Y fixed]) – SUM([X fixed]*[Y fixed])*SUM([X fixed])
 
Coefficients:
 
A
 
[Delta A] / [Delta]
 
B
 
[Delta A] / [Delta]
 
As a result, the regression line equation will look like this:
 
Y Regression
 
{FIXED [Aggregation Field]: [A]}*[X] + {FIXED [Aggregation Field]: [B]}
 

In this equation, the Aggregation Field aggregates the group data based on the selected Aggregation parameter.

Define the deviation from the trend line by calculating Variance:

ABS([Y Regression][Y])
 
This calculation can be used in tooltips if you just need deviation values.
 

5. Visualization of Deviations from the Trend Line

To visualize the points, we need to draw lines from the data points to the trend line. To do this, let’s make a UNION of the output.csv table
 
And using the Dual Axis option, add the calculation to the Rows shelf:
 
AVG(IF [Table Name] = ‘Output.csv’ THEN [Y] ELSE [Y Regression] END)
Linear regression
Now you can see the deviations of the points from the direct trend, as well as compare the trends of the groups:
Linear regression

Note: there are many dots on the diagram, they overlap each other

You can reduce the dot size:

Linear regression
Or simply leave only trend lines. Since we have found the equation of the Y Regression line, it is easy to do this by selecting the line in Marks:
Linear regression
This is how the lines of all groups look by genre. You can clearly see that the Revenue / Budget ratio for the Fantasy category is higher than for all other categories:
Linear regression
In addition, on this chart, trend lines are drawn on existing points, and are limited to the maximum values of X and Y (Budget and Revenue in this case), which allows you to quickly see the boundaries of budgets and revenue (this does not work with native lines that run through the entire scatter plot).
 

6. Correlation Score

In the article “Linear Regression in Tableau, Part 1. Time Series” I told how to calculate the correlation parameters, so in this article we will not dwell on calculating all the parameters and take only the Pearson correlation coefficient, and also build the Chaddock scale.

The Pearson correlation coefficient is calculated using the Pearson Coef measure:

{ FIXED [Aggregation Field]: CORR([X fixed],[Y fixed])}

Now let’s add a calculation to the Marks shelf so that the correlation coefficient is shown on the diagram for each point, but for all points of the same group, it is naturally the same:

Linear regression

The R-Squared value on the native trendline is the coefficient of determination, and taking the square root from it, we get the Pearson coefficient. So you can check your calculations.

I talked about Pearson’s correlation coefficient in the first part of articles on linear regression. This coefficient takes values ​​from -1 to 1 and characterizes the relationship between the variables. The closer the coefficient value is to +1 or -1, the stronger the relationship. The sign characterizes the type of correlation (positive or negative).

The screenshot above shows a chart showing the scatter of movie runtime versus movie budget. Comedies have a Pearson coefficient of 0.0471, which indicates that there is no relationship between the budget and the runtime comedies. The action category has a coefficient of 0.296, which characterizes a noticeable relationship between the budget and the duration of the film, that is, the larger the budget, the longer the film lasts.

Since it is not easy to operate with the numbers of coefficients, we introduce a qualitative scale that reflects the relationship between the metrics. This is the Chaddock scale. I wrote about the creation of such a scale in the article “Linear Regression in Tableau, Part 1. Time Series”. Here we will do the same, position the scale horizontally so that it takes up more space:

Chaddock scale

We will make indicators as arrows on a separate sheet – they will show the value of the Pearson coefficient on the scale. The X axis is pinned from -1 to 1 range:

7. Final Visualization

In order to create a tool for insights searching, we will add filters and parameters to the visualization, as well as make actions on the dashboard to be able to filter clicking on a legend and highlight groups. For clicks on the legend to filter the data, let’s make the legend a separate sheet and insert it into the visualization. We will add filters, parameters and a legend to one container and provide the ability to collapse this container by clicking the ‘Hide Filters’ button. The ability to hide tiled containers is new, and appeared in Tableau 2021.2. When you hide the filter panel, the scatter plot expands to the full width of the dashboard – this is very convenient for a detailed study of the data.

Let’s play a little with colors and fonts, so we will get the final viz:

Linear regression
Click to open the interactive viz

Naturally, filters play an important role in selecting the groups of interest, but you can watch general pictures, for example, see the correlation between the TMDB rating and the box office receipts of films in different languages as shown above.

Conclusion

In this article, I showed how you can visualize the relationship between two measures based on a linear regression model, as well as break data into groups using parameters and compare the behavior of trends in different groups.

In the article, a tool was created that allows you to look at the data in different ways and find relationships and insights in this data, confirm or discard hypotheses.

It should be noted that the search for relationships in data is the daily work of data analysts, and such tools are in demand in business tasks. From my experience I can say that dashboards with a lot of filters and parameters that allow the end user to look at the data in different ways work much better than non-interactive printed dashboards.