Interview with Developer Thomas Otte
As part of the release of our sales & revenue estimates implementation, we sat down with Tomas Otte, Steam Data Suite’s developer behind the estimates.
So Thomas, thanks for taking the time. Let’s jump right to the solution. Could you explain to our readers how it works?
Of course. I set up a neural network, a sort of AI that essentially does really fast math. It includes numerous values and multiplies those with any values that you feed it. Then it calculates the derivative, so e.g. it takes the average of the difference of 16 games towards their actual sales & revenue numbers and tries to minimize the estimate’s deviation.
How do you know that you have arrived at the most accurate estimate?
Technically you never know what the most optimal estimate is. What we do is use a different data set to estimate a variable and another set that wasn’t trained for that specific value. So if you run the model over the untrained data set you can determine how much more accurate the estimate that resulted from the first, trained data set, is. But you can never really say that you arrived at the most accurate estimate. That would be statistically incorrect.
What variables are considered to arrive at the estimates?
It’s a bunch of variables actually. They include tags, types, features, genre as well as followers, hub members, its price, amount of positive reviews, amount of total reviews, amount of days since release, and the average ranking since release. If any of these values are missing, the model considers the worst value, i.e. for the number of followers that would be 0, and for ranking that would be the bottom of the ranking.
How does the model work with tags, since they aren’t numeric values?
For each tag we have a primary key, which is essentially a key in a relational database that is unique for each record. However, the primary keys suggest that there is an order in the tags since one number is greater than the other. If you used them in calculations as is, then one tag would be weighted higher than the other, which is not what you want. To solve this each tag is assigned a binary value, either 0 or 1, depending on whether it is present in the tag composition of a certain game. Then the neural network figures out whether certain tags should be weighted more than others when performing the estimate calculations.
Did you encounter any issues with the development of the model?
I did run into some issues in the data reprocessing and outlier removal process. The hard part about the preprocessing and outlier detection was that SDS has a selection of some more successful games. Therefore an AI would assume games on steam are more successful than they generally tend to be. We have to account for that so less successful games are also considered in this calculation. A part of this was data selection, where we excluded some games on purpose from the neural network training, to simulate the expected ratios on steam in our own dataset. Also, some games don’t fit in the generalization the AI makes, so we have to accommodate for that and remove those from the dataset too since these are not very representative of the other games on steam.
Another thing worth mentioning is that the model is working with percentages rather than absolute values when calculating the estimates. So if a game e.g. earned 20 million in revenue but the model started with estimating 1 million, it will be far off. For a similar game that only earned around 20.000 in revenue and the model estimated 400.000, that absolute difference will be much lower. To anticipate that we calculate the percentage that a game is off rather than using the absolute value. That way we can greatly minimize the error of the estimate and arrive at an accuracy we are very happy with.