In the last 10 years alone, Webby entries have increased by more than 300%, from approximately 4K in 2005 to nearly 13K in 2015, and our Academy of jurors has increased from 500 to 2000. In that time, we’ve made countless refinements and learned a lot about judging the Internet. As we approached our 20th Anniversary, there were some key lessons we wanted to integrate into the current process so that it could continue to grow in the future. To help us, we approached The Center for Election Science, a world-class group of mathematicians, engineers, and logic nerds devoted to the science of voting and elections, to help us fine-tune the initial round of Webby Judging.
As we looked to build our judging system for the future, there were three key areas of focus we believed were critical to growing both the capacity and quality of the process. The first two were tactical; we wanted to continue to increase the mathematical sophistication used to address variations in judges’ scoring, and increase the impact of each juror’s time and expertise on the results. The third was more strategic: to increase the impact of more jurors and their participation without adding undue complexity.
Better Math to solve the “Tough Judges” problem
If you’ve ever judged anything on a point system, one thing you would have realized right away is that you need to do a bit of calibration to make sure your scoring is consistent – that your 7s are always better than your 5s and your 8s are always inferior to your 9s. But what happens when hundreds of judges are giving scores? Some judges are extremely difficult to please (2s for everyone!), while others give lower scores less often. Since we developed our early judging phase nearly 20 years ago, we have always used math to normalize the scores and address this issue. But this year, working with The Center for Election Science, we introduced more sophisticated logic to make the system more scalable for the future.
“A simple way to normalize scores is to go to each judge’s individual scores and subtract that judge’s average score. But this can shave off too much of a score. To correct for this, we only subtracted the part of the average that was likely real and not the part likely due to luck. And by luck, we mean things like score variability between works and our lowered confidence in judges that hadn’t scored as many works.”
Increase the impact of the Academy’s work
One of the benefits of Academy membership is the opportunity to judge The Webby Awards and review some of the best work on the Internet from the past year.
Each year, among the 13K+ entries we receive, there are entries that are unquestionably Webby Winners (like Twitter in 2009 or The Wilderness Downtown in 2011) and others, that, well, will fall near the very bottom. One thing we’ve learned in this past decade is that the hardest decisions, and where it is beneficial to have the most judges and experts participate, are on the close calls: Which of these few exceptional pieces is the one to make the cut?
Working with The Center for Election Science we were able to use data from the judging process itself to help identify which entries would be closest to the cutoff of being selected for the short-list, and then deploy more judges to evaluate those specific works based on that information.
“We prioritized which work would be scored next by figuring the probability that one more judge scoring a work would push it either above or below the cutoff. Whichever work with the highest probability got scored next. The probability itself factored in both (1) global qualities—like judge toughness and variability between works, and (2) individual qualities—like how close a work was to the cutoff and how many judges already scored a particular work.”
Our third goal was more strategic and forward looking: We wanted to ensure that our judging system was future proofed for more entries and that the effect of adding more jurors would be to increase the quality of our results, rather than make our system more complex or, worse yet, diminish the results. With the help of Electology, we were able to fine-tune our process so that each additional vote and juror is purely additive to the system, improving the precision of our score normalization, the distribution of work among members and, most importantly, the quality and fairness of the entire Webby judging process.
The Resulting Initial Judging System
Through the Center’s consulting and analysis, we fine-tuned the statistical model to a method known as Score Voting, explained at length on the Electology website.
Taking into account three factors that affected outcomes, we were able to normalize scores to be more effective at determining the “true quality” of a given site. These three factors were (1) a site’s intrinsic quality, (2) the toughness of the judge who gave that rating, and (3) random error representing how much more or less that judge liked the site compared to an average judge.
“Using these parameters, we could estimate the toughness of each judge, use that estimated toughness to correct the ratings and estimate the true quality for each site, and figure out the expected error in each of our quality estimates.”
In conclusion, through the contributions from electology.org, The Webby Awards has an even more sophisticated first round judging system. The end result allowed for incredible results, a scalable model for growth, and better utilization of jurors’ time.
We are proud to have partnered with The Center for Election Science and their team, headed by Executive Director Aaron Hamlin, and Director Jameson Quinn, and we are now slightly less terrified about how we will judge all those thousands of entries to come in the future.
For more information on The Center for Election Science, please visit their website at http://www.electology.org/
Interested in judging The Webby Awards? Apply to become an IADAS Member at http://associat