Martin Gauch, Frederik Kratzert, Oren Gilon, Hoshin Gupta, Juliane Mai, Grey Nearing, Bryan Tolson, Sepp Hochreiter, and Daniel Klotz
Everyone wants their hydrologic models to be as good as possible. But how do we know if a model is accurate or not? In the spirit of rigorous and reproducible science, the answer should be: we calculate metrics. Yet, as humans, we sometimes follow a scheme of “I know a good model when I see it” and manually inspect hydrographs to assess their quality. This is certainly a valid method for sanity checks, but it is unclear whether these subjective visual ratings agree with metric-based rankings. Moreover, the consistency of such inspections is unclear, as different observers might come to different conclusions about the same hydrographs.
In this presentation, we report a large-scale study where we collected responses from 622 experts, who compared and judged more than 14,000 pairs of hydrographs from 13 different models. Our results show that overall, human ratings broadly agree with quantitative metrics in a clear preference for a Machine Learning model. At the level of individuals, however, there is a large amount of inconsistency between ratings from different participants. Still, in cases where experts agree, we can predict their most likely rating purely from qualitative metrics. This indicates that we can encode intersubjective human preferences with a small set of objective, quantitative metrics. To us, these results make a compelling case for the community to put more trust into existing metrics—for example, by conducting more rigorous benchmarking efforts.