Progress in scene understanding requires reasoning
about the rich and diverse visual environments that make
up our daily experience. To this end, we propose the Scene
Understanding (SUN) database, a nearly exhaustive collection
of scenes categorized at the same level of specificity as
human discourse. The database contains 908 distinct scene
categories and over 100,000 images. To better understand
this large scale taxonomy of scene categories, we first perform
three human experiments: we quantify human scene
recognition accuracy, we measure how “typical” each image
is of its assigned scene category, and we estimate high-level
“spatial envelope” properties for each scene category. Next,
we perform computational experiments: scene recognition
with recent global image features, indoor vs outdoor classification,
and “scene detection” in which we relax the assumption
that one image depicts only one scene category. Finally,
we relate human experiments to machine performance and
explore the relationship between human and machine recognition
errors and the relationship between image “typicality”
and machine recognition accuracy.