A recent paper argues that alignment evaluation cannot solely rely on model-level assessments.