Internally, the model resizes down the input for stage 1. Then, it refines at high-resolution for stage 2.
Set `downsample_ratio` so that the downsampled resolution is between 256 and 512. For example, for `1920x1080` input with `downsample_ratio=0.25`, the resized resolution `480x270` is between 256 and 512.