Enhanced speaker handling

This is a feature request from @yhofmann: The first idea was to detect whether it s more than one person speaking and then prompt that there are too many speakers at once. We could also separate speakers using neural net either always or after detecting multiple people. However, this might lead to worse audio samples and a diminished accuracy. Nevertheless, it is worth testing.