@@ -93,13 +93,32 @@ __Note:__ In order to use our provided download script, you have to generate and
```
In the end, you need a folder, referred to as `$CV_DL_DIR`, which contains subfolders for every language that you want to classify.
It may look something like this:
```
common-voice/
└──en/
└────clips/
└──────*.mp3
└────train.tsv
└────dev.tsv
└────test.tsv
└──de/
└────...
└──fa/
└────...
└──kab/
└────...
```
##### AudioSet
##### AudioSet (Optional)
In this section we will first download the AudioSet meta data file from [this website](https://research.google.com/audioset/download.html). Next, we will search it for specific labels using the provided `data/audioset/download_youtube_noise.py` script. This python script defines the labels that are relevant and those that are not allowed (human voice). With the restrictions for our use case we extracted around 18,000 samples from AudioSet. You can call the shell script like this:
It will create a folder `yt-downloads` which contains the raw audio files (some may yet be flawed).
It will attempt to download all files to folder that you specified in `$YOUTUBE_DATA_DIR`. Please note that some may yet be flawed as videos may not be available in your country.
__Note:__ Even with parallelization this process will likely take hours, as the script downloads the whole media file and then cuts the important part.
### Audio Extraction
We use several processing steps to form our data set from the Common Voice downloads. We recommend using the config file to define and document the processing steps. Please take a look at the CLI arguments in the script for more information on the options.
...
...
@@ -111,11 +130,11 @@ __Note:__ Modify the config file accordingly, e.g. replace `cv_input_dir` with `