Commit b667b402 authored by Paul Bethge's avatar Paul Bethge
Browse files

edit noise process

parent 392e450a
......@@ -83,30 +83,39 @@ This repository extracts language examples from Mozilla's open speech corpus
[AudioSet](https://research.google.com/audioset/dataset/index.html).
### Download
##### Common Voice
Start by downloading language sets you are interested in. We recommend to use languages with at least 1000 speakers and 100 hours validated audio samples. Check [this site](https://commonvoice.mozilla.org/de/languages) for details.
You can use our provided download script, but you have to generate and copy machine-specific download links into it, as the download requires your consent.
__Note:__ In order to use our provided download script, you have to generate and copy machine-specific download links into it, as the download requires your consent.
```shell
./data/download_common_voice.sh
./data/common-voice/download_common_voice.sh
```
Afterwards, collect the data sets in a folder, referred to as `$DATA_DIR`.
Start downloading the YouTube noise. The following script defines the labels that are relevant and those that are not allowed (human voice). With the restrictions for our use case we extracted around 18,000 samples from AudioSet. The process is parallelized, but will still require a couple of hours. First, download the __unbalanced__ data set from the [website](https://research.google.com/audioset/download.html) and pass it to the script.
In the end, you need a folder, referred to as `$CV_DL_DIR`, which contains subfolders for every language that you want to classify.
##### AudioSet
In this section we will first download the AudioSet meta data file from [this website](https://research.google.com/audioset/download.html). Next, we will search it for specific labels using the provided `data/audioset/download_youtube_noise.py` script. This python script defines the labels that are relevant and those that are not allowed (human voice). With the restrictions for our use case we extracted around 18,000 samples from AudioSet. You can call the shell script like this:
```shell
python download_youtube_noise.py --input_file unbalanced_train_segments.csv --output_dir $NOISE_DIR
./data/audioset/download_yt_noise.sh
```
It will create a folder `yt-downloads` which contains the raw audio files (some may yet be flawed).
### Audio Extraction
We use several processing steps to form our data set from the Common Voice downloads. We recommend using the config file to define and document the processing steps. Please take a look at the CLI arguments in the script for more information on the options.
```shell
python data/cv_to_wav.py --help
python data/common-voice/cv_to_wav.py --help
```
__Note:__ Modify the config file accordingly, e.g. replace `cv_input_dir` with `$CV_DL_DIR` and `cv_output_dir` with `$DATA_DIR` (the final dataset directory). Don't forget to name the languages in the table at the bottom.
```shell
python data/common-voice/cv_to_wav.py --config data/common-voice/config_moz.yaml
```
Modify the config file accordingly, e.g. replace `cv_dir` with $DATA_DIR and name the languages in the table at the bottom.
##### Add the Noise
Afterwards we check if the noise data is valid and cut and split it into the previously created `$DATA_DIR`.
Please use the provided shell script and pass it the `$DATA_DIR` path:
```shell
python data/cv_to_wav.py --config data/config_moz.yaml
./data/audioset/process_and_split_noise.sh $DATA_DIR
```
Afterwards, create another folder, called *__noise* in the train, dev and test sub folders and fill it with portions of the YouTube noise (e.g. 80, 10, 10).
### Preprocessing
In this version, we use [kapre](https://kapre.readthedocs.io/en/latest/) to extract the features (such as FFT or Mel-filterbanks) within the TensorFlow graph. This is especially useful in terms of portability, as we only need to pass the normalized audio to the model.
......
AUDIOSET_CSV="balanced.csv"
YTDL_DIR=$PWD/"yt-noise"
NOISE_DIR=$PWD/"__noise"
CV_DIR="$PWD/data/cv"
python3 data/audioset/download_youtube_noise.py --input_file $AUDIOSET_CSV --output_dir $YTDL_DIR
python3 data/other/cut_audio.py --input_dir $YTDL_DIR --output_dir $NOISE_DIR
python3 data/other/split_to_common_voice --input_dir $NOISE_DIR --output_dir $CV_DIR
rm -r $NOISE_DIR
\ No newline at end of file
......@@ -24,8 +24,11 @@ from queue import Queue
def downloadEnclosures(i, q):
while True:
yt_url, start_s, length_s, output_dir = q.get()
download(yt_url, start_s, length_s, output_dir)
try:
yt_url, start_s, length_s, output_dir = q.get()
download(yt_url, start_s, length_s, output_dir)
except Exception as e:
print("Download oopsi: ", e)
q.task_done()
......@@ -44,7 +47,7 @@ if __name__ == '__main__':
default=os.path.join(os.getcwd(), "yt-noise"),
help="path to the output directory")
parser.add_argument('--num_threads', type=int,
default=4,
default=8,
help="amount of worker threads")
parser.add_argument('--downloads', type=int,
default=100000,
......@@ -137,16 +140,15 @@ if __name__ == '__main__':
# skip the first three line
print(f.readline())
print(f.readline())
print(f.readline())
f.readline()
# as long as we didn't reach the maximum number of files
while file_count < num_files:
# get a line
line = f.readline()[:-1].split(',')
# print(line)
# if the line is not empty
# if the line is not empty
if line[0] != "":
# get the URL and start and end points
......@@ -165,28 +167,20 @@ if __name__ == '__main__':
label = label[:-1]
labels.append(label)
# if audio_length != 10.0:
# print("Sample not 10 seconds")
# continue
# apply label restrictions
if any(label in labels for label in restrictions):
print("Found restricted label in {}".format(labels))
# print("Found restricted label in {}".format(labels))
continue
if not any (label in labels for label in positives):
print("Label not in positives!")
# print("Label not in positives!")
continue
print("Something in {} is important and nothing restricted". format(labels))
# print("Something in {} is important and nothing restricted". format(labels))
# get the data and save it
try:
function_args = (URL, start, audio_length, args.output_dir)
queue.put(function_args)
except Exception as e:
print("Download oopsi: ", e)
continue
function_args = (URL, start, audio_length, args.output_dir)
queue.put(function_args)
file_count += 1
......@@ -198,4 +192,5 @@ if __name__ == '__main__':
print("End of file!")
# wait until the workers are ready
print("Waiting for threads to finish... this may take hours")
queue.join()
AUDIOSET_URL="http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/unbalanced_train_segments.csv"
AUDIOSET_CSV=$PWD/"unbalanced.csv"
YTDL_DIR=$PWD/"yt-downloads"
echo "Downloading AudioSet meta info... this may take a while..."
wget -O $AUDIOSET_CSV $AUDIOSET_URL
echo "Downloading selected AudioSet samples... this may take hours..."
python3 data/audioset/download_youtube_noise.py --input_file $AUDIOSET_CSV --output_dir $YTDL_DIR
YTDL_DIR=$PWD/"yt-downloads"
TEMP_DIR=$PWD/"yt-noise-processed"
CV_DIR=$1
echo "Processing downloaded samples..."
python3 data/other/cut_audio.py --input_dir $YTDL_DIR --output_dir $TEMP_DIR
echo "Splitting and moving the processed samples to the dataset directory..."
python3 data/other/split_to_common_voice.py --input_dir $TEMP_DIR --output_dir $CV_DIR
rm -r $TEMP_DIR
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment