Commit e442ea16 authored by Paul Bethge's avatar Paul Bethge
Browse files

update mozilla extraction

parent 0a84af1f
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
# Repo Name
"Short Description WHAT is it doing?"
# Language Identification System
A playground to test and experiment with spoken languages identification.
This repository extracts language examples from Mozilla's open speech corpus
[Common Voice](https://commonvoice.mozilla.org/).
Feel free to contribute your voice and expertise to the corpus. Furthermore, Google's audio scene dataset
[AudioSet](https://research.google.com/audioset/dataset/index.html)
can be used to extract noise data to enhance the robustness of the model.
This code base has been developed by [ZKM | Hertz-Lab](https://zkm.de/en/about-the-zkm/organization/hertz-lab) as part of the project [»The Intelligent Museum«](#the-intelligent-museum).
Please raise issues, ask questions, throw in ideas or submit code, as this repository is intended to be an open platform to collaboratively improve "TOPIC NAME"
This code base has been developed by Hertz-Lab as part of the project [»The Intelligent Museum«](#the-intelligent-museum).
Please raise issues, ask questions, throw in ideas or submit code as this repository is intended to be an open platform to collaboratively improve the task of spoken language identification (lid).
##### Target Platform
Tested under Ubuntu 18.04 using python 3.7 and tensorflow 2.3
##### Target Platform:
* Ubuntu 18.04 Desktop
* MacOS 10.15 (Installation may differ)
##### Features
* can do this and that
* achieves blablabla
* Input may be WAV files or microphone
* Parameterizable Acoustic Activation Detection
* Enable neural network to identify language
* Disable neural network for dataset creation
* Pretrained model which detects French, Spanish, English, German and Russian
* Scripts for dataset creation and augmentation
* Possible features: MFCC, Mel-scaled filter banks, spectrogram
##### Structure
* folder1/: does that
* folder2/: has this
- lid_client/: source code for the lid application
- lid_network/: training process and model defenitions
- data/: a collection of scripts to download and process datasets
## Installation
Download and Install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment:
### Ubuntu
Download and install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment
```
$ conda create -n "name" python=3.7
$ conda activate "name"
$ pip install -r requirements.txt
$ pip3 install -r requirements.txt
```
#### Additional Software
##### Additional Software
- ffmpeg, sox and portAudio
```
$ sudo apt install howaboutthat
$ sudo apt install ffmpeg sox libasound-dev portaudio19-dev
```
- youtube-dl (version > 2020)
```
$ sudo curl -L https://yt-dl.org/downloads/latest/youtube-dl -o /usr/local/bin/youtube-dl
$ sudo chmod a+rx /usr/local/bin/youtube-dl
```
## Usage
##### show help
```
$ python executable.py --help
$ python lid.py --help
```
##### scenario 1
##### microphone input
```
$ python executable.py --parameter 42
$ python lid.py
```
##### WAV-file input
```
$ python lid.py --file_name test/paul_deutsch.wav
```
## Further Reading
* [paper](www.google.com)
* [paper2](www.google.com)
* [Speech Features](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)
* [CRNN-LID](https://github.com/HPI-DeepLearning/crnn-lid)
## License
use a License
GPLv3 see `LICENSE` for more information.
## Contribute
Contributions are very welcome!
Please send an email to author@zkm.de
Please send an email to bethge@zkm.de
## The Intelligent Museum
An artistic-curatorial field of experimentation for deep learning and visitor participation
......@@ -56,4 +83,4 @@ The [ZKM | Center for Art and Media](https://zkm.de/en) and the [Deutsches Museu
As part of the project, digital curating will be critically examined using various approaches of digital art. Experimenting with new digital aesthetics and forms of expression enables new museum experiences and thus new ways of museum communication and visitor participation. The museum is transformed to a place of experience and critical exchange.
![Logo](media/Logo_ZKM_DMN_KSB.png)
![Logo](media/Logo_ZKM_DMN_KSB.png)
\ No newline at end of file
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
absl-py==0.10.0
appdirs==1.4.4
astunparse==1.6.3
audioread==2.1.8
auditok==0.1.5
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.1
chardet==3.0.4
cycler==0.10.0
decorator==4.4.2
gast==0.3.3
google-auth==1.20.1
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.31.0
h5py==2.10.0
idna==2.10
imageio==2.9.0
importlib-metadata==1.7.0
joblib==0.16.0
Keras-Preprocessing==1.1.2
kiwisolver==1.2.0
librosa==0.8.0
llvmlite==0.34.0
Markdown==3.2.2
matplotlib==3.3.0
nlpaug==0.0.20
numba==0.51.2
numpy==1.19.1
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.4
pandas==1.1.0
Pillow==7.2.0
pooch==1.2.0
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyAudio==0.2.11
pycparser==2.20
pydub==0.24.1
pyparsing==2.4.7
python-dateutil==2.8.1
python-speech-features==0.6
pytz==2020.1
PyYAML==5.3.1
requests==2.24.0
requests-oauthlib==1.3.0
resampy==0.2.2
rsa==4.6
scikit-learn==0.23.2
scipy==1.4.1
six==1.15.0
sklearn==0.0
SoundFile==0.10.3.post1
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.0
tensorflow-estimator==2.3.0
termcolor==1.1.0
threadpoolctl==2.1.0
urllib3==1.25.10
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0
auditok
pydub
tensorflow
nlpaug
matplotlib
scipy
imageio
python_speech_features
\ No newline at end of file
File mode changed from 100755 to 100644
......@@ -4,7 +4,7 @@ import os
import numpy as np
import scipy.io.wavfile as wav
from generators import AudioGenerator
from wav.generators import AudioGenerator
def augment_noise(args):
......
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
File mode changed from 100755 to 100644
import argparse
import pydub
import os
def sentence_is_too_short(sentence_len, language):
if language == "chinese":
return sentence_len < 5
else:
return sentence_len < 10
def traverse_csv(language, input_dir, output_dir, lid_client_path, number, allowed_downvotes):
lang = language["lang"]
lang_abb = language["dir"]
input_sub_dir = os.path.join(input_dir, lang_abb)
input_sub_dir_clips = os.path.join(input_sub_dir, "clips")
input_clips_file = os.path.join(input_sub_dir, "validated.tsv")
output_dir_raw = os.path.join(output_dir, "raw", lang)
output_dir_pro = os.path.join(output_dir, "pro", lang)
output_clips_file = os.path.join(output_dir_raw, "clips.csv")
# create subdirectories in the output directory
if not os.path.exists(output_dir_raw):
os.makedirs(output_dir_raw)
if not os.path.exists(output_dir_pro):
os.makedirs(output_dir_pro)
# open the csv file to write to
out = open(output_clips_file, "w+")
# open mozillas' dataset file
with open(input_clips_file) as f:
# for a certain number extract the file
i = 0
try:
# skip the first line
line = f.readline()
while True:
# get a line
line = f.readline().split('\t')
# if the line is not empty
if line[0] != "":
# check if the sample contains more than X symbols and has not been down voted
sentence = line[2]
too_short = sentence_is_too_short(len(sentence), language["lang"])
messy = int(line[4]) > allowed_downvotes
if (too_short or messy) and language is not "unknown":
continue
# get mp3 filename
mp3_filename = line[1]
wav_filename = mp3_filename[:-4] + ".wav"
mp3_path = os.path.join(input_sub_dir_clips, mp3_filename)
wav_path = os.path.join(output_dir_raw, wav_filename)
print("\n Processing: ", mp3_path)
# convert mp3 tp wav
pydub.AudioSegment.from_mp3(mp3_path)\
.set_frame_rate(16000)\
.set_channels(1)\
.export(wav_path, format="wav")
# process file through lid_client
command = "python " + lid_client_path + \
" --file_name " + wav_path + \
" --output_dir " + output_dir_pro + \
" --num_iters 600" + " --padding Data" + \
" --min_length 300" + " " +\
" --max_silence 200" " --nn_disable"
os.system(command)
# save filename to csv
line_to_write = wav_filename + '\n'
out.write(line_to_write)
if number != -1 and i >= number:
break
i = i+1
if i % 1000 == 0:
print("Processed %d files", i)
else:
print("Nothing left!")
break
except EOFError as e:
print("End of file!")
out.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--input_dir", type=str,
help="directory containing all languages")
parser.add_argument("--output_dir", type=str,
help="directory to receive raw and processed clips of all languages")
parser.add_argument("--lid_client_path", type=str,
default="lid.py",
help="path to the python script to process the data")
parser.add_argument("--number", type=int, default=40000,
help="amount of files to be processed (!= amount of processed files)")
parser.add_argument("--allowed_downvotes", type=int, default=0,
help="amount of downvotes allowed")
args = parser.parse_args()
languages = [
# {"lang": "english", "dir": "en"},
# {"lang": "german", "dir": "de"},
# {"lang": "french", "dir": "fr"},
# {"lang": "spanish", "dir": "es"},
# {"lang": "mandarin", "dir": "zh-CN"},
# {"lang": "russian", "dir": "ru"},
# {"lang": "unknown", "dir": "ja"},
# {"lang": "unknown", "dir": "ar"},
# {"lang": "unknown", "dir": "ta"},
# {"lang": "unknown", "dir": "pt"},
# {"lang": "unknown", "dir": "tr"},
# {"lang": "unknown", "dir": "it"},
# {"lang": "unknown", "dir": "uk"},
# {"lang": "unknown", "dir": "el"},
# {"lang": "unknown", "dir": "id"},
# {"lang": "unknown", "dir": "fy-NL"},
]
# count the number of unknown languages
unknown = 0
for language in languages:
if language["lang"] == "unknown":
unknown += 1
if unknown > 0:
number_unknown = args.number / unknown
for language in languages:
clips_per_language = args.number
if language["lang"] == "unknown":
clips_per_language = number_unknown
traverse_csv(language, args.input_dir, args.output_dir, args.lid_client_path,
clips_per_language, args.allowed_downvotes)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment