Commit f0da1dbb authored by Paul Bethge's avatar Paul Bethge
Browse files

documentation and renaming

parent 9f117fd7
......@@ -2,7 +2,7 @@ High-Level Audio Processing in Python
============
This project contains code for live audio recording and processing in python.
Besides basic usage of Pyaudio for simple recording, you will find asynchronous
Besides basic usage of PyAudio for simple recording, you will find asynchronous
processing methods and the use of artificial intelligence for high level applications.
This code has been developed by [ZKM | Hertz-Lab](https://zkm.de/en/about-the-zkm/organization/hertz-lab) as part of the project [»The Intelligent Museum«](#the-intelligent-museum).
......@@ -23,12 +23,23 @@ We may use class abstraction at some point in time.
You will find a lot of borrowed code and trained neural nets from other repository, instead of submodules, as the purpose of this project is to maintain working code rather than bleeding edge technology.
We want to thank the following repositories for providing open-source solutions:
- silero-vad
- speech-commands
- [silero-vad](https://github.com/snakers4/silero-vad)
- [speech-commands](https://github.com/douglas125/SpeechCmdRecognition)
Usage
--------------
Please find a README in each of the example subfolders for clarification of the requirements and usage as each example is very different in that regard.
We will be using PyAudio to access the microphone which depends on PortAudio. Please take a look at the [installation guide](http://files.portaudio.com/docs/v19-doxydocs/tutorial_start.html). For some platforms however you can find prebuilt binaries.
##### Linux (APT)
```shell
sudo apt install libasound-dev portaudio19-dev
```
##### MacOS (Brew)
```shell
brew install portaudio
```
Please find a README in each of the example subfolders for clarification of addtional requirements and usage as each example is very different in that regard.
The Intelligent Museum
----------------------
......
import io
import numpy as np
import pyaudio
import scipy.io.wavfile as wav
import threading
from queue import Queue
import torch
torch.set_num_threads(1)
import torchaudio
torchaudio.set_audio_backend("soundfile")
from utils import *
import tensorflow as tf
from tensorflow.keras.models import load_model
# Configure
vad_threshhold = 0.8
kws_threshhold = 0.95
kws_required_size = 4
frame_duration_ms = 250
#=== Silero VAD ===#
model = torch.jit.load('vad_model.jit')
def normalize(sound):
abs_max = np.abs(sound).max()
if abs_max > 0:
sound *= 1/abs_max
sound = sound.squeeze() # depends on the use case
return sound
#=== Keyword Spotting ===#
classes = ['unknown', 'nine', 'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go',
'zero', 'one', 'two', 'three', 'four', 'five', 'six',
'seven', 'eight', 'backward', 'bed', 'bird', 'cat', 'dog',
'follow', 'forward', 'happy', 'house', 'learn', 'marvin', 'sheila', 'tree',
'visual', 'wow']
# KWS process thread
def processAudio(config, q):
kws_model = load_model('kws_model')
while True:
try:
data = q.get()
data = [item for sublist in data for item in sublist]
data_tensor = tf.convert_to_tensor(data)
data_tensor = tf.expand_dims(data_tensor, axis=0)
data_tensor = tf.cast(data_tensor, tf.float32)
data_tensor = tf.math.divide(data_tensor, 32768)
out = kws_model(data_tensor)[0].numpy()
index = tf.math.argmax(out).numpy()
if out[index] >= kws_threshhold:
print(classes[index])
# if you want to see the data please uncomment
# wav.write('results/'+classes[index]+'.wav', SAMPLE_RATE, np.asarray(data))
except Exception as e:
print("Ooopsi: ", e)
q.task_done()
# Threading
queue = Queue()
some_config = ''
worker = threading.Thread(target=processAudio, args=(some_config, queue), daemon=True)
worker.start()
# Pyaudio
FORMAT = pyaudio.paInt16
CHANNELS = 1
SAMPLE_RATE = 16000
CHUNK = int(SAMPLE_RATE / 10)
audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=CHUNK)
chunk_size = int(SAMPLE_RATE * frame_duration_ms / 1000.0)
data = []
audio_int16 = []
nu_voice_chunks = 0
got_voice = False
print("Started Recording")
while True:
# keep the last chunk so nothing gets lost
last_chunk = audio_int16
# sample chunk, convert to float and normalize
audio_chunk = stream.read(chunk_size)
audio_int16 = np.frombuffer(audio_chunk, np.int16)
audio_float32 = audio_int16.astype('float32')
audio_float32_norm = normalize(audio_float32)
# get the confidences
vad_outs = validate(model, torch.from_numpy(audio_float32_norm))[:,1]
# trigger if voice is detected
if vad_outs >= vad_threshhold and not got_voice:
print("I found something")
got_voice = True
data = []
data.append(last_chunk)
# collect data and analyze
if got_voice:
if nu_voice_chunks < kws_required_size:
data.append(audio_int16)
nu_voice_chunks += 1
else:
got_voice = False
nu_voice_chunks = 0
queue.put(data)
queue.join()
\ No newline at end of file
# Voice Activity Detection for Keyword Spotting
In this example we will use two artificial neural networks. After gathering a chunk of audio we will check if there it contains human speech. This process is called Voice Activity Detection (VAD). If a voice is detected we start accumulating audio in order to feed the Keyword Spotting System (KWS). The KWS may be exchanged with any other AI that feeds on speech.
Parts of this code are heavily borrowed from:
- [silero-vad](https://github.com/snakers4/silero-vad)
- [speech-commands](https://github.com/douglas125/SpeechCmdRecognition)
### Installing Python Requirements
__Note__: We suggest using virtual environments for dealing with python code.
```shell
pip install -r requirements
```
### Configuration
The following parameters may be useful to look at
```python
vad_threshold = 0.8 # minimum confidence of the VAD to trigger KWS
kws_threshold = 0.95 # minimum confidence of the KWS to detect a word
kws_required_size = 4 # number of chunks -1 to feed into the KWS
frame_duration_ms = 250 # chunks size for the VAD in milliseconds (250ms is min)
```
The sample rate should be kept at 16kHz for both neural networks. If higher sample rates are required for recordings consider using downsampling.
\ No newline at end of file
......@@ -15,10 +15,10 @@ import tensorflow as tf
from tensorflow.keras.models import load_model
# Configure
vad_threshhold = 0.8
kws_threshhold = 0.95
kws_required_size = 4
frame_duration_ms = 250
vad_threshold = 0.8 # minimum confidence of the VAD to trigger KWS
kws_threshold = 0.95 # minimum confidence of the KWS to detect a word
kws_required_size = 4 # number of chunks -1 to feed into the KWS
frame_duration_ms = 250 # chunks size for the VAD in milliseconds (250ms is min)
#=== Silero VAD ===#
model = torch.jit.load('vad_model.jit')
......@@ -50,7 +50,7 @@ def processAudio(config, q):
out = kws_model(data_tensor)[0].numpy()
index = tf.math.argmax(out).numpy()
if out[index] >= kws_threshhold:
if out[index] >= kws_threshold:
print(classes[index])
# if you want to see the data please uncomment
......@@ -100,7 +100,7 @@ while True:
vad_outs = validate(model, torch.from_numpy(audio_float32_norm))[:,1]
# trigger if voice is detected
if vad_outs >= vad_threshhold and not got_voice:
if vad_outs >= vad_threshold and not got_voice:
print("I found something")
got_voice = True
data = []
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment