From ec5378ff580d6b67cebfb96bdc4cce69270e5671 Mon Sep 17 00:00:00 2001 From: paul bethge Date: Wed, 27 Apr 2022 11:30:47 +0200 Subject: [PATCH 1/4] add requirements.txt --- requirements.txt | 4 ++++ 1 file changed, 4 insertions(+) create mode 100644 requirements.txt diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..3930cea --- /dev/null +++ b/requirements.txt @@ -0,0 +1,4 @@ +python-osc==1.7.7 +numpy==1.20.3 +pickle-mixin +matplotlib==3.4.2 \ No newline at end of file -- GitLab From a4ac4cf2b56f261aeca7af4983a1287dfdc9bb1e Mon Sep 17 00:00:00 2001 From: paul bethge Date: Wed, 27 Apr 2022 11:31:15 +0200 Subject: [PATCH 2/4] edit wording --- README.md | 45 ++++++++++++++++++++++----------------------- 1 file changed, 22 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index f8ad108..984667d 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ Description ### Reinforcement Learning -Basically a RL algorithm assumes that there is an environment which can have different states, that are observed by the agent acting in that environment. The agent takes actions which change the states of the environment. In the beginning the agent chooses his action randomly while being in a certain state. These state-action pairs get rewarded or punished (punishment is negative reward) in our case by a human "expert" by pressing a button. The agent changes from picking randomly (exploration) to use an action based on the RL Qtable (exploitation). Over time or with the amount of rewards received this can change from exploration to exploitation. +Basically a RL algorithm assumes that there is an environment which can have different states, that are observed by the agent acting in that environment. The agent takes actions which change the states of the environment. In the beginning the agent chooses his action randomly while being in a certain state. These state-action pairs get rewarded or punished (punishment is negative reward) in our case by a human "expert" by pressing a button. For its next action the agent randomly picks to either _explore_ new strategies or _exploit_ strategies that have already been written to the Qtable. Over time or with the amount of rewards received the chance to _exploit_ learned strategies will increase. For the development of this project the code from this very good hands-on [Tutorial on Reinforcement Learning / Q-Learning with Python](https://pythonprogramming.net/q-learning-reinforcement-learning-python-tutorial/) by [sentdex](https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ) is the basis. The tutorial is first using the [OpenAI's gym](https://gym.openai.com/), specifically with the "MountainCar-v0" environment which is predefining the possible actions and generates the states. In [part 4 of the tutorial](https://pythonprogramming.net/own-environment-q-learning-reinforcement-learning-python-tutorial/) a custom environment is designed which is inspired by the Predator-Prey example. @@ -80,7 +80,7 @@ For the action space a range of continuous pixel values for the radius could be With the described dimensionality reduction, we reach 15 states x 5 actions = 75 state-action pairs that need to be rewarded. This seems to be a learnable goal. It could be further reduced or extended by different parameters. -In the Empathy Swarm installation it could be included if a human is standing or sitting regarding the observations. For the actions the RGB colors of the robots could be changed or whether the robot is rotating or not. +In the Empathy Swarm installation it could be included if a human is standing or sitting regarding the observations. To display the actions the RGB colors of the robots could be changed or whether the robot is rotating or not. ### Structure @@ -99,6 +99,10 @@ A third Processing program can be used to send rewards through clicking buttons Requirements ------------ +Operating systems/platforms: + +_Windows 10 has been used for development & testing. But most likely other platforms should work too._ + ### ES_RL_Flocking (Processing) Software: @@ -113,27 +117,22 @@ _To add libraries in Processing: click on Sketch>Import Library...>Add Library.. ### ES_reward_punish (Python) -(Software) dependencies: +Software dependencies: * python-osc (tested with version: 1.7.7) * numpy (tested with version 1.20.3) * pickle (tested with version 4.0) * matplotlib (tested with version 3.4.2) -Operating systems/platforms: - -_Windows 10 has been used for development & testing. But most likely other platforms should work too._ - ##### Installation & Build: -Download and Install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment: +We have used the [Anaconda](https://www.anaconda.com/) Python distribution for the development of this project. Any other Python distribution should work as well. Anyhow, we highly recommend the use of virtual environmets. + +Download and install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment: ```shell -$ conda create -n "ESrl" python=3.8 -$ conda activate "ESrl" -$ pip install python-osc==1.7.7 -$ pip install numpy==1.20.3 -$ pip install pickle-mixin -$ pip install matplotlib==3.4.2 +conda create -n "ESrl" python=3.8 +conda activate "ESrl" +pip install -r requirements.txt ``` Usage @@ -141,30 +140,30 @@ Usage ### ES_reward_punish (Python) -##### To start teaching from scratch: - +##### Teaching from scratch: +Choose one of the following options: * start with a qtable that is initialized with zeros: This makes it easy to read any rewards you added afterwards. It makes sense to choose a neutral action with id 0 as the argmax function gives back the first index in case all values in a list are the same. e.g. `self.q_table[(segmentid)] = [0 for i in range(self.sizeActions)]` _or_ -start with a qtable that is initialized with random values: This means that there is already a versatile / random behaviour of the agent. +* start with a qtable that is initialized with random values: This means that there is already a versatile / random behaviour of the agent. e.g. `self.q_table[(segmentid)] = [np.random.uniform(-1, 0) for i in range(self.sizeActions)]` -In both cases use: `self.start_q_table = None` +In both cases use: `self.start_q_table = None`. -* start with a `epsilon` value of 1, this means that the whole range of state-action pairs will be explored by choosing randomly which action to take, while a value of 0 means that the qtable will be exploited, meaning that the values will be read and also will always be the same for the same state-action pair. During the teaching process epsilon will decay through the variable `EPS_DECAY`. This has to be 1 for epsilon not to decay and less than 1 in order for epsilon to decay every episode. The smaller the value, the faster the decay as its used as a factor. +Start with an `epsilon` value of 1, this means that the whole range of state-action pairs will be explored by randomly choosing which action to take. On the other hand, a value of 0 means that the Qtable will be fully exploited, which will always lead to the same action given the same state-action pair. During the teaching process epsilon will decay through the variable `EPS_DECAY`. This has to be 1 for epsilon not to decay and less than 1 in order for epsilon to decay every episode. The smaller the value, the faster the decay as its used as a factor. ##### To explore if your teaching worked or continue teaching: -* start with a qtable from a filename *.pickle. In you work directory all qtables are saved in a folder called q_tables. Choose the last one you trained. +1. Load a qtable: In you working directory all qtables are saved in a folder called `q_tables/`. Choose the last one you trained. e.g. `self.start_q_table = os.path.join(self.q_tables_dir, "qtable-1649708241.4816985.pickle")` -* for seeing if your teaching worked: choose `epsilon = 0`, this will only take the values from the qtable and is repeatable. The action will always be the same based on the same state. +2. Check if your teaching worked: Set `epsilon = 0`, this will ensure full _exploitation_ of the strategies written to the Qtable, hence no random behavior. The action will always be the same based on the same state. -* try a tradeoff between exploration with 1 and exploitation with 0 by choosing e.g. an `epsilon` of 0.5. This basically helps to avoid to be stuck in local maxima and avoids the stagnation of the system. Sometimes stagnation might be wanted, in our case not as it would mean that first visitors of the installation could teach the swarm while later the visitors can only experience and not change the selection of actions anymore. +3. Find your setting: Try a tradeoff between _exploration_ and _exploitation_ by choosing e.g. an `epsilon` of 0.5. This basically helps to avoid to be stuck in local maxima and avoids the stagnation of the system. Sometimes stagnation might be wanted, in our case not as it would mean that first visitors of the installation could teach the swarm while later the visitors can only experience and not change the selection of actions anymore. -* you can also try to change the `LEARNING_RATE` and the `DISCOUNT` which together influence how fast the algorithm learns and adapts to changes. A higher LEARNING_RATE means adapting fast but would also mean to possibly quickly learn something wrong and to forget faster. +4. Play around with the hyperparameters: You can also try to change the `LEARNING_RATE` and the `DISCOUNT` which together influence how fast the algorithm learns and adapts to changes. A higher `LEARNING_RATE` means adapting fast but would also mean to possibly quickly learn something wrong and to forget faster. ### ES_RL_Flocking (Processing) -- GitLab From 5a0491810d39c79ddb788ed68b1d9638a632e014 Mon Sep 17 00:00:00 2001 From: paul bethge Date: Thu, 28 Apr 2022 12:41:08 +0200 Subject: [PATCH 3/4] remove cv2 dependency --- software/python/ES_RL_reward_punish/ESrl.py | 1 - 1 file changed, 1 deletion(-) diff --git a/software/python/ES_RL_reward_punish/ESrl.py b/software/python/ES_RL_reward_punish/ESrl.py index 511fb5e..e5add26 100644 --- a/software/python/ES_RL_reward_punish/ESrl.py +++ b/software/python/ES_RL_reward_punish/ESrl.py @@ -3,7 +3,6 @@ ESRL """ import numpy as np from PIL import Image -import cv2 import matplotlib.pyplot as plt import pickle from matplotlib import style -- GitLab From fc9209d6505264a17006f020159f29bcb1b8b8b7 Mon Sep 17 00:00:00 2001 From: paul bethge Date: Thu, 28 Apr 2022 12:41:29 +0200 Subject: [PATCH 4/4] add simple run explanation --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 984667d..3522684 100644 --- a/README.md +++ b/README.md @@ -139,6 +139,10 @@ Usage ----- ### ES_reward_punish (Python) +```shell +cd software/python/ES_RL_reward_punish/ +python main.py +``` ##### Teaching from scratch: Choose one of the following options: -- GitLab