diff --git a/README.md b/README.md index f8ad1083172792bd42c8b57b5e278d6c0973e847..352268492c063f7030bb76904c90ecf6957af3e9 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ Description ### Reinforcement Learning -Basically a RL algorithm assumes that there is an environment which can have different states, that are observed by the agent acting in that environment. The agent takes actions which change the states of the environment. In the beginning the agent chooses his action randomly while being in a certain state. These state-action pairs get rewarded or punished (punishment is negative reward) in our case by a human "expert" by pressing a button. The agent changes from picking randomly (exploration) to use an action based on the RL Qtable (exploitation). Over time or with the amount of rewards received this can change from exploration to exploitation. +Basically a RL algorithm assumes that there is an environment which can have different states, that are observed by the agent acting in that environment. The agent takes actions which change the states of the environment. In the beginning the agent chooses his action randomly while being in a certain state. These state-action pairs get rewarded or punished (punishment is negative reward) in our case by a human "expert" by pressing a button. For its next action the agent randomly picks to either _explore_ new strategies or _exploit_ strategies that have already been written to the Qtable. Over time or with the amount of rewards received the chance to _exploit_ learned strategies will increase. For the development of this project the code from this very good hands-on [Tutorial on Reinforcement Learning / Q-Learning with Python](https://pythonprogramming.net/q-learning-reinforcement-learning-python-tutorial/) by [sentdex](https://www.youtube.com/channel/UCfzlCWGWYyIQ0aLC5w48gBQ) is the basis. The tutorial is first using the [OpenAI's gym](https://gym.openai.com/), specifically with the "MountainCar-v0" environment which is predefining the possible actions and generates the states. In [part 4 of the tutorial](https://pythonprogramming.net/own-environment-q-learning-reinforcement-learning-python-tutorial/) a custom environment is designed which is inspired by the Predator-Prey example. @@ -80,7 +80,7 @@ For the action space a range of continuous pixel values for the radius could be With the described dimensionality reduction, we reach 15 states x 5 actions = 75 state-action pairs that need to be rewarded. This seems to be a learnable goal. It could be further reduced or extended by different parameters. -In the Empathy Swarm installation it could be included if a human is standing or sitting regarding the observations. For the actions the RGB colors of the robots could be changed or whether the robot is rotating or not. +In the Empathy Swarm installation it could be included if a human is standing or sitting regarding the observations. To display the actions the RGB colors of the robots could be changed or whether the robot is rotating or not. ### Structure @@ -99,6 +99,10 @@ A third Processing program can be used to send rewards through clicking buttons Requirements ------------ +Operating systems/platforms: + +_Windows 10 has been used for development & testing. But most likely other platforms should work too._ + ### ES_RL_Flocking (Processing) Software: @@ -113,58 +117,57 @@ _To add libraries in Processing: click on Sketch>Import Library...>Add Library.. ### ES_reward_punish (Python) -(Software) dependencies: +Software dependencies: * python-osc (tested with version: 1.7.7) * numpy (tested with version 1.20.3) * pickle (tested with version 4.0) * matplotlib (tested with version 3.4.2) -Operating systems/platforms: - -_Windows 10 has been used for development & testing. But most likely other platforms should work too._ - ##### Installation & Build: -Download and Install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment: +We have used the [Anaconda](https://www.anaconda.com/) Python distribution for the development of this project. Any other Python distribution should work as well. Anyhow, we highly recommend the use of virtual environmets. + +Download and install [Anaconda](https://www.anaconda.com/products/individual). Afterwards create a virtual environment: ```shell -$ conda create -n "ESrl" python=3.8 -$ conda activate "ESrl" -$ pip install python-osc==1.7.7 -$ pip install numpy==1.20.3 -$ pip install pickle-mixin -$ pip install matplotlib==3.4.2 +conda create -n "ESrl" python=3.8 +conda activate "ESrl" +pip install -r requirements.txt ``` Usage ----- ### ES_reward_punish (Python) +```shell +cd software/python/ES_RL_reward_punish/ +python main.py +``` -##### To start teaching from scratch: - +##### Teaching from scratch: +Choose one of the following options: * start with a qtable that is initialized with zeros: This makes it easy to read any rewards you added afterwards. It makes sense to choose a neutral action with id 0 as the argmax function gives back the first index in case all values in a list are the same. e.g. `self.q_table[(segmentid)] = [0 for i in range(self.sizeActions)]` _or_ -start with a qtable that is initialized with random values: This means that there is already a versatile / random behaviour of the agent. +* start with a qtable that is initialized with random values: This means that there is already a versatile / random behaviour of the agent. e.g. `self.q_table[(segmentid)] = [np.random.uniform(-1, 0) for i in range(self.sizeActions)]` -In both cases use: `self.start_q_table = None` +In both cases use: `self.start_q_table = None`. -* start with a `epsilon` value of 1, this means that the whole range of state-action pairs will be explored by choosing randomly which action to take, while a value of 0 means that the qtable will be exploited, meaning that the values will be read and also will always be the same for the same state-action pair. During the teaching process epsilon will decay through the variable `EPS_DECAY`. This has to be 1 for epsilon not to decay and less than 1 in order for epsilon to decay every episode. The smaller the value, the faster the decay as its used as a factor. +Start with an `epsilon` value of 1, this means that the whole range of state-action pairs will be explored by randomly choosing which action to take. On the other hand, a value of 0 means that the Qtable will be fully exploited, which will always lead to the same action given the same state-action pair. During the teaching process epsilon will decay through the variable `EPS_DECAY`. This has to be 1 for epsilon not to decay and less than 1 in order for epsilon to decay every episode. The smaller the value, the faster the decay as its used as a factor. ##### To explore if your teaching worked or continue teaching: -* start with a qtable from a filename *.pickle. In you work directory all qtables are saved in a folder called q_tables. Choose the last one you trained. +1. Load a qtable: In you working directory all qtables are saved in a folder called `q_tables/`. Choose the last one you trained. e.g. `self.start_q_table = os.path.join(self.q_tables_dir, "qtable-1649708241.4816985.pickle")` -* for seeing if your teaching worked: choose `epsilon = 0`, this will only take the values from the qtable and is repeatable. The action will always be the same based on the same state. +2. Check if your teaching worked: Set `epsilon = 0`, this will ensure full _exploitation_ of the strategies written to the Qtable, hence no random behavior. The action will always be the same based on the same state. -* try a tradeoff between exploration with 1 and exploitation with 0 by choosing e.g. an `epsilon` of 0.5. This basically helps to avoid to be stuck in local maxima and avoids the stagnation of the system. Sometimes stagnation might be wanted, in our case not as it would mean that first visitors of the installation could teach the swarm while later the visitors can only experience and not change the selection of actions anymore. +3. Find your setting: Try a tradeoff between _exploration_ and _exploitation_ by choosing e.g. an `epsilon` of 0.5. This basically helps to avoid to be stuck in local maxima and avoids the stagnation of the system. Sometimes stagnation might be wanted, in our case not as it would mean that first visitors of the installation could teach the swarm while later the visitors can only experience and not change the selection of actions anymore. -* you can also try to change the `LEARNING_RATE` and the `DISCOUNT` which together influence how fast the algorithm learns and adapts to changes. A higher LEARNING_RATE means adapting fast but would also mean to possibly quickly learn something wrong and to forget faster. +4. Play around with the hyperparameters: You can also try to change the `LEARNING_RATE` and the `DISCOUNT` which together influence how fast the algorithm learns and adapts to changes. A higher `LEARNING_RATE` means adapting fast but would also mean to possibly quickly learn something wrong and to forget faster. ### ES_RL_Flocking (Processing) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..3930cea1f07296c0d88bba59a1059bed09e4d781 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,4 @@ +python-osc==1.7.7 +numpy==1.20.3 +pickle-mixin +matplotlib==3.4.2 \ No newline at end of file diff --git a/software/python/ES_RL_reward_punish/ESrl.py b/software/python/ES_RL_reward_punish/ESrl.py index 511fb5e36d9b4188446f51051e841b0221181d34..e5add26a929219c9833517af6e830f85ace97e94 100644 --- a/software/python/ES_RL_reward_punish/ESrl.py +++ b/software/python/ES_RL_reward_punish/ESrl.py @@ -3,7 +3,6 @@ ESRL """ import numpy as np from PIL import Image -import cv2 import matplotlib.pyplot as plt import pickle from matplotlib import style