1A value function tells a machine the value of the current state of affairs - simple as that. In the everyday world, where the machine is the brain, a value function does the same thing--it assigns a number to an outcome, whether or not the outcome is due to external experience or internal experience.
55
2Once a goal is selected by your nervous system, a kind of natural information cycle is set up to guide behavioral choices. There are four basic steps in this cycle. The nervous system must 1. Hold the current goal in mind. 2, produce a critic signal for this goal, 3 use the critic signal to guide choices and improve the brain's model of the goal, and 4. Select the next goal (or keep the current one active)
91
3Reinforcement learning is an approach to trial and error learning where a creature's actions are guided by a class of signals called rewards. These signals have been used in engineered systems and computer programs to equip these systems with goals, and to use the goals to guide learning.
92
4Most practical real-world problems are so hard that a program without a lot of flexibility is doomed to be special-purpose. Ultimately, the goal of such systems is to produce autonomous, self-programming systems that achieve their goals flexibly or creatively.
92
5Guidance beats prescription. Guidance is great when flexibility is required. Prescription is great when the solution to a problem is known one and for all. Creatures need guidance, not prescription. Why? Because the most constant feature of a mobile creature's environment is its inconstancy, its raw variability
97
6But in biology, there are no blank slates
98
7Despite their differences, all goals have one thing in common: They can all be used by our brains to direct decisions that lead to the satisfaction of the goal.
99
8All reinforcement learning systems have three major parts: 1. An immediate reinforcement signal that assigns a number to each state of the creature. 2, a stored value function that represents a judgment about the long-term value of each state, 3, a policy that maps the agent's states to its actions.
103
9The critic signal combines 1) immediate reinforcement information with 2) changes in value to produce what is sometimes called a reward-prediction error signal.
103
10The model showed that Schultz had discovered one of the central critic systems in the mammalian brain, and one that encoded its criticism in the delivery of dopamine.
109
11Ideas act as reward signals from the point of view of the prediction error systems.
110
12Ideas gain the power of rewards and become instantly meaningful to the rest of the brain, especially the learning and decision making algorithms there...(rest of page)
111
13First, there must be some kind of filtering process that circumscribes the kinds of thoughts that can act as a reward signal. Second, the control must be self-limiting, that is, it must be relatively short-lived.
112
14If the light-juice pairs are repeatedly delivered to the subject two remarkable changes occur. The initial response associated with the juice, the "things are better than expected" response goes away. It literally disappears and the neurons no longer change their activity when the juice arrives following the light.
113
15So the brain is solving a statistics problem here--it's constantly looking for stimuli that predict future value and by doing so the brain can use these "value proxies" to make better decisions about the future.
114
16If the brain starts out with a good model of the learning problem, that is, what it should generally do in a particular situation requiring learning, this form of reinforcement learning is quite rapid and efficient.
115
17A common currency scheme only works if there are checks and balances built into the brains of the traders. The most important check is trust, the ability to rely on one's partners to obey the rules, whatever they may be.
117
18The guiding metaphor will be the analogy between foraging for rewards in a field and foraging for ideas. The domains differ, but the issue is the same--rummage through a space of "something valuable," picking up those items most likely to return the most value to the rummager. The systems already knew how to rummage for "good stuff," so it simply redefines what qualified as "good stuff."
124
19To pursue a goal, it must first be stable through time, that is, it must be held in mind.
126
20It is now thought that goals are represented in the prefrontal cortex by stable patterns of neural activity.
127
21The idea simply sounds dangerous, doesn't it? Arbitrary goals plugged into reward sockets, like plugging the output into the input, almost always a bad idea.
134
22In this way, ideas generated by your prefrontal cortex act directly as high-priority reward signals from the point of view of the prediction error systems. These ideas then act directly as the reward signal to the dopamine neurons, which then try to combine information from other regions of the brain to predict this new "idea reward."
137
23Just like "a friend of my friend is also my friend," the brain has its own version--a predictor of a predictor is also a predictor. Where the predictors predict future rewards.
213
24The brain must possess dynamic distribution schemes if it is to be efficient overall. Some problems may require more speed and others more precision--and such needs will change with the problem at hand.
244
27Choice means loss
vii
28In that paper, Turing proposed and demonstrated that any step-by-step procedure (or algorithm) could be represented as a sequence of elementary computations.
6
29The gripe here is that computer programs, exercising great power over our momentary moods, are sort of accidentally evil. Why? They have not been given the capacity to care--they don't have any goals.
3
35The central idea is computation. All things "thoughtlike" are patterns of information stored, processed, and transformed by physical mechanisms in your brain. This is how something immaterial like an abstract thought can be grounded in the physical operation of the nervous system
11
36Int he west, we conceptualize our existence in terms of two basic, but separate entities, body and mind. This idea, usually attributed to Rene Descartes, is an old one and thought by many to be obvious.
9
37Turing conjectured that even our thoughts were equivalent to computational steps, only running on a very specific, biologically evolved device: our brains. This idea is now called the Computational Theory of Mind (CTOM). It's easy to state but still profound. Your mind is not equal to your brain and the interaction of its parts, but your mind is equivalent to the information processing, the computations, supported by your brain.
8
38Desperation is indeed the mother of invention. Plato called it necessity, but he really meant desperation.
17
39Efficiency=The best long-term returns from the least immediate investment.
18
40The surprising answer is that efficient computations care--or more precisely, they have a way to care. I know this sounds strange. And what does an efficient computation care about? Goals.
19
41Four savings Principles: 1. Drain Batteries Slowly. 2. Save Space. 3. Save Bandwidth. 4. Have Goals
32
42The reverse perspective shows that, in addition to artistry, origami contains the concept of a compression algorithm.
39
43All compression is not created equal.
40
44The mandate to save space prescribes two structural features for machines that use wires to communicate: 1. Build as few wires as possible and 2. build more short wires than long wires (conserve wire)
41
45...every rule is seen better as a statistical trend and not as a law.
42
46Specifying complex goals directly is not an efficient way to build goals into a system.
49
47Topic: Fishing Guide
49
48The guide will guide you, and the guidance will lead to a lot of little corrections in your current behavior and your behavior in the near future.
50
49This is one way that the nervous system implements goals--it uses collections of corrective guidance systems (error signals) to navigate an individual's behavior. An explicit abstract account of the goal is not really necessary as long as the system produces and follows the guidance system accurately. This means that many of the goals produced by our nervous system will not be explicitly available as a conscious experience.
50
50Value is the central concept here,...pursued under the name "reinforcement learning."
Connection between goals, values, and guidance links
51
51The field of reinforcement learning lies at the nexus of ... computer science, control theory, and statistical learning theory.
52
52Guidance signals are the error signals that tell the system how to adjust when deviations from the goal state occur. These deviations from the goal state provides a terrific model for the concept of desires, which informs the system how to adjust in order to move closer to achieving the goal state.
54