ACM Computing Surveys
28A(4), December 1996,
http://www.cs.orst.edu/~tgd/ml-strategic-directions.html. Copyright ©
1996 by the Association for Computing Machinery, Inc. See the permissions statement below.
Machine Learning
Thomas G. Dietterich
Oregon State University,
Department of Computer Science
303 Dearborn Hall, Corvallis, OR 97331-3202 USA
tgd@cs.orst.edu,
http://www.cs.orst.edu/~tgd
Abstract:
Machine Learning studies methods for developing computer systems that
can learn from experience. These methods are appropriate in situations
where it is impossible to anticipate at design time exactly how a
computer system should behave. The ultimate behavior of the system is
determined both by the framework provided by the programmer and the
subsequent input/output experiences of the program. This position
paper briefly reviews the major methods studied in machine learning
and then describes important open problems and research directions.
Categories and Subject Descriptors: I.2.6 [Learning]; I.5
[Pattern Recognition]
General Terms: Supervised learning, reinforcement learning,
temporal-difference learning, cognitive architectures.
1 Introduction
The fundamental goal of machine learning is to make computer systems
that can adapt and learn from their experience. As such, machine
learning is an alternative to traditional software engineering that
replaces human-intensive hand-engineering with data-intensive software
construction and testing. Machine learning methods are appropriate
whenever hand-engineering of software is difficult and yet data is
available for analysis by learning algorithms. This includes
situations where humans are unable to introspect (e.g., computer
vision, speech recognition, motor control), situations where human
expertise is lacking (e.g., human genome program, drug design),
situations where the task changes frequently (e.g., airline seat sales
strategies), and situations where a program must be customized for
individual users (e.g., information filtering, user interfaces).
Because machine learning is inherently stochastic, it may not be
appropriate for tasks where completely error-free performance must be
guaranteed (e.g., life-support control in intensive care).
Learning tasks can generally be divided into one-shot decision tasks
(classification, prediction) and sequential decision tasks (control,
optimization, planning). One-shot decision tasks are usually
formulated as supervised learning tasks, where the learning algorithm
is given a set of input-output pairs. The input describes the
information available for making the decision and the output describes
the correct decision. For example, in handwritten digit recognition,
the input would be an image of the handwritten digit and the output
would be the identity of the digit.
Sequential decision tasks are usually formulated as reinforcement
learning tasks, where the learning algorithm is part of an agent that
interacts with an "external environment". At each point, the agent
observes the current state of the environment (or some aspects of the
environment). It then selects and executes some action, which usually
causes the environment to change state. The environment then provides
some feedback (e.g., immediate reward) to the agent. The goal of the
learning process is to learn an action-selection policy that will
maximize the long-term rewards received by the agent. An example of a
sequential task is the control of a manufacturing facility, where the
current state includes a list of orders to be fulfilled and the
current status of the facility, and the actions change the settings of
various manufacturing machines to maximize the profit produced by the
facility. The immediate reward includes the cost of each action and
the revenue received when an order is completed.
2 Fundamental Research Problems
Machine learning algorithms can generally be arranged along a spectrum
from general ("off-the-shelf") methods, which attempt to perform well
for a wide variety of problems, to custom methods designed for
particular tasks. For example, methods for learning decision trees,
rule sets, probabilistic networks, and feed-forward neural networks
are very general, and have been applied to many different problems.
Methods for speech recognition based on hidden Markov models have a
more limited range of applicability. And methods for inferring
binding site geometry in drug design are extremely
application-specific.
Many fundamental research problems concern this general-to-specific
spectrum.
- What is the space of general methods? The algorithms for trees,
rules, bayes nets and neural nets have all been developed in the last
10-15 years. There is no reason to believe that these four general
methods exhaust the possibilities. What other convenient
representational formalisms need to be explored?
- What are the limits of general methods? Are our current
algorithms for each of these general methods essentially optimal? Or
is there substantial room for improvement?
- What theoretical and engineering principles can guide the
development of application-specific algorithms? The development of
such algorithms is currently very expensive and time-consuming.
- Can we find convenient ways to acquire, represent, and
incorporate application-specific knowledge into general learning
methods? Rather than constructing application-specific algorithms
manually, it would be much better if we could construct them
automatically by instantiating some general learning methods. This
process is still not well understood, although there are several
promising initial efforts.
In sequential decision-making tasks, there are several additional
challenging fundamental questions.
- In what situations is it better to represent and learn a policy
directly, rather than first learning a value function? A
policy specifies what action should be chosen in each state;
a value function gives the expected long-term reward of each
state. The best action can be selected by choosing the action that
will lead to the state with the highest value according to the value
function. Many methods---including genetic programming, genetic
algorithms, and parameterized policies---search directly in policy space.
Other methods---temporal difference learning, Q learning, and real-time
dynamic programming---construct value functions instead.
- What properties should an ideal value function approximator have?
- What is the best way to resolve the tradeoff between exploration
(i.e., search for better policies) and exploitation (i.e., execution
of the current policy)?
- Can we find ways to solve large partially-observable Markov
decision problems?
3 Interdisciplinary Connections
A strength of research in machine learning is that in the past decade,
a strong link has been forged between theoretical and experimental
research, with overlapping attendence at conferences, publication in
the same journals, and even individual papers that combine theoretical
analysis with experiment. All of the fundamental problems listed above
require a concerted attack using both theoretical and experimental
methods.
Similar links are beginning to be established across disciplines, with
connections to statistics, operations research, control, robotics,
signal processing, and many other fields. These links need to be
strengthened and expanded to include stronger links with practitioners
in industry and in application fields in science and engineering.
Graduate students need to have training in at least one application
field if they are going to apply machine learning methods effectively
in that field.
There is currently a serious lack of communication between academic
research and industrial applications. Academic research is rarely
packaged and delivered in a way that is accessible to industrial
researchers (e.g., the fields of statistics and databases have been
much more successful in this regard). Furthermore, many fundamental
and practical problems arising in applications are not being addressed
by academic research. One symptom of this is that few applications
papers are being accepted for publication at conferences. Several
people are attempting to rectify this problem. But other mechanisms
must also be found for widening the communication from industry to
academe.
Surprisingly, the links between machine learning research and the rest
of artificial intelligence and computer science are not very strong.
This is due in large part to the very different background and
training required. Machine learning requires a strong background in
continuous mathematics, numerical computing, probability, and
statistics, whereas most of computer science has been focusing on
discrete mathematics, and many departments of computer science do not
require any background in probability theory!
Data-intensive methods are beginning to appear in parts of AI
including computer vision and corpus-based natural language
processing. However, for the most part, machine learning has focused
on one-shot decision problems, which constitute a rather small portion
of general intelligent behavior. It is time to consider how machine
learning should be integrated into the architectures of intelligent
agents. Should there be a single, general weak learning component
(e.g., chunking in SOAR), or a large number of specialized learners?
Should learning for perception and motor control be separate from more
"cognitive" tasks? How can knowledge be transferred from one task to
another within the lifetime of an agent? Concrete experimental
settings are required to push this research forward.
4 Concluding Remarks
In conclusion, let me list some of the applications areas where there
are opportunities for major improvements through the application of
machine learning methods:
- Machine perception. Approaches that learn to perform feature
extraction, segmentation, and classification show promise of higher
performance and greater ease of implementation than traditional
systems built out of separate hand-built components.
- Large-scale data analysis and information management. Businesses
and scientists are gathering data faster than ever before. Scaled-up
machine learning methods are critical for finding patterns and
relationships in these data sets.
- Information filtering and retrieval. Machine learning methods
can form models of the information needs of users and of the various
available sources of information. By combining these models, the
explosion in online information can be managed to help users find only
the information they need.
- Control and optimization. Adaptive control methods will permit
larger scale optimization (e.g., over entire enterprises) of
manufacturing, planning, scheduling, inventory management, and
logistics. They will also enable new kinds of applications such as
smart buildings and smart vehicles.
- Software engineering. Existing applications packages (such as
statistical packages, image processing workbenches, user interface
toolkits) provide modules that can be assembled to solve particular
problems. Can this assembly be automated so that users do not need to
search through the space of possible combinations?
- Model building. Many disciplines in science and engineering
construct complex models in order to support monitoring, diagnosis,
repair, prediction, extrapolation, and so on. This is a mixture of
programming (to construct the model) and adaptation (to modify the
model and calibrate it to data). Modelers need better tools for
constructing, calibrating, and managing their models.
This list is not exhaustive; there are many other fruitful areas
for research (e.g., studying the dynamics of interactions among
multiple learning agents, applications in cognitive psychology, etc.).
Indeed, I believe that the next decade will see a revolution in the
way we develop software. I believe we must move away from software
that is pre-engineered, pre-compiled, and pre-optimized toward
software that is adaptive and learns from its experiences. Unless
software becomes self-assembling and adaptive, every computer user
will need to have his or her own personal software engineer, and the
cost of providing such services will severely limit the growth of the
computer industry.
Permission to make digital
or hard copies of part or all of this work for personal or classroom
use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from
Publications Dept, ACM Inc., fax +1 (212) 869-0481, or
permissions@acm.org.
Last modified: Sat Nov 16 12:03:20 1996
Tom Dietterich
<tgd@cs.orst.edu>