Neural architectures for database query processing, syntax analysis, knowledge representation, and inference

Chun-Hsien Chen
Iowa State University

Follow this and additional works at: http://lib.dr.iastate.edu/rtd
Part of the Artificial Intelligence and Robotics Commons

Recommended Citation
http://lib.dr.iastate.edu/rtd/11833

This Dissertation is brought to you for free and open access by Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact digirep@iastate.edu.
INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.
Neural architectures for database query processing, syntax analysis, knowledge representation, and inference

by

Chun-Hsien Chen

A dissertation submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY

Major: Computer Science
Major Professor: Vasant Honavar

Iowa State University
Ames, Iowa
1997
Copyright © Chun-Hsien Chen, 1997. All rights reserved.
Graduate College
Iowa State University

This is to certify that the Doctoral dissertation of

Chun-Hsien Chen

has met the dissertation requirements of Iowa State University

Signature was redacted for privacy.

Committee Member

Signature was redacted for privacy.

Committee Member

Signature was redacted for privacy.

Committee Member

Signature was redacted for privacy.

Committee Member

Signature was redacted for privacy.

Major Professor

Signature was redacted for privacy.

For the Major Program

Signature was redacted for privacy.

For the Graduate College
# TABLE OF CONTENTS

**ACKNOWLEDGEMENTS** .............................................................. x

**ABSTRACT** ................................................................................. xi

**1 INTRODUCTION** ........................................................................ 1

1.1 Artificial Neural Networks ..................................................... 3

1.1.1 Artificial neural units ....................................................... 3

1.1.2 Activation functions ....................................................... 4

1.1.3 Types of artificial neural networks and their computational capabilities ....................................................... 5

1.1.4 Implementation of artificial neural networks ....................... 6

1.2 A Brief Review of Artificial Neural Networks ............................. 6

1.3 An Overview of the Dissertation ........................................... 9

**2 A NEURAL MEMORY FOR CONTENT AS WELL AS ADDRESS-BASED STORAGE AND RECALL** .............................................. 13

2.1 Introduction ............................................................................. 13

2.1.1 Information retrieval and binary mapping ............................ 14

2.1.2 Associative memory (Content-addressed memory) ................. 15

2.1.3 Address-based memory .................................................... 18

2.1.4 Perceptrons ....................................................................... 19

2.2 Multi-layer Perceptrons as Neural Memories ........................... 19

2.2.1 The application of linear separability of binary vertices in pattern classification ....................................................... 20

2.2.2 Best match: pattern classification with precision control ........ 23

2.2.3 Storage capacity ............................................................... 24
2.2.4 Synthesis of associative and address-based memories .................. 24
2.2.5 Exact match: binary mapping Perceptron (BMP) module ............... 27
2.2.6 Conversion between memory models using bipolar and binary inputs .. 28

2.3 Properties of the Proposed Neural Associative Memory ................. 32
2.3.1 Partial match: associative recall from a partially specified input ..... 33
2.3.2 Multiple associative recalls .................................. 35
2.3.3 Fault tolerance .............................................. 38

2.4 Summary and Discussion ........................................... 40

3 NEURAL ARCHITECTURES FOR INFORMATION RETREIVAL AND
DATABASE QUERY PROCESSING ........................................ 42
3.1 Introduction ...................................................... 42
3.1.1 Information retrieval in neural associative memories ............... 44

3.2 Query Processing Using Neural Associative Memories .................... 45
3.2.1 Realization of lexical access for a machine-readable lexicon using a neural
associative memory .............................................. 45
3.2.2 Realization of a library query system using a neural associative memory 50
3.2.3 The implementation of case insensitive pattern matching ............ 52

3.3 Comparison with Other Database Query Processing Techniques ............ 53
3.3.1 Performance of current electronic realization for neural networks .... 53
3.3.2 Analysis of query processing in conventional computer systems ....... 54

3.4 Summary and Discussion ........................................... 58

4 NEURAL ARCHITECTURES FOR ELEMENTARY LOGICAL INFERENCE ........... 59
4.1 Introduction ...................................................... 59
4.2 Neural Assemblies for the Recognition of Partial Patterns ............... 60
4.2.1 A neural assembly for inclusive pattern recognition ............... 61
4.2.2 A neural assembly for exclusive pattern recognition ............... 62

4.3 A Neural Assembly for Executing a Logical AND (AND Neural Assembly) .... 64
<table>
<thead>
<tr>
<th>Section</th>
<th>Title</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.4</td>
<td>Neural Assemblies for Executing Logic ORs (OR Neural Assemblies)</td>
<td>67</td>
</tr>
<tr>
<td>4.4.1</td>
<td>A general OR neural assembly</td>
<td>67</td>
</tr>
<tr>
<td>4.4.2</td>
<td>A monotone OR neural assembly</td>
<td>69</td>
</tr>
<tr>
<td>4.5</td>
<td>A Neural Architecture for Realizing DNF Boolean Functions</td>
<td>70</td>
</tr>
<tr>
<td>4.6</td>
<td>Summary and Discussion</td>
<td>71</td>
</tr>
<tr>
<td>5</td>
<td>NEURAL ARCHITECTURES FOR SEQUENCE PROCESSING</td>
<td>73</td>
</tr>
<tr>
<td>5.1</td>
<td>Introduction</td>
<td>73</td>
</tr>
<tr>
<td>5.1.1</td>
<td>Symbolic functions and binary mappings</td>
<td>74</td>
</tr>
<tr>
<td>5.2</td>
<td>Neural Network Design for Deterministic Finite Automata (NN DFA)</td>
<td>76</td>
</tr>
<tr>
<td>5.2.1</td>
<td>Deterministic finite automata (DFA)</td>
<td>76</td>
</tr>
<tr>
<td>5.2.2</td>
<td>Architecture of NN DFA</td>
<td>77</td>
</tr>
<tr>
<td>5.3</td>
<td>Neural Network Design for Deterministic Pushdown Automata (NN DPDA)</td>
<td>80</td>
</tr>
<tr>
<td>5.3.1</td>
<td>Deterministic pushdown automata (DPDA)</td>
<td>80</td>
</tr>
<tr>
<td>5.3.2</td>
<td>Architecture of NN DPDA</td>
<td>81</td>
</tr>
<tr>
<td>5.4</td>
<td>Neural Network Design for Stack (NN Stack)</td>
<td>84</td>
</tr>
<tr>
<td>5.4.1</td>
<td>Symbolic representation of stack</td>
<td>84</td>
</tr>
<tr>
<td>5.4.2</td>
<td>Architecture of NN Stack</td>
<td>85</td>
</tr>
<tr>
<td>5.4.3</td>
<td>NN Stack in action</td>
<td>89</td>
</tr>
<tr>
<td>5.5</td>
<td>Neural Network Design for Nondeterministic Finite Automata (NN NFA)</td>
<td>90</td>
</tr>
<tr>
<td>5.5.1</td>
<td>Nondeterministic finite automata (NFA)</td>
<td>91</td>
</tr>
<tr>
<td>5.5.2</td>
<td>Model for concurrently tracking all the possible nondeterministic moves in the operation of an NFA using RNN</td>
<td>92</td>
</tr>
<tr>
<td>5.5.3</td>
<td>Architecture of NN NFA</td>
<td>95</td>
</tr>
<tr>
<td>5.5.4</td>
<td>Proof of correctness</td>
<td>103</td>
</tr>
<tr>
<td>5.5.5</td>
<td>NN NFA in Action</td>
<td>105</td>
</tr>
<tr>
<td>5.6</td>
<td>Summary and Discussion</td>
<td>106</td>
</tr>
<tr>
<td>6</td>
<td>NEURAL ARCHITECTURES FOR SYNTAX ANALYSIS</td>
<td>108</td>
</tr>
<tr>
<td>6.1</td>
<td>Introduction</td>
<td>108</td>
</tr>
</tbody>
</table>
LIST OF FIGURES

Figure 1.1 A typical computing unit of an ANN ................................. 4

Figure 2.1 Examples of memory pattern, noisy patterns, and partial pattern ... 17

Figure 2.2 The spatial distribution of a 3-dimensional and an n-dimensional binary hypercubes ................................................................. 22

Figure 2.3 The setting of connection weights and hidden node threshold in the proposed neural memory (a 2-layer Perceptron with binary input) for a given associated memory pair ................................................. 26

Figure 2.4 The settings of connection weights and hidden node threshold in the proposed BMP module for an associated binary mapping ordered pair 28

Figure 2.5 The setting of connection weights and hidden node threshold in the proposed neural memory (a 2-layer Perceptron with bipolar input) for a given associated memory pair ................................................. 32

Figure 3.1 A modular design of the proposed ANN memory for easy expansion. This 1-dimensional array structure can be easily extended to 2 or 3-dimensional array structures. ................................................. 50

Figure 4.1 A 1-layer Perceptron which recognizes all the 5-dimensional binary patterns that contain the partial pattern \(<1,?,0,?,1>\), where ? denotes don’t care ......................................................... 63

Figure 4.2 A 1-layer Perceptron which recognizes all the 5-dimensional binary patterns that don’t contain the partial pattern \(<1,?,0,?,1>\), where ? denotes don’t care ......................................................... 65
<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>4.3</td>
<td>An AND neural assembly which realizes the logical AND function $C(v)$</td>
<td>66</td>
</tr>
<tr>
<td>4.4</td>
<td>An OR neural assembly which realizes the logical OR function $D(v)$</td>
<td>69</td>
</tr>
<tr>
<td>4.5</td>
<td>An neural architecture which realizes the DNF Boolean function $E(v)$</td>
<td>72</td>
</tr>
<tr>
<td>5.1</td>
<td>The proposed modular neural network architecture for DFA</td>
<td>79</td>
</tr>
<tr>
<td>5.2</td>
<td>The proposed modular neural network architecture for DPDA</td>
<td>83</td>
</tr>
<tr>
<td>5.3</td>
<td>The proposed neural network architecture for stack mechanism</td>
<td>86</td>
</tr>
<tr>
<td>5.4</td>
<td>The state diagram of an NFA that accepts any input string containing the sub-string abaa</td>
<td>93</td>
</tr>
<tr>
<td>5.5</td>
<td>The state diagram of a DFA that accepts any input string containing the sub-string abaa</td>
<td>93</td>
</tr>
<tr>
<td>5.6</td>
<td>The proposed recurrent neural network architecture for concurrently tracking all the nondeterministic computations of a given NFA</td>
<td>96</td>
</tr>
<tr>
<td>6.1</td>
<td>The simplified state diagram of a DFA which recognizes keywords: begin, end, if, then, and else</td>
<td>113</td>
</tr>
<tr>
<td>6.2</td>
<td>The state diagram of a DFA which simulates a simple word segmenter carving continuous input stream of characters into words including integer constants, keywords and identifiers</td>
<td>114</td>
</tr>
<tr>
<td>6.3</td>
<td>The proposed neural network architecture for LR(1) parser</td>
<td>119</td>
</tr>
<tr>
<td>6.4</td>
<td>The state diagram of the DFA $M_{L_1}$ for the lexical analyzer $L_1$</td>
<td>127</td>
</tr>
</tbody>
</table>
# LIST OF TABLES

| Table 2.1 | The corresponding maximal storage capacity of a 1-layer Perceptron with 100 input neurons for classifying binary patterns for a range of allowable noise levels | 25 |
| Table 3.1 | A comparison of the estimated performance of the proposed neural associative memory with that of other techniques commonly used in conventional computer systems for locating a record pointer in key-based organizations | 56 |
| Table 3.2 | A comparison of the capabilities of the proposed neural associative memory with those of other techniques commonly used in conventional computer systems for exact match and partial match | 56 |
| Table 6.1 | The parse table of the LR(1) parser for grammar $G_1$ | 125 |
| Table 6.2 | The transition function $\delta_{L_1}$ of the DFA $M_{L_1}$ | 127 |
| Table 6.3 | Moves of the LR(1) parser for grammar $G_1$ on input string $I\times I+I$ | 129 |
| Table 6.4 | A comparison of the estimated performance of the proposed NNLR Parser with that of conventional computer systems for syntax analysis | 133 |
ACKNOWLEDGEMENTS

I would like to express my sincere appreciation to Dr. Vasant Honavar for his support, assistance, and guidance, as well as challenging and inspiring criticism in the preparation of this dissertation and in my studies at Iowa State University. I would also like to thank my Graduate Committee: Dr. Tom Barta, Dr. Julie Dickerson, Dr. Les Miller, and Dr. Johnny Wong for their helpful advice, comments, and suggestions.

I am deeply indebted to the Extended and Continuing Education Office, Iowa State University, for the warm friendship of the staff, the valuable work experience, and the timely financial support without which my study for a Ph.D. would not be possible. I am grateful to Russell Meier and the Artificial Intelligence Research Group: Karthik Balakrishnan, Armin Mikler, Rajesh Parekh and Jihoon Yang for their friendship and inspiring discussion. I am also grateful to my friends in Ames, Iowa, for keeping me company at my leisure time and for nourishing me with friendship.

Also, I would like to express my deepest gratitude to my parents for their everlasting support and love.
ABSTRACT

Artificial neural networks (ANN), due to their inherent parallelism, potential for fault tolerance, and adaptation through learning, offer an attractive computational paradigm for a variety of applications in computer science and engineering, artificial intelligence, robotics, and cognitive modeling. Despite the success in the application of ANN to a broad range of numeric tasks in pattern classification, control, function approximation, and system identification, the integration of ANN and symbolic computing is only beginning to be explored. This dissertation explores to integrate ANN and some essential symbolic computations for content-based associative symbolic processing. This offers an opportunity to explore the potential benefits of ANN's inherent parallelism in the design of high performance computing systems for real time content-based symbolic processing applications. We develop methods to systematically design massively parallel architectures for pattern-directed symbol processing using neural associative memories as key components. In particular, we propose neural architectures for content-based as well as address-based data storage and recall, information retrieval and database query processing, elementary logical inference, sequence processing, and syntax analysis. Their potential advantages over conventional serial computer implementations of the same functions are examined in the dissertation.
1 INTRODUCTION

The goal of artificial intelligence (AI), broadly interpreted, is to understand and engineer intelligent systems. It is often suggested that traditionally serial symbol processing systems of AI and inherently massively parallel artificial neural networks (ANN) offer two radically, perhaps even irreconcilably different paradigms for modelling minds and brains — both artificial as well as natural [130, 160]. AI has been successful in applications such as theorem proving, knowledge-based expert systems, mathematical reasoning, syntax analysis, and related applications which mainly involve systematic symbol manipulation. On the other hand, ANN have been particularly successful in applications such as pattern recognition, function approximation, and nonlinear control [60, 150] which involve primarily numeric computation. Meyerowitz has suggested that the design of neural architectures capable of supporting dynamic representations for symbol manipulation is one of the grand challenges of neural network research [113].

As shown by Church, Kleene, McCulloch, Post, Turing, and others through their work on the theory of Computation [117, 100], both AI and ANN represent particular realizations of a universal (Turing-equivalent) model of computation [185]. Thus, despite assertions by some to the contrary, any task that can be realized by one can, in principle, be accomplished by the other. However, most AI systems have been traditionally programmed in languages that were influenced by Von Neumann's design of a serial stored program computer. ANN systems on the other hand, have been inspired by (albeit overly simplified) models of biological neural networks. They represent different commitments regarding the architecture and the primitive building blocks used to implement the necessary computations. Thus they occupy different regions characterized by possibly different cost-performance tradeoffs in a much larger space of potentially interesting designs for intelligent systems. Recently, several researchers have begun
to explore previously unexplored parts of this design space.

Given the reliance of both traditional AI and ANN on essentially equivalent formal models of computation, a central issue in design and analysis of intelligent systems has to do with the identification and implementation, under a variety of design, cost, and performance constraints, of a suitable subset of Turing-computable functions that adequately model the desired behaviors. Today's AI and ANN systems each demonstrate at least one way of performing a certain task (e.g., logical inference, pattern recognition, syntax analysis) naturally and thus pose the interesting problem for the other of doing the same task, perhaps more elegantly, efficiently, robustly, or cost-effectively than the other. In this context, it is beneficial to critically examine the often implicit and unstated assumptions on which current AI and ANN systems are based and to identify alternative (and potentially better) approaches to designing such systems. Massively parallel symbol processing architectures for AI systems or highly structured (as opposed to homogeneous, fully connected) ANN are just two examples of a wide range of approaches to designing intelligent systems [185, 72, 73]. Of particular interest are alternative designs (including synergistic hybrids of ANN and AI designs) for intelligent systems [47, 65, 70, 72, 73, 99, 136, 172, 179, 185]. Examples of such systems include: neural architectures for information retrieval and database query processing [23, 24], generation of context-free languages [187], rule-based inference [5, 31, 141, 167, 176], computer vision [11, 119], natural language processing [14, 32], learning [46, 69, 168], and knowledge-based systems [94, 145]. We strongly believe that a judicious and systematic exploration of the design space of such systems is essential for understanding the nature of key cost-performance tradeoffs in the synthesis of intelligent systems.

This dissertation explores to integrate ANN and some essential symbolic computations for content-based associative symbolic processing. This offers an opportunity to explore the potential benefits of ANN's inherent parallelism in the design of high performance computing systems for real time content-based symbolic processing applications.
1.1 Artificial Neural Networks

ANN are biologically inspired by the neural systems of human brain which are massively parallel interconnected networks of hierarchically organized nerve cells (neurons). ANN are extremely simplified models of biological neural systems in many aspects such as the structure of basic computational units, the mechanism for information processing, network architecture, etc. Compared to most current digital computer systems, ANN are particularly well-suited for pattern-directed problems - pattern completion, pattern classification and pattern association [29] which arise frequently in applications such as language processing, speech recognition, and pattern recognition.

It is worth mentioning that the primary goal of ANN research (unlike neural modelling or computational neuroscience research) is not to discover a computational model for the detailed processes of human brain but to technologically pursue a computing paradigm which can effectively realize and efficiently perform high-level intelligent processes.

1.1.1 Artificial neural units

A typical computing unit (node) in an ANN has $n$ input and $m$ output connections, each of which has an associated weight. The node computes the weighted sum on the inputs, compares the sum to its node threshold, and produces its output based on an activation function. A commonly used activation function is threshold function. The resulting output is sent along the output connections to other nodes. The output of such a node used in this dissertation is defined by

$$y = f(s) \text{ and } s = \sum_{i=1}^{n} w_i x_i - \theta$$

where $x_i$ is the value of input $i$, $w_i$ is the associated weight on input connection $i$, $\theta$ is the node threshold, $y$ is the output value, and $f$ is the activation function. Figure 1.1 shows such a node.
1.1.2 Activation functions

The types of activation functions used by an ANN affect its expressiveness, computational capabilities, and performance. Several typical activation functions are linear, binary sigmoidal, bipolar sigmoidal, binary hardlimiter, bipolar hardlimiter, gaussian, and ramp [56, 102] defined as follows:

linear: \[ f_L(s) = cs, \text{ where } s = \sum_{i=1}^{n} w_i x_i - \theta \text{ and } c > 0 \]

binary sigmoidal: \[ f_S(s) = \frac{1 - e^{-cs}}{1 + e^{-cs}}, \text{ where } s = \sum_{i=1}^{n} w_i x_i - \theta \text{ and } c > 0 \]

bipolar sigmoidal: \[ f_S(s) = \frac{1 - e^{-cs}}{1 + e^{-cs}}, \text{ where } s = \sum_{i=1}^{n} w_i x_i - \theta \text{ and } c > 0 \]

binary hardlimiter: \[ f_H(s) = \begin{cases} 1 & \text{if } s \geq 0 \\ 0 & \text{otherwise} \end{cases}, \text{ where } s = \sum_{i=1}^{n} w_i x_i - \theta \]

bipolar hardlimiter: \[ f_H(s) = \begin{cases} +1 & \text{if } s \geq 0 \\ -1 & \text{otherwise} \end{cases}, \text{ where } s = \sum_{i=1}^{n} w_i x_i - \theta \]
gaussian: \[ f_G(s) = e^{\frac{-s^2}{2\sigma^2}}, \text{ where } s = \sum_{i=0}^{n}(w_i - x_i)^2 \]

ramp: \[ f_R(s) = \begin{cases} +1 & \text{if } cs > +1 \\ cs & \text{if } |cs| \leq 1 \\ -1 & \text{if } cs < -1 \end{cases} \text{, where } s = \sum_{i=1}^{n} w_i x_i - \theta \text{ and } c > 0 \]

In above equations, \( c \) is activation gain. Note that most of the activation functions produce output in the range of \([0, 1]\) for binary signals, and \([-1, 1]\) for bipolar signals.

This dissertation uses two types of threshold functions: binary hardlimiter and bipolar hardlimiter. Their simplicity allows simple and efficient hardware implementation of such threshold functions.

1.1.3 Types of artificial neural networks and their computational capabilities

ANN can be mainly classified into three basic categories: feedforward networks, feedback networks and recurrent networks [28, 56, 131] according to their architectures, functionalities, and signal propagation direction of their connections. The output of a feedforward network is a function of current input, and its connections are unidirectional. The output of a feedback network is a function of current input (and past inputs in some cases), and its connections are not necessarily unidirectional. The output of a recurrent network is a function of current and past inputs, and its connections are unidirectional. Architecturally, a recurrent network is a feedforward networks with recurrent connections, but it is a feedback network functionally. Since the output of both feedback networks and recurrent networks can be a function of past inputs, and thus they are suitable for sequence processing. Mathematically, the computing of feedforward networks approximates a function mapping, and that of feedback networks approximates finite state machines, pushdown automata, or Turing machines.

Typically, a feedforward network and a recurrent network has a layer of input neurons to receive input, a layer of output neurons to produce output, and often layers of hidden neurons to extend the computing capability of the network. Usually, the neurons of a feedback net-
work are classified into input, hidden, and output neurons functionally but not architecturally. *Perceptrons* [155] and *multi-layer Perceptron* [156] are two examples of feedforward networks. *Elman network* [33] and *Jordan network* [81] are two examples of recurrent networks. *Hopfield networks* [75] and *BAM* [90] two examples of feedback networks.

1.1.4 Implementation of artificial neural networks

Due to the computations required by enormous neural nodes to calculate their thresholded activation and weighted sum on the inputs from their associated input connections in an ANN, ANN systems generally require more intensive computational power but simpler types of computations than current computer systems do. There are many technologies available for implementing ANN, mainly including software simulation which is the most widely used due to the fact that digital computer systems are highly available for writing and testing the simulation programs, electronic hardware (digital VLSI, analog VLSI, hybrid of digital and analog VLSI, etc.) realization which potentially owns both the benefits of high performance and cost-effectiveness currently due to the fact that VLSI provides relatively high performance and is extensively used in current computer systems, optical computing which potentially has the highest performance because it computes at the speed of light, and biological implementation which is biologically closer to biological nervous systems.

1.2 A Brief Review of Artificial Neural Networks

Since the resurgence of research on ANN in 1980s, ANN have attracted much interest of many researchers from various science and engineering disciplines, which is shown by the explosive amount of applications and published technical papers on ANN in 1980s and 1990s. It is beyond the scope of this dissertation to review in detail the rich literature in every research area of ANN. Instead, this section will only briefly review up to late 1980s several representative concepts and landmarks on the common research ground of ANN. Reviewed in much more detail in the related chapters is the literature specific to the research topics covered by this
dissertation which explores methods for systematically designing neural architectures for associative memories, database query processing, elementary logical inference, sequence processing, and syntax analysis. The reference book [189] which provides more than 4000 references is a good source of research material for facilitating a general and in-depth understanding of ANN research.

Two of ANN research problem domains for which few conventional computing solutions exist are

- **associative memories** which are anticipated to provide the same advantageous capability as human memory does and are currently mainly used in the applications of pattern classification based on their capability for best match and partial match, and

- **learning** which is anticipated to be used as an efficient and cost-effective alternative to knowledge engineering for automated knowledge acquisition without intensive programming.

Following brief review on ANN literature mainly proceeds along these two intermingled themes which have driven the development of new ANN architectures, models, and algorithms for information processing. The development of formal mathematical models for ANN can be traced back to the early 1940s in the work by McCulloch and Pitts [108], which showed that any logical proposition can be represented by a network of interconnected neurons of two states if enough neurons are provided. The computational capability of McCulloch-Pitts neural networks was proved to be equivalent to Turing machines [183] which are the essential model of symbolic computation and can perform any computation that can be described by a finite program in any general purpose language [26].

In 1949, Hebb proposed the first learning rule for neurons [63]. In the late 1950s and early 1960s, Rosenblatt introduced a class of neural networks [155], called perceptrons, which can learn to classify patterns through supervised learning. Rosenblatt's work helped produce a large amount of research activities in this early ANN research era. In 1969, Minsky and Papert showed in their landmark book Perceptrons that the computational power of perceptron's single-layer learning algorithm is only able to solve linearly separable problem but not a large
class of other problems [118]. With the misinterpretation of such a result, research funding and interest in ANN drastically dropped in the following 1970s. In the dark ages of the 1970s, the dedicated and everlasting efforts of Amari [7, 8, 9], Anderson [10], Fukushima [39], Grossberg [51, 52, 53], Kohonen [84, 85], and many other researchers ultimately brought in the renascence of ANN in 1980s.

The tremendous resurgence of ANN research interest in 1980s was mainly due to the invention of Hopfield networks [75] which can serve as content-addressable memory or solve combinatorial optimization problems [76], and the introduction of Backpropagation learning algorithm [156] which overcomes the limitation of perceptron's single-layer learning algorithm in linearly separable problems and can be exploited to train multi-layer perceptron to solve nonlinearly separable problems. Since then, Backpropagation multi-layer perceptron has been successfully applied in a variety of applications and has became the most widely used neural network paradigm. The Backpropagation learning algorithm were independently derived by Werbos [193], Parker [138, 139], and LeCun [98], but its popularity was mainly due to the effort of Rumelhart, McClelland, and the PDP Group. Other representative ANN models in the bright 1980s, to name a few, include Hinton, Sejnowski, and Ackley's Boltzmann machine model [1, 67] which can be used to find the global optimum solution for a given problem; Kohonen's Self-Organizing Feature Map [86] which can be trained without supervision to find the organization of relationships among training patterns; Kosko's BAM [89, 90] which can serve as hetero-associative memory and temporal associative memory; Carpenter and Grossberg's ART networks [16, 17, 18] which can be typically used to cluster training patterns via unsupervised training; Radial Basis Function method [114, 146, 147] which was originally used for function interpolation and was also applied to other applications [128, 149]; Hecht-Nielsen's Counterpropagation network [64] which has both supervised as well as unsupervised training stages and can be trained to perform pattern mapping, data compression and associative recall; Fukushima, Miyake, and Ito's Neocognitron [40, 41] which can be trained with supervision to recognize handwritten characters; and recurrent neural networks [33, 81, 144] which allow recursive processing on input string of variable length. A more detailed taxonomy of most
neural network architectures and learning algorithms can be found in [56, 102, 112].

1.3 An Overview of the Dissertation

Artificial neural networks, due to their inherent parallelism, potential for fault tolerance, and adaptation through learning, offer an attractive computational paradigm for a variety of applications in computer science and engineering, artificial intelligence, robotics, and cognitive modeling. Despite the success in the application of ANN to a broad range of numeric tasks in pattern classification, control, function approximation, and system identification, the integration of ANN and symbolic computing is only beginning to be explored [22, 23, 47, 70, 72, 73, 99, 145, 172, 179, 185] and is currently viewed as one of important research goals in massively parallel computing and artificial intelligence [65].

Pattern-directed associative processing relies on associative pattern matching and retrieval, is central to many problem solving paradigms in AI (e.g., knowledge based expert systems, case based reasoning) as well as computer science (e.g., database query processing, information retrieval) [54, 97, 181], and dominates the computational requirements of many applications in AI and computer science [55, 97, 127]. This dissertation proposes methods to systematically design massively parallel architectures for pattern-directed symbol processing using neural associative memories as key components. In particular, we propose neural architectures for content-based as well as address-based data storage and recall, database query processing, elementary logical inference, sequence processing, and syntax analysis. Their potential advantages over conventional serial computer implementations of the same functions are examined in the dissertation.

Chapter 2 proposes an approach for the design of a neural memory which supports both content-based (associative) and address-based data storage and retrieval. The proposed neural associative memory allows efficient access of stored data by way of massively parallel best match, partial match and exact match. When used as a content-addressed memory, the proposed neural memory supports recall from partial input patterns, (sequential) multiple recalls, fault-tolerance, precision control and sorted extraction of all stored memory patterns. When
used as an address-based memory, the memory module can provide working space for dynamic representations for symbol processing and shared message-passing among neural network modules within an integrated neural network system. It also provides for real-time update of memory contents by one-shot learning without interference with other stored patterns.

The pattern matching and retrieval process in the proposed neural associative memory which provides massive communication bandwidth and processing units respectively via its massive connections and nodes to match a given pattern with all stored patterns in parallel within one step can be far more efficient (in terms of computation time) than that in a key-based organization of the sort used in conventional computer systems. Chapter 3 takes advantage of this fact to explore the potential benefits of the proposed neural associative memory in the implementation of efficient, noise-tolerant information retrieval and query module in large database systems.

Since most of current digital computer systems store data using address-based memories which are accessed via shared buses, the retrieval of a desired data item satisfying certain criteria (patterns) from a set of candidate data items stored in the memories is inherently sequential and requires certain data organization, which is manipulated and interpreted by a relatively complex program(s), to provide appropriate performance. Although parallel pattern matching can be achieved by current digital computer systems when the systems are provided with multiple processors and memory buses, it would not be cost effective to dedicate such systems to applications which mainly involve intensive pattern matching. The proposed neural associative memory is a cost effective SIMD computer dedicated to pattern association. Therefore, such SIMD capability of the proposed neural associative memory is further explored for relational database queries. The potential merits of ANN's inherent parallelism and noise-tolerance for database query processing are demonstrated by comparing the estimated performance of the proposed neural architecture with that of other techniques commonly used in conventional computer systems for database query processing.

Chapter 4 explores how neural architectures for binary pattern recognition can be extended for elementary logical inference. The proposed neural assemblies for propositional logic are
based on geometrical/mathematical analysis. Logical operations such as AND and OR are realized by neural assemblies for the recognition of binary subpatterns. It is known that any proposition (or equivalently a Boolean function) can be represented in DNF, and hence can be realized by a 2-layer neural architecture assembled using the proposed AND and OR neural assemblies. Since logical AND, logical OR, as well as DNF representation are essential to logical inference and Boolean functions are basic to many applications in science and engineering, we expect the proposed neural assemblies would find use in the construction of modular neural networks for a variety of applications. For instance, Chapter 5 illustrates their use in an neural architecture for sequence processing.

Chapter 5 proposes methods for systematic design of neural architectures for sequence processing, which are used as building blocks to systematically assemble neural architectures for syntax analysis in Chapter 6. Basically, memories and sequence processing mechanisms (with flow control capability) compose current digital computer systems which are driven by sequences of binary codes which are translated from sequences of symbolic program representation that humans can efficiently and effectively read, write, and reason on. Therefore, a computing system integrated from the proposed neural architectures for memories and sequence processing is expected to possess computation capability corresponding to that of current digital computer systems.

Chapter 6 explores the advantages of ANN's inherent parallelism and associative processing capability in the design of modular neural architectures for syntax analysis using a pre-specified grammar — a prototypical symbol processing task. A more general goal of this chapter is to explore the systematical design of massively parallel architectures for symbol processing using the neural associative memory proposed in Chapter 2 and the neural architectures for sequence processing proposed in Chapter 5 as key components.

Since each component in the proposed neural architectures for syntax analysis computes a well-defined symbolic function, it facilitates the systematic synthesis as well as analysis of the resulting symbolic computation at a fairly abstract (symbolic) level. This facilitates rapid design and test of other provably correct prototypes of modular neural architectures for
complex symbolic processing using simpler building blocks by way of recursion, composition of elementary symbolic functions, and data representation manipulated by them. The elementary symbolic functions are represented in terms of binary mappings which are realized provably correctly by basic neural modules using one-shot learning.

Chapter 6 concludes with a summary of the key contributions of this dissertation.
2 A NEURAL MEMORY FOR CONTENT AS WELL AS ADDRESS-BASED STORAGE AND RECALL

2.1 Introduction

This chapter presents an approach to design of a neural architecture for both associative (content-addressed) and address-based memories. Several interesting properties of the memories are mathematically analyzed in detail such that it is known that by systematically adjusting the node thresholds and connection weights, the same proposed neural architecture can serve as memories with precision control to perform best match, exact match and partial match which are main knowledge retrieval techniques extensively used in numerous artificial intelligence systems [191]. When used as an associative memory, the proposed neural architecture supports recall from partial input patterns, (sequential) multiple recalls and fault tolerance. When used as an address-based memory, the memory can provide working space for dynamic representations for symbol processing and shared message-passing among neural network modules within an integrated neural network system. It also provides for real-time update of memory contents by one-shot learning without interference with other stored patterns.

It is generally agreed that artificial neural networks (ANN) have demonstrated success in low-level perceptual tasks (e.g., signal processing, pattern recognition) [62, 93, 111, 113]. However, despite their generality (as computational models) and despite the potential advantages of using them as components in general-purpose artificial intelligence systems which usually involve content-based or memory-based knowledge storing and retrieving [47, 70, 72, 73, 99, 173, 179], detailed design and performance tradeoffs in integrated systems of this sort are yet to be fully understood and working prototypes of such systems are only beginning to be developed. Towards this end, an innovative design and careful analysis of neural associative memories with
emphasis on problems and prospects of integrating them into larger systems that combine the advantages of both traditional symbol processing and neural network approaches to artificial intelligence is needed.

A particular class of neural memories built from threshold logic units (Perceptrons or McCulloch-Pitts neurons) is explored from a geometrical/mathematical perspective in this chapter. This analysis provides mathematical foundations for understanding several interesting properties of such memories including: auto as well as hetero-associative recall from partially specified patterns, (sequential) sorted recall of multiple stored patterns with different degrees of match with an input pattern, incremental learning, fault tolerance, and address-based storage and recall (mimicking the behavior of memories used in conventional digital computers). The mathematical analysis also suggests efficient hardware realizations of such memories. This chapter is organized as follows:

• Section 2.1 reviews associative memory, address-based memory, and key properties of multi-layer Perceptrons which form the basis of the proposed neural memories.

• Section 2.2 develops the theoretical foundations and examines the storage capacity of the proposed binary/bipolar neural memories through an investigation of the spatial distribution and linear separability of vertices in binary/bipolar hypercubes from a geometric perspective.

• Section 2.3 explores several interesting properties of the proposed memory modules including: recall from partially specified input patterns, (sequential) multiple recalls, and fault tolerance by examining and extending the physical meanings of the settings of connection weights and neuron thresholds in the proposed neural memories.

• Section 2.4 concludes with a summary of the chapter and a brief discussion of related researches.

2.1.1 Information retrieval and binary mapping

In general, most classification and information retrieval problems using discrete input/output values can be viewed in terms of a binary random mapping \( f_t \), where \( f_t \) is rigidly defined from
a set $U$ of $k$ distinct binary input vectors $u_1, ..., u_k$ of dimension $n$ to a set $V$ of $k$ binary output vectors $v_1, ..., v_k$ of dimension $m$ such that $f_I : U \rightarrow V$ and

$$f_I(u_i) = v_i \text{ for } 1 \leq i \leq k \quad (2.1)$$

Note that $f_I$ is a partial function.

2.1.2 Associative memory (Content-addressed memory)

Since the resurgence of ANN in 1980s, ANN have been applied in many science and engineering disciplines. This is shown by the explosive growth in the number of published technical papers on ANN in 1980s and 1990s. In particular, neural architectures for associative memories have been the subject of considerable research, because of their potential applications in several areas of artificial intelligence, computer science, and cognitive modelling.

The term associated memory (AM) or content-addressed memory refers to a memory system where recall of a stored pattern is accomplished by providing a noisy or partially specified input pattern. Examples of such memory models include Hopfield networks [75], correlation matrix memories [84], bidirectional associative memories [90], among others [9, 59, 91, 125]. A precise definition of binary/bipolar associative memories follows:

Let $D_H(u, u')$ denote the Hamming distance between binary (bipolar) vectors $u$ and $u'$. Hamming distance is the number of bits that differ between two binary (bipolar) vectors. Suppose we are given a set $U$ of $k$ binary input vectors $u_1, ..., u_k$ of dimension $n$ and a set $V$ of $k$ desired binary output vectors $v_1, ..., v_k$ of dimension $m$. Then the task is to design an associative memory that can store each of the input-output pattern pairs.

In many applications, it is useful to be able to control the degree of mismatch that is tolerated during information retrieval. This is accomplished by introducing the concept of precision control in associative memory as follows: Define $U^n_i(p_i) = \{ u | u \in \mathbb{B}^n \& D_H(u, u_i) \leq p_i \}, 1 \leq i \leq k$, i.e., $U^n_i(p_i)$ is the set of $n$-dimensional binary vectors which have Hamming distance less than or equal to $p_i$ away from the given $n$-dimensional binary vector $u_i$, where $\mathbb{B}^n$ is the universe of $n$-dimensional binary vectors, and $p_i$ is called allowable precision level and is an adjustable integer parameter.
Information retrieval in a binary associative memory can be specified in terms of a binary associative mapping \( f_A : U^n \rightarrow V \) as follows:

\[
f_A(x) = v_i \text{ if } x \in U^n_i(p_i), \quad 1 \leq i \leq k \tag{2.2}
\]

where \( U^n = \bigcup_{i=1}^{k} U^n_i(p_i) = U^n_1(p_1) \cup U^n_2(p_2) \ldots \cup U^n_k(p_k) \) and conventionally \( U^n_i(p_i) \cap U^n_j(p_j) = \emptyset \) for \( i \neq j, \quad 1 \leq i, j \leq k \). For example, if such a memory is used to store and recall uppercase English characters, then \( U = V \) and \( u_i = v_i, \quad 1 \leq i \leq 26 \). Suppose the allowable precision levels (i.e., all of the \( p_i \)'s) are set equal to (Hamming distance) 4. Then in Figure 2.1, the noisy input patterns 1 and 2 would result in the recall of the stored memory pattern \( T \). Multiple recalls are possible in the proposed neural memory when \( \exists i \neq j \) such that \( U^n_i(p_i) \cap U^n_j(p_j) \neq \emptyset \) in which case \( f_A \) is a one-to-many mapping. Most conventional associative memory models seldom tackle the problem of multiple recalls.

Note that \( f_I \subseteq f_A \) if functions \( f_I \) (expression 2.1) and \( f_A \) are viewed as sets of input-output ordered pairs of the functions \( f_I \) and \( f_A \) respectively. That is,

\[
f_I = \{(x, f_I(x)) | x \in U\}
\]

\[
f_A = \{(x, f_A(x)) | x \in U^n\}
\]

The partial function \( f_A \) may be extended to a full function \( \hat{f}_A : B^n \rightarrow (V \cup \{<0^m>\}) \) for binary associative (information retrieval) memory as follows:

\[
\hat{f}_A(x) = \begin{cases} 
  f_A(x) & \text{if } x \in U^n \\
  <0^m> & \text{if } x \in (B^n - U^n) 
\end{cases} \tag{2.3}
\]

where \( <0^m> \) is the \( m \)-dimensional binary vector of all zeros and denotes a value which is undefined.

Content-addressed memories can be divided into two categories: auto-associative memories (used primarily for reconstructing a pattern from a noisy or partially specified pattern) and hetero-associative memories which can be used to store associated pattern pairs so that when an input pattern is provided, the associated pattern is retrieved. The types of pattern associations that can be stored in neural associative memories depend on various factors such as: the choice of neural network architecture, the choice of activation functions computed by the neurons,
and the algorithm used to set up the parameters (thresholds and weights) associated with the neurons and connections. Thus, a linear associative memory with \( n \) input neurons can store and recall perfectly at most \( n \) pattern associations. Similar storage capacity results are known for several content-addressed memory models such as the Hopfield network [75, 109], bi-directional associative memories [90], correlation matrix memories [84], etc. A variety of associative memory models are discussed in [62, 93]. As already pointed out, many simple content-addressed memory models studied in the literature are incapable of stable storage and recall of associations between arbitrary pairs of patterns (except under certain restricted circumstances). In such models, whether a pattern can be associated with another critically depends on how the two patterns are coded as bit vectors as well as on all the other pattern associations that have already been stored in memory. The ability to reliably store and recall associations between arbitrary patterns is regarded by many to be a prerequisite for higher level cognitive activity (e.g., logical inference) [35]. The associative memory model proposed in this chapter is designed to reliably store and recall associations between arbitrary pairs of patterns.
2.1.3 Address-based memory

Address-based memory is extensively used for storing both data as well as programs in current computer systems. In cognitive models and artificial intelligence programs based on Von Neumann model of computation, i.e., models within the so-called symbolic paradigm [126], address-based memory often serves as the working memory (or scratch-pad) for storing intermediate results during the execution of a program. On the surface, storage and recall of patterns using addresses appear to be very different in spirit from the recall of patterns based on their content (as judged by its similarity to a stored pattern). Indeed, many authors have suggested this to be a primary difference between neural networks (or connectionist models) and traditional artificial intelligence systems. However, this perceived difference is rather superficial given the demonstrable Turing-equivalence of sufficiently powerful neural network models [69, 70]. Therefore, it is rather straightforward to design neural memories capable of address-based storage and recall of patterns as the following discussion illustrates.

A mathematical model for information retrieval in address-based memory can be formulated in terms of a binary random mapping $f_I$ (expression 2.1) by extending the partial function $f_I$ to a full function $\hat{f}_I : \mathbb{B}^n \to (V \cup \{<0^m>\})$ for address-based (information retrieval) memory as follows:

$$\hat{f}_I(x) = \begin{cases} f_I(x) & \text{if } x \in U \\ <0^m> & \text{if } x \in (\mathbb{B}^n - U) \end{cases} \quad (2.4)$$

$\hat{f}_I$ maps from the set of $n$-bit binary addresses to the set of $m$-bit binary values. The retrieved value (or content of a memory address) is undefined if no pattern has been stored at the corresponding address.

It is well known (in the literature on the design of memory systems for digital computers) that this approach to address-based memory design is not necessarily the most efficient for large address spaces. In this case, hierarchical memory organization using multiple levels of address decoding and multiple memory modules of the type specified above is a more practical alternative [171].
2.1.4 Perceptrons

A 1-layer Perceptron has \( n \) input neurons, \( m \) output neurons and one layer of connection weights. The output \( y_i \) of output neuron \( i \) is given by

\[
y_i = f_H(\sum_{j=1}^{n} w_{ij} x_j - \theta_i)\]

where \( w_{ij} \) denotes the weight on the link from input neuron \( j \) to output neuron \( i \), \( \theta_i \) is the threshold of output neuron \( i \), \( x_j \) is input value at input neuron \( j \), and \( f_H \) is binary hardlimiter function, where

\[
f_H(x) = \begin{cases} 
1 & \text{if } x \geq 0 \\
0 & \text{otherwise}
\end{cases}
\] (2.5)

It is well known that such a 1-layer Perceptron can implement only linearly separable functions from \( \mathbb{R}^n \) to \( \{0,1\}^m \) \([118]\). We can see the connection weight vector \( w_i = <w_{i1}, \ldots, w_{in}>^T \) and the node threshold \( \theta_i \) as defining a linear hyperplane \( H_i \) which partitions the \( n \)-dimensional pattern space into two half-spaces, where \( [\cdot]^T \) denotes the transpose of a vector or a matrix.

A 2-layer Perceptron has one layer of \( k \) hidden neurons (and hence two layers of connection weights with each hidden neuron being connected to every input neuron as well as every output neuron). In this chapter, we use 2-layer Perceptron in which each hidden neuron uses binary hardlimiter function \( f_H \) as activation function. The output of output neuron \( i \) is given by

\[
y_i = f(\sum_{l=1}^{k} w_{il} z_l - 1); \quad z_l \text{ is the output of hidden neuron } l, \ f \text{ is binary hardlimiter function } f_H \text{ in the model using binary output, and } f \text{ is bipolar hardlimiter function } \tilde{f}_H \text{ in the model using bipolar output. (The thresholds of all output neurons are set to 1). The bipolar hardlimiter function } \tilde{f}_H \text{ is defined as}
\]

\[
\tilde{f}_H(x) = \begin{cases} 
1 & \text{if } x \geq 0 \\
-1 & \text{otherwise}
\end{cases}
\] (2.6)

2.2 Multi-layer Perceptrons as Neural Memories

This section describes the synthesis of a binary address-based memory or a binary associative memory using a 2-layer Perceptron. The binary address-based memory has a storage capacity of \( N = 2^n \) while the binary associative memory has a storage capacity \( N = \)
\[ \frac{2^n}{\sum_{i=0}^{p} C(n, i)} \], where \( n \) is the number of input neurons and \( p \) is the adjustable precision level (allowable noise level) measured in terms of Hamming distance. A hidden neuron is used for each stored associative pair of input and output patterns. The numbers of input and output neurons are fixed in these models. In the case of associative memory, this amounts to fixing the dimensionality of input and output patterns; while in the address based memory, it is tantamount to fixing the maximum size of the address space and the dimensionality of the patterns stored in memory.

2.2.1 The application of linear separability of binary vertices in pattern classification

Note that every \( n \)-dimensional binary vector is a binary vertex of an \( n \)-dimensional hypercube. Hereafter, we will use the terms binary vertex and binary vector interchangeably. The following theorem and its proof facilitate the systematic synthesis of the proposed neural memories.

**Theorem 2.1:** Let \( u \) be a binary vector of dimension \( n \), i.e., \( u = < u_1, ..., u_n >^T \) where \( u_i \in \{0, 1\} \) for \( 1 \leq i \leq n \). Let \( \overline{u} = < \overline{u}_1, ..., \overline{u}_n >^T \) be the complement of the binary vector \( u \). That is, \( u_i + \overline{u}_i = 1 \) for \( 1 \leq i \leq n \). Let \( u - \overline{u} = u^{ref}_u = < u^{ref}_1, ..., u^{ref}_n >^T \). Note that \( u^{ref}_i \in \{1, -1\} \) for \( 1 \leq i \leq n \). Let us call \( u^{ref}_u \) the reference vector. Let \( S^u_p \) be the set of \( n \)-dimensional binary vertices which are at a Hamming distance \( p \) away from vertex \( u \), \( 0 \leq p \leq n \).

Then every binary vertex \( x \in S^u_p \) falls on an \( n \)-dimensional linear hyperplane \( H^{n,u}_p \) which is perpendicular to the reference vector \( u^{ref}_u \). Furthermore, if \( H^{n,u} = \{ H^{n,u}_p ; 0 \leq p \leq n \} \), the \( n \)-dimensional linear hyperplanes in \( H^{n,u} \) are mutually parallel.

**Proof:** Let \( x \) be a binary vertex in \( S^u_p \), \( x - \overline{u} = x^{ref}_u = < x^{ref}_1, ..., x^{ref}_n >^T \) and \( l^u_x \) be the length of the projection of \( x^{ref}_u \) onto the reference vector \( u^{ref}_u \). Note that \( x^{ref}_i = 0 \) or 1 if \( u^{ref}_i = 1 \), and \( x^{ref}_i = 0 \) or \( -1 \) if \( u^{ref}_i = -1 \) for \( 1 \leq i \leq n \). Note also that there are \( p \) components \( x^{ref}_i \) of \( x^{ref}_u \) such that \( x^{ref}_i = 0 \) and \( (n - p) \) components \( x^{ref}_j \) of \( x^{ref}_u \) such that \( x^{ref}_j = 1 \) or \( -1 \), where \( 1 \leq i, j \leq n \). Let \( \| \cdot \| \) denote the length of a vector. Then

\[
\frac{l^u_x}{\|u^{ref}_u\|} = \frac{1}{\|u^{ref}_u\|}(u^{ref}_u)^T x^{ref}_u
\]

(2.7)
Thus, \( V \in S_p \), the length of the projection of \( x^{ref_u} \) onto the reference vector \( u^{ref_u} \) is 
\[(n - p)/\sqrt{n} \]. That is, all binary vertices in \( S_p \) lie on the same \( n \)-dimensional linear hyperplane \( H_p^{n,u} \) which is perpendicular to the reference vector \( u^{ref_u} \) and located at a distance of \((n - p)/\sqrt{n}\) from the vertex \( u \), that is, a distance \( p/\sqrt{n} \) to vertex \( u \). Hence, every hyperplane \( H_p^{n,u} \in H^{n,u} \)

\[0 \leq p \leq n\], is parallel to every other hyperplane in \( H^{n,u} \). There are \( n+1 \) such mutually parallel linear hyperplanes \( H_p^{n,u} \)'s (Figure 2.2). Among them, \( u \) is on \( H_0^{n,u} \) and \( \bar{u} \) is on \( H_n^{n,u} \). The hyperplanes have same normal vector \((u - \bar{u})/\sqrt{n} \).

The expression defining the \( n \)-dimensional linear hyperplane \( H_p^{n,u} \), \(0 \leq p \leq n\), is

\[
H_p^{n,u} \equiv \left( \sum_{i=1}^{n} (2u_i - 1)x_i \right) - \left( \sum_{i=1}^{n} u_i - p \right) = 0
\]  
(2.12)

which can be derived as follows. Let \( x \) be a binary vertex on hyperplane \( H_p^{n,u} \) where \(0 \leq p \leq n\). From expressions 2.7 and 2.11, we have:

\[
I_x = \frac{1}{\|u^{ref_u}\|}(u^{ref_u})^T x^{ref_u} = \frac{1}{\sqrt{n}}(u - \bar{u})^T (x - \bar{u}) = \frac{1}{\sqrt{n}}(n - p)
\]  
(2.13)

Thus,

\[(u - \bar{u})^T (x - \bar{u}) = (n - p)
\]  
(2.14)

So the defining expression of the hyperplane \( H_p^{n,u}, 0 \leq p \leq n \), is given by:

\[
H_p^{n,u} \equiv (u - \bar{u})^T (x - \bar{u}) = (n - p)
\]  
(2.15)

\[
\equiv (u - \bar{u})^T x - (u - \bar{u})^T \bar{u} - (n - p) = 0, \quad \text{note } u^T \bar{u} = 0
\]  
(2.16)

\[
\equiv (u - \bar{u})^T x - (n - ||\bar{u}||^2 - p) = 0, \quad \text{note } ||u||^2 + ||\bar{u}||^2 = n
\]  
(2.17)

\[
\equiv (u - \bar{u})^T x - (||u||^2 - p) = 0
\]  
(2.18)
Figure 2.2 The spatial distribution of a 3-dimensional and an n-dimensional binary hypercubes

\[
\begin{align*}
\sum_{i=1}^{n} (2u_i - 1)x_i - (\|u\|^2 - p) &= 0 \\
\sum_{i=1}^{n} (2u_i - 1)x_i - (\|u\|^2 - p) &= 0 \\
\sum_{i=1}^{n} (2u_i - 1)x_i - (\sum_{i=1}^{n} u_i - p) &= 0
\end{align*}
\]

Note that \(\|u\|^2 = \sum_{i=1}^{n} u_i\) since \(u\) is a binary vector. From above, it is known that the expressions defining the \(n+1\) \(n\)-dimensional mutually parallel linear hyperplanes \(H_p^{n,u}\)’s, \(0 \leq p \leq n\), have same coefficients but different constant terms. Every hyperplane \(H_p^{n,u}\), where \(0 \leq p \leq n\), can serve as a linear separating hyperplane to partition all \(n\)-dimensional binary vertices into two sets. Such a linear separating hyperplane can be efficiently implemented for 2-class pattern classification by a 1-layer Perceptron with one output neuron. The output neuron has a threshold of \(\sum_{i=1}^{n} u_i - p\) and the connection weight on the link from input neuron \(i\) is given by \(2u_i - 1(= u_i - \overline{u_i})\) for \(1 \leq i \leq n\), where \(n\) is the number of input neurons.

When the separating hyperplane \(H_p^{n,u}\) is realized by a 1-layer, 1-output Perceptron, the value of \(2u_i - 1\) is either 1 or \(-1\), \(x_i\) is either 1 or 0, and \((2u_i - 1)x_i\) can therefore be 1, 0 or \(-1\); and \(\sum_{i=1}^{n} u_i\) is integer. Also note that the maximum activation of the 1-layer Perceptron
is $p$ and minimum value is $-(n - p)$; since the separating hyperplane $H_p^{n,u}$ is defined as $(u - \bar{u})(x - \bar{u})^T - (n - p) = 0$, the maximum value of $(u - \bar{u})(x - \bar{u})^T$ is $n$ when $x = u$, and the minimum value of $(u - \bar{u})(x - \bar{u})^T$ is $0$ when $x = \bar{u}$.

### 2.2.2 Best match: pattern classification with precision control

Since each binary vertex of dimension $n$ on hyperplane $H_p^{n,u}$ is Hamming distance $p$ away from vertex $u$, there are $N_p = C(n, p) = \frac{n!}{(n-p)!p!}$ such binary vertices of dimension $n$ on hyperplane $H_p^{n,u}$, where $0 \leq p \leq n$. The **separating hyperplane** $H_p^{n,u}$ partitions all the binary vertices of a binary $n$-hypercube into two sets. One set contains $N_A = \sum_{i=0}^{p} N_i = \sum_{i=0}^{p} C(n, i)$ binary vertices that are at a Hamming distance less than or equal to $p$ away from vertex $u$, and the other contains $N_B = \sum_{i=p+1}^{n} N_i = \sum_{i=p+1}^{n} C(n, i)$ binary vertices that are at a Hamming distance more than $p$ away from vertex $u$. Let us call the former partition the **associative partition** (denoted by $\alpha_p^u$) of $u$, the vertex $u$ the **center** of that associative partition, and $p$ the **radius** of the associative partition. Note that both $H_p^{n,u}$ and $\alpha_p^u$ are defined by the given binary exemplar pattern $u$ and its precision level $p$.

In theory, an $n$-hypercube can be almost equally partitioned by $N = \lceil 2^n / \sum_{i=0}^{n} C(n, i) \rceil$ such associative partitions as $\alpha_p^u$ with each associative partition containing $\sum_{i=0}^{p} C(n, i)$ $n$-dimensional binary vertices which are all at a Hamming distance less than or equal to $p$ away from their corresponding partition center. The partition centers correspond to the given binary exemplar patterns.

We say that an associative partition $\alpha_{p_i}^{\nu_i}$ is not **isolated** from another associative partition $\alpha_{p_j}^{\nu_j}$ ($i \neq j$) if $\alpha_{p_i}^{\nu_i} \cap \alpha_{p_j}^{\nu_j} \neq \emptyset$. Thus, if two associative partitions are not isolated from each other, they overlap and as a result, there is at least one binary vector that is a member of both partitions. The separating hyperplanes (or equivalently, associative partitions) can be implemented in a 1-layer Perceptron with $N$ output neurons to recognize $N$ patterns with precision level (allowable noise level) up to Hamming distance $p_i$ for exemplar pattern $\nu_i$, $1 \leq i \leq N$, provided the precision levels $(p_i, s)$ are chosen to ensure that each associative partition is **isolated** from every other. When $x \in \alpha_{p_i}^{\nu_i}$ is fed into the 1-layer Perceptron, the
output neuron that corresponds to the separating hyperplane $H_{p_i}^n$ is activated to produce an output of 1. In order to ensure that the associative partitions corresponding to two exemplar patterns $\nu_i$ and $\nu_j$ are isolated from each other, $D_H(\nu_i, \nu_j)$ has to be greater than $(p_i + p_j)$ where $p_i$ and $p_j$ are the allowable precision levels. Otherwise, the associative partitions of $\nu_i$ and $\nu_j$ would overlap with each other, and when an input pattern $x$, where $D_H(x, \nu_i) \leq p_i$ and $D_H(x, \nu_j) \leq p_j$, is fed into the 1-layer Perceptron, the output neurons for the two exemplar patterns $\nu_i$ and $\nu_j$ will produce 1 as their outputs. In this case, the input pattern $x$ cannot be unambiguously classified as it falls in the region of overlap between the associative partitions $Q_{p_i}$ and $Q_{p_j}$.

2.2.3 Storage capacity

Suppose input patterns are 10x10 arrays of binary pixels (see Figure 2.1). Then 100 input neurons are required to implement such a 1-layer Perceptron for pattern classification. The number of possible input patterns is $2^{100} \approx 10^{30}$. An output neuron is needed for each distinct exemplar pattern. Table 2.1 shows the corresponding maximal storage capacity of the 1-layer Perceptrons designed for a range of different allowable noise levels. Table 2.1 also suggests that a 1-layer Perceptron with $n$ input neurons has very high storage capacity for classifying binary patterns and that the allowable precision (noise) levels of less than 30% are desirable for reliable classification.

2.2.4 Synthesis of associative and address-based memories

Given a set $U$ of $k$ distinct binary input vectors $u_1, ..., u_k$ of dimension $n$, where $u_i = \langle u_{i1}, ..., u_{in} \rangle^T$ and $u_{ig} \in \{0, 1\}$ for $1 \leq i \leq k$ & $1 \leq g \leq n$; and a set $V$ of $k$ desired binary (bipolar) output vectors $v_1, ..., v_k$ of dimension $m$, where $v_i = \langle v_{i1}, ..., v_{im} \rangle^T$ and $v_{ih} \in \{0, 1\}$ (or $\{-1, 1\}$) for $1 \leq i \leq k$ & $1 \leq h \leq m$. Assume the Hamming distance between any two binary vectors in $U$ is at least $2p + 1$, where $p \in N$. This ensures that all associative partitions would be isolated with the precision level being set at $p$.

We can now design a neural architecture for information retrieval using address-based
Table 2.1  The corresponding maximal storage capacity of a 1-layer Perception with 100 input neurons for classifying binary patterns for a range of allowable noise levels

<table>
<thead>
<tr>
<th>allowable noise</th>
<th>maximal capacity</th>
</tr>
</thead>
<tbody>
<tr>
<td>0%</td>
<td>$N = \frac{2^{100}}{\sum_{i=0}^{100} C(100, i)} \approx 1.0 \times 10^{30}$</td>
</tr>
<tr>
<td>10%</td>
<td>$N = \frac{2^{100}}{\sum_{i=0}^{100 \times 0.1} C(100, i)} \approx 5.0 \times 10^{16}$</td>
</tr>
<tr>
<td>20%</td>
<td>$N = \frac{2^{100}}{\sum_{i=0}^{100 \times 0.2} C(100, i)} \approx 1.4 \times 10^9$</td>
</tr>
<tr>
<td>30%</td>
<td>$N = \frac{2^{100}}{\sum_{i=0}^{100 \times 0.3} C(100, i)} \approx 2.4 \times 10^4$</td>
</tr>
<tr>
<td>40%</td>
<td>$N = \frac{2^{100}}{\sum_{i=0}^{100 \times 0.4} C(100, i)} \approx 38$</td>
</tr>
<tr>
<td>50%</td>
<td>$N = 2$</td>
</tr>
</tbody>
</table>

memory, denoted by function $\tilde{f}_I$ (defined by expression 2.4), or associative memory, denoted by function $\tilde{f}_A$ (defined by expression 2.3). For this purpose, a memory module of a 2-layer Perceptron can be synthesized using the 1-layer Perceptrons, proposed for pattern classification in Section 2.2.1., as follows:

The memory module (using binary input) has $n$ input, $k$ hidden, and $m$ output neurons. For each associative ordered pair $(u_i, u_h)$, where $1 \leq i \leq k$, we create a hidden neuron $i$ with threshold $\sum_{j=1}^{n} u_{ij} - p_i$ (see Figure 2.3), where $p_i \in \mathbb{N}$ and $p_i \leq p$ is the adjustable precision level for that associative pair. The connection weight from input neuron $g$ to hidden neuron $i$ is $2u_{ig} - 1 = (u_{ig} - \overline{u}_{ig})$ and that from hidden neuron $i$ to output neuron $h$ is $u_{ih}$. The threshold for each of the output neurons is set to 1. The activation functions at hidden neurons are binary hardlimiter function $f_H$. The activation functions at output neurons are binary hardlimiter function $f_H$ (expression 2.5) if the desired output of output neurons is binary. The activation functions at output neurons are bipolar hardlimiter function $\tilde{f}_H$ (expression 2.6) if the desired output of output neurons is binary.

Since input is binary, the weights in the 1st-layer connections of the memory module are either 1 or -1. A bit of an input pattern that is wrongly on (with respect to a stored pattern), contributes -1 to the activation of the corresponding hidden neuron and a bit of an input pattern that is rightly on (with respect to a stored pattern) contributes +1 to the activation of the corresponding hidden neuron. A bit of an input pattern that is (rightly or wrongly) off (with respect to a stored pattern) contributes 0 to the activation of the corresponding
Figure 2.3  The setting of connection weights and hidden node threshold in the proposed neural memory (a 2-layer Perceptron with binary input) for a given associated memory pair.

hidden neuron. Each hidden neuron sums up the contributions to its activation from its 1st-layer connections, compares the result with its threshold (which equals the number of 1 in the stored memory pattern minus its desired precision level), and produces output value 1 if its activation exceeds or equals its threshold. If one of the hidden neurons is turned on, one of the stored memory patterns will be recalled by that hidden neuron. Note that an input pattern is matched against all the stored memory patterns in parallel. If the time delay for computing the activation at a neuron is fixed, the time complexity for such a pattern matching process is $O(1)$. Note that this is attained at the cost of a hidden neuron (and its connections) for each stored association.

Since all the associative partitions are isolated from each other, when the memory module is presented with a binary input vector $x \in \alpha_p^{m}$, only the hidden neuron $i$ produces an output of 1 and the output values from all other hidden neurons are 0. So the value at output neuron $j$ is $v_{ij}$, and hence the output binary vector will be $< v_{i1}, ..., v_{im} >^T = v_i$. Since for each memory association pair a hidden neuron is created and its creation or deletion is independent
of other stored associative pairs, this particular design of associative memory lends itself to rapid *one-shot incremental learning* with no interference with previously stored associations.

It is also worth pointing out that exactly the same network architecture can be used to realize both associative as well as address-based memory. If $p_i$ is set as 0, $|\alpha_{p_i}^u| = 1$ and the memory module functions as an address-based memory when $2^n$ hidden neurons are used to resolve all possible addresses; and if $1 \leq p_i \leq p$, $|\alpha_{p_i}^u| > 1$ and it can be used as an associative memory with adjustable precision control. ($|A|$ denotes the cardinality of a set $A$). Address-based memory, extensively used in current computer system, can serve as working space of dynamic representations for symbol processing and shared message-passing space among neural network modules in an integrated neural network system. As working space for symbol manipulation, neural memories have to allow run-time update without learning and do not degrade when the number of stored memory patterns increases. Note that the proposed neural address-based memory has these two properties.

### 2.2.5 Exact match: binary mapping Perceptron (BMP) module

Let $U$ be a set of $k$ distinct binary input vectors $u_1, ..., u_k$ of dimension $n$, where $u_i = \langle u_{i1}, ..., u_{in} \rangle^T$ and $u_{ig} \in \{0, 1\}$ for $1 \leq i \leq k$ and $1 \leq g \leq n$; and $V$ be a set of $k$ desired binary output vectors $v_1, ..., v_k$ of dimension $m$, where $v_i = \langle v_{i1}, ..., v_{im} \rangle^T$ and $v_{ih} \in \{0, 1\}$ for $1 \leq i \leq k$ and $1 \leq h \leq m$. Consider a binary mapping function $f_{BMP} : B^n \rightarrow (V \cup \{<0^m>\})$ defined as follows:

$$f_{BMP}(x) = \begin{cases} v_i & \text{if } x = u_i, 1 \leq i \leq n \\ <0^m> & \text{if } x \in (B^n - U) \end{cases}$$

(2.23)

where $B^n$ is the $n$-dimensional binary space. A BMP module for the binary mapping function $f_{BMP}$ can be synthesized using a 2-layer Perceptron as follows: The BMP module (see Figure 2.4) has $n$ input, $k$ hidden and $m$ output neurons. For each binary mapping ordered pair $(u_i, v_i)$, where $1 \leq i \leq k$, we create a hidden neuron $i$ with threshold $\sum_{j=1}^n u_{ij}$. The connection weight from input neuron $g$ to hidden neuron $i$ is $2u_{ig} - 1 (= u_{ig} - u_{ig})$ and that from hidden neuron $i$ to output neuron $h$ is $v_{ih}$. The threshold for each of the output neurons is set to 1. The activation functions at hidden and output neurons are binary hardlimiter function $f_H$. 
Figure 2.4  The settings of connection weights and hidden node threshold in the proposed BMP module for an associated binary mapping ordered pair.

Note that for the binary input vector \( u_i \), only the hidden neuron \( i \) outputs a 1, and the rest of the hidden neurons output 0. Thus the output of the \( j \)th output neuron is \( v_{ij} \), and so the binary output vector is \(< u_{i1},..., u_{im} >= v_i \). While for an input vector \( x \notin U \), no hidden neuron is activated and the output is \(<0^m> \).

2.2.6 Conversion between memory models using bipolar and binary inputs

Much of the analysis in previous subsections assumed binary input patterns. It turns out that the use of bipolar instead of binary inputs simplifies the implementation of the proposed associative memory design especially when recall from partially specified input patterns is desired (see Section 2 for details). This subsection explores the relationship between memory models using binary and bipolar inputs and the conversion between the two.

The spatial distribution and geometrical characteristics of bipolar vertices in a bipolar hypercube is very similar to those of binary vertices in a binary hypercube, except that the distance between any two bipolar vertices is 2 times of that between their corresponding binary.
vertices. Given a bipolar vertex \( u \), there also exist \( n+1 \) mutually parallel linear hyperplanes which have similar features described in Theorem 2.1. The expression defining \( H_p^{n,u} \) in an \( n \)-dimensional bipolar hypercube is

\[
H_p^{n,u} \equiv \left( \sum_{i=1}^{n} u_i x_i \right) - (n - 2p) = 0
\]  

(2.24)

which will be derived in the following. All notations here are same as those in Section 2.2.4, with one exception: bipolar vector (vertex) is used in place of binary vector (vertex). In this case \( u_i \in \{-1, 1\} \); \( u_i + \overline{u}_i = 0 \); \( u_i^{refu} \in \{2, -2\} \); \( x_i^{refu} = 0 \) or 2 if \( u_i^{refu} = 2 \), and \( x_i^{refu} = 0 \) or -2 if \( u_i^{refu} = -2 \); there are \( p \) components \( x_i^{refu} \) of \( x^{refu} \) such that \( x_i^{refu} = 0 \), and \( (n - p) \) components \( x_j^{refu} \) of \( x^{refu} \) such that \( x_j^{refu} = 2 \) or -2. Then

\[
I_x^u = \frac{1}{\|u^{refu}\|} (u^{refu})^T x^{refu}
\]  

(2.25)

\[
= \frac{1}{\|u^{refu}\|} \sum_{i=1}^{n} u_i^{refu} x_i^{refu}
\]  

(2.26)

\[
= \frac{1}{\|u^{refu}\|} \left( \sum_{x_j^{refu} = 2 \text{ or } -2}^{n-p} u_j^{refu} x_j^{refu} + \sum_{x_i^{refu} = 0}^{p} u_i^{refu} x_i^{refu} \right)
\]  

(2.27)

\[
= \frac{1}{\|u^{refu}\|} \times 4(n - p)
\]  

(2.28)

\[
= \frac{1}{2 \sqrt{n}} \times 4(n - p)
\]  

(2.29)

\[
= \frac{2}{\sqrt{n}} (n - p)
\]  

(2.30)

From expressions 2.25 and 2.30 we have:

\[
l_x^u = \frac{1}{\|u^{refu}\|} (u^{refu})^T x^{refu} = \frac{1}{2\sqrt{n}} (u - \overline{u})^T (x - \overline{u}) = \frac{2}{\sqrt{n}} (n - p)
\]  

(2.31)

Thus,

\[
(u - \overline{u})^T (x - \overline{u}) = 4(n - p)
\]  

(2.32)

So the hyperplane \( H_p^{n,u} \) is given by

\[
H_p^{n,u} \equiv (u - \overline{u})^T (x - \overline{u}) = 4(n - p)
\]  

(2.33)

\[
\equiv (u - \overline{u})^T x - (u - \overline{u})^T \overline{u} - 4(n - p) = 0
\]  

(2.34)

\[
\equiv (u - \overline{u})^T x - (u^T \overline{u} - \|u\|^2 + 4n - 4p) = 0, \text{ note } u^T \overline{u} = -n
\]  

(2.35)
(u - \overline{u})^T x - (-n - n + 4n - 4p) = 0 \quad (2.36)

\equiv (u - \overline{u})^T x - (2n - 4p) = 0 \quad (2.37)

\equiv (u_1 - \overline{u}_1)x_1 + \ldots + (u_n - \overline{u}_n)x_n - (2n - 4p) = 0 \quad (2.38)

\equiv 2u_1x_1 + \ldots + 2u_nx_n - (2n - 4p) = 0 \quad (2.39)

\equiv u_1x_1 + \ldots + u_nx_n - (n - 2p) = 0 \quad (2.40)

\equiv \left(\sum_{i=1}^{n} u_i x_i\right) - (n - 2p) = 0 \quad (2.41)

Every hyperplane $H_p^{n,n}$, where $0 < p < n$, can serve as a linear separating hyperplane to partition all $n$-dimensional bipolar vertices into two sets. Such a linear separating hyperplane can be efficiently implemented for pattern classification by a 1-layer Perceptron with one output neuron. The output neuron has a threshold of $n - 2p$ and the connection weight on the link from input neuron $i$ is given by $u_i$ for $1 \leq i \leq n$, where $n$ is the number of input neurons.

Since input is bipolar, the connection weight in the 1-layer Perceptron is either 1 or $-1$. The connection weight of 1 matches the corresponding bit of an input pattern if it is on while a connection weight of $-1$ matches the corresponding bit of an input pattern if it is off. A match contributes 1 unit to the activation of the corresponding hidden neuron while a mismatch contributes $-1$ unit. Each hidden neuron sums up the contributions to its activation from each of its input links, compares it with its threshold and activates the corresponding output neuron if the degree of match (similarity measurement) for the entire pattern exceeds or equals the threshold.

Note that the value passed from each connection is either 1 or $-1$, compared to the three values $\{1, 0, -1\}$ in the binary model. This property can further simplify the hardware implementation requirement for a 1-layer Perceptron using bipolar (as opposed to binary) input.

Based on this 1-layer Perceptron and the method described in previous subsections for setting the weights of the second layer connections, the synthesis of a memory module (using bipolar input, see Figure 2.5) of 2-layer Perceptron is rather straightforward given a set of desired pattern associations.
The memory module (using bipolar input) has \( n \) input, \( k \) hidden, and \( m \) output neurons. For each associative ordered pair \((u_i, v_i)\), where \( 1 \leq i \leq k \), we create a hidden neuron \( i \) with threshold \( n - 2p_i \), where \( p_i \in \mathbb{N} \) and \( p_i \leq n \) is the adjustable precision level for that associative pair. The connection weight from input neuron \( g \) to hidden neuron \( i \) is \( w_{ig} \) and that from hidden neuron \( i \) to output neuron \( h \) is \( v_{ih} \). The threshold for each of the output neurons is set to 1. The activation functions at hidden neurons are binary hardlimiter function \( f_H \). The activation functions at output neurons are binary hardlimiter function \( f_H \) if the desired output of output neurons is binary. The activation functions at output neurons are bipolar hardlimiter function \( f_H \) if the desired output of output neurons is binary.

It is worth pointing out that the bipolar associative neural memory model derived here turns out to be exactly equivalent to the memory model proposed by [59] which uses real-value neuron thresholds and proves the effectiveness of the bipolar memory model by algebra based on using Hamming distance as difference measurement between input pattern and memory patterns. In this subsection, the spatial distribution and linear separability of bipolar vertices in a bipolar hypercube is examined from a geometrical perspective to locate a set of mutually parallel linear hyperplanes which respectively separate nicely all the bipolar vertices into two sets to facilitate the design of bipolar neural memories. The linear separating hyperplanes can be efficiently implemented in a 1-layer Perceptron with connection weights and neuron thresholds of integer values.

Some notable differences between the binary and bipolar associative memory models developed above are:

- The binary model uses binary hardlimiter as the activation function at both hidden and output neurons, so does the bipolar model if the associated output is binary. If the associated output is bipolar, binary and bipolar hardlimiters (respectively) are used as activation functions at hidden and output neurons.

- Threshold setting for a hidden neuron in the binary model equals the number of on bits of the corresponding memory pattern minus the desired precision level (measured in Hamming distance), which is not independent of the corresponding memory pattern;
whereas threshold setting for a hidden neuron in the bipolar model equals the number of input neurons minus twice the value of the desired precision level, which is independent of all the memory patterns. This has a special advantage when the associative memory is used to recall a pattern based on a partially specified input (see Section 2 for details).

2.3 Properties of the Proposed Neural Associative Memory

The following three subsections explore and develop mathematical models for several interesting properties of the proposed bipolar neural associative memory including: recall from partially specified input patterns, (sequential) multiple recalls, and fault tolerance. A set $U'$ of $k$ bipolar input vectors $u_1, ..., u_k$ of dimension $n$ and a set $V$ of $k$ desired binary/bipolar output vectors $v_1, ..., v_k$ of dimension $m$ are given. In the discussion that follows, we assume that the $m$-dimensional null pattern (a vector of all 0s in the binary case or a vector of all $-1$s in the bipolar case) is excluded from $V$. 
2.3.1 Partial match: associative recall from a partially specified input

This subsection examines the problem of recall from a partially specified bipolar input pattern. The analysis that follows assumes that the unavailable components of a bipolar partial input pattern have a default value of 0. Thus, a bipolar partial input pattern is completed by filling in a 0 for each of the unavailable components (the 0s as a whole also can serve as noise mask or filter). This makes it possible to handle the problem of associative recall from partially specified input pattern in a manner that is analogous to that of recall from completely specified input pattern. The 1st-layer connections of the neural memory module perform similarity measurements on the available components of a partial input pattern, ignore the similarity measurements on the unavailable components, and pass the similarity measurements to the corresponding hidden neurons to decide whether to activate a corresponding hidden neuron.

Let \( \hat{u} \) be a partially specified \( n \)-dimensional bipolar pattern with the values of some of its components being unknown. Define

- \( \text{bits}(\hat{u}) \): a function which counts the number of components with known values (+1 or -1) of bipolar partial pattern \( \hat{u} \)
- \( \text{pad}0(\hat{u}) \): a function which pads the unavailable bits of bipolar partial pattern \( \hat{u} \) with 0s
- \( \text{u} \odot v \): a binary predicate which tests whether \( \text{"u is a partial pattern of v"} \), where \( \text{"u is a partial pattern of v"} \) means that the values of available components of \( \hat{u} \) are same as those of their corresponding components in \( v \)

For example, let \( \hat{u} = <?, -1, 1, 1, ?> \) be a 5-dimensional partial pattern whose first and fifth components have unspecified values. Then \( \text{bits}(\hat{u}) = 3 \), \( \text{pad}0(\hat{u}) = <0, -1, 1, 1, 0> \) and \( \hat{u} \odot <1, -1, 1, 1, 1> \) is true. (Note that this definition of a partial pattern respects the positions of the components and does not accommodate shifts or translation).

Let \( \hat{D}_H(\hat{u}, \hat{v}) \) denote the Hamming distance between two bipolar partial patterns \( \hat{u} \) and \( \hat{v} \) which have same corresponding unavailable components. If \( \text{bits}(\hat{u}) = j \), \( \text{pad}0(\hat{u}) \) is a padded \( j \)-bit partial pattern derived from partially specified pattern \( \hat{u} \). Define \( \hat{U}_{ji}^{\hat{P}_j} = \{ \hat{u} \mid \text{bits}(\hat{u}) = j \& \hat{u} \odot u_i \} \), \( 1 \leq j \leq n \& 1 \leq i \leq k \). i.e., \( \hat{U}_{ji}^{\hat{P}_j} \) is the set of partial patterns, with \( j \) available bits.
of a bipolar pattern $u_i$. Define $\bar{U}_{p_i}(p) = \{\text{pad}0(\bar{u}) \mid \exists \bar{u}, \bar{u} \in \bar{U}_{p_i} \land D_H(\bar{u}, \bar{u}) \leq \lceil j/n \rceil \times p_i\}$, $1 \leq j \leq n$ & $1 \leq i \leq k$, i.e., $\bar{U}_{p_i}(p_i)$ is the set of padded $j$-bit partial patterns which are at Hamming distance less than or equal to $\lceil j/n \rceil \times p_i$ to any one of the $j$-bit partial patterns of full pattern $u_i$.

Practical applications may place limits on the range of usable settings of $p_i$ (allowable noise level) (see Section 2 for details). It may also be necessary to limit recall from partial input pattern to cases in which a sufficiently large number of bits in the input pattern have available values. For instance, when dealing with patterns of 10x10 pixels, we may set $p_i = 0.3 \times 100 = 30$ and require that at least 40% of the pixels be available in the input pattern.

To simplify matters in what follows, we use the same precision level $p$ for each stored pattern. That is, $p_i = p, 1 \leq i \leq k$. However, note that particular applications may require the use of different values of $p_i$ under different circumstances. For example, punctuation symbols and letters of the alphabet may need different values of $p_i$ for successful recognition in recognizing printed English characters.

Let $\bar{U}_{p_i}^{\infty-n}(p) = \bigcup_{j=c}^{n} \bar{U}_{p_i}(p)$; and $\bar{U}_{p}^{\infty-n}(p) = \bigcup_{i=1}^{k} \bar{U}_{p_i}(p)$. Let $f_p : \bar{U}_{p}^{\infty-n}(p) \to V$ denote the function of recall from padded bipolar partial pattern. Then $f_p$ is defined as follows:

\[ f_p(x) = v_i; \text{ if } x \in \bar{U}_{p_i}^{\infty-n}(p), 1 \leq i \leq k \]  

(2.42)

$f_p$ is a partial function and is extended to a function $\hat{f}_p : \bar{B}^{\infty-n} \to (V \cup \{-1\}^m)$ for recall from padded bipolar partial pattern using associative memory as follows:

\[ \hat{f}_p(x) = \begin{cases} f_p(x) & \text{if } x \in \bar{U}_{p}^{\infty-n}(p) \\ (-1)^m & \text{if } x \in (\bar{B}^{\infty-n} - \bar{U}_{p}^{\infty-n}(p)) \end{cases} \]  

(2.43)

where $\bar{B}^{\infty-n}$ is the universe of $n$-dimensional vectors each of whose components is 1, 0, or $-1$ and which have at least $c$ non-zero components (corresponding to the available bits) and thus at most $(n - c)$ zeros for the unavailable bits (as a result of padding).

It is easy to see that if the 1st-layer connection weights in the bipolar neural memory (described in Section 2) were set up using only a part of a complete memory pattern, the connection weights set for the available components of its partial pattern would be same as those
that would have been obtained if the complete memory pattern were used in establishing the weights. The threshold setting for a hidden neuron in the bipolar memory equals the number of input neurons minus twice the value of the desired precision level, which is independent of all stored memory patterns but depends on the dimensionality of input pattern. Hence the bipolar neural memory module designed for recall from a fully specified input pattern can be used for associative recall from a partially specified input pattern by only adjusting the thresholds of the hidden neurons as follows: multiply the threshold of each hidden neuron by the ratio of the number of available components of a partial input pattern to that of a complete pattern. That is, reduce the threshold of each hidden neuron $i$ from $(n - 2p_i)$ to $(n - 2p_i) \times \frac{n_a}{n}$, where $n_a \leq n$ is the number of the available bits of a partial input pattern.

Note that $p_i$ is the precision level for memory pattern $i$ in the problem of recall from full pattern and $(n - 2p_i) \times \frac{n_a}{n} = n_a - 2(p_i \times \frac{n_a}{n})$ is the new threshold for recall from a partial pattern. The expression for the new threshold is similar to that for old threshold. In the new threshold $n_a$ equals the number of available bits of a partial input pattern and $p_i \times \frac{n_a}{n}$ is the new precision level. In the interest of efficiency of a hardware realization, it is desirable to use $[p_i \times \frac{n_a}{n}]$ or $\lfloor p_i \times \frac{n_a}{n} \rfloor$ as the new precision level, where $\lceil \cdot \rceil$ and $\lfloor \cdot \rfloor$ respectively denote the integer ceiling and floor of a real value.

### 2.3.2 Multiple associative recalls

The memory retrieval process in the neural memory described in Section 2 can be viewed as a two-stage process: identification and recall. During identification of an input pattern, the 1st-layer connections perform similarity measurements and sufficiently activate zero or more hidden neurons so that they produce outputs of 1. The actual choice of hidden neurons to be turned on is a function of the 1st-layer weights, the input pattern, and the threshold settings of the hidden neurons. During recall, if only one hidden neuron is turned on, one of the stored memory patterns will be recalled by that hidden neuron along with its associated 2nd-layer connections. Without any additional control, if multiple hidden neurons are enabled, the corresponding output pattern will be a superposition of the output patterns associated with
each of the activated hidden neurons. With the addition of appropriate control circuitry, this
behavior can be modified to yield sequential recall of more than one stored pattern. This has
a number of practical applications such as information retrieval, database query processing
[23, 24] (see Chapter 3), knowledge-based diagnosis systems, etc. This has the effect of
searching through memory for patterns that are sufficiently close to a given input pattern and
then recall them one after another.

Multiple recalls are possible if some of the associative partitions realized in the memory
module are not isolated (see Section 2 for details). An input pattern (a bipolar vertex) located
in a region of overlap among several partitions is close enough to the corresponding partition
centers (stored memory patterns) at the same time and hence can turn on more than one
hidden neuron. The following explores this phenomenon in more detail.

Define $\hat{U}_i^n(p_i) = \{ u \mid u \in \hat{B}^n & D_H(u, u_i) \leq p_i \}$, $1 \leq i \leq k$, where $\hat{B}^n$ is the universe of
$n$-dimensional bipolar vectors; i.e., $\hat{U}_i^n(p_i)$ to be the set of $n$-dimensional bipolar vectors which
have Hamming distance less than or equal to $p_i$ away from the given $n$-dimensional bipolar
vector $u_i$, where $p_i$ is a specified precision level. Let $p_i = p, 1 \leq i \leq k$.

Define $f_M : \hat{U}^n(p) \rightarrow (2^V - \emptyset)$ as follows:

\[
    f_M(x) = \{ v_i \mid x \in \hat{U}_i^n(p), 1 \leq i \leq k \}
\]

where $\hat{U}^n(p) = \hat{U}_1^n(p) \cup \hat{U}_2^n(p) \ldots \cup \hat{U}_k^n(p)$, $\hat{U}_i(p) \cap \hat{U}_j(p) \neq \emptyset$ for some $i \neq j$, and $2^V$ is the
power set of $V$ (i.e., the set of all subsets of $V$). The output of $f_M$ is a set of bipolar vectors
that correspond to the set of patterns that should be recalled given the bipolar input vector $x$.

$f_M$ is a partial function and is extended to a full function $\hat{f}_M : \hat{B}^n \rightarrow (2^V \cup \{ (-1)^m > \} - \emptyset)$
to describe multiple recall in the neural associative memory as follows:

\[
    \hat{f}_M(x) = \begin{cases} 
    f_M(x) & \text{if } x \in \hat{U}^n(p) \\
    \{ (-1)^m > \} & \text{if } x \in (\hat{B}^n - \hat{U}^n(p)) 
    \end{cases}
\]

Recall of multiple patterns is likely to be all the more useful when the input pattern is only
partially specified. The following extends the mathematical model for multiple recall outlined
above to deal with recall from a partially specified bipolar input pattern.
Let $f_{MP} : \bar{\mathbb{B}}^\infty(p) \rightarrow (2^V - \emptyset)$ be a function defined as follows:

$$f_{MP}(x) = \{v_i \mid x \in \bar{\mathbb{B}}^\infty_i(p), \ 1 \leq i \leq k\}$$  \hfill (2.46)

where $\bar{\mathbb{B}}^\infty_i(p) \cap \bar{\mathbb{B}}^\infty_j(p) \neq \emptyset$ for some $i$'s and $j$'s, $c \leq h \leq n$ and $1 \leq i \leq j \leq k$. $f_{MP}$ is a partial function and is extended to a function $\hat{f}_{MP} : \mathbb{B}^\infty \rightarrow (2^V \cup \{(\emptyset)^m\} - \emptyset)$ for multiple recalls from padded bipolar partial patterns in associative memory as follows:

$$\hat{f}_{MP}(x) = \begin{cases} f_{MP}(x) & \text{if } x \in \bar{\mathbb{B}}^\infty(p) \\ \langle (-1)^m \rangle & \text{if } x \in (\mathbb{B}^\infty(p) - \bar{\mathbb{B}}^\infty(p)) \end{cases}$$  \hfill (2.47)

Another interesting property of the proposed ANN memory is that it allows sorted multiple recall described as follows: If the input pattern is held constant and the thresholds of all hidden neurons are decremented at each time step, then gradually more and more hidden neurons will be turned on. Decrementing the threshold of a hidden neuron results in an enlargement of the corresponding associative partition in a geometrical sense, and hence more and more partitions will overlap at the input vector (vertex) from iteration to iteration. In the absence of any other control circuits, the recalled pattern will be a superposition of the outputs resulting from all of the hidden neurons that are enabled at any time step. However, in many applications, we need different patterns to be recalled individually. This can be accomplished by adding a habituation mechanism that forces a hidden neuron to turn itself off automatically after it has been on for one time step unless a new input pattern is presented. This results in a serialized or sequential recall of patterns in increasing order of dissimilarity (as measured by Hamming distance) from the input pattern. Alternatively, one can perform a set difference operation on the hidden neuron outputs from every pair of consecutive time steps before allowing the hidden neurons to influence the 2nd-layer connections and the output neurons. It is rather straightforward to implement such set-theoretic operations using neural networks [25].

As already pointed out, the ability to perform multiple recalls is more likely to be useful when dealing with partially specified input patterns. Such information retrieval applications of practical interest include: database lookup using keywords, diagnosis of diseases or faults from a partially specified set of symptoms or test results, and DNA sequence recognition from available DNA segments.
2.3.3 Fault tolerance

This section discusses the performance of the proposed neural memory in the presence of two basic types of faults — connection fault and neuron fault. In the discussion that follows, it is assumed that:

- When a connection fails and stops passing a value, it is assumed that default value 0 is passed from that faulty connection.

- When an input or hidden neuron fails and stops functioning, it is assumed that default value 0 is passed along each of its outgoing connections.

- When an output neuron fails, it is assumed that default value $-1$ (or 0 for binary output) is produced by that output neuron.

2.3.3.1 Connection fault

First we note that a single fault in a 1st-layer connection has less deleterious effect on the performance of the memory than that caused by a noisy bit in an input pattern. This is because a faulty 1st-layer connection will adversely affect only one of the similarity measurements between the input pattern and the stored memory patterns whereas a noisy bit of an input pattern affects all the similarity measurements.

For example, assume $u_i = \langle 1, -1, 1, -1, 1 \rangle^T$ and $u_j = \langle 1, 1, 1, 1, 1 \rangle^T$ are two of the memory patterns stored by hidden neurons $i$, $j$ and their respective connections in an auto-associative memory. Note that $D_H(u_i, u_j) = 2$. Let $w_i$ denote the weight vector of the 1st-layer connections connected to hidden neuron $i$. Suppose the precision levels $p_i = p_j = 1$. Then $w_i = \langle 1, -1, 1, -1, 1 \rangle^T$, $w_j = \langle 1, 1, 1, 1, 1 \rangle^T$, and the thresholds at hidden neurons $i$ and $j$ are $\theta_i = \theta_j = 3$ in the neural memory according to expression 2.41. Suppose a noisy input pattern $x = \langle 1, -1, 1, 1, 1 \rangle^T$ is fed into the neural memory module. Then both hidden neurons $i$ and $j$ are activated and as a result, the output is a superposition of memory patterns $u_i$ and $u_j$. Suppose the connection from the second input neuron to hidden neuron $i$ is faulty, which causes $w_i = \langle 1, 0, 1, -1, 1 \rangle^T$ equivalently. Assumes that $w_j$ is unaffected by the connection.
fault. When the noisy input pattern $x$ is fed into the neural memory module, the summation values at hidden neuron $i$ and $j$ are $-1$ and $0$ respectively. Only hidden neuron $j$ is activated and memory pattern $u_j$ is recalled.

Since each of the 2nd-layer connections emanating from a hidden unit stores 1 bit of the corresponding stored memory pattern, a faulty 2nd-layer connection corrupts at most 1 bit of the recalled memory pattern. Suppose the connection from hidden neuron $j$ to the third output neuron is faulty. When only hidden neuron $j$ is activated to recall memory pattern $u_j$, a default value 0 is passed from that faulty connection to the third output neuron under the assumption of connection fault. The recalled output is $<1, 1, -1, 1, 1>^T$ which is one Hamming distance from memory pattern $u_j$. When only hidden neuron $i$ is activated to recall memory pattern $u_i$ with a connection fault from hidden neuron $i$ to the second output neuron, the recalled output is $<1, -1, 1, -1, 1>^T$ which equals $u_i$.

### 2.3.3.2 Neuron fault

A fault in one of the input neurons has less of an adverse effect on the performance of the memory than 1 bit of noise in the input pattern. However, it is easy to see that an input neuron fault is more serious than a fault in a single 1st-layer connection. This is because a fault in one of the input neuron adversely affects each of the stored memory patterns. For example, suppose $u_i$, $u_j$ and $x$ are as before. Let us consider following three cases:

- **Case 1**: one of the first, third, or fifth input neurons is faulty. When $x$ is fed into the neural memory module, the summation values at hidden neurons $i$ and $j$ are both $-1$. No memory pattern is recalled.

- **Case 2**: the second input neuron is faulty. Then the summation values at hidden neuron $i$ and $j$ are $-1$ and $1$ respectively. The hidden neuron $j$ is activated to recall memory pattern $u_j$.

- **Case 3**: the fourth input neuron is faulty. Then the summation values at hidden neuron $i$ and $j$ are $1$ and $-1$ respectively. The hidden neuron $i$ is activated to recall memory pattern $u_i$. 
If a hidden neuron is faulty, the memory pattern associated with that hidden neuron can not be recalled. If an output neuron is faulty, the corresponding bit of all recalled memory patterns will have value $-1$. So the recalled memory patterns having value $1$ at that corresponding bit are corrupted by one bit of noise.

2.4 Summary and Discussion

This chapter has discussed the analysis and synthesis of a neural memory for both address-based as well as associative (content-based) storage and recall of patterns. When used as content-addressed memory, the proposed ANN memory allows adjustable precision and sorted extraction of all stored memory patterns, has high potential storage capacity, and exhibits several interesting properties: recall from partial pattern, multiple recall and fault tolerance. It also lends itself to one-shot incremental learning without interference with previously memorized patterns. A detailed mathematical analysis of the properties of the proposed neural memory architecture is presented. Address-based memory can serve as working space of dynamic representations for symbol processing and shared message-passing space among neural network modules in an integrated neural network system. It provides for reliable content modification in real time, a necessary feature for symbol processing applications.

The pattern matching process of the proposed content-addressed memory in which data parallelism is achieved and all memory patterns are compared with input pattern in parallel within one step can be far more efficient (in terms of computation time) than searching for data in a key-based organization of the sort commonly used in conventional computer systems [23]. With the need for real-time response in language translation and with the increased number of users as well as increased use of large networked databases over the Internet, efficient architectures for high-speed table lookup, message routing and database query processing have assumed great practical significance. Extensions of the proposed ANN memory architecture for efficiently handling database queries and syntax analysis are proposed in Chapters 3 and 6 (also see [22, 23]) respectively.
It is worth mentioning that the proposed neural memory supports realization of many-to-one binary random mappings which is extensively used in the design of digital logic devices such as logic circuitries of AND/OR (or NAND/NOR) gates. The design and optimization of such circuitry has always been one of the key research problems in the VLSI research community and industry. The hardware realization of the proposed neural architecture provides same flexibility as PLA (programmable logic array) and thus higher abstraction level than AND/OR gates for logic functions (many-to-one binary mappings). The anticipated performance of hardware realizations of the proposed memory architecture is evaluated in Section 3.2.1.1.

If the hardware realization provides for run-time loading of connection weights and neuron thresholds under software control, it provides for an efficient, time-saving, and error-preventing alternative to the implementation of PLA and combinational circuitry of AND/OR gates for logic circuitry.
3 NEURAL ARCHITECTURES FOR INFORMATION RETRIEVAL AND DATABASE QUERY PROCESSING

3.1 Introduction

This chapter explores the application of neural associative memory to efficient implementation of noise-tolerant information retrieval and query module in large database systems. Based on the neural associative memory proposed in Chapter 2, a library query system and a query system for text-based machine-readable lexicon are explored respectively by exploiting the capability of neural associative memory for massively parallel associative pattern matching and retrieval. The performance of the ANN-based database query module is analyzed and compared with other techniques commonly used in current computer systems. The results of this analysis suggest that the proposed ANN design offers an attractive approach for the realization of query modules in large database and knowledge base systems, especially for information retrieval based on partial matches.

Artificial neural networks offer an attractive computational model for a variety of applications in pattern classification, language processing, complex systems modelling, control, optimization, prediction and automated reasoning for a variety of reasons including: potential for massively parallel, high-speed processing, resilience in the presence of faults (failure of components) and noise. Despite a large number of successful applications of ANN in aforementioned areas, their use in complex symbolic computing tasks (including storage and retrieval of records in large databases, and inference in deductive knowledge bases) is only beginning to be explored [21, 22, 23, 24, 47, 72, 99, 179].

Database query entails a process of content-based table lookup (associative search and retrieval) which is used in a wide variety of computing applications. Examples of such lookup
tables include: *routing tables* used in routing of messages in communication networks, *symbol tables* used in compiling computer programs written in high level languages, knowledge bases which store facts and rules in relational form, fact and rule tables used in unification process of logic programming systems, keyword tables (inverted and signature files) used in information retrieval applications [37], and machine-readable lexicons, dictionary as well as varieties of tables used in *memory-based parsing* [82] for natural language processing. In such tables, every table entry is an associated input-output ordered pair. As the number of table entries and the occurrence of partially specified inputs increase, the delay of locating an associative table entry can become a severe bottleneck in large-scale information processing tasks which involve extensive associative table lookup. Therefore, many researchers have explored to augment conventional database systems with subsystems which effectively exploit associative processing to enhance the performance of the systems [30, 101, 121, 135, 157, 162, 196]. Many applications require associative table lookup mechanism or query processing system to be capable of retrieving items based on *partial matches* (some features of the input are noisy or missing) or retrieval of *multiple* records matching the specified query criteria. This capability is computationally rather expensive in many current computer systems. The ANN-based approach to database query processing that is proposed in this chapter exploits the fact that an associative table lookup task can be viewed at an abstract level in terms of *associative pattern matching and retrieval* which can be efficiently realized using neural associative memories. The rest of the chapter is organized as follows:

- The rest of Section 3.1 briefly discusses how to represent symbolic information in terms of binary codings to facilitate symbolic information manipulation on the proposed neural associative memory which operates on bipolar/binary values.

- Section 3.2 explores information retrieval and query processing using neural associative memory. ANN designs are developed respectively for a library query system and a query system for text-based machine-readable lexicon by taking advantage of the capability of the proposed neural associative memory for massively parallel pattern matching and retrieval.
• Section 3.3 compares the performance of the proposed ANN-based query processing system with that of several commonly used techniques.

• Section 3.4 concludes with a summary.

3.1.1 Information retrieval in neural associative memories

Most database systems store (symbolic) data in the form of structured records. When a query is made, the database system searches and retrieves records that match the user’s query criteria which typically only partially specify the contents of records to be retrieved. Also, there are usually multiple records that match a query (e.g., books written by a particular author in a library or the lexical specifications of the words matching the partially specified input pattern ma?e in a machine-readable lexicon, where the symbol ? means the English letter at that position is unavailable). Thus, query processing in a database can be viewed as an instance of the task of recall of multiple stored patterns given a partial specification of the patterns to be recalled. The proposed neural associative memory which is capable of massively parallel best match, exact match, and partial match and recall of binary (bipolar) patterns can serve to efficiently handle information retrieval and query processing in large database systems.

The proposed neural associative memory operates on binary (bipolar) values. Since humans find it difficult to work with binary codes, we use symbolic representations when the neural associative memory is used for information storage and retrieval. The translation from symbolic representations to binary codings can be done automatically and is not discussed here.

In general, symbolic information retrieval (lookup) from a table can be viewed in terms of a binary random mapping \( f_t : U \rightarrow V \), defined in expression 2.1. A binary vector \( u_i \in U \) can be used to represent an ordered set of \( r \) binary-coded symbols from symbol sets \( \Gamma_1, \Gamma_2, \cdots, \Gamma_r \) respectively (i.e., \( \exists \alpha_1 \in \Gamma_1, \ldots, \alpha_r \in \Gamma_r \) s.t. \( u_i = \alpha_1 \cdot \alpha_2 \cdots \cdot \alpha_r \), where \( \cdot \) denotes the concatenation of two binary codes), and a binary vector \( v_i \in V \) can be used to represent an ordered set of \( t \) symbols from symbol sets \( \Delta_1, \Delta_2, \cdots, \Delta_t \) respectively, where \( 1 \leq i \leq k \). In the context of Section 3, every \( \Gamma_i \) denotes the set of ASCII-coded English letters, \( r \) is the length (in
number of English letters) of the input, \( t = 1 \), and \( \Delta_1 \) is the set of \( M \)-bit record pointers, where \( 1 \leq i \leq r \). Let \( |U| = |\Gamma_1| |\Gamma_2| \cdots |\Gamma_r| \). Then, \( f_I \) defines a symbolic mapping function \( f_S : \Gamma_1 \times \Gamma_2 \cdots \times \Gamma_r \to \Delta_1 \times \Delta_2 \cdots \times \Delta_t \). In this case, the I/O mapping of symbolic function \( f_S \) (information retrieval from a symbolic table, given a query criterion) can be viewed in terms of the binary (bipolar) mapping operations of \( f_I \) which is realized by the proposed neural associative memory.

### 3.2 Query Processing Using Neural Associative Memories

This section describes the use of the neural associative memory described in Chapter 2 to implement high-speed database query systems. An ANN-based library query system and an ANN-based query system for a text-based machine-readable lexicon for natural language processing (NLP) are presented respectively to illustrate the key concepts. As the quantity of entries (records) of a database increases, the cost of locating an entry can become a significant cost for real-time, large-scale machine processing of text and for a library system with huge stored volumes and large users. For example, the library at Iowa State University has over 2 million volumes, and the number of words a native English speaker knows is estimated to be between 50,000 and 250,000 [4]. In the proposed ANN-based query systems, such a cost can be reduced significantly by taking advantage of the capability of the proposed neural associative memory for massively parallel associative pattern matching and retrieval.

#### 3.2.1 Realization of lexical access for a machine-readable lexicon using a neural associative memory

In the analysis, interpretation, and generation of natural languages, the lexicon is one of the central components of many NLP applications. Basically, the lexical specification for a word in a lexicon contains phonological, morpho-syntactic, syntactic, semantic, and other fields [58]. Each field may contain several sub-fields. In a lexical database which realizes a machine-readable lexicon for real-time NLP, the lengths of the fields and sub-fields are usually fixed to allow efficient random access to them. This is where a computational lexicon is distinguished
from a dictionary in which the format of lexical entries is mostly irregular and hence the access of the lexical fields for a word (lexeme) is sequential. Typically, a dictionary contains much free text including definitions, examples, cross-reference words, and others.

Generally, there are two basic conceptions about the form of the items which serve as access keys in a lexicon. One is *minimal listing hypothesis* [15] which only lists lexemes and results in a root lexicon. A lexeme may have several variants, e.g., in English, the words: *produces*, *produced*, *producing*, *producer*, *productive* and *production* are variants of the lexeme *produce*, and the words: *shorter*, *shortest* and *shortly* are variants of the lexeme *short*. The other is *full listing hypothesis* which lists all possible words of a language and results in a full-form lexicon. A root lexicon is more compact and requires a rule system to process the variants of lexemes, while a full-form lexicon is more computationally efficient in terms of lexical access and more user-friendly in terms of lexicon editing and extension [58]. Therefore, a hybrid of the two conceptions is often adopted in many computational lexicon applications. In the following, the term *access key* is used to stand for either word or lexeme in a computational lexicon no matter whether it is a root or full-form lexicon.

There are several models of lexical access in a computational lexicon. Our ANN-based query system for NLP lexicon is based on the search model of lexical access (indirect access) [36, 58]. In such a model, a text-based computational lexicon which associates every access key with its lexical specification contains two organizations: one is called *master file* which stores entries of lexical specifications, and the other is called *access file* which consists of pairs of (<access key>, <lexical pointer>). The access keys are organized to allow location of desired access keys and their associated lexical pointers efficiently. The lexical pointers point to the lexical specifications of their corresponding entries in the master file. The process of lexical access in the search model is similar to that of locating a book in a library. To locate a book from a collection of shelves (the master file) in a library, the book catalog (the access file) is searched using author name(s) and/or book title to find the call number (a pointer indicating the location) of a desired book.
A noise-tolerant neural associative memory which can efficiently support the process of search and retrieval of desired lexical pointers for a text-based machine-readable English lexicon is designed as follows. Suppose English letters of the lexicon are represented using 8-bit ASCII codes (extended to 8 bits by padding each 7-bit ASCII code with a leading 0). Assume the maximal length of an English word is \( L \) letters. Since each letter is represented by an 8-bit ASCII code, \( 8L \) input neurons are used in the ANN memory. Each binary bit \( x_b \) of the ASCII input is converted into a bipolar bit \( x_p \) by expression \( x_p = 2x_b - 1 \) before it is fed into the ANN memory to execute a query. (This is motivated by the relative efficiency of the hardware implementations of binary and bipolar neural associative memories – see Chapter 2 for details).

Let the output (the lexical pointer) be represented as an \( M \)-bit binary vector which can access at most \( 2^M \) lexical specifications in the lexical database. So, the ANN memory uses \( M \) output neurons.

For every associative ordered pair of an access key and a lexical pointer, a hidden neuron is used in the ANN memory. Suppose there are \( k \) such pairs. Then, the ANN memory uses \( k \) hidden neurons. Every access key is represented by padding its corresponding English word with trailing spaces and each binary bit \( x_b \) of every access key is converted into a bipolar bit \( x_p \) by expression \( x_p = 2x_b - 1 \) to be stored in the ANN memory. For example, if an English word has \( j \) letters (\( j \leq L \)), then the first \( j \) letters of its corresponding access key are from the English word and the last \( L - j \) letters of the access key are spaces. The reason for such padding will become obvious in the coming examples. The ASCII code for the special symbol space is \( 20_{16} = 0010\ 0000_2 \). During storage of an associated pair, the connection weights are set as explained in Section 2.2.6. Note that the input and output of an associated pair are represented in bipolar and binary values respectively. During recall, the thresholds of hidden neurons are adjusted for each query as outlined in Section 2.3.1 (where for each query, the value of \( n_a \) can be set either by centralized check on the number of letters of the input access key, or distributed circuitry embedded in input neurons). The precision level \( p \) is set at 0 for this associative ANN memory.
3.2.1.1 Examples of query processing in the neural lexicon

The following examples illustrate how the proposed ANN memory for NLP lexicon retrieves desired lexical pointers by processing a query which may contain a partially specified input (target access key).

- **Example 1** (exact match): Suppose the lexical pointer of the word product is to be retrieved from the ANN memory. Then, the first 7 letters of the target access key to be searched are p, r, o, d, u, c and t, and the last \( L - 7 \) letters are spaces. In this case, no letter of the target access key is unavailable. Therefore, in the ANN memory, the threshold set at all hidden neurons is \( L \times 8 = 8L \). Suppose a hidden neuron \( i \) is used for the association of this access key and its associated lexical pointer. When the target access key is presented to the ANN memory, only hidden neuron \( i \) has net input of 0 and other hidden neurons have net input less than 0 (see Section 2 for details). So, hidden neuron \( i \) is activated to recall the desired lexical pointer using the weights on the 2nd-layer connections associated with hidden neuron \( i \).

- **Example 2** (prefix match): Suppose the lexical pointer(s) of the word(s) matching the pattern product* is to be retrieved from the ANN memory, where the symbol * means the trailing English letters starting from that position are unavailable. In this case, the last \( L - 7 \) letters of the target access key are viewed as unavailable, only the first 7 letters are available, and its first 7 letters are p, r, o, d, u, c and t. Therefore, in the ANN memory, only the first \( 7 \times 8 = 56 \) input neurons have input value either 1 or -1, the other input neurons are fed with 0, and the threshold set at all hidden neurons is \( 7 \times 8 = 56 \). Suppose, in the lexicon, product, production, productive, productively, productiveness and productivity are the words the first 7 letters of which match the pattern product*. In this case, six hidden neurons are used for the associations of these six access keys and their lexical pointers respectively in the ANN memory. When the partially specified target access key is presented to the ANN memory, only these six hidden neurons have net input of 0 and other hidden neurons have net input.
less than 0. So, these six hidden neurons get activated one at a time to sequentially recall the associated lexical pointers using the weights on the 2nd-layer connections respectively associated with these six hidden neurons.

- **Example 3** (partial match): Suppose the lexical pointer(s) of a noisy 7-letter word pro??ct is to be retrieved from the ANN memory, where the symbol ? means the English letter at that position is unavailable. In this case, 2 of the letters (the 4th and 5th letters) of the target access key are viewed as unavailable, its first 3, 6th and 7th letters are p, r, o, c and t respectively, and the last \( L - 7 \) letters are spaces. Therefore, in the ANN memory, the input neurons representing the 4th and 5th input letters are fed with 0, other input neurons have input value either 1 or -1, and the threshold set at all hidden neurons is \( (L - 2) \times 8 = 8(L - 2) \). Suppose, in the lexicon, product, project, and protect are the only 7-letter words which match the pattern pro??ct. Therefore, three hidden neurons are used for the associations of these three access keys and their lexical pointers respectively in the ANN memory. When the partially specified target access key is presented to the ANN memory, only these three hidden neurons have net input of 0 and other hidden neurons have net input less than 0. So, these three hidden neurons get activated one at a time to sequentially recall the associated lexical pointers using the weights on the 2nd-layer connections respectively associated with them.

The large number of hidden neurons in such an ANN module poses a problem for hardware realization because of the large fan-out for input neurons and large fan-in for output neurons. One solution to this problem is to divide the whole module into several sub-modules which contain same number of input, hidden, and output neurons. These sub-modules are linked together by shared input and output bus (see Figure 3.1). Such a bus topology also makes it possible to easily expand the size of the ANN memory. The 1-dimensional array structure shown in Figure 3.1 can be easily extended to 2 or 3-dimensional array structures.
3.2.2 Realization of a library query system using a neural associative memory

A neural associative memory that can be used to support a library system queried by name can be designed as follows: Suppose the input is a name (provided in a format with last name followed by first name) of an author, and characters that appear in the name are represented using 8-bit ASCII codes. Assume the length of both last and first name are truncated to at most $L$ characters each. Since each ASCII code consists of 8 binary bits, $16L$ input neurons are used in the ANN memory. The first $8L$ input neurons are for last name and the last $8L$ neurons for first name. Each binary bit $x_b$ of the ASCII input is converted into a bipolar bit $x_p$ by expression $x_p = 2x_b - 1$ before it is fed into the ANN memory module for database queries.

Let output be a $M$-dimensional binary vector pointing to a record in the library database that contains information about a volume (or the binary vector can encode information about a volume directly). The output binary vector in turn can therefore be used to locate the title, author, call number and other relevant information about a volume. Using $M$ output neurons, we can access at most $2^M$ records from the library database. Each hidden neuron in
the associative memory module is used to realize an ordered pair associating an author's name with an $M$-bit pointer that points to a record which contains information about a corresponding volume. The last and first names of an author of an associated pair are represented by padding the names with trailing spaces and each binary bit $x_b$ of the padded names is converted into a bipolar bit $x_p$ by expression $x_p = 2x_b - 1$ to be stored in the ANN memory. For example, if Smith John is the name part of an associated pair, the first 5 letters for the last name part of the associated pair are Smith and the other $L - 5$ letters are spaces, and the first 4 letters for the first name part of the associated pair are John and the other $L - 4$ letters are spaces. During storage of an associated pair, the connection weights are set as explained in Section 2.2.6. Note that the input and output of an associated pair are represented in bipolar and binary values respectively. During recall, the thresholds of hidden neurons are adjusted for each query as outlined in Section 2.3.1. The precision level $p$ is set at 0 for this associative ANN memory module.

The following cases illustrate how the ANN-based library query system retrieves desired record pointers by processing a query which may contain a partially specified input.

- **Case 1**: Suppose a user enters "Smith *" to search for the books written by authors with last name Smith. In this case, that part of input for first name is viewed as unavailable, the first 5 letters for the part of input for last name are S, m, i, t, and h, and the other $L - 5$ letters are spaces. Therefore, in the ANN memory, the first $8 \times L = 8L$ input neurons have input value either 1 or -1, the last $8 \times L = 8L$ input neurons which together represent the part of input for first name are fed with 0, and the threshold set at all hidden neurons is $8 \times L = 8L$. Suppose the library database contains $k$ volumes written by authors with last name Smith. In this case, the ANN memory module contains $k$ hidden neurons for these $k$ volumes (one for each volume written by an author whose last name is Smith). During the recall process all these hidden neurons will have net input of 0 and other hidden neurons have net input less than 0 (see Chapter 2 for details). The neurons with non-negative net input get activated one at a time to sequentially recall the desired $M$-bit pointers pointing to the books written by authors with the specified
last name.

- **Case 2**: suppose a user enters "John" to search for the books written by authors with first name John. In this case, that part of input for last name is viewed as unavailable, the first 4 letters for the part of input for first name are J, o, h, and n, and the other \( L - 4 \) letters are spaces. Therefore, in the ANN memory, the last \( 8 \times L = 8L \) input neurons have input value either 1 or -1, the first \( 8 \times L = 8L \) input neurons which together represent the part of input for last name are fed with 0, and the threshold set at all hidden neurons is \( 8 \times L = 8L \). The recall of the associated pointers proceeds as in Case 1.

- **Case 3**: Suppose a user enters "Smith J*" to search for the books written by authors with last name called Smith and first name beginning with a J. In this case, the rest of the letters of first name is viewed as unavailable, the first 5 letters for the part of input for last name are S, m, i, t, and h, and the other \( L - 5 \) letters are spaces. Therefore, in the ANN memory, the first \( 8 \times (L + 1) = 8(L + 1) \) input neurons have input value either 1 or -1, the last \( 8 \times (L - 1) = 8(L - 1) \) input neurons are fed with 0, and the threshold set at all hidden neurons is \( 8 \times (L + 1) = 8(L + 1) \). The recall of the associated pointers proceeds as in Case 1.

### 3.2.3 The implementation of case insensitive pattern matching

It is rather straightforward to modify the proposed ANN-based query system to make it case-insensitive. The following shows ASCII codes of English letters, which are denoted in hexadecimal and binary codes.

\[
\begin{align*}
A &= 41_{16} = 0100\ 0001_2, \ldots, \ Z &= 5A_{16} = 0101\ 1010_2 \\
\text{a} &= 61_{16} = 0110\ 0001_2, \ldots, \ z &= 7A_{16} = 0111\ 1010_2
\end{align*}
\]

The binary codes for the capital case and small case of every same English letter only differ at the 3rd bit counted from left hand side. If that bit is viewed as "don't care" (or unavailable), this query system will be case insensitive. This effect can be achieved by treating the corresponding input value as though it was unavailable.
3.3 Comparison with Other Database Query Processing Techniques

This section compares the anticipated performance of the proposed neural architecture for database query processing with other approaches that are widely used in current computer systems. Such a comparison takes into account the performance of hardware used in these systems and the process used for locating data items. It is assumed that the systems have comparable I/O characteristics which are not discussed here. First, the performance of the proposed neural network is estimated, based on current CMOS technology for realizing neural networks. Next, the operation of conventional database systems is examined, and their performance is estimated and compared to that of the proposed neural architecture.

3.3.1 Performance of current electronic realization for neural networks

Electronic hardware realizations of ANN have been explored by several authors [49, 50, 57, 103, 106, 107, 120, 152, 184, 190]. Such implementations typically employ CMOS analog, digital, or hybrid (analog/digital) electronic circuits. Analog circuits typically consist of processing elements for multiplication, summation and thresholding. Analog CMOS technology is attractive for realization of ANN because it can yield compact circuits that are capable of high-speed asynchronous operation [48]. [184] reports a measured propagation delay of 104ns in a digital circuit with each synapse containing an 8-bit memory, an 8-bit subtractor and an 8-bit adder. [50] reports throughput at the rate of 10MHz (or equivalently, delay of 100ns) in a Hamming Net pattern classifier using analog circuits. [106] describes a hybrid analog-digital design with 5-bit (4 bits + sign) binary synapse weight values and current-summing circuits that is used to realize a 2-layer feed-forward ANN with a network computation delay of less than 20ns.

The 1st-layer and 2nd-layer subnetworks of the proposed neural architecture for database query processing are very similar to the 1st-layer subnetwork of a Hamming Net respectively, and the neural architecture with 2 connection layers in the proposed ANN is exactly same as that implemented by [106] except [106] uses discretized inputs, 5-bit synaptic weights, and sigmoid-like activation function. The proposed ANN uses bipolar inputs, weights in \(-1, 0, 1\)
and binary hardlimiter as activation function. Hence the computation delay of the proposed ANN can be expected to be at worst of the order of 100 ns and at best 20 ns given the current CMOS technology for realizing ANN.

The development of specialized hardware for implementation of ANN is still in its early stages. Conventional CMOS technology that is currently the main technology for VLSI implementation of ANN is known to be slow [92, 104]. Other technologies, such as BiCMOS, NCMOS [92], pseudo-NMOS logic, standard N-P domino logic, and quasi N-P domino logic [104], may provide better performance for the realization of ANN. Thus, the performance of the hardware implementation of ANN is likely to improve with technological advances in VLSI.

3.3.2 Analysis of query processing in conventional computer systems

Accessing information based on a key is central to information retrieval systems [37, 157, 158] and database systems [186]. In relational database systems implemented on conventional computer systems, given the value for a key, a record is located efficiently by using key-based organizations including hashing, index-sequential access files and B-trees [186]. Such a key-based organization usually contains two data structures: index files(s) and master file. In an index file, every key is organized and usually associated with a record pointer which points to a corresponding record in the master file which is typically stored in secondary storage devices like hard disks for large databases. Conventionally, estimated cost of locating a record is based on the number of physical block accesses of secondary storage devices [186] since the access latency with current cost-effective disk systems is around 5~10 ms (millisecond) and every one of the repetitive search steps which together facilitate locating a desired record pointer from index files (loaded into the main memory) takes only several CPU clock cycles. The clock cycle of current cost-effective CPUs is around 2~10 ns. With the large number of entries in index files of large databases and with the low price of current memory chips, master files of databases for real-time applications tend to be loaded into the main memory to avoid accessing records from low-speed secondary storage devices (compared to memory chips) and thus the cost of locating a desired record pointer can become a dominant cost for record retrieval in
large databases.

The following compares the anticipated performance of the proposed neural associative memory with other approaches that are widely used in current computer systems for locating a record pointer associated with a given key. In the following analysis, it is assumed that all program and index files for processing queries using current computer systems are pre-loaded into the main memory. The effect of data dependency among instructions which offsets pipeline and superscalar effects and thus much reduces the average performance of current computer systems is not considered here.

To simplify the comparison, it is assumed that each instruction on a conventional computer takes $\tau$ ns on an average. For instance, on a relatively cost-effective 100 MIPS processor, a typical instruction would take 10 ns (The MIPS measure for speed combines clock speed, effect of caching, pipelining and superscalar design into a single figure for speed of a microprocessor). Similarly, we will assume that a single identification and recall operation cycle by a neural associative memory takes $\alpha$ ns. Assuming hardware implementation based on current CMOS VLSI technology, $\alpha$ is around 20~100 ns. Table 3.1 summarizes from following analysis the estimated performance of the proposed neural associative memory and other techniques commonly used in conventional computer systems for locating a desired record pointer. The summary assumes that the value of the key is given, the data structures and programs are loaded into the main memory of the computer systems used, index search occurs in a balanced binary tree of $(2^M - 1)$ records, and partial match occurs in a $k$-d-tree of $N$ records. $L$ is the total number of bytes of a key, $n$ is the data bus width of the computer systems used, $h$ is the average number of executed instructions in a hashing cycle, $\tau$ is the average time delay for executing an instruction, $\delta$ is the average number of executed instructions in a comparison cycle for every $n$ bits in a binary search cycle, $\alpha$ is the time delay of the proposed neural memory, $K$ is the number of index fields used in the $k$-d-tree, and $J$ is the number of index fields specified in a query criterion. Table 3.2 summarizes the capabilities of the proposed neural associative memory and other techniques commonly used in conventional computer
Table 3.1 A comparison of the estimated performance of the proposed neural associative memory with that of other techniques commonly used in conventional computer systems for locating a record pointer in key-based organizations

<table>
<thead>
<tr>
<th>Method</th>
<th>Estimated time (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>hashing</td>
<td>$[8L/n] h \tau$</td>
</tr>
<tr>
<td>index search</td>
<td>$(M - 1)[4L/n] b \tau$</td>
</tr>
<tr>
<td>ANN memory</td>
<td>$\alpha$</td>
</tr>
<tr>
<td>k-d-tree (partial match)</td>
<td>$O(N^{(K-J)/K})$</td>
</tr>
</tbody>
</table>

Table 3.2 A comparison of the capabilities of the proposed neural associative memory with those of other techniques commonly used in conventional computer systems for exact match and partial match

<table>
<thead>
<tr>
<th>Method</th>
<th>Exact match</th>
<th>Prefix match</th>
<th>Partial match</th>
</tr>
</thead>
<tbody>
<tr>
<td>hashing</td>
<td>efficient</td>
<td>unable</td>
<td>unable</td>
</tr>
<tr>
<td>index search</td>
<td>efficient</td>
<td>efficient</td>
<td>inefficient</td>
</tr>
<tr>
<td>ANN memory</td>
<td>efficient</td>
<td>efficient</td>
<td>efficient</td>
</tr>
<tr>
<td>k-d-tree</td>
<td>satisfactory</td>
<td>satisfactory</td>
<td>inefficient</td>
</tr>
</tbody>
</table>

systems for exact match, prefix match and partial match mentioned in Section 3.2.1.

3.3.2.1 Analysis of locating a record pointer using hashing functions

Hashing structure is the fastest of all key-based searching techniques for locating a record pointer for a single record. However, although it is effective in locating a single record by exact match (e.g., example 1 of Section 3.2.1), it is inefficient at or incapable of locating related records in response to a partially specified input (e.g., examples 2 and 3 of Section 3.2.1). Let us consider the time needed for locating a record pointer using a hash function in current computer systems. Commonly used hash functions are based on multiplication, division and addition operations [87, 163]. In hardware implementation addition is faster than multiplication which in turn is far faster than division. Assume that computing a hashing function on a key with a length of $L$ bytes (characters) takes $[8L/n]$ cycles using a processor with an $n$-bit data bus and every cycle takes $h$ instructions. Then, the estimated computation time for locating a record pointer is $[8L/n] h \tau$. Other overheads in computing a hashing
function in such systems include the time for handling the potential problem of collisions in hash functions. If a single-CPU 100 MIPS processor with a 32-bit data bus is used, it is expected that the total computation time for locating a record pointer will typically be in excess of 100 ns (If \( L = 15 \) and \( h = 5 \), the total computation time is \( \left[8 \times 15/32\right] \times 5 \times 10 \text{ ns} = \left[120/32\right] \times 50 \text{ ns} = 200 \text{ ns} \)).

3.3.2.2 Analysis of locating a record pointer using index search

A perfectly balanced binary search tree is another popular, efficient data structure used in conventional database systems to locate a single record by exact match (e.g., example 1 of Section 3.2.1) or several related records by partial match (e.g., example 2 but not example 3 of Section 3.2.1). Assume every non-terminal node in a perfectly balanced binary search tree links two child subtrees and there are \( 2^M - 1 \) nodes in the tree. Assume the length of the index key is \( L \) bytes (characters). The average number of nodes visited for locating a desired key would be \( \frac{2^M(M-1)+1}{2^M-1} \approx M - 1 \). On an average, every visit takes \( \left[\frac{3L}{n}\right] = \left[\frac{4L}{n}\right] \) comparison cycles for a processor with an \( n \)-bit data bus. Suppose every comparison cycle takes \( b \) instructions. Then, the estimated computation time for locating a desired record pointer is \( (M - 1)\left[\frac{4L}{n}\right] \times b \times \tau \). If \( L = 15 \), and a 100 MIPS processor with a 32-bit data bus is used, the comparison cycle for every 32 bits takes 5 instructions on average, and there are \( 2^{16} - 1 = 65,535 \) records (the number of words a native English speaker knows is estimated to be between 50,000 and 250,000 [4]), then the overhead for locating a desired record pointer is about \( (16 - 1) \times \left[4 \times 15/32\right] \times 5 \times 10 \text{ ns} = 1500 \text{ ns} \) which compares unfavorably with 100 ns. Note that this is only the cost of locating a record pointer for a single record. The cost of locating several record pointers of related records using user-entered data in an index file containing multiple index fields is examined in next section.

3.3.2.3 The cost of partial-match queries

One of the most commonly used data structures for processing partial-match queries on multiple index fields is \( k-d \)-tree [12]. It can provide approximately satisfactory performance
for locating a single record by exact match or several related records by partial match. In the worst case, the number of visited nodes in an ideal k-d-tree of \( N \) nodes (one for each record stored) for locating the desired record pointers for a partial-match query is

\[
\frac{(J + 2)2^{K-J-1} - 1}{2^{K-J} - 1} \approx O\left(\frac{(N+1)^{(K-J)/K} - 1}{N^{(K-J)/K}}\right)
\]

where \( K \) is the number of index fields used to construct the k-d-tree, and \( J \) out of \( K \) index fields are explicitly specified by a user query. For typical values of \( N \), \( K \), and \( J \), the performance of such systems is far worse than that of the proposed ANN based model according to expression 3.1.

### 3.4 Summary and Discussion

Artificial neural networks, due to their inherent parallelism and potential for noise tolerance, offer an attractive paradigm for efficient implementations of a broad range of information processing tasks. In this chapter, we have explored the use of artificial neural networks for pattern-based (key-based) query processing in large databases. The use of the proposed approach was demonstrated using the examples of a library query system and a query system for text-based machine-readable lexicon used in natural language processing. The performance of a CMOS hardware realization of the proposed neural associative memory for database query processing system was estimated and compared with that of other approaches which are widely used in conventional databases implemented on current computer systems. The comparison shows that ANN architectures for query processing offer an attractive alternative to conventional approaches, especially for dealing with partial-match queries in large databases. With the need for real-time response in language translation and with the explosive growth of the Internet as well as increased use of large networked databases over the Internet, efficient architectures for high-speed information retrieval, associative table lookup, message routing and database query processing have assumed great practical significance.
4 NEURAL ARCHITECTURES FOR ELEMENTARY LOGICAL INFERENCE

4.1 Introduction

Inference often involves tasks which look for interesting patterns in the input or memory to solve questions such as "What is the most likely answer?", "Is there sufficient evidence to adopt a conclusion or is more evidence needed?" [42, 61, 191], etc. Such tasks are important for inference from partial information, and they generally involve a process of pattern recognition by way of best, partial, and/or exact matches. Artificial neural networks, due to their inherent massive parallelism, potential for fault tolerance and adaptation capability through learning, have attracted extensive interest for robust and efficient implementations of logical inference systems. Many of the systems proposed in the literature are motivated by the need for massively parallel architecture for AI applications, and some of them are proposed to model human cognitive processes robustly. In particular, they explore neural mechanisms for variable binding to facilitate complex reasoning based on predicate logic [5, 31, 95, 175, 176]; and connectionist realizations of production system [182], expert systems [42, 43], hybrid knowledge processing systems [136], semantic networks [68, 167], frame representation [79], planning [145, 188], non-monotonic reasoning [142], legal reasoning [148], commonsense reasoning [177, 178], and logical theorem proving [141]. This chapter explores how neural architectures for binary partial pattern recognition can be extended for elementary logical inference based on propositional logic. The proposed neural architectures, like the ones proposed in Chapters 2 and 3 for associative memory and query processing, exploit the massively parallel computational capability of artificial neural networks.
Propositional logic, which typically operates on propositions and logical connectives: AND, OR, as well as negation, is basic to logical inference. For this reason, it is customary to use propositional logic for demonstrating the feasibility of new tools for logical inference. This chapter proposes a method based on geometrical/mathematical analysis to systematically design neural architectures for realizing logical ANDs, logical ORs, and DNF (Disjunctive Normal Form) propositions (sum of products). A DNF proposition is a disjunction of conjunctions. The evaluation of a conjunction corresponds to that of a logical AND function, and the evaluation of a disjunction corresponds to that of a logical OR function. The evaluation of logical AND and OR functions can be respectively realized by the AND and OR neural assemblies proposed in this chapter through a process of pattern recognition. It is known that any proposition can be represented in DNF. Therefore, any proposition can be realized by a 2-layer neural architecture assembled from an OR neural assembly and a fixed number of AND neural assemblies. The rest of the chapter is organized as follows:

- Section 4.2 develops two types of neural assemblies for the recognition of binary partial patterns.
- Section 4.3 develops a general AND neural assembly which can be used to realize any arbitrary logical AND function of a finite number of Boolean variables.
- Section 4.4 develops a general OR neural assembly which can be used to realize any arbitrary logical OR function of a finite number of Boolean variables. Then, a monotone OR neural assembly is derived.
- Section 4.5 discusses how to use AND and OR neural assemblies to realize arbitrary Boolean functions.
- Section 4.6 concludes with a summary of the chapter and a brief discussion.

### 4.2 Neural Assemblies for the Recognition of Partial Patterns

This section develops two types of neural assemblies for the recognition of binary partial patterns. One of them is used for the recognition of patterns which contain a specific sub-
pattern, and the other is used for the recognition of patterns which don't contain a specific 
sub-pattern. Let us call the former the neural assembly for inclusive pattern recognition, and 
the latter, the neural assembly for exclusive pattern recognition. The two assemblies are used 
to build the AND neural assembly proposed in Section 4 and the OR neural assembly proposed 
in Section 4 respectively.

4.2.1 A neural assembly for inclusive pattern recognition

Let \( u = < u_1, ..., u_n > \) be a binary vertex (vector) of dimension \( n \), where \( u_i \in \{0, 1\} \) 
for \( 1 \leq i \leq n \). Let \( H_{S}^{n,u} \) be an \( n \)-dimensional separating hyperplane which can be used 
to implement a 1-layer Perceptron to distinguish the vertex \( u \) from all other \( n \)-dimensional 
vertices. According to expression 2.12, among a set of possible expressions for \( H_{S}^{n,u} \), we 
choose:

\[
H_{S}^{n,u} = \sum_{i=1}^{n} (2u_i - 1)x_i - \sum_{i=1}^{n} u_i = 0 \tag{4.1}
\]

Therefore, in the \( n \)-input, 1-output Perceptron implemented to recognize the vertex \( u \)
- the threshold of the output neuron is set as \( \sum_{i=1}^{n} u_i \), and
- the connection weight from the \( i \)-th input neuron to the output neuron is set as \( 2u_i - 1 \).

Now consider a binary vector \( \tilde{u} \) of dimension \( m \), where \( m \geq n \). Suppose, in a system of 
\( m \) variables, only the values of \( n \) of \( m \) components of vector \( \tilde{u} \) are of interest. For two given 
binary vectors \( u \) and \( \tilde{u} \) of dimensions \( n \) and \( m \) respectively, only whether \( \tilde{u}_{j_1} = u_1, ..., \tilde{u}_{j_n} = u_n \) 
are concerned, where \( 1 \leq n \leq m \) and \( 1 \leq j_1 < j_2 < \cdots < j_n \leq m \). Let us call 
\( J^n = \{j_1, j_2, ..., j_n\} \) the interest set \( J^n \), and \( \tilde{u}(J^n) = < \tilde{u}_{j_1}, \tilde{u}_{j_2}, ..., \tilde{u}_{j_n} > \) the \( J^n \)-set partial vector 
of the binary vector \( \tilde{u} \). Note that several interest sets could be defined concurrently for a given 
problem in an \( m \)-dimensional binary space. In the following, the expression for the separating 
hyperplane \( H_{S}^{n,u} \) (expression 4.1) in an \( n \)-dimensional binary space is re-defined as a separating 
hyperplane \( H_{S}^{m,u,J^n} \) in an \( m \)-dimensional binary space to implement a 1-layer Perceptron to 
recognize all the \( 2^{m-n} \) \( m \)-dimensional binary vectors whose \( J^n \)-set partial vectors equal the
given n-dimensional binary vector \( u = \vec{u}(J^n) = \langle \vec{u}_{j_1}, \vec{u}_{j_2}, ..., \vec{u}_{j_n} \rangle = \langle u_1, u_2, ..., u_n \rangle \):

\[
H_{3}^{m, u, J^n} \equiv \sum_{j_i \in J^n} (2u_i - 1)x_{j_i} + \sum_{i \notin J^n} 0 \cdot x_i - \sum_{k} u_k = 0 \tag{4.2}
\]

Let \( w_i \) be the connection weight from the \( i \)th input neuron to the output neuron and \( \theta \) be the threshold of the output neuron in a 1-layer, 1-output Perceptron. Then, in the \( m \)-input, 1-output Perceptron,

- \( \forall j_i \in J^n \ & 1 \leq i \leq n, w_{j_i} = 2u_i - 1, \)
- \( \forall i \notin J^n \ & 1 \leq i \leq m, w_i = 0, \text{ and} \)
- \( \theta = \sum_{k=1}^{n} u_k. \)

The values from those input neurons which are not in the interest set \( J^n \) will not affect the net input of the output neuron since the weights on the connections from those input neurons are set as 0. These connections together act as a \textit{don't-care} filter.

For example, suppose one wants to use a 1-layer Perceptron to recognize all the 5-dimensional binary vertices whose 1st, 3rd, and 5th components are 1, 0, and 1 respectively. Then, the corresponding interest set would be \( J^5 = \{1, 3, 5\} \), and the implemented Perceptron is shown in Figure 4.1.

4.2.2 A neural assembly for exclusive pattern recognition

For a given \( n \)-dimensional binary vertex \( u = \langle u_1, ..., u_n \rangle \), all \( n \)-dimensional binary vertices can be partitioned into \( n + 1 \) parallel layers according to their Hamming distance \( p \) to the given binary vertex \( u \) (Theorem 2.1). Those \( n + 1 \) layers are respectively on \( n + 1 \) mutually parallel \( n \)-dimensional hyperplanes \( H_p^{n, u} \)'s (expression 2.12), \( 0 \leq p \leq n \), where

\[
H_p^{n, u} \equiv \sum_{i=1}^{n} (2u_i - 1)x_i - (\sum_{i=1}^{n} u_i - p) = 0
\]

and \( u \) is the only vertex of the first layer which is on \( H_0^{n, u} \), where

\[
H_0^{n, u} \equiv \sum_{i=1}^{n} (2u_i - 1)x_i - \sum_{i=1}^{n} u_i = 0 \tag{4.3}
\]
Figure 4.1 A 1-layer Perceptron which recognizes all the 5-dimensional binary patterns that contain the partial pattern \(<1, ?, 0, ?, 1>\). where \(\ ?\) denotes don't care

Let \(\overline{u} = \langle \overline{u}_1, ..., \overline{u}_n \rangle\) be the complement vertex of binary vertex \(u\), i.e., \(u_i + \overline{u}_i = 1\) for \(1 \leq i \leq n\). Then \(\overline{u}\) is the only vertex of the \((n+1)\)th layer which is on \(H_{n,n}^{u}\), where

\[
H_{n,n}^{u} \equiv \sum_{i=1}^{n} (2u_i - 1)x_i - (\sum_{i=1}^{n} u_i - n) = 0
\]  

(4.4)

The hyperplane \(H_{n-1}^{u}\) which is defined as

\[
H_{n-1}^{u} \equiv \sum_{i=1}^{n} (2u_i - 1)x_i - (\sum_{i=1}^{n} u_i - n + 1) = 0
\]  

(4.5)

can be used to implement a 1-layer Perceptron to distinguish the binary vertex \(\overline{u}\) from all other \(n\)-dimensional binary vertices (i.e. the Perceptron recognizes all the \(n\)-dimensional binary vertices which are not \(\overline{u}\)) by setting

- the threshold of the output neuron as \(\sum_{i=1}^{n} u_i - n + 1\), and

- the connection weight from the \(i\)th input neuron to the output neuron as \(2u_i - 1\)

in the \(n\)-input, 1-output Perceptron.

Now consider a binary vector \(\hat{u}\) of dimension \(m\), where \(m \geq n\). Suppose, only the values of \(n\) of \(m\) components of vector \(\hat{u}\) are of interest and an interest set \(J_n = \{j_1, j_2, ..., j_n\}\) is defined. The expression for the hyperplane \(H_{n-1}^{u}\) in an \(n\)-dimensional binary space is re-defined as a hyperplane \(H_{n-1}^{m,u,J^n}\) in an \(m\)-dimensional binary space to implement a 1-layer Perceptron to recognize all the \(2^{m-n}\) \(m\)-dimensional binary vectors whose \(J^n\)-set partial vectors don't equal
the $n$-dimensional binary vector $\overline{u}$. Then,

$$H_{n-1}^{m,u,J_n} \equiv \sum_{j_i \in J^n} (2u_i - 1)x_{j_i} + \sum_{i \notin J^n} 0 \cdot x_i - \left(\sum_{k=1}^{n} u_k - n + 1\right) = 0 \quad (4.6)$$

and, in the 1-layer, $m$-input, 1-output Perceptron,

- $\forall j_i \in J^n \& 1 \leq i \leq n, w_{j_i} = 2u_i - 1,$
- $\forall i \notin J^n \& 1 \leq i \leq m, w_i = 0,$ and
- $\theta = \sum_{k=1}^{n} u_k - n + 1.$

For example, suppose one wants to use a 1-layer Perceptron to recognize all the 5-dimensional binary vertices whose 1st, 3rd, and 5th components are not 1, 0, and 1 respectively. Then, the corresponding interest set would be $J^5 = \{1, 3, 5\}$, and the implemented Perceptron is shown in Figure 4.2.

### 4.3 A Neural Assembly for Executing a Logical AND (AND Neural Assembly)

This section develops an AND neural assembly which can realize any arbitrary logical AND function of a finite number of Boolean variables. First, we develop notations to represent Boolean variables (atomic propositionaJ variables) and logical AND expressions to facilitate such a realization. Let $\neg v_i$ be the negation of the Boolean variable $v_i$. Further, let $\neg v_i$ be denoted by $v_i^0$, and $v_i$ by $v_i^1$. Then a logical expression $v_1 \land \neg v_2 \land v_3$ (a conjunction of three Boolean variables) can be denoted as $v_1^1 \land v_2^0 \land v_3^1$. Let $v = <v_1, \ldots, v_n>$ and $z = <z_1, \ldots, z_n>$, where $v_i, z_i \in \{0,1\}$ for $1 \leq i \leq n$. Then, for a logical AND function denoted by $C^{n-x}(v) = v_1^{z_1} \land v_2^{z_2} \cdots \land v_n^{z_n}$, we have

$$C^{n-x}(v) = \begin{cases} 1 & \text{if } v = z \\ 0 & \text{if } v \text{ is any other } n \text{-dimensional binary vertex} \end{cases} \quad (4.7)$$

The evaluation of the logical AND function $C^{n-x}(v)$ can be viewed as a process of binary pattern recognition. Thus, it can be realized by a 1-layer Perceptron that implements the
Figure 4.2 A 1-layer Perceptron which recognizes all the 5-dimensional binary patterns that don’t contain the partial pattern 
\(<1,?,0,?,1>\), where ? denotes don’t care

hyperplane \(H^z_S\) to recognize the binary vertex \(z\). Let \(H^{n,z}_{\text{AND}}\) be used for \(H^z_S\). Then, according to expression 4.1 and its associated Perceptron implementation,

\[
H^{n,z}_{\text{AND}} \equiv H^{n,z}_S \equiv \sum_{i=1}^{n} (2z_i - 1) x_i - \sum_{i=1}^{n} z_i = 0
\] (4.8)

and the logical AND function \(C^{n,z}(v)\) can be realized by a 1-layer Perceptron with \(n\) input neurons and one output neuron. The corresponding Perceptron has

- the threshold of the output neuron set to \(\sum_{i=1}^{n} z_i\), and
- the connection weight from the \(i\)th input neuron to the output neuron set to \(2z_i - 1\).

For example, suppose \(v = <v_1, v_2, v_3>\) and \(C(v) = v_1 \land \neg v_2 \land v_3 = v_1 \wedge v_2^0 \wedge v_3\). Then, we have

\[
C(v) = \begin{cases} 
1 & \text{if } v = <1,0,1> \\
0 & \text{if } v \text{ is any other } n\text{-dimensional binary vertex}
\end{cases}
\] (4.9)

and the corresponding Perceptron which realizes the logical AND function \(C(v)\) is shown in Figure 4.3.

In order to be able to realize all possible logical AND functions in a system of \(m\) Boolean variables using their corresponding 1-layer Perceptrons, the expression for the separating hyperplane \(H^{n,z}_{\text{AND}}\) is extended from an \(n\)-dimensional binary space to an \(m\)-dimensional binary space to recognize all the \(m\)-dimensional binary patterns whose partial patterns equal the
Figure 4.3 An AND neural assembly which realizes the logical AND function $C(v)$

$n$-dimensional binary vector $z$ for certain interest set $J^n$s, where $m \geq n$. Suppose $\tilde{v} = <\tilde{v}_1, \tilde{v}_2, ..., \tilde{v}_m>$ is a binary vertex of dimension $m$. We define an interest set $J^n = \{j_1, j_2, ..., j_n\}$, $1 \leq j_1 < j_2 < \cdots < j_n \leq m$. Let $C^{m,z,J^n}(\tilde{v}) = \tilde{v}_{j_1}^{z_1} \land \tilde{v}_{j_2}^{z_2} \cdots \land \tilde{v}_{j_n}^{z_n}$. Then

$$
C^{m,z,J^n}(\tilde{v}) = \begin{cases} 
1 & \text{if } \tilde{v}(J^n) = z \\
0 & \text{if } \tilde{v}(J^n) \text{ is any other } n\text{-dimensional binary vector}
\end{cases} 
(4.10)
$$

The logical AND function $C^{m,z,J^n}(v)$ can be realized by a 1-layer Perceptron that implements the hyperplane $H^{m,z,J^n}_S$ to recognize all the $m$-dimensional binary vectors whose $J^n$-set partial vectors equal to the $n$-dimensional binary vector $z$. Let $H^{m,z,J^n}_{AND}$ be used for $H^{m,z,J^n}_S$. Then, according to expression 4.2 and its associated Perceptron implementation,

$$
H^{m,z,J^n}_{AND} \equiv \sum_{j_i \in J^n} (2z_i - 1)x_{j_i} + \sum_{i \notin J^n} 0 \cdot x_i - \sum_{k} z_k = 0 
(4.11)
$$

and the logical AND function $C^{m,z,J^n}(\tilde{v}) = \tilde{v}_{j_1}^{z_1} \land \tilde{v}_{j_2}^{z_2} \cdots \land \tilde{v}_{j_n}^{z_n}$ can be realized by a 1-layer Perceptron with $m$ input neurons and one output neuron. In the Perceptron,

- $\forall j_i \in J^n \& 1 \leq i \leq n, w_{j_i} = 2z_i - 1$.

- $\forall i \notin J^n \& 1 \leq i \leq m, w_i = 0$, and

- $\theta = \sum_{k=1}^{n} z_k$. 

Such an AND neural assembly will be used as a building block to assemble the neural architectures for realizing Boolean functions represented in DNF representation (see Section 4). Examples of such a neural assembly will be shown in Section 4 to assemble a neural architecture which realizes a given DNF Boolean function.

4.4 Neural Assemblies for Executing Logic ORs (OR Neural Assemblies)

This section develops OR neural assemblies which can realize any arbitrary logical OR functions of a finite number of Boolean variables. First, a general OR neural assembly is described. The assembly will be used a building block to assemble the neural architecture proposed in Section 4 for realizing DNF Boolean functions. Then, a monotone OR neural assembly is derived from the general OR neural assembly. The monotone OR neural assembly will be used as a building block to assemble the neural architecture proposed in Section 5.4.2.4 for realizing NFA.

4.4.1 A general OR neural assembly

This subsection investigates how a 1-layer Perceptron can realize a general logical OR function which contains negated Boolean variables. The notations used here follows that in Section 4. Let $D_G^{n,z}(v) = v_1^z_1 \lor v_2^z_2 \cdots \lor v_n^z_n$, where $\lor$ is a logical connective OR, $v_i$'s are Boolean variables, $v = <v_1, \ldots, v_n>$, $z = <z_1, \ldots, z_n>$, and $v_i, z_i \in \{0,1\}$ for $1 \leq i \leq n$. Then, we have

$$D_G^{n,z}(v) = \begin{cases} 0 & \text{if } v = <z_1, z_2, \ldots, z_n> \\ 1 & \text{if } v \text{ is any other } n\text{-dimensional binary vertex} \end{cases} (4.12)$$

The logical OR function $D_G^{n,z}(v)$ can be realized by a 1-layer Perceptron that implements the hyperplane $H_{n-1}^{n,z}$ to recognize all $n$-dimensional binary vertices which are not $\bar{z}$. Let $H_{OR}^{n,z}$ be used for $H_{n-1}^{n,z}$. Then, according to expression 4.5 and its associated Perceptron implementation,

$$H_{OR}^{n,z} \equiv H_{n-1}^{n,z} \equiv \sum_{i=1}^{n} (2z_i - 1)x_i - (\sum_{i=1}^{n} z_i - n + 1) = 0 (4.13)$$

and, in the $n$-input, 1-output Perceptron which realizes $D_G^{n,z}(v)$,
• the threshold of the output neuron is set to $\sum_{i=1}^{n} z_i - n + 1$, and

• the connection weight from the $i$th input neuron to the output neuron is set to $2z_i - 1$.

For example, suppose $v = < v_1, v_2, v_3 >$ and $D(v) = v_1 \lor \neg v_2 \lor v_3 = v_1 \lor v_2^0 \lor v_3^1$. Then, we have

$$D(v) = \begin{cases} 0 & \text{if } v = < \overline{1}, \overline{0}, \overline{1} > \\ 1 & \text{if } v \text{ is any other } n\text{-dimensional binary vector} \end{cases}$$

and the corresponding Perceptron which realizes the logical OR function $D(v)$ is shown in Figure 4.4.

In order to be able to realize all the possible logical OR functions in a system of $m$ Boolean variables using their corresponding 1-layer Perceptron implementations, the expression for the separating hyperplane $H_{OR}^{n,z}$ is extended from an $n$-dimensional binary space to an $m$-dimensional binary space to recognize all the $m$-dimensional binary patterns whose partial patterns don't equal the $n$-dimensional binary vector $\overline{z}$ for certain interest set $J^n_m$, where $m \geq n$. Suppose $\bar{v} = < \bar{v}_1, \bar{v}_2, \ldots, \bar{v}_m >$ is a binary (Boolean) vertex of dimension $m$. We define an interest set $J^n = \{j_1, j_2, \ldots, j_n\}$, $1 \leq j_1 < j_2 < \cdots < j_n \leq m$. Let $D_G^{m,z,J^n}(\bar{v}) = \bar{v}_{j_1}^{z_1} \lor \bar{v}_{j_2}^{z_2} \cdots \lor \bar{v}_{j_n}^{z_n}$. Then

$$D_G^{m,z,J^n}(\bar{v}) = \begin{cases} 0 & \text{if } \bar{v}(J^n) = \overline{z} \\ 1 & \text{if } \bar{v}(J^n) \text{ is any other } n\text{-dimensional binary vector} \end{cases}$$

The logical OR function $D_G^{m,z,J^n}(\bar{v})$ can be realized by a 1-layer Perceptron that implements the hyperplane $H_{OR}^{m,z,J^n}$ to recognize the $m$-dimensional binary vectors whose $J^n$-set partial vectors don't equal to the $n$-dimensional binary vector $\overline{z}$. Let $H_{OR}^{m,z,J^n}$ be used for $H_{OR}^{m,z,J^n}$. Then, according to expression 4.6 and its associated Perceptron implementation.

$$H_{OR}^{m,z,J^n} = \sum_{j_i \in J^n} (2z_i - 1)x_{j_i} + \sum_{i \notin J^n} 0 \cdot x_i - (\sum_{k} z_k - n + 1) = 0$$

and, in the $m$-input, 1-output Perceptron which realizes $D_G^{m,z,J^n}(\bar{v})$,

• $\forall j_i \in J^n \& 1 \leq i \leq n, w_{j_i} = 2z_i - 1$,

• $\forall i \notin J^n \& 1 \leq i \leq m, w_i = 0$, and
Figure 4.4 An OR neural assembly which realizes the logical OR function $D(v)$

\[ \theta = \sum_{k=1}^{n} z_k - n + 1. \]

Figure 4.2 is a corresponding Perceptron implementation which realize the logical OR function $y = D(x) = \neg x_1 \lor x_3 \lor \neg x_5 = x_1^0 \lor x_3^1 \lor x_5^0$ in a system which contains Boolean variables $x_1$, $x_2$, $x_3$, $x_4$, and $x_5$, where $x = \langle x_1, ..., x_5 \rangle$.

4.4.2 A monotone OR neural assembly

Consider an $n$-variable Boolean expression represented as a monotone disjunction:

\[ v_1 \lor v_2 \cdots \lor v_n \quad (4.17) \]

where $v_i$'s are Boolean variables. Monotone disjunctions are simply disjunctions which don't contain negated Boolean variables. For example, $v_1 \lor v_2$ is a monotone disjunction, but $\neg v_1 \lor v_2$ is not a monotone disjunction. Let $v = \langle v_1, ..., v_n \rangle$ and $D_M^n(v) = v_1^1 \lor v_2^1 \cdots \lor v_n^1$. Then, we have

\[ D_M^n(v) = \begin{cases} 0 & \text{if } v = \langle 1^n \rangle \\ 1 & \text{if } v \text{ is any other } n\text{-dimensional binary vertex} \end{cases} \quad (4.18) \]

The logical monotone OR function $D_M^n(v)$ is a special case of a general logical OR function $D_G^n(v)$ with $z = \langle 1^n \rangle$. Then, according to the Perceptron implementation for $D_G^n(v)$ (which corresponds to expression 4.13), the logical monotone OR function $D_M^n(v)$ can be
realized by an n-input, 1-output Perceptron with the threshold of the output neuron being set as 1, and the connection weight from every input neuron to the output neuron being set as 1.

In order to be able to realize all the possible logical monotone OR functions in a system of m Boolean variables using their corresponding 1-layer Perceptron implementations, where \( m \geq n \). The logical monotone OR function \( D^n_M(v) \) is extended from an n-dimensional Boolean space to an m-dimensional Boolean space, where \( m \geq n \). Suppose \( \tilde{v} = < \tilde{v}_1, \tilde{v}_2, \ldots, \tilde{v}_m > \) is a Boolean vector of dimension m. Assume an interest set \( J^n = \{ j_1, j_2, \ldots, j_n \} \), \( 1 \leq j_1 < j_2 < \cdots < j_n \leq m \), is defined. Define \( D^{m,J^n}_M(\tilde{v}) = \tilde{v}_{j_1}^1 \lor \tilde{v}_{j_2}^1 \cdots \lor \tilde{v}_{j_n}^1 \). Then

\[
D^{m,J^n}_M(\tilde{v}) = \begin{cases} 
0 & \text{if } \tilde{v}(J^n) = < \top^n > \\
1 & \text{if } \tilde{v}(J^n) \text{ is any other } n\text{-dimensional binary vector}
\end{cases}
\]  

(4.19)

The logical monotone OR function \( D^{m,J^n}_M(\tilde{v}) \) is a special case of a general logical OR function \( D^{m,z,J^n}_G(\tilde{v}) \) with \( z = < 1^n > \). Then, according to the Perceptron implementation for \( D^{m,z,J^n}_G(\tilde{v}) \) (which corresponds to expression 4.16), the logical monotone OR function \( D^{m,J^n}_M(v) \) can be realized by an m-input, 1-output Perceptron with

- \( \forall j_i \in J^n \& 1 \leq i \leq n, w_{j_i} = 1 \),
- \( \forall i \notin J^n \& 1 \leq i \leq m, w_i = 0 \), and
- \( \theta = 1 \).

Such a general OR neural assembly will be used as a building block to assemble the neural architectures for realizing NFA in Section 5.4.2.4.

### 4.5 A Neural Architecture for Realizing DNF Boolean Functions

Let \( \neg C_i \) be the negation of the conjunction \( C_i \). Further, let \( \neg C_i \) be denoted by \( (C_i)^0 \), and \( C_i \) by \( (C_i)^1 \). Let \( v = < v_1, \ldots, v_m > \) and \( C_i \) be defined on \( v \) for \( 1 \leq i \leq n \). Then, a DNF Boolean function \( D^{0}_{\text{DNF}}(v) = C_1^{z_1} \lor C_2^{z_2} \cdots \lor C_n^{z_n} \) can be realized by a 2-layer Perceptron. The first layer of the Perceptron consists of \( n \) m-input AND neural assemblies defined by expression 4.11, and the second layer is an n-input OR neural assembly defined by expression 4.13. Each of the \( n \) AND neural assemblies is used to realize a conjunction \( C_i \), where \( 1 \leq i \leq n \).
For example, let \( v = < v_1, v_2, v_3, v_4, v_5 > \), \( J_1^3 = \{1, 2, 3\} \), and \( J_2^5 = \{3, 4, 5\} \). Then,

\[
E(v) = (v_1 \land \neg v_2 \land v_3) \lor \neg(v_3 \land v_4 \land v_5)
\]

\[
= (v_1 \land v_2^0 \land v_3^1) \lor (v_3^1 \land v_4 \land v_5^0)
\]

\[
= (C_{5,\langle 1,0,1\rangle}^1 J_1^3(v))^1 \lor (C_{5,\langle 1,1,1\rangle}^2 J_2^5(v))^0
\]

where \( C_{5,\langle 1,0,1\rangle}^1 J_1^3(v) = v_1 \land \neg v_2 \land v_3 \) and \( C_{5,\langle 1,1,1\rangle}^2 J_2^5(v) = v_3 \land v_4 \land v_5 \). The corresponding 2-layer Perceptron which realize the DNF Boolean function \( E(v) \) is shown in Figure 4.5.

4.6 Summary and Discussion

Artificial neural networks, due to their inherent massive parallelism, potential fault tolerance and adaptation capability through learning, offer an alternative paradigm for robust and efficient implementations of logical inference systems. In this chapter, a method based on geometrical/mathematical analysis has been proposed for systematically designing neural architectures for elementary logical inference. Particularly, neural architectures for realizing logical ANDs, logical ORs, and DNF propositions have been synthesized by way of binary pattern recognition.

The input to a Boolean function can be represented as a binary (bipolar) code. Therefore, the evaluation of a Boolean function can be viewed as a process of binary (bipolar) pattern recognition. It is known that every Boolean function can be represented as a DNF expression [43]. A DNF expression is a disjunction of conjunctions. The evaluations of conjunction and disjunction can be realized by the proposed AND and OR neural assemblies respectively. Hence, any Boolean function (except the constant 0) can be realized by a 2-layer neural architecture (Perceptron) assembled from a fixed number of AND and OR neural assemblies. Besides, Perceptrons have space and speed advantages over DNF representations for representing and evaluating Boolean functions (see [43] for details). Since logical AND, logical OR, as well as DNF representation are essential to logical inference and Boolean functions are basic to many applications in science and engineering, we expect the proposed neural assemblies would find use in the construction of modular neural networks for a variety of applications. But, in or-
In order to apply neural networks to applications involving more complex logical inference, neural networks would need to be able to do variable binding, logical proof, unification, resolution, etc.

It is worth pointing out that the derivation of AND and OR neural assemblies which operate on bipolar values is straightforward given the methods proposed in this chapter and the method proposed in Section 2.2.6 for the conversion between models using bipolar and binary inputs. We expect that the resulting bipolar AND and OR neural assemblies will be exactly equivalent to those proposed in [43]. Since an input value 0 can be used to stand for unknown in bipolar model which denotes true by 1 and false by -1, bipolar model is more flexible than binary model which denotes true by 1 and false by 0.
5 NEURAL ARCHITECTURES FOR SEQUENCE PROCESSING

5.1 Introduction

Artificial neural networks (ANN), due to their inherent parallelism, offer an attractive paradigm for efficient implementations of functional modules for symbol processing. This chapter focuses on systematic designs for neural network architectures for sequence processing which is essential to many practical applications involving symbol processing in computer science, linguistics, systems modeling and control, artificial intelligence, and structural pattern recognition.

The capabilities of neural network models (in particular, recurrent networks of threshold logic units or McCulloch-Pitts neurons) in processing and generating sequences (strings defined over some finite alphabet) and hence their formal equivalence with finite state automata or regular language generators/recognizers have been known for several decades [83, 108, 117]. More recently, recurrent neural network realizations of finite state automata for recognition and learning of finite state (regular) languages have been explored by numerous authors [6, 20, 33, 38, 45, 44, 77, 122, 129, 132, 133, 134, 159, 166, 192]. There has been considerable work on extending the computational capabilities of recurrent neural network models by providing some form of external memory in the form of a tape [194] or a stack [13, 27, 66, 116, 123, 144, 161, 169, 174, 197]. To the best of our knowledge, to date, most of the research on neural architectures for sequence processing has focused on the investigation of neural networks that are designed to learn to handle sequence processing.

This chapter presents designs of several modular ANN modules for basic sequence processing. The ANN modules which are used as building blocks for the neural architectures proposed in Chapter 6 for syntax analysis include neural network architectures for realizing determin-
istic finite automata, stacks, and deterministic pushdown automata. These ANN modules are
systematically synthesized from the BMP modules proposed in Section 2. Besides, neural
network architecture for realizing nondeterministic finite automata is proposed to explore the
potential benefits of ANN in the design of high performance systems for parallel symbolic
computing applications. The rest of the chapter is organized as follows:

- The rest of Section 5.1 briefly discusses how to represent symbolic functions in terms of
  binary mappings to facilitate symbolic information manipulation via the proposed BMP
  module which operates on binary values.

- Sections 5.2, 5.3, 5.4 and 5.5 respectively explore the systematic synthesis of neural
  network architectures for realizing deterministic finite automata, deterministic pushdown
  automata, stack and nondeterministic finite automata.

- Section 5.6 concludes with a summary and a brief discussion.

5.1.1 Symbolic functions and binary mappings

In general, most of simple, non-recursive symbolic functions and table lookup functions can
be viewed in terms of a binary random mapping \( f_I : U \rightarrow V \) (expression 2.1). For example, \( f_I \)
may define a symbolic mapping function \( f_S : \Gamma_1 \times \Gamma_2 \times \cdots \times \Gamma_r \rightarrow \Delta_1 \times \Delta_2 \times \cdots \times \Delta_t \) as described
in Section 3.1.1. In this case, the operations of \( f_S \) on its associated symbols can be viewed in
terms of the binary mapping operations of \( f_I \) which in turn can be realized by a BMP module
proposed in Section 2.2.5.

Therefore, modular neural network modules for complex symbol processing can be syn-
thesized through a composition of appropriate primitive symbolic functions which are directly
realized by suitable BMP modules. Two of basic ways of recursively composing composite
symbolic functions from component symbolic functions (which may themselves be composite
functions or primitive functions) are discussed here. Let \( f \) and \( g \) be two symbolic functions
declared as follows:

\[
  f : \Gamma_1 \times \Gamma_2 \times \cdots \times \Gamma_r \rightarrow \Delta_1 \times \Delta_2 \times \cdots \times \Delta_s
\]
The composition of \( f \) and \( g \) is denoted by \( g \circ f \) such that

\[
g \circ f : \Gamma_1 \times \cdots \times \Gamma_r \to \Lambda_1 \times \cdots \times \Lambda_t
\]

and for every \( (\alpha_1, \ldots, \alpha_r) \) in \( \Gamma_1 \times \cdots \times \Gamma_r \)

\[
g \circ f(\alpha_1, \ldots, \alpha_r) = g(f(\alpha_1, \ldots, \alpha_r))
\]

Suppose \( f_i \) is a symbolic function such that

\[
f_i : \Gamma_1 \times \cdots \times \Gamma_r \to \Delta_i \text{ for } 1 \leq i \leq s
\]

The composition \( c \) of symbolic functions \( g, f_1, \ldots, f_s \) is defined as:

\[
c : \Gamma_1 \times \cdots \times \Gamma_r \to \Lambda_1 \times \cdots \times \Lambda_t
\]

and for every \( (\alpha_1, \ldots, \alpha_r) \) in \( \Gamma_1 \times \cdots \times \Gamma_r \)

\[
c(\alpha_1, \ldots, \alpha_r) = g(f_1(\alpha_1, \ldots, \alpha_r), \ldots, f_s(\alpha_1, \ldots, \alpha_r))
\]

The recursive processing of input strings of variable length (of the sort needed in lexical analysis and parsing) can be handled by composite functions \( \hat{f} : \Gamma^* \to \Delta^* \), \( \hat{g} : \Lambda \times \Gamma^* \to \Lambda \), and \( \hat{c} : \Lambda \times \Gamma^* \to \Lambda \times \Delta^* \) which are respectively realized by the modular recurrent neural architectures proposed in this chapter and Chapter 6, where \( \Gamma^* (\Delta^*) \) denotes the set of all strings over the alphabet \( \Gamma (\Delta) \). Here, function \( \hat{f} \) denotes the recursive processing of input strings of variable length by a parser or a lexical analyzer (see Chapter 6); function \( \hat{g} \) denotes the recursive evaluation of input strings of variable length by the extended transition function of a DFA (see Section 5); and function \( \hat{c} \) denotes the recursive parsing of syntactically tagged input tokens by the extended transition function of an LR(1) parser (see Chapter 6). The functions \( \hat{f} \), \( \hat{g} \), and \( \hat{c} \) that process input strings of variable length can be composed using symbolic functions \( f \), \( g \), \( c \), output selector function, and string concatenation function by recursion on the length of the input string (See Section 5 for an example). Other recursive symbolic functions can also be composed using composition and recursion [140, 154, 195].
The operation of a desired composite function on its symbolic input (string) can be fully characterized analytically in terms of its component symbolic functions on their respective symbolic inputs and outputs. The component symbolic functions are either composite functions of other symbolic functions or primitive symbolic functions which are realized directly by appropriate BMP modules. This makes it possible to systematically (and provably correctly) synthesize any desired symbolic function using BMP modules. (Such designs often require recurrent links for realizing recursive functions such as the extended transition function \( \delta \) of a DFA or a more complex recursive function as we shall see later and in Chapter 6).

5.2 Neural Network Design for Deterministic Finite Automata (NN DFA)

Deterministic finite automata (finite state machines) are a basic computing model which is essential to many science and engineering applications involving sequence processing. This section first briefly reviews the symbolic computing model for deterministic finite automata and then presents a method to systematically design neural network architectures for realizing deterministic finite automata [20].

5.2.1 Deterministic finite automata (DFA)

A deterministic finite automaton is a 5-tuple \( M_{DFA} = (Q, \Gamma, \delta, q_0, F) \) [74], where \( Q \) is a finite non-empty set of states, \( \Gamma \) is a finite non-empty input alphabet, \( q_0 \in Q \) is the initial state, \( F \subseteq Q \) is the set of final or accepting states, and \( \delta : Q \times \Gamma \rightarrow Q \) is the transition function. A finite automaton is deterministic if there is at most one transition that is applicable for each pair of state and input symbol.

The extended transition function \( \hat{\delta} \) of a DFA with transition function \( \delta \) is a mapping from \( Q \times \Gamma^* \) to \( Q \) defined by recursion on the length of the input string as follows:

- **Basis:** \( \hat{\delta}(q_i, \epsilon) = q_i \), where \( \epsilon \) is empty string.

- **Recursive step:** \( \hat{\delta}(q_i, ua) = \delta(\hat{\delta}(q_i, u), a) \) for all input symbols \( a \in \Gamma \) and strings \( u \in \Gamma^* \).

The computation of the machine \( M_{DFA} \) in state \( q_i \) with string \( w \) halts in state \( \hat{\delta}(q_i, w) \). The evaluation of the function \( \hat{\delta}(q_0, w) \) simulates the repeated application of the transition
function $\delta$ required to process the string $w$ from initial state $q_0$. A string $w$ is accepted by $M_{DFA}$ if $\delta(q_0, w) \in F$; otherwise it is rejected. The set of strings accepted by $M_{DFA}$ is denoted as $L(M_{DFA}) = \{ w | \delta(q_0, w) \in F \}$, called the language of $M_{DFA}$.

A Mealy machine is a DFA augmented with an output function. It is defined by a 6-tuple $M_{Mealy} = (Q, \Gamma, \Delta, \delta, \lambda, q_0)$ [74], where $Q$, $\Gamma$, $\Delta$, and $q_0$ are as in the DFA $M_{DFA}$, $\Delta$ is a finite non-empty output alphabet, and $\lambda$ is output function mapping from $Q \times \Gamma$ to $\Delta$. $\lambda(q, a)$ is the output associated with the transition from state $q$ on input symbol $a$. The output of $M_{Mealy}$ responding to input string $a_1a_2\cdots a_n$ is output string $\lambda(q_0, a_1)\lambda(q_1, a_2)\cdots\lambda(q_{n-1}, a_n)$, where $q_0, q_1, \ldots, q_n$ is the sequence of states such that $a_i = q_i$ for $1 \leq i \leq n$.

5.2.2 Architecture of NN DFA

A partially recurrent neural network architecture can be used to realize a DFA as shown in [20]. Its central concept is to use a BMP module to realize the transition function of a DFA. The neural representation in the BMP module is described as follows.

- The input neurons are divided into two groups. One group of input neurons has no recurrent connections and receives the binary coded current input symbol. There are $n = \lceil \log_2 |\Gamma| \rceil$ such input neurons. The second group has $m = \lceil \log_2 (|Q| + 1) \rceil$ input neurons and holds the current state (coded in binary). Each input neuron in this group has a recurrent connection from the corresponding output neuron.

- The output neurons together hold the next state (coded in binary). There are $m = \lceil \log_2 (|Q| + 1) \rceil$ output neurons.

- Every transition is represented as an ordered pair of binary codes. For each such ordered pair, a hidden neuron and its associated connections are used to realize the ordered pair in terms of binary mapping. Thus the number of required hidden neurons equals the number of valid transitions in the transition function. For example, suppose $p, q \in Q$, $a \in \Gamma$, $\delta(p, a) = q$ is a valid transition, and $p$, $q$ as well as $a$ are encoded as binary codes such that $p = < p_1, \ldots, p_m >$, $q = < q_1, \ldots, q_m >$ and $a = < a_1, \ldots, a_n >$ where $p_i, q_i, a_j \in \{0, 1\}$ for $1 \leq i \leq m$ and $1 \leq j \leq n$. Then the transition $\delta(p, a) = q$ is represented as a binary
mapping ordered pair \((p_1, ..., p_m, a_1, ..., a_n, q_1, ..., q_m)\) implemented by a BMP module (See Section 2).

- An explicit synchronization mechanism is used to support the repetitive evaluation of the transition function \(\delta\) on input string of variable length.

The transition function of a DFA can be represented as a 2-dimensional table with current state and current input symbol as indices. The operation of such a DFA involves repetitive lookup of the value for next state from the table using current state and current input symbol at each move until an error state or an accepting state is reached. Such a repetitive table lookup process involves content-based pattern matching and retrieval wherein the indices of the table are used as input patterns to retrieve the next state. This process can exploit the massively parallel associative processing capabilities of the neural associative memory proposed in Chapter 2.

Figure 5.1 shows the neural network architecture for realizing a DFA. Let \(0, 1, 2, ..., t\) denote a succession of points along the discrete time line. The current and next states are denoted by \(state(t)\) and \(state(t + 1)\) respectively. The current input symbol is denoted by \(input(t)\). This NN DFA module consists of two BMP modules, one accepting state trapping module (AST module) and three buffers. One buffer stores current state \(state(t)\), another stores input symbol \(input(t)\), and the other stores next state \(state(t + 1)\) which exists only logically but not physically. The first two buffers operate under synchronization control which enforces discrete time \(0, 1, ..., t\). The reset link resets the NN DFA to initial state.

BMP module 1, called \(NN\ DFA\ transition\ module\), realizes the transition function of a DFA. BMP module 2 is optional, and it allows the output of the NN DFA to be remapped from the output of BMP module 1. The AST module is optional and can be implemented by a BMP module. It enables BMP module 2 to produce an output only when the NN DFA goes into an accepting state. A connection from the AST module to upper-layer control would be needed to alert it when the AST module traps a rejecting state, i.e., when this NN DFA goes into a rejecting state. Let \(<0^m>\) denote the encoded binary value of dead state (garbage state), a state which is not a final state and has transitions to itself on all input symbols. Note that any
Figure 5.1 The proposed modular neural network architecture for DFA
unspecified transition will automatically have the next state coded as $<0^m>$ as a consequence of our design of a BMP module (see Section 2.2.5). This simplifies the implementation of a DFA, since any transition to rejecting state does not need to be implemented using a hidden neuron in the NN DFA transition module.

5.3 Neural Network Design for Deterministic Pushdown Automata (NN DPDA)

The capability of DFA is limited to recognition and production of the set of regular languages, the simplest class of languages in Chomsky hierarchy [74]. The capability of DFA can be extended by adding a stack. The resulting automata can recognize the set of deterministic context-free languages (DCFL), a more complex and widely used class of languages in Chomsky hierarchy [74]. This section describes a method to systematically synthesize neural network architectures for deterministic pushdown finite automata [20].

5.3.1 Deterministic pushdown automata (DPDA)

A pushdown automaton $M_{PDA}$ is a 7-tuple $(Q, \Gamma, \Delta, \delta, q_0, \bot, F)$ [74], where $Q$ is a finite set of states, $\Gamma$ is a finite input alphabet, $\Delta$ is a finite stack alphabet, $q_0 \in Q$ is the initial state, $\bot \in \Delta$ is a particular stack symbol called stack start symbol, $F \subseteq Q$ is the set of final states, and $\delta$ is the transition function mapping from $Q \times (\Gamma \cup \{\varepsilon\}) \times \Delta$ to $Q \times \Delta^*$. A pushdown automaton is deterministic if there is at most one transition that is applicable for each combination of state, input symbol and stack top symbol. We denote a DPDA by $M_{DPDA}$. An input string is accepted if the automaton processes the entire string and ends in an accepting state with an empty stack.

For the need of implementing a DPDA in a neural network, we let $\delta$ map from $Q \times (\Gamma \cup \{\varepsilon\}) \times \Delta$ to $Q \times \{\text{pop, push, noop}\} \times (\Delta \cup \{\ast\})$ to allow stack operation being expressed explicitly during the computation of a DPDA, where $\ast$ denotes a don’t care value, $\{\text{pop, push, noop}\}$ is the set of possible stack operations, and noop denotes no operation.
5.3.2 Architecture of NN DPDA

A partially recurrent neural network architecture can be used to realize a DPDA as shown in [20]. Its central concept is to use a BMP module to realize the transition function of a DPDA. The neural representation in the BMP module is described as follows.

- The input neurons are divided into three groups. The first group has \( m = \lceil \log_2(|Q|+1) \rceil \) neurons and holds the binary-coded current state. Each input neuron in this set has a recurrent connection from the corresponding output neuron. The second group receives the binary coded current input symbol and has \( n = \lceil \log_2 |\Gamma| + 1 \rceil \) neurons. The third group receives the binary coded stack top symbol and has \( k = \lceil \log_2 |A| + 1 \rceil \) neurons. The last two groups have no recurrent connections.

- The output neurons are divided into three groups. The first group represents the binary-coded next state and has \( m = \lceil \log_2(|Q|+1) \rceil \) neurons. The second group has two neurons and represents the binary-coded stack operation. The third group has \( k = \lceil \log_2 |A| + 1 \rceil \) neurons and represents the binary-coded stack symbol to be pushed into the stack or a don’t care (denoted as *) when the stack action to be performed is a pop.

- Every transition is represented as an ordered pair of binary codes. For each such ordered pair, a hidden neuron and its associated connections are used to realize the ordered pair in terms of binary mapping. Thus the number of required hidden neurons equals the number of valid transitions in the transition function. For example, suppose \( p, q \in Q, a \in (\Gamma \cup \{\epsilon\}), \alpha, \beta \in A, s \in \{\text{pop, push, noop}\}, \delta(p, a, \alpha) = (q, s, \beta) \) is a valid transition, and \( p, q, a, \alpha, \beta \) and \( s \) are encoded into binary vectors such that \( p = < p_1, \ldots, p_m >, q = < q_1, \ldots, q_m >, a = < a_1, \ldots, a_n >, \alpha = < \alpha_1, \ldots, \alpha_k >, \beta = < \beta_1, \ldots, \beta_k > \) and \( s = < s_1, s_2 > \), where \( p_i, q_i, a_j, \alpha_l, \beta_l, s_1, s_2 \in \{0, 1\} \) for \( 1 \leq i \leq m, 1 \leq j \leq n, \) and \( 1 \leq l \leq k \). Note that our representation of a transition of a DPDA is different from the conventional representation in that we express stack pop/push action explicitly. Stack push, pop, and noop actions are denoted by \( s = < 0, 1 >, s = < 1, 0 >, \) and \( s = < 0, 0 > \) respectively. Then the transition \( \delta(p, a, \alpha) = (q, s, \beta) \) is represented as the binary mapping ordered pair.
to be implemented by a BMP module (See Section 2.2.5).

- An explicit synchronization mechanism is used to support the repetitive evaluation of the transition function on input string of variable length.

The transition function of a DPDA can be represented as a 3-dimensional table with current state, current input symbol, and stack top symbol as indices. The operation of such a DPDA involves repetitive lookup of the value for next state from the table using current state, current input symbol and stack top symbol at each move until an error state or an accepting state is reached. Such a repetitive table lookup process involves content-based pattern matching and retrieval wherein the indices of the table are used as input patterns to retrieve the next state. This process can exploit the massively parallel associative processing capabilities of the neural associative memory proposed in Chapter 2.

Figure 5.2 shows the proposed modular neural network architecture for realizing a DPDA. The current and next states are denoted by \( state(t) \) and \( state(t + 1) \) respectively. This NN DPDA module consists of three BMP modules, one AST module, one stack mechanism module and four buffers. One buffer stores current state \( state(t) \), one stores current input symbol \( input(t) \), another stores stack top symbol \( stack_{top} \) and the other stores next state \( state(t + 1) \) which exists only logically but not physically. The first three buffers operate under synchronization control which enforces discrete time \( 0, 1, ..., t \). The reset link resets the NN DPDA to initial state.

BMP module 1, called \( NN\ DPDA\ transition\ module \), realizes the transition function of a DPDA. Each state transition is coded as an ordered pair of binary mapping codes. There are two push/pop connections from the NN DPDA transition module to the stack mechanism module. These links inform the stack mechanism module whether to pop or push. The AST module is optional and can be implemented by a BMP module. It enables BMP module 2 to produce output only when the NN DPDA goes into an accepting state. A connection from the AST module to upper-layer control would be needed to alert it when the AST module traps a rejecting state, i.e., when this NN DPDA goes into a rejecting state. BMP module 2
Figure 5.2 The proposed modular neural network architecture for DPDA
is optional, and it allows the output from the NN DPDA to be remapped from the output of the NN DPDA transition module. BMP module 3 is optional and provides remapping of stack symbol produced from the NN DPDA transition module. Note that any unspecified transition will have the next state $<0^m>$ given our implementation of a BMP module.

### 5.4 Neural Network Design for Stack (NN Stack)

This section first briefly discuss the symbolic computing model for stack and then presents a method to systematically design neural network architectures for realizing stacks [22].

#### 5.4.1 Symbolic representation of stack

A stack can be coded as a string over a stack alphabet, with its top element at one end of the string and its bottom element at the other end. Pop and push are the main actions of a stack. In the implementation of a stack, these actions can be performed by a DFA which is augmented with memory to store stack symbols which are accessed sequentially using a stack top pointer (SP) which points to the top symbol of the stack. The stack top pointer is maintained by the current state of the DFA, and the current action of the stack by the input to the DFA. Let $A = \{ \text{pop}, \text{push}, \text{noop} \}$ be the set of possible stack actions, $C$ the set of possible stack configurations (contents), $S$ the set of stack symbols, $P = \{0, 1, 2, \ldots, n\}$ the set of possible positions of stack top pointer, and $n$ the maximal depth (capacity) of a given stack. Let $\bot$ be stack bottom symbol and $c \cdot s$ denote the stack configuration after a stack symbol $s$ is pushed onto the stack configuration $c$. Note that $C = \{ \alpha \mid \alpha \in \bot \cdot S^* \text{ and } |\alpha| \leq n \}$, where $|\alpha|$ denotes the number of stack symbols in the stack configuration $\alpha$. Assume that the value of stack top pointer doesn't change on a noop action, and it is incremented on a push action and decremented on a pop action. The operation of a stack and the retrieval of stack top symbol from a stack can be characterized by the symbolic functions $f_{Stack} : A \times S \times C \times P \rightarrow C \times P$ and $f_{Top} : C \times P \rightarrow S \cup \{\bot\}$ respectively. They are defined as follows.
\begin{align*}
\text{\(f_{\text{Stack}}(\text{push}, s, c, p) = \begin{cases}
(c \cdot s, p+1) & \text{if } s \in S, c \in C, \quad p \in P, \text{ and } p \leq n - 1 \\
\text{error} & \text{otherwise}
\end{cases}\)} \\
\text{\(f_{\text{Stack}}(\text{pop}, \ast, c, p) = \begin{cases}
(c', p-1) & \text{if } c \in C \text{ and } c = c' \cdot s \text{ for some } s \in S \text{ and some } c' \in C; \\
\text{error} & \text{otherwise}
\end{cases}\)} \\
\text{\(f_{\text{Stack}}(\text{noop}, \ast, c, p) = \begin{cases}
(c, p) & \text{if } c \in C \text{ and } p \in P \\
\text{error} & \text{otherwise}
\end{cases}\)}
\end{align*}

\begin{align*}
\text{\(f_{\text{Top}}(c, p) = \begin{cases}
\bot & \text{if } c = \bot \text{ and } p = 0 \\
s & \text{if } c \in C \text{ and } c = c' \cdot s \text{ for some } s \in S \\
\text{error} & \text{otherwise}
\end{cases}\)} \\
\text{\(\quad \text{and some } c' \in C; \text{ and } p \in P \text{ and } |c| = p\)}
\end{align*}

where \(\ast\) stands for a don't care.

\subsection{Architecture of NN Stack}

This subsection discusses the neural network realization of a stack in terms of symbolic functions \(f_{\text{Stack}}\) and \(f_{\text{Top}}\). A design for NN Stack obtained by adding a \textit{write control module} to an NN DFA is shown in Figure 5.3. (The use of such a circuit might be considered by some to be somewhat unconventional given the implicit assumption of lack of explicit control in many neural network models. However, the operation of most existing neural networks implicitly assume at least some form of control. Given the rich panoply of controls found in biological neural networks, there is no reason not to build in a variety of control and coordination structures into neural networks whenever it is beneficial to do so [71]). NN Stack has an \(n\)-bit binary output corresponding to the element popped from the stack, and four sets of binary inputs:
Figure 5.3 The proposed neural network architecture for stack mechanism
• **Reset** which is a 1-bit signal which resets \( \text{pointer}(t) \) (current SP) to point to the bottom of the stack at the beginning.

• **Synchronization control** which is a 1-bit signal that synchronizes NN Stack with the discrete time line, denoted by \( 0, 1, \ldots, t, t+1, \ldots \).

• **Action** which is a 2-bit binary code so that
  - 01 denotes push.
  - 10 denotes pop.
  - 00 denotes no action.

• **Input stack symbol** which is an \( n \)-bit binary code for the symbol to be pushed onto the stack during a stack operation.

An NN Stack consists of a **pointer control module**, a **stack memory module**, a **write control module** and two buffers. The first buffer stores current SP value (\( \text{pointer}(t) \)) and the second stores the current stack action (push/pop). In Figure 5.3, the dotted box labeled with \( \text{pointer}(t+1) \) exists only logically but not physically, and \( \text{pointer}(t) \) and \( \text{pointer}(t+1) \) respectively denote SP before and after a stack action. SP is coded into an \( m \)-bit binary number.

### 5.4.2.1 Pointer control module

The pointer control module (BMP module 1) realizes a symbolic function \( f_{PControl} : A \times P \rightarrow P \) and controls the movement of SP which is incremented on a push and decremented on a pop. The pointer control module uses \( m + 2 \) input, \( 3 \times 2^m \) hidden, and \( m \) output neurons. \( m \) of the input neurons represent \( \text{pointer}(t) \) (current SP value), and the remaining 2 input neurons encodes the stack action. There are \( 2^m \) possible SP values. The \( m \) output neurons represent \( \text{pointer}(t+1) \) (the SP value after a stack action). Each change in SP value can be realized by a binary mapping (with one hidden neuron per change). Since **noop** (no action) is one of legal stack actions, \( 3 \times 2^m \) hidden neurons are used in the pointer control module.
5.4.2.2 Stack memory module

The stack memory module (BMP module 2) realizes the symbolic function $f_{Top}$. It uses $m$ input neurons, $n$ output neurons, and $2^m$ hidden neurons which together allow storage of $2^m$ stack symbols at $2^m$ SP positions. The stack symbols stored in stack memory module are accessed through $pointer(t+1)$ (the output of the pointer control module). Note that the BMP module 2 uses its 2nd-layer connections associated with a hidden neuron to store a symbol (see Chapter 2).

5.4.2.3 Write control module

The write control module (plus stack memory module) realizes a symbolic function $f_{SWrite} : A \times S \times C \times P \rightarrow C$. Physically, it receives $m$ binary inputs from the buffer labeled with $pointer(t)$ (denoting current SP), 1 binary input from the second output line of the buffer labeled with push/pop (denoting current stack action), and $n$ binary inputs (denoting the stack symbol to be pushed onto the stack) from environment. Stack memory module is used to store current stack configuration. The module does nothing when a pop is performed. The $n$ dotted output lines from the write control module write the $n$-bit binary-coded stack symbol into $n$ of the 2nd-layer connections associated with a corresponding hidden neuron in the stack memory module when a push is performed. The hidden neuron and its $n$ associated connections are located by using current SP value ($pointer(t)$). (The processing of stack overflow and underflow is not discussed here. It has to be taken care of by appropriate error handling mechanisms).

5.4.2.4 Timing considerations

The proposed design for NN Stack shown in Figure 5.3 is based on the assumption that the write control module finishes updating the 2nd-layer connection weights associated with a hidden neuron of stack memory module before the signals from pointer control module are passed to stack memory module during a push stack action. If this assumption fails to hold, the original design needs to be modified by adding: $n$ links from input stack symbol (buffer)
to output stack symbol (buffer); an inhibition latch, which is activated by the leftmost output line of the push/pop buffer, on the links to inhibit signal passing from input stack symbol (buffer) to output stack symbol (buffer) at a pop operation; a second inhibition latch, which is activated by the rightmost output line of the push/pop buffer between pointer control module and stack memory module to inhibit signal transmission between these two modules at a push operation.

5.4.3 NN Stack in action

This subsection symbolically illustrates how the modules of NN Stack together realize a stack by considering several successive stack actions. Symbolic function \( f_{Stack} \) is a composition of symbolic functions \( f_{PControl} \) and \( f_{SWrite} \) such that \( \forall (a, s, c, p) \in A \times S \times C \times P, \\
\begin{align*}
    f_{Stack}(a, s, c, p) &= (f_{SWrite}(a, s, c, p), f_{PControl}(a, p)).
\end{align*}
Consider the following sequence of stack operations:

1. At time = \( t_1 \), suppose the value of stack top pointer (current SP value) is 4 and the stack action to be performed is a push on a stack symbol \( a \). Let \( c_t \) be current stack configuration. At this time step, NN Stack computes \( f_{Stack}(push, a, c_t, 4) = (c_t \cdot a, 5) \) and \( f_{Top}(c_t \cdot a, 5) = a \). i.e.,

\begin{itemize}
  \item the pointer control module computes \( f_{PControl}(push, 4) = 5 \),
  \item the write control module (plus stack memory module) computes \( f_{SWrite}(push, a, c_t, 4) = c_t \cdot a \), and
  \item the stack memory module computes \( f_{Top}(c_t \cdot a, 5) = a \).
\end{itemize}

2. At time = \( t_1 + 1 \), suppose the stack action to be performed is a push on a stack symbol \( b \). At this time step, NN Stack computes \( f_{Stack}(push, b, c_t \cdot a, 5) = (c_t \cdot a \cdot b, 6) \) and \( f_{Top}(c_t \cdot a \cdot b, 6) = b \), i.e.,

\begin{itemize}
  \item the pointer control module computes \( f_{PControl}(push, 5) = 6 \),
  \item the write control module (plus stack memory module) computes \( f_{SWrite}(push, b, c_t \cdot a, 5) = c_t \cdot a \cdot b \), and
\end{itemize}
90

• the stack memory module computes \( f_{\text{Top}}(c_{t_1}, \cdot a \cdot b, 6) = b \).

3. At time \( t_1 + 2 \), suppose the stack action to be performed is a pop. At this time step, 
NN Stack computes \( f_{\text{Stack}}(\text{pop}, \cdot a, c_{t_1} \cdot a \cdot b, 6) = (c_{t_1} \cdot a, 5) \) and \( f_{\text{Top}}(c_{t_1} \cdot a, 5) = a \), i.e.,
  - the pointer control module computes \( f_{\text{Control}}(\text{pop}, 6) = 5 \),
  - the write control module does nothing, and
  - the stack memory module computes \( f_{\text{Top}}(c_{t_1} \cdot a, 5) = a \).

5.5 Neural Network Design for Nondeterministic Finite Automata (NN NFA)

This section explores how to exploit the inherent parallelism and versatile representation 
in ANN to reduce the operational and implementational time overhead of nondeterministic 
finite automata (NFA) which are a basic model of symbolic computing in computer science 
and provide a typical model suitable for the exploration of parallel symbolic computing via 
ANN. A recurrent neural network (RNN) is systematically synthesized to concurrently track 
all the possible nondeterministic computations of a given NFA. Such a concurrent breadth-first 
tracking is facilitated by two types of parallel symbolic computations executed by the proposed 
RNN. One of the types is parallel content-based pattern matching, and the other is parallel 
union operations of sets. The RNN acts like a cost-effective SIMD computer system dedicated 
to the two types of parallel symbolic computations. The proposed RNN is provably correctly 
assembled from two kinds of neural assemblies. One of the neural assemblies computes a logical 
AND, and the other computes a logical OR.

Although the concept of nondeterministicism embedded in NFA provides an elegantly simple 
and intuitive description for sequence processing, it results in much computational and 
implementational overhead in single-CPU computer systems. Thus the concept of nondeter-
ministicism in NFA, which plays a central role in both the theory of languages and the theory 
of computation [74], provides a typical model suitable for the exploration of parallel symbolic
computing via neural networks. The reduced operation time complexity of NFA realized by the proposed RNN is due to the parallel operations of the neural assemblies in the RNN.

It is well known that DFA and NFA are equivalent, and every NFA can be converted into its equivalent DFA [74]. NFA seem to be of no practical interest in direct application implementations since they are embedded with nondeterminism and don't correspond naturally to deterministic algorithms. But, NFA have a variety of practical applications in computer science, linguistics, systems modeling and control, and artificial intelligence; and NFA are simpler and more intuitive to design than their equivalent DFA for a given task due to the powerful concept of nondeterminism embedded in NFA, especially for pattern matching [195]. NFA are rarely directly implemented in conventional computer systems because the nondeterminism in NFA causes operational and implementational overhead. Usually, they are converted into their equivalent DFA for implementation. So, for syntax analysis on regular languages, an NFA could be constructed for a given language first, and then its equivalent DFA is implemented to recognize the language. The direct construction of an NFA is as simple as that of a DFA using the proposed RNN in which the power of nondeterminism in NFA is retained, and there is no need to convert an NFA into its equivalent DFA before its construction. Note also that every DFA is an NFA. Therefore, the proposed RNN can be used as a general neural architecture for realizing finite automata including DFA and NFA.

5.5.1 Nondeterministic finite automata (NFA)

A nondeterministic finite automaton \( M_{NFA} \) is a 5-tuple \((Q, \Gamma, \delta', q_0, F)\) [74], where \( Q, \Gamma, q_0, \) and \( F \) have same meaning as for a DFA, but \( \delta' \) is a mapping from \( Q \times \Gamma \) to \( 2^Q \). Note that \( 2^Q \) is the power set of \( Q \), and \( \delta'(q, a) \) is the set of all states \( p \) such that there is a transition, denoted as \((q, a, p)\), from \( q \) to \( p \) on an input symbol \( a \). Also note that there could be more than one transition which is applicable for each combination of state and input symbol in an NFA, and \(|\delta'(q, a)|\) is bounded by \(|Q|\), where \(|A|\) denotes the cardinality of set \( A \). An input string is accepted by \( M_{NFA} \) if there is a computation on the input string by \( M_{NFA} \) which processes the entire input string and halts in an accepting state; otherwise it is rejected. The set of strings
accepted by $M_{NFA}$ in $\Gamma^*$ is denoted as $L(M_{NFA})$, called the language accepted by $M_{NFA}$.

5.5.1.1 Advantages of NFA for applications

It is well known that NFA and DFA are equivalent [74]. Two automata are said equivalent if they accept the same language. Any language accepted by an NFA can also be accepted by a DFA, and every NFA can be converted into an equivalent DFA [74]. However, an NFA is usually simpler and more intuitive to design than its equivalent DFA for a given language due to the powerful concept of nondeterminism inherent in NFA. Figures 5.4 and 5.5 respectively show the state diagrams of an NFA and its equivalent DFA. Both of them accept input strings that contain the sub-string abaa [195]. These two automata are equivalent, but apparently the language the NFA accepts is much easier to understand. The state diagram of an NFA or a DFA is a labeled directed graph in which the nodes denote the states of the NFA or DFA, and the arcs are obtained from their transition functions. An arc from node $q_i$ to $q_j$ is labeled $a$ if $\delta(q_i, a) = q_j$ for a DFA or $q_j \in \delta(q_i, a)$ for an NFA, and the transition $(q_i, a, q_j)$ is a fan-in transition for state $q_j$ on input symbol $a$. Note that if an NFA has $Q$ states, then the number of possible states of its equivalent DFA could be as large as $2^{Q!}$ and the number of possible transitions in the DFA could also be the same order.

5.5.2 Model for concurrently tracking all the possible nondeterministic moves in the operation of an NFA using RNN

The deterministic and linear-time operation of a given NFA which is realized by the proposed RNN can be modeled conventionally by its equivalent DFA which is constructed according to Subset Construction algorithm [195]. The main idea of Subset Construction is to concurrently track all the possible states that can be reached at each step of an NFA. In the computation of an NFA, tracking all the reachable states at each step induces much overhead in single-CPU computer systems, whereas the proposed RNN efficiently computes in parallel, all the reachable states at each step by exploiting the parallelism of ANN. In Subset Construction, for a given NFA $M_{NFA} = (Q, \Gamma, \delta', q_0, F)$, a DFA $M''_{DFA} = (2^Q, \Gamma, \delta'', Q_0, F'')$ is defined from
Figure 5.4 The state diagram of an NFA that accepts any input string containing the sub-string abaa

Figure 5.5 The state diagram of a DFA that accepts any input string containing the sub-string abaa

$M_{NFA}$ such that $L(M_{NFA}) = L(M''_{DFA})$, where $Q_0 = \{q_0\}$, $F'' = \{K \mid K \subseteq Q \& K \cap F \neq \emptyset\}$, and $\delta'' : 2^Q \times \Gamma \rightarrow 2^Q$ is defined by

$$Q_j = \delta''(Q_i, a), \text{ if } Q_j = \bigcup_{q \in Q} \delta'(q, a) \text{ for all } Q_i \subseteq Q \& a \in \Gamma$$  \hspace{1cm} (5.12)

One main problem with Subset Construction, which views every $Q_i$ as an individual state in implementation, is the exponential increase in the number of states ($O(2^{\left|Q\right|})$) and the number of possible transitions defined in transition function $\delta'' (O(2^{2\left|Q\right|} \times |\Gamma|))$. This situation can often be somewhat alleviated by Iterative Subset Construction [195] which only includes the states that can be reached from initial state $q_0$. Let $M''_{DFA} = (Q'', \Gamma, \delta'', Q_0, F'')$ be defined from $M_{NFA}$ according to Iterative Subset Construction such that $L(M_{NFA}) = L(M''_{DFA})$. Since Iterative Subset Construction eliminates the states which can not be reached from initial state $q_0$, $M''_{DFA}$ is smaller than $M''_{DFA}$ in terms of the number of defined states and transitions.

The major drawback of both subset algorithms is the bookkeeping overhead associated with maintaining $\delta''$, $F''$, $Q''$, $\delta''$, and $F''$, which are derived from $M_{NFA}$. The direct realization of a
given NFA by the proposed RNN avoids this problem since the transition function module of
the proposed NN NFA captures the regularity of \( \delta'' \) in expression 5.12 without actually
knowing in advance the legal transitions defined by \( \delta'' \). Such simplification is partly facilitated
by representationally viewing every \( Q_i \) as an individual set of states denoted by localist neural
representation. The transition module of the proposed NN NFA realizes not only \( \delta'' \) but also
\( \delta^\ast \). Since the proposed RNN always starts from initial state \( q_0 \) for any input string, the states
which can not be reached from \( q_0 \) will not appear in the transition module of the proposed NN
NFA during input processing, i.e., only the states in \( Q^\ast \) will appear in the proposed transition
module during input processing. Hereafter, we only discuss \( M_{DFA}^D \) instead of \( M_{DFA}^N \).

Let 0, 1, ..., \( t \), \( t+1 \), ... denote a succession of points along the discrete time line. Then,
for an NFA, let us call \( Q_{act}(0) = Q_0 \) the initial set of active states, \( Q_{act}(t) \) the current set of
active states which corresponds to the set of reachable states from \( q_0 \) by \( M_{NFA} (M_{DFA}^D) \) at
time \( t \) (current time), and \( Q_{act}(t + 1) \) the next set of active states which corresponds to the set
of reachable states from \( q_0 \) at time \( t + 1 \). \( Q_{act}(t) \) and \( Q_{act}(t + 1) \) are derived recursively from
\( Q_{act}(0) \) by expression \( Q_{act}(t + 1) = \bigcup_{q \in Q_{act}(t)} \delta'(q, a) \) during the processing of the input string,
where \( a \) is input symbol at time \( t \). \( Q_{act}(t) \) is bounded in a way that \( Q_{act}(t) \subseteq Q \) for \( t \geq 0 \),
i.e., all the sets of reachable states from initial state during the processing of the input string
are bounded by a same set of states, the number of reachable states at each step does not
proliferate indefinitely or exponentially during the processing of the input string, and thus the
nondeterministicism shown during the processing of the input string is globally bounded. The
proposed RNN directly constructs a given NFA \( M_{NFA} \) without the need to convert the given
NFA into its equivalent DFA \( M_{DFA}^N \). It concurrently tracks all the possible moves during the
processing of the input string in the NFA by simulating the deterministic move of the \( M_{DFA}^N \).
According to Iterative Subset Construction and expression 5.12, the transition function \( \delta^\ast \)
and every move of \( M_{DFA}^N \) can be characterized by

\[
\forall a \in \Gamma \forall t \geq 0 \ [Q_{act}(t + 1) = \delta^\ast(Q_{act}(t), a)]
\]  

(5.13)

where \( Q_{act}(0) = \{q_0\} \) and \( Q_{act}(t + 1) = \bigcup_{q \in Q_{act}(t)} \delta'(q, a) \)
For an input string, the recursive evaluation of $Q_{act}(t + 1)$ along the moves of $M_{DFA}^{NFA}$ involves two kinds of repetitive symbolic computations, one of which computes the sets of reachable states from every state in $Q_{act}(t)$ and the other of which computes the union of the sets of the reachable states. In the proposed NN NFA, the former is computed by the first layer of the transition module of the proposed NN NFA by parallel content-based pattern matching and the later by the second layer by parallel logical OR operations. In applications of realizing an NFA by the proposed RNN, a special symbol $\$ \notin \Gamma$ might need to be appended at the end of the input string to acknowledge the end of input. When the $\$ $ is encountered, the RNN terminates input processing and tests the acceptance of the input string.

5.5.3 Architecture of NN NFA

This subsection describes the symbolic and neural representations in the proposed NN NFA, and presents a method for assembling the proposed NN NFA using the neural assemblies proposed in Chapter 4.

Figure 5.6 shows the partially recurrent neural network architecture for concurrently tracking all the nondeterministic computations of a given NFA. The entire architecture essentially consists of one NN NFA transition module, one acceptance testing module, one end-of-input testing module which is not shown in the figure, three buffers, and recurrent links from the output neurons of the NN NFA transition module to the buffer storing $Q_{act}(t)$ (which could be part of the input neurons of the NN NFA transition module, depending on the applications implemented). One buffer stores current set of active states $Q_{act}(t)$, another buffer (which could also be part of the input neurons of the NN NFA transition module, depending on applications implemented) stores current input symbol $a(t)$, and the other buffer (which exists only logically but not physically) represents the next set of active states $Q_{act}(t + 1)$. The first two buffers are under centralized synchronization control which enforces discrete time 0, 1, ..., t, t+1, .... The link "reset" resets the NN NFA to its initial set of active states.
Figure 5.6 The proposed recurrent neural network architecture for concurrently tracking all the nondeterministic computations of a given NFA
5.5.3.1 Symbolic representation in the NN NFA transition module

The realization of the symbolic function $\delta^*$ by the NN NFA transition module is central to the construction of the proposed NN NFA. The notations used here follow those described in previous subsections. The symbolic representations of the NN NFA transition module which is a 2-layer Perceptron are described as follows.

- The output from every neuron and the input to every input neuron are binary value.

- Every transition defined by the transition function $\delta^*$ (expression 5.13) is represented as an ordered binary mapping pair $< Q_{act}(t) \times a(t), Q_{act}(t+1) >$ stored in the NN NFA transition module.

- The input neurons together denote $Q_{act}(t) \times a(t)$ and are divided into two groups. One group uses a distributed representation and the other uses a local representation. The former group has no recurrent connection and denotes the binary-coded current input symbol $a(t)$. There are $[\log(|F| + 1)]$ such input neurons. The latter group has recurrent connections and denotes the current set of active states $Q_{act}(t)$. There are $|Q|$ such input neurons, the $i$th neuron of which denotes whether state $q_{i-1}$ is in $Q_{act}(t)$. If the value at the $i$th neuron of this group is 1, then state $q_{i-1}$ is in $Q_{act}(t)$. Otherwise state $q_{i-1}$ is not in $Q_{act}(t)$. In this group, the $i$th input neuron has a recurrent link from the $i$th output neuron.

- The hidden neurons along with their associated 1st-layer connections are used to concurrently recognize all the applicable transitions for the states in the current set of active states on current input symbol. The hidden layer uses a local representation, and one hidden neuron is used for one uniquely defined transition. The number of activated hidden neurons at each step of the proposed NN NFA equals $\sum_{q \in Q_{act}(t)} | \delta'(q, a(t)) |$. The activated hidden neurons in turn activate some of the output neurons which together denote the next set of active states.
• The output layer uses a local representation, and the output neurons together denote the next set of active states $Q_{act}(t + 1)$. There are $|Q|$ output neurons, the $i$th neuron of which denotes whether state $q_{i-1}$ is in $Q_{act}(t + 1)$. If the value at the $i$th neuron is 1, then state $q_{i-1}$ is in $Q_{act}(t + 1)$. Otherwise state $q_{i-1}$ is not in $Q_{act}(t + 1)$. The output neurons along with their associated 2nd-layer connections (which get their input from the hidden neurons) operate together to compute the next set of active states according to expression $Q_{act}(t + 1) = \bigcup_{q \in Q_{act}(t)} \delta'(q, a)$ (the union of the sets of reachable states reached by the states in the current set of active states on current input symbol).

• The recurrent connections from the output neurons to part of the input neurons facilitate the continuous execution of the proposed NN NFA.

5.5.3.2 Neural representation in the NN NFA transition module

Let $n_T = \sum_{q \in Q} \sum_{a \in \Gamma} | \delta'(q, a) |$, $n_I = \lfloor \log(|\Gamma| + 1) \rfloor$, and $n_A = |Q|$ be respectively the total number of defined transitions of a given NFA, the number of input neurons used for denoting current input symbol, and the number of input neurons used for denoting the current set of active states in the NN NFA transition module. Then the NN NFA transition module has $(n_A + n_I)$ input neurons, $n_T$ hidden neurons, and $n_A$ output neurons. The hidden neurons along with their associated 1st-layer connections are used to identify all the transitions applicable for the states in the current set of active states on current input symbol. One hidden neuron is used for one uniquely defined transition in the given NFA. The NN NFA transition module is constructed directly from the transition function $\delta'$ of the given NFA $M_{NFA}$.

Let binary vectors $u = \langle u_1, ..., u_{n_A+n_I} \rangle$ and $v = \langle v_1, ..., v_{n_A} \rangle$ respectively denote the ordered values at input neurons and output neurons in the NN NFA transition module. The first $n_A$ components of vector $u$, being $\langle u_1, ..., u_{n_A} \rangle$, together represent $Q_{act}(t)$; and the last $n_I$ components of vector $u$, being $\langle u_{n_A+1}, ..., u_{n_A+n_I} \rangle$, together represent current input symbol $a$. The vectors $u$ and $v$ respectively represent $Q_{act}(t) \times a$ and $Q_{act}(t + 1)$ for the given NFA. Let $J^{n_I+1}_i = \{i + 1, n_A + 1, n_A + 2, ..., n_A + n_I\}$ be an interest set for $0 \leq i \leq n_A - 1$. Totally $n_A$ interest sets $J^{n_I+1}_0, J^{n_I+1}_1, ..., J^{n_I+1}_{n_A-1}$ are defined. Let current input symbol $a$ be
encoded as a binary vector $< a_1, ..., a_{n_f} >$, where $a_k \in \{0, 1\}$ for $1 \leq k \leq n_f$. If $u_{i+1} = 1$, then $q_i$ is in $Q_{act}(t)$ and $u(J_{i}^{n_f+1}) = u_{i+1}, u_{n_A+1}, ..., u_{n_A+n_f} = 1$. $a_1, ..., a_{n_f}$ denotes $\{q_i\} \times a$, where $0 \leq i \leq n_A - 1$.

5.5.3.3 Parallel symbolic computations in the NN NFA transition module

The realization of every move (the move from $Q_{act}(t)$ to $Q_{act}(t + 1)$) of $M_{DPF}$ in the NN NFA transition module can be reasoned in two steps. The first step is computed by the hidden neurons which serve as parallel recognizers of multiple input sub-patterns, and the second step is computed by the output layer which serves as a parallel union operator of sets.

1. The hidden layer consists of a fixed number of neural assemblies which operate in parallel and independently of each other. Thus the hidden neurons serve as parallel recognizers of multiple sub-patterns contained in the input. Such a neural assembly (equivalent to an AND neural assembly) for partial pattern recognition is proposed in Section 4. Each hidden neuron $h$ and its associated 1st-layer connections serve as a neural assembly for recognizing a certain $J_{i}^{n_f+1}$-set partial input pattern. Each such neural assembly checks for a certain transition $(q_i, a, q_j)$ ($0 \leq i, j \leq n_A - 1$) whether current input vector $u$ contains the $J_{i}^{n_f+1}$-set partial pattern $u(J_{i}^{n_f+1}) = 1, a_1, ..., a_{n_f}$ (denoting $\{q_i\} \times a$. and $q_i \in Q_{act}(t)$) according to expression 4.2. If it is, the hidden neuron $h$ is activated by the partial input pattern $\{q_i\} \times a$ at time $t$ and the transition $(q_i, a, q_j)$ is applied. Totally there are $n_f$ such neural assemblies operating in parallel to identify all the possible transitions which are applicable for the states in $Q_{act}(t)$ on current input symbol, and thus they operate together like a simplified, cost-effective SIMD computer system dedicated to parallel partial pattern matching.

2. The output layer consists of a fixed number of the monotone OR neural assemblies proposed in 4. Each of the assemblies computes a logical OR operation in parallel with each other on shared inputs. Each output neuron and its associated 2nd-layer connections compose such a neural assembly. Each such neural assembly is used to check for a certain state $q_j$ whether any of its fan-in transitions is applicable for the states in $Q_{act}(t)$
on current input symbol $a$. If it is, then output neuron $j + 1$ is activated and $q_j$ is in $Q_{act}(t + 1)$. Totally there are $n_A$ such neural assemblies (output neurons) which share their input, and operate in parallel to compute the next set of active states $Q_{act}(t + 1)$ according to expression $Q_{act}(t + 1) = \cup_{q \in Q_{act}(t)} \delta'(q, a)$. Such neural assemblies operate together to compute $Q_{act}(t + 1)$ like a simplified, cost-effective SIMD computer system dedicated to parallel union computations of sets.

When the representations of $Q_{act}(t)$ and $Q_{act}(t + 1)$ are viewed locally (i.e., each of them is viewed as a set of states), the NN NFA transition module realizes the transition function $\delta'$ of the given NFA $M_{NFA}$ if it is restricted that $|Q_{act}(t)| = 1$. Note that $\delta'$ maps from $Q \times \Gamma$ to $2^Q$. Such local representations facilitate the parallel recognition of all the transitions applicable for the states in $Q_{act}(t)$ on current input symbol even if $|Q_{act}(t)| \geq 1$. Thus the representations facilitate the concurrent tracking of all the possible nondeterministic paths at each move of an NFA. When the representations of $Q_{act}(t)$ and $Q_{act}(t + 1)$ are viewed distributedly (i.e., each of them is viewed as a state), the NN NFA transition module realizes the transition function $\delta^*$ of $M_{DFA}$. It means that the NN NFA transition module concurrently realizes the transition functions $\delta'$ and $\delta^*$. Such a concurrent realization facilitates not only the direct construction of the NN NFA transition module from the transition function $\delta'$ but also the linear operation time complexity of the proposed NN NFA for the processing of input strings.

5.5.3.4 Settings of connection weights and thresholds in the NN NFA transition module

Note that every transition defined by expression 5.13 is represented as an ordered binary mapping pair $<Q_{act}(t) \times a(t), Q_{act}(t + 1)>$ at the input and output layers of the NN NFA transition module, and such mappings are achieved by capturing the regularity in expression $Q_{act}(t + 1) = \cup_{q \in Q_{act}(t)} \delta'(q, a(t))$ using a 2-layer Perceptron.

Suppose $q_i, q_j \in Q, a \in \Gamma$, and $q_j \in \delta'(q_i, a)$, where $0 \leq i, j \leq n_A - 1$. In order to identify the transition $(q_i, a, q_j)$ which is applicable on an input containing the partial input pattern $\{q_i\} \times a$ in the NN NFA transition module, the interest set $J_{i, t}^{i+1} = \{i + 1, n_A + 1, n_A + 2, ... , n_A + n_J\}$
is used for the identification of the \( J_i^{n_i+1} \)-set partial input vector \( u(J_i^{n_i+1}) = < 1, a_1, ..., a_{n_i} > \) which denotes \( \{q_i\} \times a \).

According to expressions 4.2, 4.16 and their corresponding Perceptron implementation, a hidden neuron \( h \) is created, and its associated connection weights as well as threshold are set for every transition \((q_i, a, q_j)\) of the given NFA \( M_{NFA} \) in the NN NFA transition module as follows:

1. In the 1st-layer connections, according to expression 4.2,
   - the connection weight from the \((i + 1)\)th input neuron to the hidden neuron \( h \) is set to 1,
   - the connection weight from the \((n_A + k)\)th input neuron to the hidden neuron is set to \( 2a_k - 1 \) for \( 1 \leq k \leq n_i \), and
   - the connection weights from other input neurons (which are not in \( J_i^{n_i+1} \)) to the hidden neuron are set to 0.

2. The threshold of the hidden neuron is set to \( \sum_{k=1}^{n_i} a_k + 1 \).

3. In the 2nd-layer connections,
   - the connection weight from the hidden neuron to the \((j + 1)\)th output neuron is set to 1, and
   - the connection weights from the hidden neuron to other output neurons are set to 0.

4. The thresholds of all output neurons are set to 1 in the NN NFA transition module.

Therefore, if \( q_i \in Q_{act}(t) \) and \( a \) is current input symbol; then \( u_{i+1} = 1, u_{n_A+k} = a_k \) for \( 1 \leq k \leq n_i \) at time \( t \), an input containing sub-pattern \( u(J_i^{n_i+1}) = < 1, a_1, ..., a_{n_i} > \) is identified by hidden neuron \( h \) and its associated 1st-layer connections, the hidden neuron \( h \) is activated, and in turn the \((j + 1)\)th output neuron is activated (i.e., \( v_{j+1} = 1 \) at time \( t + 1 \), and thus \( q_j \in Q_{act}(t + 1) \)) according to above settings. So, the transition \((q_i, a, q_j)\) is applied in the NN NFA transition module.
5.5.3.5 Settings of connection weights and thresholds in the acceptance testing module

The acceptance testing module of the proposed NN NFA tests whether an input string is accepted by the NN NFA at the end of input processing. It is a 1-layer Perceptron which has $n_A$ input neurons and an output neuron.

The output neuron tests whether $Q_{act}(t + 1) \in F$ by checking whether any state of $F$ is in $Q_{act}(t + 1)$ at the end of input processing. Such a test can be characterized by a monotone logical OR operation (expression 4.17) on the values of the neurons denoting accepting states, and hence it can be realized by a Perceptron according to expression 4.16 with $v_i$ denoting whether state $q_{i-1}$ is in $F$. The connection weights and threshold of the accepting neuron are set as follows:

- If $q_i \in F$, then the connection weight from the $(i + 1)$th input neuron to the output neuron is set to 1 for $0 < i < n_A - 1$. Otherwise it is set to 0.
- The threshold of the output neuron is set to 1.

5.5.3.6 The end-of-input testing module

In the proposed NN NFA, an end-of-input testing module is used to test the end of input string. The end-of-input testing module is a neural assembly (a 1-layer/1-output Perceptron) which recognizes the end-of-input symbol $\$" that is encoded as binary vector $<1^n\$>. By expression 4.1 and its corresponding Perceptron implementation, all connection weights are set to 1 and the threshold at output neuron is set to $n_1$ in the 1-layer/1-output Perceptron to recognize $\$. The end-of-input testing module is not shown in Figure 5.6.

5.5.3.7 Operation time complexity of the proposed NN NFA

The time complexity of processing an input string of length $n$ by an NFA directly implemented in single-processor computer systems is $O(m^2n)$ [163], where $m$ is the number of states in the NFA. The proposed NN NFA concurrently tracks all the possible nondeterministic
transitions during the processing of an input string for a given NFA by exploiting the inherent parallelism in ANN. In such a computation, the proposed NN NFA retains the powerful concept of nondeterministicism of NFA, and it also has the advantage of DFA which run in linear time proportional to the length of the input string. Since the NN NFA transition module realizes both the transition functions $\delta'$ and $\delta^*$, the time complexity of processing an input string by such a parallel and deterministic computation in the proposed NN NFA is linearly proportional to the length of the input string, i.e., for an input string of length $n$ the processing time complexity in the proposed NN NFA is $O(n)$. Therefore the computational overhead of input processing due to the nondeterministicism in NFA can be eliminated by taking advantage of the inherent parallelism of ANN as shown by the proposed NN NFA.

5.5.4 Proof of correctness

This subsection proves the correctness in the construction of the proposed NN NFA for a given NFA.

Theorem 5.1: The proposed NN NFA can correctly realize a given NFA $M_{NFA} = (Q, \Gamma, \delta', q_0, F)$.

Proof: The theorem is proved by showing that the NN NFA transition module of the proposed NN NFA correctly realizes the transition function $\delta^*$ of $M_{DFA}$ which is re-defined from the given NFA $M_{NFA}$ according to Iterative Subset Construction (please refer to [195] for the proof of equivalence between a given NFA $M_{NFA}$ and its equivalent DFA $M_{DFA}$). Note that every transition defined by the transition function $\delta^*$ is represented as an ordered binary mapping pair $< Q_{act}(t) \times a(t), Q_{act}(t + 1) >$ at the input and output layers of the NN NFA transition module. The mappings are implemented in the NN NFA transition module without knowing in advance every possible mapping pair (legal transition) in transition function $\delta^*$. Instead, they are realized by capturing the regularity in expression $Q_{act}(t + 1) = \cup_{q \in Q_{act}(t)} \delta'(q, a(t))$ using a 2-layer Perceptron, the first layer of which consists of AND neural assemblies and the second layer of which consists of OR neural assemblies.
All the notations and representations here follow previous subsections. The transition function $\delta^*$ is defined by expression 5.13 as follows:

$$\forall a \in \Gamma \forall t \geq 0 \left[ Q_{act}(t+1) = \delta^*(Q_{act}(t), a) \right]$$

where $Q_{act}(0) = \{q_0\}$ and $Q_{act}(t+1) = \cup_{q \in Q_{act}(t)} \delta'(q, a)$. Since $Q_{act}(t+1)$ is computed from $Q_{act}(t)$ and $a$, the above expression can be denoted as following:

$$\forall a \in \Gamma \forall t \geq 0 \left[ Q_{act}(t+1) = \cup_{q \in Q_{act}(t)} \delta'(q, a) \right]$$

(5.14)

where $Q_{act}(0) = \{q_0\}$. Expression 5.14 is equivalent to

$$\forall a \in \Gamma \forall t \geq 0 \left[ \forall q_i \in Q_{act}(t) \& \forall q_j \in \delta'(q_i, a) \Rightarrow q_j \in Q_{act}(t+1) \right]$$

(5.15)

where $Q_{act}(0) = \{q_0\}$. We want to show that the NN NFA transition module realizes the transition function $\delta^*$ of $M_{DFA}^*$ by proving that expression 5.15 is preserved by the NN NFA transition module.

The NN NFA transition module is represented in such a way (see Section 5.5.3) that the conditions $a \in \Gamma$ and $0 \leq i, j \leq n_A - 1$ always hold in expression 5.15. Let $u = < u_1, ..., u_{n_A+n_f} >$ and $v = < v_1, ..., v_{n_A} >$ be respectively the binary input value and output value of the NN NFA transition module, and $a \in \Gamma$ be current input symbol encoded as a binary value $< a_1, ..., a_{n_f} >$, where $a_k \in \{0,1\}$ for $1 \leq k \leq n_f$. Let $w_{k,i}^1$, $w_{j,k}^2$, $\theta_k^1$, and $\theta_j^2$ respectively denote the 1st-layer connection weight from input neuron $i$ to hidden neuron $k$, the 2nd-layer connection weight from hidden neuron $k$ to output neuron $j$, the threshold of hidden neuron $k$, and the threshold of output neuron $j$ in the NN NFA transition module, where $1 \leq i \leq n_A + n_f$, $1 \leq k \leq n_T$, and $1 \leq j \leq n_A$.

For all $i$ and $j$ ($0 \leq i, j \leq n_A - 1$), if $q_j \in \delta'(q_i, a)$, a hidden neuron $h$ with an interest set $J_i^{n_f+1} = \{i+1, n_A+1, n_A+2, ..., n_A+n_f\}$ should have been created by the proposed method presented in Section 5.5.3 for the transition $(q_i, a, q_j)$ for recognizing the $J_i^{n_f+1}$-set partial input value $u(J_i^{n_f+1}) = < 1, a_1, ..., a_{n_f} >$, denoting $\{q_i\} \times a$, by following settings:

- $\theta_h^1 = \sum_{k=1}^{n_f} a_k + 1$,
- $w_{h,i+1}^1 = 1$
• $w^1_{h,k} = 0$ for $1 \leq k \leq n_A$ & $k \neq (i + 1)$ (where $k \notin J^{n_I + 1}_i$),

• $w^1_{h,k} = 2a_j - 1$ for $n_A + 1 \leq k \leq n_A + n_I$ (where $k \in J^{n_I + 1}_i$),

• $\theta^2_j = 1$,

• $w^2_{j+1,h} = 1$, and

• $w^2_{m,A} = 0$ for $1 \leq m \leq n_A$ & $m \neq (j + 1)$.

For any moment $t \geq 0$, if $q_i \in Q_{act}(t)$ (i.e. $u_{i+1} = 1$), the $J^{n_I + 1}_i$-set partial input vector $u(J^{n_I + 1}_i) = < a_1, ..., a_n, r >$, denoting $\{ q_i \} \times a$, is recognized and hidden neuron $h$ is activated when the input $u$ denoting $Q_{act}(t) \times a$ is fed into the NN NFA transition module. The hidden neuron $h$ in turn activates output neuron $j + 1$. So $v_{j+1} = 1$ at time $t + 1$, and thus $q_j \in Q_{act}(t + 1)$. Therefore expression 5.15, i.e., expression 5.13, is preserved by the NN NFA transition module.

5.5.5 NN NFA in Action

This subsection constructs an NN NFA transition module of the proposed NN NFA for the NFA defined in Figure 5.4 (see Section 5.5.1). In the NFA, $Q = \{ q_0, q_1, q_2, q_3, q_4 \}$, $q_0$ is initial state, $\Gamma = \{ a, b \}$, and $F = \{ q_4 \}$. The transitions defined in the NFA are $(q_0, a, q_0)$, $(q_0, b, q_0)$, $(q_0, a, q_1)$, $(q_1, b, q_2)$, $(q_2, a, q_3)$, $(q_3, a, q_4)$, $(q_4, a, q_4)$, and $(q_4, b, q_4)$. Then $n_T = 8$, $n_I = 2$, and $n_A = 5$. The NN NFA transition module has 7 input, 8 hidden, and 5 output neurons. Let $w^1_{i,j}$, $w^2_{j,k}$, $\theta^1_k$, and $\theta^2_j$ respectively denote the 1st-layer connection weight from input neuron $i$ to hidden neuron $k$, the 2nd-layer connection weight from hidden neuron $k$ to output neuron $j$, the threshold of hidden neuron $k$, and the threshold of output neuron $j$ in the NN NFA transition module, where $1 \leq i \leq 7$, $1 \leq k \leq 8$ and $1 \leq j \leq 5$. Let input symbol $a$ be encoded as $< 0, 0 >$ and $b$ as $< 0, 1 >$. Then the connection weights and neuron thresholds of the NN NFA transition module are set as follows:

$\theta^1_1 = 1$, $w^1_{1,1} = 1$, $w^1_{1,2} = 0$, $w^1_{1,3} = 0$, $w^1_{1,4} = 0$, $w^1_{1,5} = 0$, $w^1_{1,6} = -1$, $w^1_{1,7} = -1$,

$\theta^2_1 = 2$, $w^1_{2,1} = 1$, $w^1_{2,2} = 0$, $w^1_{2,3} = 0$, $w^1_{2,4} = 0$, $w^1_{2,5} = 0$, $w^1_{2,6} = -1$, $w^1_{2,7} = 1$, ...
\[ \theta_1 = 1, w_{1,1}^1 = 1, w_{1,2}^1 = 0, w_{1,3}^1 = 0, w_{1,4}^1 = 0, w_{1,5}^1 = 0, w_{1,6}^1 = -1, w_{1,7}^1 = -1, \]
\[ \theta_2 = 2, w_{2,1}^2 = 0, w_{2,2}^2 = 1, w_{2,3}^2 = 0, w_{2,4}^2 = 0, w_{2,5}^2 = 0, w_{2,6}^2 = -1, w_{2,7}^2 = 1, \]
\[ \theta_3 = 1, w_{3,1}^3 = 0, w_{3,2}^3 = 0, w_{3,3}^3 = 1, w_{3,4}^3 = 0, w_{3,5}^3 = 0, w_{3,6}^3 = -1, w_{3,7}^3 = -1, \]
\[ \theta_4 = 1, w_{4,1}^4 = 0, w_{4,2}^4 = 0, w_{4,3}^4 = 0, w_{4,4}^4 = 1, w_{4,5}^4 = 0, w_{4,6}^4 = -1, w_{4,7}^4 = 1, \]
\[ \theta_5 = 1, w_{5,1}^5 = 0, w_{5,2}^5 = 0, w_{5,3}^5 = 0, w_{5,4}^5 = 0, w_{5,5}^5 = 0, w_{5,6}^5 = -1, w_{5,7}^5 = -1, \]
\[ \theta_6 = 1, w_{6,1}^6 = 0, w_{6,2}^6 = 0, w_{6,3}^6 = 0, w_{6,4}^6 = 1, w_{6,5}^6 = 0, w_{6,6}^6 = -1, w_{6,7}^6 = -1, \]
\[ \theta_7 = 1, w_{7,1}^7 = 0, w_{7,2}^7 = 0, w_{7,3}^7 = 0, w_{7,4}^7 = 0, w_{7,5}^7 = 1, w_{7,6}^7 = -1, w_{7,7}^7 = -1, \]
\[ \theta_8 = 2, w_{8,1}^8 = 0, w_{8,2}^8 = 0, w_{8,3}^8 = 0, w_{8,4}^8 = 0, w_{8,5}^8 = 1, w_{8,6}^8 = -1, w_{8,7}^8 = 1, \]
\[ \theta_j^j = 1 \text{ for } 1 \leq j \leq 5, \]
\[ w_{1,1}^1 = 1, w_{1,2}^1 = 1, w_{1,3}^1 = 0, w_{1,4}^1 = 0, w_{1,5}^1 = 0, w_{1,6}^1 = 0, w_{1,7}^1 = 0, w_{1,8}^1 = 0, \]
\[ w_{2,1}^2 = 0, w_{2,2}^2 = 0, w_{2,3}^2 = 1, w_{2,4}^2 = 0, w_{2,5}^2 = 0, w_{2,6}^2 = 0, w_{2,7}^2 = 0, w_{2,8}^2 = 0, \]
\[ w_{3,1}^3 = 0, w_{3,2}^3 = 0, w_{3,3}^3 = 0, w_{3,4}^3 = 1, w_{3,5}^3 = 0, w_{3,6}^3 = 0, w_{3,7}^3 = 0, w_{3,8}^3 = 0, \]
\[ w_{4,1}^4 = 0, w_{4,2}^4 = 0, w_{4,3}^4 = 0, w_{4,4}^4 = 0, w_{4,5}^4 = 1, w_{4,6}^4 = 0, w_{4,7}^4 = 0, w_{4,8}^4 = 0, \]
\[ w_{5,1}^5 = 0, w_{5,2}^5 = 0, w_{5,3}^5 = 0, w_{5,4}^5 = 0, w_{5,5}^5 = 1, w_{5,6}^5 = 0, w_{5,7}^5 = 1, w_{5,8}^5 = 1. \]

Note that this NN NFA starts from initial state \( Q_{act}(0) = \{ q_0 \} \), which is encoded as \(< 1.0,0,0,0,0>\).

### 5.6 Summary and Discussion

In conventional computer systems, computer programs for real world applications are usually large and complex. They are typically built from a set of pre-defined modules (or objects/classes in object-oriented paradigms). Such modules allow code reuse and rapid implementation with fewer errors. We advocate a similar approach to the construction of complex neural networks. In this chapter, we have constructed a given DFA using an RNN which in turn is assembled from BMP modules and recurrent links; a stack using an RNN which is assembled from BMP modules, recurrent links and a write control module; a given DPDA using an RNN which is assembled from BMP modules, an NN Stack and recurrent links; and a given NFA using an RNN which is assembled from basic neural assemblies that realize logical AND and OR operations on Boolean variables. The proposed NN NFA demonstrates the potential benefits of ANN in the design of high performance systems for parallel symbolic computing applications. Our other attempts for more complex symbolic processing includes neural net-
works designed respectively for simple database query processing (see Chapter 3) and syntax analysis (see Chapter 6). We expect a similar approach to be applicable in the construction of neural networks for a variety of important applications.
6 NEURAL ARCHITECTURES FOR SYNTAX ANALYSIS

6.1 Introduction

This chapter explores the synthesis of neural architectures for syntax analysis using pre-specified grammars — a prototypical symbol processing task with applications in interactive programming environments (using interpreted languages such as LISP and JAVA), analysis of symbolic expressions (e.g., in real-time knowledge based systems and database query processing), and high-performance compilers. This chapter does not address machine learning of unknown grammars (which finds applications in tasks such as natural language acquisition).

A more general goal of this chapter is to explore the design of massively parallel architectures for symbol processing using the neural associative memories proposed in Chapter 2 as key components. Pattern-directed associative inference is an essential part of most AI systems [54, 97, 181] and dominates the computational requirements of many AI applications [55, 97, 127].

The proposed high performance neural architectures for syntax analysis are systematically (and provably correctly) synthesized through composition of the necessary symbolic functions using a set of component symbolic functions each of which is realized using a neural associative processor (memory). It takes advantage of massively parallel pattern matching and retrieval capabilities of neural associative processors (memories) to speed up syntax analysis for real-time applications. The rest of the chapter is organized as follows:

• The remainder of Section 6.1 reviews related research on neural architectures for syntax analysis.
• Sections 6.2 and 6.3 respectively develop modular neural network architectures for lexical analysis and parsing.

• Section 6.4 compares the estimated performance of the proposed neural architectures for syntax analysis (based on current CMOS VLSI technology) with that of commonly used approaches to syntax analysis in conventional computer systems that rely on inherently sequential index or matrix structure for pattern matching.

• Section 6.5 concludes with a summary and discussion.

6.1.1 Review of related research on neural architectures for syntax analysis

To the best of our knowledge, to date, most of the research on neural architectures for syntax analysis has focused on the investigation of neural networks that are designed to learn to parse particular classes of syntactic structures (e.g., strings from deterministic context-free languages (DCFL) or natural language sentences constructed using limited vocabulary). Notable exceptions are: connectionist realizations of Turing Machines (wherein a stack is simulated using binary representation of a fractional number) [143, 169]; a few neural architectures designed for parsing based on a known grammar [34, 164]; and neural network realizations of finite state automata [20, 134]. Nevertheless, it is informative to examine the various proposals for neural architectures for syntax analysis (regardless of whether the grammar is preprogrammed or learned). The remainder of this subsection explores some of the proposed architectures in the literature for syntax analysis in terms of how each of them addresses the key subtasks of syntax analysis.

[34] proposes a neural network to parse input strings of fixed maximum length for known context-free grammars (CFG). The whole input string is presented at one time to the neural parser which is a layered network of logical AND and OR nodes with connections set by an algorithm based on CYK algorithm [74].

PARSEC [80] is a modular neural parser consisting of six neural network modules. It transforms a semantically rich and therefore fairly complex English sentence into three output representations produced by its respective output modules. The three output modules are
role labeler which associates case-role labels with each phrase block in each clause, interclause labeler which indicates subordinate and relative clause relationships, and mood labeler which indicates the overall sentence mood (declarative or interrogative). Each neural module is trained individually by a variation of Backpropagation algorithm. The input is a sequence of syntactically as well as semantically tagged words in the form of binary vectors and is sequentially presented to PARSEC, one word at a time. PARSEC exploits generalization as well as noise tolerance capabilities of neural networks to reportedly attain 78% correct labeling on a test set of 117 sentences when trained with a training set of 240 sentences. Both the test and training sets were based on conference registration dialogs from a vocabulary of about 400 words.

SPEC [116] is a modular neural parser which parses variable-length sentences with embedded clauses and produces case-role representations as output. SPEC consists of a parser which is a simple recurrent network, a stack which is realized using a recursive auto-associative memory (RAAM) [144], and a segmenter which controls the push/pop operations of the stack using a 2-layer Perceptron.

RAAM has been used by several researchers to implement stacks in connectionist designs for parsers [13, 66, 116]. A RAAM is a 2-layer Perceptron with recurrent links from hidden neurons to part of input neurons and from part of output neurons to hidden neurons. The performance of a RAAM stack is known to degrade substantially with increase in depth of the stack, and the number of hidden neurons needed for encoding a stack of a given depth has to be determined through a process of trial and error [116]. A RAAM stack has to be trained for each application. Other drawbacks associated with the use of RAAM as a stack are discussed in [174].

Each module of SPEC is trained individually using Backpropagation algorithm to approximate a mapping function as follows: Let $Q$ be a finite non-empty set of states, $\Gamma$ a finite non-empty input alphabet, $\mathcal{V}_{CRV}$ a finite non-empty set of case-role vectors, $A = \{\text{output, push, pop}\}$ the set of stack actions, and $V_{\text{stack}}$ the set of compressed stack representations at the hidden layer of a RAAM. Then the first and second connection layers of the parser ap-
proximate the transition function of a DFA (see Section 5) \( f_{P_1} : \Gamma \times Q \rightarrow Q \) and a symbolic mapping function \( f_{P_2} : Q \rightarrow V_{CRV} \) respectively; the segmenter approximates a symbolic function \( f_S : \Gamma \times Q \rightarrow Q \times A \); and the first and second connection layers of the RAAM approximate the push function \( f_{push} : V_{stack} \times Q \rightarrow V_{stack} \) and the pop function \( f_{pop} : V_{stack} \rightarrow V_{stack} \times Q \) of a RAAM stack respectively. The input string is sequentially presented to SPEC and is a sequence of syntactically untagged English words represented as fixed-length distributive vectors of gray-scale values between 0 and 1. The emphasis of SPEC was on exploring the generalization as well as noise tolerance capabilities of a neural parser. SPEC uses implicit central control to integrate its different modules and reportedly achieves 100% generalization performance on a whole test set of 98100 English relative clause sentences with up to 4 clauses. Since the words (terminals) in the CFG which generates the test sentences are not pre-translated by a lexical analyzer into syntactically tagged tokens, the number of production rules and terminals tend to increase linearly with the size of the vocabulary in the CFG. Augmenting SPEC with a lexical analyzer offers a way around this problem.

[27, 174, 197] propose higher-order recurrent neural network equipped with an external stack to learn to recognize deterministic CFG, i.e., to learn to simulate a deterministic pushdown automata (DPDA). [27, 174] use an analog network coupled with a continuous stack and use a variant of a real time recurrent network learning algorithm to train the network. [197] uses a discrete network coupled with a discrete stack and employs a pseudo-gradient learning method to train the network. The input to the network is a sequentially presented, unary-coded string of variable length. Let \( Q \) be a finite non-empty set of states, \( \Gamma \) a finite non-empty input alphabet, \( \Lambda \) a finite non-empty stack alphabet, \( A = \{\text{push}, \text{pop}, \text{no-operation}\} \) the set of stack actions, and Boolean the set \{false, true\}. These recurrent neural networks approximate the transition function of a DPDA, i.e., \( f_{DPDA} : Q \times \Gamma \times \Lambda \rightarrow Q \times \Lambda \times A \). The networks are trained to approximate a language recognizer function \( f_L : \Gamma^* \rightarrow \text{Boolean} \). Strings generated from CFG including balanced parenthesis grammar, \( a^n b^n, a^m b^m c^n, a^n b^n c b^m a^m \), postfix grammar, and/or palindrome grammar were used to evaluate the generalization performance of the proposed networks.
The proposed neural architecture for syntax analysis is composed of neural network modules for stack, lexical analysis, parsing, and parse tree construction. It differs from most of the neural network realizations of parsers in that it is systematically assembled using neural associative processors (memories) as primary building blocks. It is able to exploit massively parallel content-based pattern matching and retrieval capabilities of neural associative processors (memories). This offers an opportunity to explore the potential benefits of massively parallel pattern matching in the design of high performance computing systems for real time symbol processing applications.

6.2 Neural Network Design for a Lexical Analyzer (NNLexAn)

A lexical analyzer is defined by a recursive symbolic function $f_{LexAn} : \Gamma^* \$ \rightarrow \Delta^* \$. $\Gamma$ is the input alphabet, $\$ $ is a special symbol denoting "end of input", and $\Delta$ is the set of lexical tokens. $\Gamma^* \$ $ (or $\Delta^* \$) denotes the set of strings obtained by adding the suffix $\$ $ to each of the strings over the alphabet $\Gamma$ (or $\Delta$). The conventional approach to implementing a lexical analyzer using a DFA (in particular, a Mealy machine) can be realized quite simply using an NN DFA [20]. However, a major drawback of this approach is that all legal transitions have to be exhaustively specified in the DFA. For example, Figure 6.1 shows a simplified state diagram without all legal transitions specified for a lexical analyzer which recognizes keywords of a programming language: begin, end, if, then, and else.

Suppose the lexical analyzer is in a state that corresponds to the end of a keyword. Then its current state would be state 7, 11, 15, 18, or 23. If the next input character is $b$, there should be legal transitions defined from those states to state 2. That is the same for states 8, 16, and 19 in order to handle the next input characters $e$, $i$, and $t$. Thus, this extremely simple lexical analyzer with 22 explicitly defined legal transitions has 20 unspecified transitions. The realization of such a simple 5-word (23-state) lexical analyzer by an NN DFA requires $20 + 22 = 42$ hidden neurons. Additional transitions have to be defined in order to allow multiple blanks between two consecutive words in the input stream, and for error handling. These drawbacks are further exacerbated in applications involving languages with large vocabularies.
A better alternative is to use a Dictionary (or a database) to serve as a lexicon. The proposed design for NNLexAn consists of a word segmenter for carving an input stream of characters into a stream of words, and a word lookup table for translating the carved words of variable length into syntactically tagged tokens of fixed length. The syntactically tagged tokens of fixed length are to be used as single logical units in parsing. Such a translation can be realized by a simple query to a database using a key. Such database query processing can be efficiently implemented using neural associative memories (see Chapter 2).

6.2.1 Neural network design for a word segmenter (NNSeg)

In program translation, the primary function of a word segmenter is to identify illegal words and to group input stream into legal words including keywords, identifiers, constants, operators, and punctuation symbols. A word segmenter can be defined by a recursive symbolic function $\tilde{f}_{\text{WordSeg}} : \Gamma^*\$ \rightarrow \Lambda^*\$, where $\Gamma$ is the input alphabet, $\$ is a special symbol denoting "end of input", and $\Lambda$ is the set of legal words. $\Gamma^*\$ (or $\Lambda^*\$) denotes the set of strings obtained by adding the suffix $\$ to each of the strings over the alphabet $\Gamma$ (or $\Lambda$).

Figure 6.2 shows the state diagram of a DFA simulating a simple word segmenter which carves continuous input stream of characters into integer constants, keywords, and identifiers.
Both the keywords and identifiers are defined as strings of English characters. For simplicity, the handling of end-of-input is not shown in the figure. The word segmenter terminates processing upon encountering the end-of-input symbol $. Each time when the word segmenter goes into an accepting state, it instructs the word lookup table to look up a word that has been extracted from the input stream and stored in a buffer.

Since syntax error handling is not discussed here, it may be assumed that any illegal word is discarded by the word segmenter and is also discarded from the buffer which temporarily stores the illegal word being extracted from the input stream. Such a word segmenter can also be realized by an NN DFA. Since any undefined (un-implemented) transition moves into a binary-coded state of all zeros automatically in an NN DFA, it would be expedient to encode the garbage state (state G in Figure 6.2) using a string of all zeros. Although the most straightforward implementation of NN DFA (see Section 5) uses one hidden neuron per transition, one can do better. In Figure 6.2 the 10 transitions from state 4 on ASCII-coded input symbols 0,1,...,9 can be realized by only two hidden neurons in an NN DFA using partial pat-
tem recognition (see Chapters 2 and 3). Other transitions on input symbols 0,1,...,9, a,b,...,z, and A,B,...,Z can be handled in a similar fashion.

6.2.2 Neural network design for a word lookup table (NNLTab)

During lexical analysis in program compilation or similar applications, each word of variable length (extracted by the word segmenter) is translated into a token of fixed length. Each such token is treated as a single logical entity: an identifier, a keyword, a constant, an operator or a punctuation symbol. Such a translation can be defined by a simple symbolic function \( f_{\text{WordTrans}} : \Lambda \cup \{\$\} \rightarrow \Delta \cup \{\$\} \). Here, \( \Lambda \), $, and \( \Delta \) denote the same entities as in the definitions of \( f_{\text{WordSeg}} \) and \( f_{\text{LexAn}} \) above. Note that \( f_{\text{WordTrans}} \) can be realized by a BMP module. In other lexical analysis applications, a word may be translated into a token having two sub-parts: category code denoting the syntactic category of a word, and feature code denoting the syntactic features of a word.

Conventional approach to doing such translation (dictionary lookup) is to perform a simple query on a suitably organized database (with the segmented word being used as the key). This content-based pattern matching and retrieval process can be efficiently and effectively realized by neural associative memories. Database query processing using neural associative memories is discussed in detail in Chapter 3 and is summarized briefly in what follows. Each word and its corresponding token are stored as an association pair in a neural associative memory. Each such association is implemented by a hidden neuron and its associated connections. A query is processed in two steps: *identification* and *recall*. During the identification step, a given word is compared to all stored words in parallel by the hidden neurons and their associated 1st-layer connections in the memory. Once a match is found, one of the hidden neurons is activated to recall the corresponding token using the 2nd-layer connections associated with the activated hidden neuron. The time required for processing such a query is of the order of 20 ns (at best) to 100 ns (at worst) given the current CMOS technology for implementation of artificial neural networks (see Section 3.2.1.1).
6.3 A Modular Neural Architecture for LR Parser (NNLR Parser)

LR(k) grammars generate the so-called deterministic context-free languages which can be accepted by deterministic push-down automata [74]. Such grammars find extensive applications in programming languages and compilers. LR parsing is a linear time table-driven algorithm which is widely used for syntax analysis of computer programs [2, 19, 170]. This algorithm involves extensive pattern matching which suggests the consideration of a neural network implementation using associative memories. This section proposes a modular neural network architecture for parsing LR(1) grammars. LR(k) parsers scan input from left to right and produce a rightmost derivation tree by using lookahead of \( k \) unscanned input symbols. Since any LR(k) grammar for \( k \geq 1 \) can be transformed into an LR(1) grammar [170], LR(1) parsers are sufficient for practical applications [74].

An LR(1) grammar can be defined as \( G_{LR(1)} = (V, T, \Sigma, \Theta) \) [74], where \( V \) and \( T \) are finite sets of variables (nonterminals) and terminals respectively, \( \Sigma \) is a finite set of production rules, and \( \Theta \in V \) is a special variable called the start symbol. \( V \) and \( T \) are disjoint. Each production rule is of the form \( A \rightarrow \alpha \), where \( A \in V \) and \( \alpha \in (V \cup T)^* \). An LR(1) parser can be defined by a recursive symbolic function \( \hat{f}_{LRParser} : \Delta^*S \rightarrow \Sigma^* \), where \( \Delta \) (\( \Delta = T \) in the context), \( S \), and \( \Delta^* \) are as in \( \hat{f}_{LexAn} \), and \( \Sigma^* \) denotes the set of all sequences of production rules over the rule alphabet \( \Sigma \). Although \( \hat{f}_{LRParser} \) corresponds in form to the recursive symbolic function \( \hat{f}_{LexAn} \) in Section 6, it can not be realized simply by a Mealy machine which implements \( \hat{f}_{LexAn} \). This is due to the fact that the one-to-one mapping relationship between every input symbol of the input string and the output symbol of the output string at corresponding position in a Mealy machine does not hold for \( \hat{f}_{LRParser} \). A stack is required to store intermediate results of the parsing process in order to realize an LR(1) parser which is characterized by \( \hat{f}_{LRParser} \).

6.3.1 Representation of parse table

Logically, an LR parser consists of two parts: a driver routine which is the same for all LR parsers and a parse table which is grammar-dependent [2]. LR parsing algorithm pre-compiles an LR grammar into a parse table which is referred by the driver routine for deterministically
parsing input string of lexical tokens by \textit{shift/reduce} moves \cite{2,19}. Such a parsing mechanism can be simulated by a DPDA (deterministic pushdown automata) with \(\epsilon\)-moves \cite{74}. An \(\epsilon\)-move does not consume the input symbol, and the input head is not advanced after the move. This enables a DPDA to manipulate a stack without reading input symbols. The neural network architecture for DPDA (NN DPDA) proposed in Section 5, augmented with an NN Stack (see Section 5), is able to parse DCFL. However, the proposed NN DPDA architecture cannot efficiently handle \(\epsilon\)-moves because of the need to check for the possibility of an \(\epsilon\)-move at every state. Therefore, a modified design for LR(1) parsing is discussed below.

Parse table and stack are two main components of an LR(1) parser. The access of parse table can be defined by the symbolic function \(f_{\text{ParseTable}} : Q \times (\Delta \cup V \cup \{\$\}) \to A \times Q \cup \{*\} \times T \cup \{*\} \times N \cup \{*\} \times V \cup \{*\} \times Z\) in terms of binary mapping. Here, \(Q\) is the finite set of states; \(\Delta, V, \$\), and \(T\) have the same meaning as in the definition of \(G_{LR(1)}\) and \(\hat{J}_{LRParser}\) given above; \(A = \{\text{shift, reduce}\}\) is the set of parsing actions; \(*\) denotes a don't care; \(N\) is the set of natural numbers; and \(Z = \{\text{error, in-progress, accept}\}\) is the set of possible parsing status values.

A parse table can be realized using a BMP module as described in Section 2.2.5 in terms of binary mapping. The next move of the parsing automaton is determined by current input symbol \(a\) and the state \(q\) that is stored at the top of the stack. It is given by the parse table entry corresponding to \([q,a]\). Each such 2-dimensional parse table entry \(\text{action}[q,a]\) is implemented as a 6-tuple binary code \(<\text{action, state, rule, length, lhs, status}>\) in the BMP module for parse table where

- \textit{action} is a 2-bit binary code denoting one of two possible actions, 01 (\textit{shift}) or 10 (\textit{reduce});
- \textit{state} is an \(S\)-bit binary number denoting "\textit{the next state}";
- \textit{rule} is an \(R\)-bit binary number denoting the grammar production rule \(r\) to be applied if the consulted \textit{action} is a \textit{reduce};
- **length** is an $L$-bit binary number denoting the length of the right hand side of the grammar production rule $r$ to be applied if the consulted *action* is a *reduce*;

- **lhs** is an $H$-bit binary code encoding the grammar nonterminal symbol at the left hand side of the grammar production rule $r$ to be applied if the consulted *action* is a *reduce* and

- **status** is a 2-bit binary code denoting one of three possible *parsing status* codes, 00: *error*, 01: *in progress*, or 10: *accept* (used by higher-level control to acknowledge the success or failure of a parsing).

Note that the order of the tuple's elements arranged in Figure 6.3 is different from above. A canonical LR(1) parse table is relatively large and would typically have several thousand states for a programming language like C. SLR(1) and LALR(1) tables, which are far smaller than LR(1) table, typically have several hundreds of states for the same size of language, and they always have the same number of states for a given grammar [3] (The differences among LR, SLR, and LALR parsers are discussed in [19]). The number of states in the parse table of LALR(1) parsers for most programming languages is between about 200 and 450, and the number of symbols (lexical tokens) is around 50 [19], i.e., the number of table entries is between about 10000 and 22500.

Typically a parse table is realized as a 2-dimensional array in current computer systems. Memory is allocated for every entry of the parse table, and the access of an entry is via its offset in the memory, which is computed efficiently by the size of the fixed memory space for each entry and the indices of an entry in the array. However, it is much more natural to retrieve an entry in a table using content-based pattern matching on the indices of the entry. As described in Section 2.2.5, a BMP module can effectively and efficiently realize such content-based table lookup.

LR grammars used in practical applications typically produce parse tables with between 80% and 95% undefined error entries [19]. The size of the table is reduced by using lists which can result in a significant performance penalty. The use of a BMP for such table lookup help
Figure 6.3 The proposed neural network architecture for LR(1) parser
overcome this problem since undefined mappings are naturally realized by a BMP module without the need for extra space and without incurring any performance penalties. Thus, LALR(1) parsing (which is generally the technique of choice for parsing computer programs) table can be realized using at most about $22500 \times 20\% = 4500$ hidden neurons.

6.3.2 Representation of parsing moves and parse trees

A configuration of an LR parser is an ordered pair whose first component corresponds to the stack contents and whose second component is the part of the input that remains to be parsed. A configuration can be denoted by $(q_0 q_1 \cdots q_i, a_j a_{j+1} \cdots a_n \$, where $q_i$ is the state on top of the stack, $q_0$ is the stack bottom symbol, $a_j$ is current input symbol, and \$ is a special symbol denoting "end of input". The initial configuration is $(q_0, a_1 a_2 \cdots a_n \$). Let $0^k$ be a $k$-bit binary number (code) of all zeros denoting a value of don't care for $k \geq 1$. In the proposed NNLR Parser, the configurations resulting from one of four types of moves on parsing an input lexical token are as follows:

- If $\text{action}[q_i, a_j] = 01, q, 0^R, 0^L, 0^H, 0^1$, the parser performs a shift move and enters the configuration $(q_0 q_1 \cdots q_i q, a_j a_{j+1} \cdots a_n \$). Such a shift move is realized in one operation cycle in the proposed NNLR Parser.

- If $\text{action}[q_i, a_j] = 10, 0^S, r, l, h, 01$, the parser performs a reduce by producing a binary number $r$ (which denotes a grammar production rule $A \rightarrow \beta$ being applied, where the grammar nonterminal $A$ is denoted by the binary code $h$, and $l$ is the number of non-empty grammar symbols in $\beta$) as part of the parse tree, popping $l$ symbols off the stack, consulting parse table entry $[q_{i-l}, h]$ and entering the configuration $(q_0 q_1 \cdots q_{i-l} q, a_j \cdots a_n \$) where $\text{action}[q_{i-l}, h] = 01, q, 0^R, 0^L, 0^H, 01$. Such a reduce move is realized in two operation cycles in the proposed NNLR Parser since the parse table is consulted twice for simulating the move.

- If $\text{action}[q_i, a_j] = 0^2, 0^S, 0^R, 0^L, 0^H, 10$, parsing is completed.
• If \( \text{action}[q_i, a_j] = < 0^2, 0^5, 0^R, 0^L, 0^H, 00 > \), an error is discovered and the parser stops. Note that such an entry is a binary code of all zeros. (We do not discuss error handling any further in this chapter).

An LR parser scans input string from left to right and performs bottom-up parsing which results in a rightmost derivation tree in reverse. Thus, a stack can be used to store the parse tree (derivation tree) which is a sequence of grammar production rules (in reverse order) applied in the derivation of the scanned input string. The rule on top of the final stack which stores a successfully parsed derivation tree is a grammar production rule with the start symbol of an LR grammar at its left hand side. Note that each rule is represented by an \( R \)-bit binary number and the mapping from a binary-coded rule to the rule itself can be realized by a BMP module.

6.3.3 Architecture of NNLR parser

Figure 6.3 shows the architecture of a modular neural network design for an LR(1) parser which takes advantage of the efficient shift/reduce technique. The NNLR Parser uses an optional queue handler module and an NN stack which stores the parse tree (derivation tree). The queue handler stores lexical tokens extracted by the NN lexical analyzer described in Section 6 and facilitates the operation of lexical analyzer and parser in parallel. To extract the binary-coded grammar production rules in derivation order sequentially out of the NN stack which stores parse tree, the next processing unit connected to the NN stack sends binary-coded stack pop actions to the stack in an appropriate order.

6.3.3.1 Modules of NNLR Parser

The proposed NNLR Parser consists of a BMP module implementing the parse table, an NN shift/reduce stack storing states during shift/reduce simulation, a buffer (\text{state}(t)) storing the current state (from the top of the NN shift/reduce stack), and a buffer (\text{input}(t)) storing either current input lexical token or a grammar nonterminal symbol produced by last consulted parsing action which is a reduce. When the last consulted parsing action is a reduce
encoded as 10; the grammar production rule to be reduced is pushed onto the stack for parse
tree, the transmission of $\text{input}(t)$ is from the latched buffer $\text{lhs}$, and the input from the queue
mechanism is inhibited by the leftmost bit of the binary-coded reduce action. When the last
consulted parsing action is a shift encoded as 01, the transmission of $\text{input}(t)$ is from the
queue mechanism and the input from the latched buffer $\text{lhs}$ is inhibited by the rightmost bit
of the binary-coded shift action.

Parsing is initiated by reset signals to the NN shift/reduce stack and the NN stack storing
parse tree. The signals reset the SPs of these two stacks to stack bottom and hence $\text{state}(t)$
is reset to initial state. To avoid clutter, the reset signal lines are not shown in Figure 6.3. The
current state buffer $\text{state}(t)$ and the current input buffer $\text{input}(t)$ need to be synchronized
but the necessary synchronization circuit is omitted from Figure 6.3.

The operations of an LR parser can be viewed in terms of a sequence of transitions from an
initial configuration to a final configuration. The transition from one configuration to another
can be divided into two steps: the first involves consulting the parse table for next action using
current input symbol and current state on top of the stack; the second step involves execution
of the action — either a shift or a reduce — as specified by the parse table. In the NNLR
Parser, the first step is realized by a BMP module which implements the parse table lookup;
and the second step is executed by a combination of an NN shift/reduce stack which stores
states, and an NN stack which stores the parse tree (and the BMP module when the next
action is a reduce).

6.3.3.2 Complexity of the BMP module for parse table

Let $M$ be the number of defined action entries in the parse table. All grammar symbols are
encoded into $H$-bit binary codes. The BMP module for parse table uses $S + H$ input neurons,
$M$ hidden neurons, and $4 + S + R + L + H$ output neurons. Note that the BMP module
produces a binary output of all zeros, denoting a parsing error (see previous description of
parsing status code in an action entry of the parse table), for any undefined action entry in
the parse table. The $R$-bit binary-coded grammar production rule is used as the stack symbol
for the NN stack which stores the parse tree.

6.3.3.3 Complexity of the NN Stack for parse tree

Assume the pointer control module of the NN stack for parse tree use $m_p$ bits to encode its SP values. Then the pointer control module of the NN stack for parse tree uses $m_p + 2$ input neurons, $3 \times 2^{m_p}$ hidden neurons, and $m_p$ output neurons. The stack memory module uses $m_p$ input neurons, $2^{m_p}$ hidden neurons, and $R$ output neurons. The write control module receives $m_p + 1$ binary inputs (the stack pointer + push/pop signal) and $R$ binary inputs (the grammar production rule).

6.3.3.4 Complexity of the Shift/Reduce NN Stack

To efficiently implement the reduce action in LR parsing, the NN shift/reduce stack can be slightly modified from the NN stack described in Section 5 to allow multiple stack pops in one operation cycle of the NNLR Parser. The number of pops is coded as an $L$-bit binary number and equals the number of non-empty grammar symbols at the right hand side of the grammar production rule being reduced. It is used as input to the pointer control module and write control module in the NN shift/reduce stack. Thus, each of the modules use $L$ additional input neurons in the NN shift/reduce stack as compared to the NN stack proposed in Section 5. The output from the NN parse table, namely, the $S$-bit binary code for state, is used as the stack symbol to the NN shift/reduce stack. Let the maximum number of non-empty grammar symbols that appear in the right hand side of a production rule in the LR grammar being parsed be $L_m$. Then $k$ multiple pops are implemented in the NN shift/reduce stack in a manner similar to a single pop in the NN stack proposed in Section 5 except that the SP value is decreased by $k$ instead of 1, $1 \leq k \leq L_m$. Hence for each SP value, $L_m - 1$ additional hidden neurons are required to allow multiple pops in the pointer control module.
6.3.4 NNLR Parser in action

This subsection illustrates the operation of the proposed NNLR Parser for a given LR(1) grammar. The example of LR(1) grammar ($G_1$) used here is taken from [3]. The BNF (Backus-Naur Form) description of the grammar $G_1$ is as follows:

$$
\text{expression} \rightarrow \text{expression} + \text{term} \mid \text{term} \\
\text{term} \rightarrow \text{term} \times \text{factor} \mid \text{factor} \\
\text{factor} \rightarrow ( \text{expression} ) \mid \text{identifier}
$$

Using $E$, $T$, $F$, and $I$ to denote expression, term, factor, and identifier (respectively), these rules can be rewritten in the form of production rules $p_1$ through $p_6$:

- Production rule $p_1 : E \rightarrow E + T$
- Production rule $p_2 : E \rightarrow T$
- Production rule $p_3 : T \rightarrow T \times F$
- Production rule $p_4 : T \rightarrow F$
- Production rule $p_5 : F \rightarrow (E)$
- Production rule $p_6 : F \rightarrow I$

Then $\{I, +, \times, (, )\}$ is the set of terminals (i.e. the set of possible lexical tokens from the lexical analyzer), $\{E, T, F\}$ is the set of nonterminals, $\{p_1, p_2, p_3, p_4, p_5, p_6\}$ is the set of production rules, and $E$ is the start symbol of the grammar $G_1$.

The operation of the parser is shown in terms of symbolic codes (instead of the binary codes used by the NN implementation) to make it easy to understand. Note however that the transformation of symbolic codes into binary form used by NNLR Parser is rather straightforward and has been explained in the preceding sections.

Let $s$ and $r$ denote the parsing actions shift and reduce and $a$, $e$, and $i$ the parsing statuses accept, error, and in-progress respectively. The parse table of the LR(1) parser (more specifically, SLR(1) parser) for grammar $G_1$ is shown in Table 6.1.

The implementation and operations of the NN shift/reduce stack and the NN stack for parse tree follow the discussion and examples in Section 5 and they are not discussed here.
Table 6.1 The parse table of the LR(1) parser for grammar $G_1$

<table>
<thead>
<tr>
<th>State</th>
<th>$+$</th>
<th>$-$</th>
<th>$( )$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$q_0$</td>
<td>$(s,q_5,<em>,</em>,*,i)$</td>
<td>$(s,q_6,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_1$</td>
<td>$(r,*,p_2,1,E,i)$</td>
<td>$(s,q_7,<em>,</em>,*,i)$</td>
<td>$(r,*,p_2,1,E,i)$</td>
</tr>
<tr>
<td>$q_2$</td>
<td>$(r,*,p_4,1,T,i)$</td>
<td>$(r,*,p_4,1,T,i)$</td>
<td>$(r,<em>,p_4,1,</em>,i)$</td>
</tr>
<tr>
<td>$q_3$</td>
<td>$(s,q_5,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_4$</td>
<td>$(r,*,p_3,1,F,i)$</td>
<td>$(r,*,p_3,1,F,i)$</td>
<td>$(r,*,p_6,1,F,i)$</td>
</tr>
<tr>
<td>$q_5$</td>
<td>$(s,q_9,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_6$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
</tr>
<tr>
<td>$q_7$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
</tr>
<tr>
<td>$q_8$</td>
<td>$(s,q_6,<em>,</em>,*,i)$</td>
<td>$(s,q_1,<em>,</em>,*,i)$</td>
<td>$(s,q_1,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_9$</td>
<td>$(r,*,p_1,3,E,i)$</td>
<td>$(s,q_7,<em>,</em>,*,i)$</td>
<td>$(r,*,p_1,3,E,i)$</td>
</tr>
<tr>
<td>$q_{10}$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
</tr>
<tr>
<td>$q_{11}$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>State</th>
<th>$$</th>
<th>$E$</th>
<th>$T$</th>
<th>$F$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$q_0$</td>
<td>$(s,q_1,<em>,</em>,*,i)$</td>
<td>$(s,q_2,<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_1$</td>
<td>$(s,q_6,<em>,</em>,*,i)$</td>
<td>$(s,q_2,<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_2$</td>
<td>$(r,*,p_2,1,E,i)$</td>
<td>$(s,q_7,<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_3$</td>
<td>$(r,*,p_4,1,T,i)$</td>
<td>$(s,q_8,<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_4$</td>
<td>$(r,*,p_3,1,F,i)$</td>
<td>$(s,q_9,<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_5$</td>
<td>$(r,*,p_6,1,F,i)$</td>
<td>$(s,q_{10},<em>,</em>,*,i)$</td>
<td>$(s,q_3,<em>,</em>,*,i)$</td>
<td>$(s,q_4,<em>,</em>,*,i)$</td>
</tr>
<tr>
<td>$q_6$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
</tr>
<tr>
<td>$q_7$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
</tr>
<tr>
<td>$q_8$</td>
<td>$(r,*,p_1,3,E,i)$</td>
<td>$(r,*,p_1,3,E,i)$</td>
<td>$(r,*,p_1,3,E,i)$</td>
<td>$(r,*,p_1,3,E,i)$</td>
</tr>
<tr>
<td>$q_9$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
<td>$(r,*,p_3,3,T,i)$</td>
</tr>
<tr>
<td>$q_{10}$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
<td>$(r,*,p_5,3,F,i)$</td>
</tr>
</tbody>
</table>
The parse table can be represented by a binary mapping which in turn can be easily realized by a BMP module (see Section 2.2.5 for details). Following the notation introduced in Section 6 for the NN realization of the parse table, we have: $M = 45$ since there are 45 defined entries in the parse table; $S = \lceil \log_2 12 \rceil = 4$ since there are 12 states; $H = \lceil \log_2 (8 + 2) \rceil = 4$ since there are 8 grammar symbols plus a null symbol $\epsilon$ and an additional end-of-string symbol $\$$; $R = \lceil \log_2 6 \rceil = 3$ since there are 6 production rules; and $L = \lceil \log_2 3 \rceil = 2$ since the maximum number of non-empty grammar symbols that appear in the right hand side of a production rule in the LR grammar $G_1$ is 3. Therefore, the BMP for parse table of $G_1$ has $S + H = 8$ input, 45 hidden, and $4 + S + R + L + H = 17$ output neurons.

Assume every identifier $I$ is translated from a string of lower case English characters. The lexical analyzer $L_1$ which translates input strings of $+, \times, (, ), $$, \text{blank}$, and lower case English characters into strings of lexical tokens can be realized by an NN DFA. Figure 6.4 shows the state diagram of the DFA $M_{L_1}$ for $L_1$. Note that additional machinery needed for error handling is not included in the DFA; and when the DFA sees a $\$$, it stops the processing of the input string and appends a $\$$ at the end of the output string.

The transition function $\delta_{L_1}$ of the DFA $M_{L_1}$ is shown in Table 6.2. This function can be expressed as a binary mapping which in turn can be easily realized by a BMP module (see Section 2.2.5 for details). In the NN DFA, BMP module 1 realizes the transition function $\delta_{L_1}: Q \times \Gamma \to Q$ and BMP module 2 realizes a translation function $\lambda' : Q \to \Delta$ s.t. $\lambda'(q_2) = I$, $\lambda'(q_4) = +$, $\lambda'(q_6) = \times$, $\lambda'(q_8) = (, \lambda'(q_{10}) = )$, and $\lambda'(q) = \epsilon$ (null symbol, which is discarded) for other $q \in Q$, where $Q = \{q_0, q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8, q_9, q_{10}\}$ is the set of states, $\Gamma = \{a, b, \ldots, z, +, \times, (, ) ,\$$, \text{blank}\}$ is the input alphabet, and $\Delta = \{I, +, \times, (, )\}$ is the output alphabet (i.e., the set of lexical tokens). The symbolic functions $\delta_{L_1}$ and $\lambda'$ can be expressed as binary mappings which in turn can be realized by BMP modules (see Section 2.2.5 for details).

Let us now consider the operation of the LR(1) parser when it is presented with the input string $aa \times bb + cc$. This string is first translated by the lexical analyzer $L_1$ into a string of lexical tokens $1x1+1$ which is then provided to the LR(1) parser. This translation is quite straightforward, given the state diagram and transition function $\delta_{L_1}$ (Table 6.2) of $M_{L_1}$ and...
Figure 6.4  The state diagram of the DFA $M_{L_1}$ for the lexical analyzer $L_1$

Table 6.2  The transition function $\delta_{L_1}$ of the DFA $M_{L_1}$

<table>
<thead>
<tr>
<th>State</th>
<th>$a, b, ..., z$</th>
<th>$+$</th>
<th>$\times$</th>
<th>( )</th>
<th>blank</th>
<th>$$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$q_0$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_1$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_2$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_3$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>$q_4$</td>
</tr>
<tr>
<td>$q_4$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_5$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>$q_6$</td>
</tr>
<tr>
<td>$q_6$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_7$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>$q_8$</td>
</tr>
<tr>
<td>$q_8$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
<tr>
<td>$q_9$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>$q_{10}$</td>
</tr>
<tr>
<td>$q_{10}$</td>
<td>$q_1$</td>
<td>$q_3$</td>
<td>$q_5$</td>
<td>$q_7$</td>
<td>$q_9$</td>
<td>$q_0$</td>
</tr>
</tbody>
</table>
its translation function $\lambda'$. Note that there is a space between each pair of consecutive words in the input character string, and there is no space token between each pair of consecutive lexical tokens in the string of lexical tokens.

The string of lexical tokens is parsed by the LR(1) parser whose moves are shown in Table 6.3. At step 1, the parse table entry corresponding to $(q_0, I)$ is consulted. Its value is $(s, q_0, *, *, *, i)$. This results in shifting $I$ and pushing state $q_3$ onto the shift/reduce stack. At step 2, the table entry corresponding to $(q_3, x)$ is consulted first. Its value is $(r, *, p_6, 1, F, i)$ which indicates a reduce on production rule $p_6$. Therefore, state $q_3$ is popped off the stack, and table entry corresponding to $(q_0, F)$ is consulted next. The entry is $(s, q_3, *, *, *, i)$ which means shifting $F$ and pushing state $q_2$ onto the stack. The remaining steps are executed in a similar fashion. At the end of the moves (step 14), the sequence of production rules stored in the stack for parse tree can be applied in reverse order to derive the string $1x1+1$ from grammar start symbol $E$.

### 6.4 Performance Analysis

This section explores potential performance advantages of the proposed neural network architecture for syntax analysis in comparison with that of current computer systems that employ inherently sequential index or matrix structure for information matching and retrieval. The performance estimates for the NNLR Parser assume hardware realization based on current CMOS VLSI technology. In the analysis that follows, it is assumed that the two systems have comparable I/O performance and error handling capabilities.

To simplify the comparison, it is assumed that each instruction on a conventional computer takes $\tau$ ns (nanoseconds) on an average. For instance, on a relatively cost-effective 100 MIPS processor, a typical instruction would take 10 ns to complete. (The MIPS measure for speed combines clock speed, effect of caching, pipelining and superscalar design into a single figure for speed of a microprocessor). Similarly, we will assume that a single identification and recall operation by a neural associative memory takes $\alpha$ ns. Assuming hardware implementation based on current CMOS VLSI technology, $\alpha = 20$ ns (see Section 3.2.1.1).
Syntax analysis in a conventional computer typically involves: lexical analysis, grammar parsing, parse tree construction and error handling. These four processes are generally coded into two modules [2]. Error handling is usually embedded in grammar parsing and lexical analysis respectively, and parse tree construction is often embedded in grammar parsing. The procedure for grammar parsing is the main module. In single-CPU computer systems, even assuming negligible overhead for parameter passing, a procedure call entails, at the very minimum, (1) saving the context of the caller procedure and activation of the callee procedure which typically requires 6 instructions [105]; and (2) context restoration and resumption of caller procedure upon the return (exit) of the callee procedure, which typically requires at least 3 instructions [105]). Thus, a procedure call entails a penalty of 9 instructions or about 9\,\mu s.

6.4.1 Performance analysis of lexical analyzer

Lexical analysis can be performed by a DFA whose transition function can be represented as a 2-dimensional table with current state and current input symbol as indices. The continuous transition moves of such a DFA involve repetitive lookup of its next state from the table using

<table>
<thead>
<tr>
<th>Step</th>
<th>Content of shift/reduce stack</th>
<th>Remaining input</th>
<th>Referred entries of parse table</th>
<th>Content of parse tree stack</th>
</tr>
</thead>
<tbody>
<tr>
<td>(1)</td>
<td>$g_0$</td>
<td>$I \times I + I$</td>
<td>$(g_0, I)$</td>
<td>$\perp$</td>
</tr>
<tr>
<td>(2)</td>
<td>$q_0 q_5$</td>
<td>$I + I$</td>
<td>$(q_5, X)$, $(q_0, F)$</td>
<td>$\perp$</td>
</tr>
<tr>
<td>(3)</td>
<td>$q_0 q_3$</td>
<td>$I + I$</td>
<td>$(q_3, X)$, $(q_0, T)$</td>
<td>$\perp p_6$</td>
</tr>
<tr>
<td>(4)</td>
<td>$q_0 q_2$</td>
<td>$I + I$</td>
<td>$(q_2, X)$</td>
<td>$\perp p_6 p_4$</td>
</tr>
<tr>
<td>(5)</td>
<td>$q_0 q_2 q_7$</td>
<td>$I + I$</td>
<td>$(q_7, I)$</td>
<td>$\perp p_6 p_4$</td>
</tr>
<tr>
<td>(6)</td>
<td>$q_0 q_2 q_7 q_5$</td>
<td>$I + I$</td>
<td>$(q_5, +)$, $(q_7, F)$</td>
<td>$\perp p_6 p_4$</td>
</tr>
<tr>
<td>(7)</td>
<td>$q_0 q_2 q_7 q_10$</td>
<td>$I + I$</td>
<td>$(q_{10}, +)$, $(q_0, T)$</td>
<td>$\perp p_6 p_4 p_6$</td>
</tr>
<tr>
<td>(8)</td>
<td>$q_0 q_2$</td>
<td>$I + I$</td>
<td>$(q_{2}, +)$, $(q_0, E)$</td>
<td>$\perp p_6 p_4 p_6 p_3$</td>
</tr>
<tr>
<td>(9)</td>
<td>$q_0 q_1$</td>
<td>$I + I$</td>
<td>$(q_{1}, +)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2$</td>
</tr>
<tr>
<td>(10)</td>
<td>$q_0 q_1 q_6$</td>
<td>$I + I$</td>
<td>$(q_6, I)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2$</td>
</tr>
<tr>
<td>(11)</td>
<td>$q_0 q_1 q_6 q_5$</td>
<td>$I + I$</td>
<td>$(q_5, S)$, $(q_6, F)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2$</td>
</tr>
<tr>
<td>(12)</td>
<td>$q_0 q_1 q_6 q_3$</td>
<td>$I + I$</td>
<td>$(q_3, S)$, $(q_6, T)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2$</td>
</tr>
<tr>
<td>(13)</td>
<td>$q_0 q_1 q_6 q_9$</td>
<td>$I + I$</td>
<td>$(q_9, S)$, $(q_0, E)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2 p_6 p_4$</td>
</tr>
<tr>
<td>(14)</td>
<td>$q_0 q_1$</td>
<td>$I + I$</td>
<td>$(q_1, S)$</td>
<td>$\perp p_6 p_4 p_6 p_3 p_2 p_6 p_4 p_1$</td>
</tr>
</tbody>
</table>
current state and current input symbol at each move until an error state or an accepting state is reached. Such a repetitive table lookup involves content-based pattern matching and retrieval which can be performed potentially more efficiently by neural associative memories.

Each entry of the DFA transition table implemented on conventional computers usually contains three parts: the next state; a code for whether the next state is an accepting state, an error state, or neither; and the lexical token to use if the next state is an accepting state. Implementing such a repetitive table lookup on conventional computers requires, at a minimum, six instructions: one (or two) multiplication and one addition to compute the offset in the transition table (to access the location where the next state is stored), one memory access to fetch the next state from the table, one addition to compute the offset of the second part in the transition table (based on the known offset of the first part), one memory access to fetch the second part from the table, and one branch-on-comparison instruction to jump back to the first instruction of the loop if the next state is neither an error state nor an accepting state. (Note that this analysis ignores I/O processing requirements). Thus, each state transition takes 6 instructions or 6\( \tau \) ns.

In contrast, the proposed NN architecture for lexical analyzer computes the next state using associative (content-addressed) pattern matching-and-retrieval in a single identification-and-recall cycle of a BMP module. In the 2-dimensional table, the values of the two indices for an entry provide a unique pattern – the index pattern, for accessing the table entry. In the BMP module, each index pattern and the corresponding entry are stored as an association pair by a hidden neuron and its associated connections. The BMP performs a table lookup in two steps: identification and recall. In the identification step, a given index pattern is compared to all stored index patterns in parallel by the hidden neurons and their associated 1st-layer connections. Once a match is found, one of the hidden neurons is activated to recall the associated entry value using the 2nd-layer connections associated with the activated hidden neuron.

In program compilation, a segmented word is translated into a syntactically tagged token when the DFA for lexical analysis enters an accepting state. On conventional computers, this
translation step costs, at the very minimum, three instructions (or $3\tau$ ns): one addition to compute the offset of the third part in the transition table (based on the known offset of the first part), one memory access to fetch the lexical token from the table, and one branch instruction to jump back to the first instruction of the loop for carving next word.

In other syntax analysis applications that involve large vocabularies, a database lookup is typically used to translate a word into a syntactically tagged token. In this case, depending on the size of the vocabulary and the organization of the database, it would generally take more than 10 instructions to perform this translation. (See Chapter 3 for a comparison of database query processing using neural associative memories as opposed to conventional computers).

A BMP module is capable of translating a carved word into a token as described in Section 2.2.5 in a single cycle of identification-and-recall with a time delay of $\alpha$ ns. Note that this step can be pipelined (see the NNLR Parser in action in Section 6).

In summary, if we assume the average length of words in input string being $W$ symbols and we ignore I/O, error handling and the overhead associated with procedure calls, it would take $(6W + 3)\tau$ ns on average to perform lexical analysis of a word on a conventional computer. In contrast, it would take $(W + 1)\alpha$ ns using the proposed NN lexical analyzer. This analysis ignores I/O and error handling. For example, assuming a 100 MIPS conventional computer ($\tau = 10$ ns), and current CMOS VLSI implementation of neural associative memories ($\alpha = 20$ ns), with $W = 5$, then the former takes 330 ns and the latter 120 ns.

6.4.2 Performance analysis of LR parser

LR parsing also involves repetitive table lookup which can be performed efficiently by neural associative memories. LR parser is driven by a 2-dimensional table (parse table) with current state and current input symbol as indices. Once a next state is retrieved, it is stored on a stack and is used as the current state for the next move. Parsing involves repetitive application of a sequence of shift and reduce moves. A shift move would take at least 6 instructions, or equivalently $6\tau$ ns on a conventional computer. This includes 3 instructions to consult the parse table, 1 instruction to push the next state onto the stack, 1 instruction to increment the
stack pointer, and 1 instruction to go back to the first instruction of the repetitive loop for next move. A typical reduce move involves a parse table lookup, a pop of the state stack, a push to store a rule into the stack for parse tree, and a shift move. Thus, a typical reduce would take at least $3 + 1 + 2 + 6 = 12$ instructions, or equivalently $12\tau$ ns, on a conventional computer.

In the proposed NNLR Parser, the computation delay consists of the delays contributed by the operation of the two NN Stacks and the BMP which stores the parse table. An NN Stack consists of two BMP modules, one of which is augmented with a write control module. Assuming that the computation delay of an NN stack is roughly equal to that of two sequentially linked BMPs ($2\alpha$ ns), a shift move (which takes one operation cycle of the NNLR Parser) and a typical reduce move (which takes two operation cycles of the NNLR Parser) would consume $3\alpha$ ns and $6\alpha$ ns respectively. (This analysis ignores the effect of queuing between the NNLR Parser and the NN lexical analyzer).

Assuming that the average length of words in input string be $W$ symbols, and ignoring I/O, error handling and the overhead associated with procedure calls, parsing a word (a word has to be translated into a lexical token by lexical analysis first) by shift and reduce moves would take $(6W + 9)\tau$ ns and $(6W + 15)\tau$ ns respectively on a conventional computer.

In contrast, because the NNLR Parser and NN lexical analyzer can operate in parallel, shift and reduce moves take $3\alpha$ ns or $(W + 1)\alpha$ ns (whichever is larger) and $6\alpha$ ns or $(W + 1)\alpha$ ns (whichever is larger) respectively on the NNLR Parser.

Thus, as shown in Table 6.4, for typical values of $\alpha$, $\tau$, and $W$, the proposed NNLR Parser offers a potentially attractive alternative to conventional computers for syntax analysis.

It should be noted that the preceding performance comparison has not considered alternative hardware realizations of syntax analyzers. These include hardware implementations of parsers using conventional building blocks used for building today’s serial computers. We are not aware of any such implementations although clearly, they can be built. In this context it is worth noting that the neural architecture for syntax analysis proposed in this chapter makes extensive use of massively parallel processing capabilities of neural associative proces-
Table 6.4 A comparison of the estimated performance of the proposed NNLR Parser with that of conventional computer systems for syntax analysis

<table>
<thead>
<tr>
<th>Type of overhead</th>
<th>NNLR Parser</th>
<th>Conventional computers</th>
</tr>
</thead>
<tbody>
<tr>
<td>time for lexical analysis of a word</td>
<td>$(W + 1)\alpha$</td>
<td>$(6W + 3)\tau$</td>
</tr>
<tr>
<td>time for a shift move of parsing</td>
<td>$\max [3\alpha, (W + 1)\alpha]$</td>
<td>$(6W + 9)\tau$</td>
</tr>
<tr>
<td>time for a reduce move of parsing</td>
<td>$\max [6\alpha, (W + 1)\alpha]$</td>
<td>$(6W + 15)\tau$</td>
</tr>
</tbody>
</table>

sors (memories). It is quite possible that other parallel (possibly non neural network) hardware realizations of syntax analyzers offer performance that compares favorably with that of the proposed neural network realization. We can only speculate as to why there appears to have been little research on parallel architectures for syntax analysis. Historically, research in high performance computing has focused primarily on speeding up the execution of numeric computations, typically performed by programs written in compiled languages such as C and FORTRAN. In such applications, syntax analysis is done during program compilation which is relatively infrequently compared to program execution. The situation is quite different in symbol processing (e.g., knowledge based systems of AI, analysis of mathematical expressions in software designed for symbolic integration, algebraic simplification, theorem proving) and interactive programming environments based on interpreted programming languages (e.g., LISP, JAVA). Massively parallel architectures for such tasks are only beginning to be explored.

6.5 Summary and Discussion

This chapter has explored the design of a neural architecture for syntax analysis of languages with known (a-priori specified) grammars. Syntax analysis is a prototypical symbol processing task with a diverse range of applications in artificial intelligence, cognitive modelling, and computer science. Examples of such applications include: language interpreters for interactive programming environments using interpreted languages (e.g., LISP, JAVA), parsing of symbolic expressions (e.g., in real-time knowledge based systems, database query processing, and mathematical problem solving environments), syntactic or structural analysis of large collections of data (e.g., molecular structures, engineering drawings, etc.), and high-performance
compilers for program compilation and behavior-based robotics. Indeed, one would be hard-pressed to find a computing application that does not rely on syntax analysis at some level. The need for syntax analysis in real time calls for novel solutions that can deliver the desired performance at an affordable cost. Artificial neural networks, due to their potential advantages for real-time applications on account of their inherent parallelism, offer an attractive approach to the design of high performance syntax analyzers.

The proposed neural architecture for syntax analysis is obtained through systematic and provably correct composition of a suitable set of component symbolic functions which are ultimately realized using neural associative processor (memory) modules. The neural associative processor (memory) is essentially a 2-layer perceptron which can store and retrieve arbitrary binary pattern associations [21]. It is a cost-effective SIMD (single instruction, multiple data) computer system for massively parallel pattern matching and retrieval. Since each component in the proposed neural architecture computes a well-defined symbolic function, it facilitates the systematic synthesis as well as analysis of the desired computation at a fairly abstract (symbolic) level. Realization of the component symbolic functions using neural associative processors (memories) allows one to exploit massive parallelism to support applications that require syntax analysis to be performed in real time.

The proposed neural network for syntax analysis is capable of handling sequentially presented character strings of variable length, and it is assembled from neural network modules for lexical analysis, stack processing, parsing, and parse tree construction. The neural network stack can realize stacks of arbitrary depths (limited only by the number of neurons available). The parser outputs the parse tree resulting from syntax analysis of strings from widely used subsets of deterministic context-free languages (i.e., those generated by LR grammars). Since logically an LR parser consists of two parts, a driver routine which is the same for all LR parsers, and a parse table which varies from one application to the next [3], the proposed NNLR Parser can be used as a general-purpose neural architecture for LR parsing.

It is relatively straightforward to estimate the cost and performance of the proposed neural architecture for syntax analysis based on the known computation delays associated with
the component modules (using known facts or a suitable set of assumptions regarding current
VLSI technology for implementing the component modules). Our estimates suggest that the
proposed system offers a systematic and provably correct approach to designing cost-effective
high-performance syntax analyzers for real-time syntax analysis using known (a-priori speci-
fied) grammars.

The choice of the neural associative processors (memories) as the primary building blocks
for the synthesis of the proposed neural architecture for syntax analysis was influenced, among
other things, by the fact that they find use in a wide range of systems in computer science,
artificial intelligence, and cognitive modelling. This is because associative pattern matching
and recall is central to pattern-directed processing which is at the heart of many problem
solving paradigms in AI (e.g., knowledge based expert systems, case based reasoning) as well
as computer science (e.g., database query processing, information retrieval). As a result, design,
VLSI implementation, and applications of associative processors have been studied extensively
in the literature [21, 23, 68, 78, 88, 97, 110, 124, 127, 151, 153]. The neural network architecture
proposed in this chapter for syntax analysis demonstrates the versatility of neural associative
processors (memories) as generic building blocks for systematic synthesis of modular massively
parallel architectures for symbol processing applications.

It should be noted that the primary focus of this chapter was on taking advantage of massive
parallelism and associative pattern storage, matching, and recall properties of a particular class
of neural associative memories in designing high performance syntax analyzers for a-priori
specified grammars. Consequently, it has not addressed several other potential advantages
of neural network architectures for intelligent systems. Notable among these are inductive
learning and fault tolerance.

Machine learning of grammars or grammar inference is a major research topic which has
been, and continues to be, the subject of investigation by a large number of researchers in
artificial intelligence, machine learning, syntactic pattern recognition, neural networks, com-
putational learning theory, natural language processing, and related areas. The surveys of
grammar inference in general can be found in [69, 96, 115, 137], and the recent results on
grammar inference using neural networks can be found in [6, 13, 27, 38, 45, 44, 66, 77, 80, 116, 122, 123, 129, 159, 161, 166, 174, 192, 194, 197].

Fault tolerance capabilities of neural architectures under different fault models (neuron faults, connection faults, etc) has been the topic of considerable research [21, 165, 180] and is beyond the scope of this chapter. However, it is worth noting that the proposed neural network design for syntax analysis inherits some of the fault tolerance capabilities of its primary building block, the neural associative processor (memory) (see Section 2.3.3 for details).
7 CONCLUSION

Traditional symbol processing models of AI and ANN have been viewed by many as radically (and perhaps even irreconcilably) different paradigms for the design of intelligent systems. But given the fact that they are essentially equivalent in terms of their computing capabilities, a more reasonable view is that they each represent different architectural commitments and hence different cost-performance tradeoffs within the space of possible designs for intelligent systems. This latter viewpoint argues for a somewhat systematic exploration of this design space in search of novel and efficient computational architectures for such systems. This dissertation takes a few small steps in this direction and adds to the growing body of literature [47, 72, 99, 179] that demonstrates the potential benefits of integrated neural-symbolic architectures that overcome some of the limitations of today's ANN and AI systems.

More specifically, this dissertation develops the theory and implementation of a neural architecture for associative memory which is capable of massively parallel best match, exact match, and partial match. It also demonstrates systematic, provably correct synthesis of efficient neural architectures respectively for information retrieval and database query processing, elementary logical inference, sequence processing, and syntax analysis using neural associative memories as the primary building blocks for massively parallel pattern-directed symbol processing. This facilitates the systematic analysis of the resulting computation performed by the resulting neural systems at a fairly abstract (symbolic) level.
APPENDIX. ACRONYMS

AI: Artificial Intelligence
AM: Associative Memory
ANN: Artificial Neural Networks
BiCMOS: Bipolar Complementary Metal Oxide Semiconductor
BMP: Binary Mapping Perceptron
BNF: Backus-Naur Form
CFG: Context-Free Grammars
CFL: Context-Free Languages
CMOS: Complementary Metal Oxide Semiconductor
DCFL: Deterministic Context-Free Languages
DFA: Deterministic Finite Automata
DNF: Disjunctive Normal Form
DPDA: Deterministic Pushdown Automata
MIPS: Million Instructions Per Second
NFA: Nondeterministic Finite Automata
NLP: Natural Language Processing
NN DFA: Neural Network for Deterministic Finite Automata
NN DPDA: Neural Network for Deterministic Pushdown Automata
NN NFA: Neural Network for Nondeterministic Finite Automata
NN Stack: Neural Network for Stack
NNLR Parser: Neural Network for LR(1) Parser
PLA: Programmable Logic Array
RAAM: Recursive Auto-Associative Memory
RNN: Recurrent Neural Networks
SIMD: Single Instruction Multiple Data
VLSI: Very Large Scale Integration
BIBLIOGRAPHY


