## Abstract

Neural decoding may be formulated as dynamic state estimation (filtering) based on point-process observations, a generally intractable problem. Numerical sampling techniques are often practically useful for the decoding of real neural data. However, they are less useful as theoretical tools for modeling and understanding sensory neural systems, since they lead to limited conceptual insight into optimal encoding and decoding strategies. We consider sensory neural populations characterized by a distribution over neuron parameters. We develop an analytically tractable Bayesian approximation to optimal filtering based on the observation of spiking activity that greatly facilitates the analysis of optimal encoding in situations deviating from common assumptions of uniform coding. Continuous distributions are used to approximate large populations with few parameters, resulting in a filter whose complexity does not grow with population size and allowing optimization of population parameters rather than individual tuning functions. Numerical comparison with particle filtering demonstrates the quality of the approximation. The analytic framework leads to insights that are difficult to obtain from numerical algorithms and is consistent with biological observations about the distribution of sensory cells' preferred stimuli.

## 1 Introduction

Populations of sensory neurons encode information about the external world through their spiking activity. To understand this encoding, it is natural to model it as an optimal or near-optimal code in the context of some task performed by higher brain regions, using performance criteria such as decoding error or motor performance. A Bayesian theory of neural decoding is useful to characterize optimal encoding, as the computation of performance criteria typically involves the posterior distribution of the world state conditioned on spiking activity.

We model the external world state as a random process, observed through a set of sensory neuron-like elements characterized by multidimensional tuning functions, representing the elements' average firing rate (see Figure 1). The actual firing of each cell is random and is given by a point process (PP) with rate determined by the external state and the cell's tuning function (Dayan & Abbott, 2005). Under this model, decoding of sensory spike trains may be formulated as a filtering problem based on PP observations, thus falling within the purview of nonlinear filtering theory (Snyder & Miller, 1991; Brémaud, 1981). Inferring the hidden state under such circumstances has been widely studied within the computational neuroscience literature (Dayan & Abbott, 2005; Macke, Buesing, & Sahani, 2015). Beyond neuroscience, PP-based filtering has been used for position sensing and tracking in optical communication (Snyder, Rhodes, & Hoversten, 1977), control of computer communication networks (Segall, 1978), queuing (Brémaud, 1981), and econometrics (Frey & Runggaldier, 2001).

A significant amount of work has been devoted in recent years to the development of algorithms for fast approximation of the posterior distribution, leading to an extensive literature (see Macke et al., 2015, for a recent review). Much of this work is devoted to the development of effective sampling techniques, leading to highly performing finite-dimensional filters that can be applied profitably to real neural data. These approaches are usually formulated in discrete time, as befits implementation on digital computers, and lead to complex mathematical expressions for the posterior distributions, which are difficult to interpret qualitatively. In this work, we are less concerned with algorithmic issues and more with establishing closed-form analytic expressions for approximately optimal continuous time filters and using these to characterize the nature of near-optimal encoders, namely, determining the structure and distribution of tuning functions for optimal state inference. A significant advantage of the closed-form expressions over purely numerical techniques is the insight and intuition that is gained from them about qualitative aspects of the system. Moreover, the leverage gained by the analytic computation contributes to reducing the variance inherent in Monte Carlo approaches. Thus, in this work, we do not compare our results to algorithmically oriented discrete-time filters but, rather, to other continuous-time analytically expressible filters for dynamically varying signals, with the aim of gaining insight into optimal decoding and encoding within an analytic framework.

The problem of filtering a continuous-time diffusion process through PP observations is solved formally under general conditions in Snyder (1972; see also Segall, 1976, and Solo, 2000), where a stochastic partial differential equation (PDE) for the infinite-dimensional posterior state distribution is derived. However, this PDE is intractable in general and not easily amenable to qualitative or even numerical analysis. Some previous work has derived exact or approximate finite-dimensional filters under various simplifying assumptions. In many of these works (e.g., Komaee, 2010; Rhodes & Snyder, 1977; Susemihl, Meir, & Opper, 2013; Twum-Danso & Brockett, 2001; Yaeli & Meir, 2010), the tuning functions are chosen so that the total firing rate (i.e., the sum of firing rates of all neurons) is independent of the state, an assumption we refer to as *uniform coding* (see Figure 2). Rhodes and Snyder (1977) derived an exact finite-dimensional filter for the case of linear dynamics with gaussian noise observed through uniform coding with gaussian tuning functions.^{1} Komaee (2010) considered the more general setting of uniform coding with arbitrary tuning functions, where an approximate filter is obtained.

Other work derives the posterior distribution for non-Markovian state dynamics, modeled as a gaussian processes. Huys, Zemel, Natarajan, and Dayan (2007) derive the posterior exactly, but its computation is not recursive, requiring memory of the entire spike history. A recursive version for gaussian processes with a Matérn kernel autocorrelation is derived in Susemihl, Meir, and Opper (2011). Both of these works assume uniform coding with gaussian tuning functions.

For reasons of mathematical tractability, few previous analytically oriented works studied neural decoding without the uniform coding assumption in spite of the experimental importance and relevance of nonuniform coding. We discuss such work in comparison to our work in section 6.

The problem of optimal encoding by neural populations has been studied mostly in the static case. A natural optimality criterion is the estimation mean square error (MSE). Some work (e.g., Harper & McAlpine, 2004; Ganguli & Simoncelli, 2014) optimizes Fisher information, which serves as a proxy to the MSE of unbiased estimators through the Cramér-Rao bound (Radhakrishna Rao, 1945) or, in the Bayesian setting, the Van Trees inequality (Gill & Levit, 1995). Fisher information of neural spiking activity is easy to compute analytically, at least in the static case (Dayan & Abbott, 2005), and it can be used without solving the decoding problem. This approach has been used to study nonuniform coding of static stimuli by heterogeneous populations in many works, including Chelaru and Dragoi (2008), Ecker, Berens, Tolias, and Bethge (2011), Ganguli and Simoncelli (2014). However, optimizing Fisher information may yield misleading qualitative results regarding the MSE-optimal encoding (Bethge, Rotermund, & Pawelzik, 2002; Pilarski & Pokora, 2015; Yaeli & Meir, 2010). Although under appropriate conditions, the inverse of Fisher information approaches the minimum attainable MSE in the limit of infinite decoding time, it may be a poor proxy for the MSE for finite decoding times, which are of particular importance in natural settings and control problems. Exact computation of the estimation MSE is possible in some restricted settings; some work along those lines is discussed in section 6.2.

A possible alternative is the computation of estimation MSE for a given filter through Monte Carlo simulations. This approach is complicated by high variability between trials, which means many trials are necessary for each value of the parameters. Consequently, optimization becomes very time-consuming, and possibly impractical, when using numerical filters such as particle filtering or large neural populations with many parameters.

In this work, we derive an approximate online filter for the neural decoding problem in continuous time and demonstrate its use in investigating optimal neural encoding. We consider neural populations characterized by a distribution over neuron parameters. Continuous distributions are used to approximate large populations with few parameters, resulting in a filter whose complexity does not grow with the population size and allowing optimization of population parameters rather than individual tuning functions. We suggest further reducing computational complexity for the encoding problem by using the estimated posterior variance as an approximation to estimation MSE, as discussed in appendix C.

Technically, given the intractable infinite-dimensional nature of the posterior distribution, we use a projection method replacing the full posterior at each point in time by a projection onto a simple family of distributions (gaussian in our case). This approach, originally developed in the filtering literature (Brigo, Hanzon, Le Gland, 1999; Maybeck, 1979) and termed *assumed density filtering* (ADF), has been successfully used more recently in machine learning (Opper, 1998; Minka, 2001). We derive approximate filters for gaussian tuning functions and for several distributions over tuning function centers, including the case of a finite population. These filters may be combined to obtain a filter for heterogeneous mixtures of homogeneous subpopulations. We are not aware of any previous work providing an effective closed-form filter for heterogeneous populations of sensory neurons characterized by a small number of parameters.

The main contributions of this letter are as follows:

- •
Derivation of closed-form recursive expressions for the continuous-time posterior mean and variance within the ADF approximation, in the context of large nonuniform populations characterized by a small number of parameters

- •
Demonstration of the quality of the ADF approximation by comparison to state-of-the-art particle filtering methods

- •
Characterization of optimal adaptation (encoding) for sensory cells in a more general setting than hitherto considered (nonuniform coding, dynamic signals)

- •
Demonstration of the interesting interplay between prior information and neuronal firing, showing how in certain situations, the absence of spikes can be highly informative (this phenomenon is absent under uniform coding).

Preliminary results discussed in this letter were presented at a conference (Harel, Meir, & Opper, 2015). These included only the special case of gaussian distribution of preferred stimuli. This letter provides a more general and rigorous formulation of the mathematical framework. By separately considering different terms in the approximate filter, we find that updates at spike time depend only on the tuning function of the spiking neuron and apply generally to any population distribution. For the updates between spikes, we provide closed-form expressions for cases that were not discussed in Harel et al. (2015)—namely, uniform populations on an interval and finite heterogeneous mixtures—and non-closed-form expressions for the general case, given as integrals involving the distribution of neuron parameters. We further supplement our previously published results with numerical evaluation of the filter's accuracy, an additional example application, and a detailed comparison with previous work.

## 2 Problem Overview

Consider a dynamical system with state $Xt\u2208Rn$, observed through the firing patterns of $M$ sensory neurons, as illustrated in Figure 1. Each neuron fires stochastically and independently, with the $i$th neuron having firing rate $\lambda iXt$. (More detailed assumptions about the dynamics of the state and observation processes are described in later sections.) In this context, we are interested in the question of optimal encoding and decoding. By *decoding*, we mean computing (exactly or approximately) the full posterior distribution of $Xt$ given $Nt$, which is the history of neural spikes up to time $t$. The problem of optimal encoding is then the problem of optimal sensory cell configuration—finding the optimal rate function $\lambda i\xb7i=1M$ so as to minimize some performance criterion. We assume the set $\lambda ii=1M$ belongs to some parameterized family with parameter $\varphi $.

To quantify the performance of the encoding-decoding system, we summarize the result of decoding using a single estimator $X^t=X^tNt$, and define the mean square error (MSE) as $\u220at\u225ctrace[(Xt-X^t)(Xt-X^t)T]$. We seek $X^t$ and $\varphi $ that solve $min\varphi minX^tE\u220at=min\varphi E[minX^tE[\u220at|Nt]]$. The inner minimization problem in this equation is solved by the MSE-optimal decoder, which is the posterior mean $X^t=\mu t\u225cEXt|Nt$. The posterior mean may be computed from the full posterior obtained by decoding. The outer minimization problem is solved by the optimal encoder. If decoding is exact, the problem of optimal encoding becomes that of minimizing the expected posterior variance. Note that although we assume a fixed parameter $\varphi $ that does not depend on time, the optimal value of $\varphi $ for which the minimum is obtained generally depends on the time $t$ where the error is to be minimized. In principle, the encoding/decoding problem can be solved for any value of $t$. In order to assess performance, it is convenient to consider the steady-state limit $t\u2192\u221e$ for the encoding problem.

Below, we approximately solve the decoding problem for any $t$. We then explore the problem of choosing the optimal encoding parameters $\varphi $ using Monte Carlo simulations in examples motivated by experimental results.

Having an efficient (closed-form) approximate filter allows performing the Monte Carlo simulation at a significantly reduced computational cost relative to numerical methods such as particle filtering. The computational cost is further reduced by averaging the computed posterior variance across trials rather than the squared error, thereby requiring fewer trials. The mean of the posterior variance equals the MSE (of the posterior mean) but has the advantage of being less noisy than the squared error itself, since by definition, it is the mean of the square error under conditioning on $Nt$.

## 3 Decoding

### 3.1 A Finite Population of Gaussian Neurons

For ease of exposition, we first formulate the problem for a finite population of neurons. We address a more general setting in subsequent sections.

#### 3.1.1 State and Observation Model

^{2}

^{3}and $Wt$ is a standard Wiener process whose increments are independent of the history of all other random processes. The integral with respect to $dWt$ is to be interpreted in the Ito sense. The initial condition $X0$ is assumed to have a continuous distribution with a known density.

^{4}

#### 3.1.2 Model Limitations

The model outlined above involves several simplifications to achieve tractability. Namely, tuning functions are assumed to be gaussian, and firing rates are assumed to be independent of state history and spike history given the current state (see equation 3.2), yielding a DSPP model (see note 4).

Gaussian tuning functions are a reasonable model for some neural systems but inadequate for others—for example, where the tuning is sigmoidal or there is a baseline firing rate regardless of stimulus value. For simplicity, we focus on the gaussian case in this work. It is straightforward to extend the derivation presented here to piecewise linear tuning, which may be used to represent sigmoidal tuning functions, but the resulting expression are more cumbersome. We have also developed closed-form results for tuning functions given by sums of gaussians; however, these require further approximations in order to obtain analytic results and are not discussed in this letter.

The assumption of history-independent rates may also limit the model's applicability. Real sensory neurons exhibit firing history dependence in the form of refractory periods and rate adaptation (Dayan & Abbott, 2005), state history dependence such as input integration (Dayan & Abbott, 2005), or correlations between the firing of different neurons conditioned on the state (Pillow et al., 2008). These phenomena are captured by some encoding models, such as simple integrate-and-fire models, as well as more complex physiological models like the Hodgkin-Huxley model. However, the simplifying independence assumptions above are common to all work presenting closed-form, continuous-time filters for point-process observations that we are aware of.

Note that characterization of the point processes in terms of their history-conditioned firing rate, as opposed to finite-dimensional distributions, does not in itself restrict the model's generality in any substantial way (see Segall & Kailath, 1975, theorem 1). Rather, the independence assumptions are expressed rigorously by the fact that the right-hand side of equation 3.2 depends on neither previous values of $X$ nor previous spike times of any neuron. Some of our analysis applies without modification when rates are allowed to depend on spiking history (specifically, equations 3.6), so it may be possible to extend these techniques to some history-dependent models. However, when rates may depend on the state history, exact filtering may involve the posterior distribution of the entire state history rather than the current state, so that a different approach is probably required.

#### 3.1.3 Exact Filtering Equations

^{5}

^{6}

The stochastic PDE, equation 3.5, is nonlinear and nonlocal (due to the dependence of $\lambda ^ti$ on $ptN$) and therefore usually intractable. Rhodes and Snyder (1977) and Susemihl et al. (2014) consider linear dynamics with a gaussian prior and gaussian sensors with centers distributed uniformly over the state space. In this case, the posterior is gaussian, and equation 3.5 leads to closed-form ordinary differential equations (ODEs) for its mean and variance. In our more general setting, we can obtain exact equations for the posterior mean and variance, as follows.

In contrast with the more familiar case of linear dynamics with gaussian white noise and the corresponding Kalman-Bucy filter (Maybeck, 1979), here the posterior variance is random and is generally not monotonically decreasing even when estimating a constant state. However, noting that $E[dNti-\lambda ^tidt]=0$, we may observe from equation 3.6b that for a constant state ($A=D=0$), the expected posterior variance $E\Sigma t$ is decreasing, since the first two terms in equation 3.6b vanish.

Note that the gaussian tuning assumption 3.3 has not been used in this section, and equations 3.7 are valid for any form of $\lambda i$.

#### 3.1.4 ADF Approximation

While equations 3.7 are exact, they are not practical, since they require computation of posterior expectations $EtN\xb7$. To bring them to a closed form, we use ADF with an assumed gaussian density (see Opper, 1998 for details). Informally, this may be envisioned as integrating equation 3.7, while replacing the distribution $ptN$ by its approximating gaussian “at each time step.” The approximating gaussian is obtained by matching the first two moments of $ptN$ (Opper, 1998). Note that the solution of the resulting equations does not in general match the first two moments of the exact solution, though it may approximate it. Practically, the ADF approximation amounts to substituting the normal distribution $N(\mu t,\Sigma t)$ for $ptN$ to compute the expectations in equation 3.7. This heuristic may be justified by its relation to a projection method, where the right-hand side of the density PDE is projected onto the tangent space of the approximating family of densities: the two approaches are equivalent when the approximating family is exponential (Brigo et al., 1999).

We therefore turn to the approximation of the nonprior updates, equations 3.7c to 3.7f, in the case of gaussian tuning, equation 3.3.

#### 3.1.5 Interpretation

To gain some insight into the filtering equations, we consider the discontinuous updates, equations 3.9c and 3.9d, and continuous updates, equations 3.9a and 3.9b, in some special cases, in reference to the example presented in Figure 3.

*Discontinuous updates.* Consider the case $Hi=I$. As seen from the discontinuous update equations 3.9c and 3.9d, when the $i$th neuron spikes, the posterior mean moves toward its preferred location $\theta i$, and the posterior variance decreases (in the sense that $\Sigma t+-\Sigma t-$ is negative definite), as seen in Figure 3. Neither update depends on $hi$.

*Continuous updates.*The continuous mean update equation 3.9a, contributing between spiking events, also admits an intuitive interpretation in the case where all neurons share the same shape matrices $Hi=H,Ri=R$. In this case, the equation reads

This behavior may be observed in Figure 3. When the posterior mean $\mu t$ is near a neuron's preferred stimulus, it moves away from it between spikes as the next spike is expected from that neuron. Similarly, despite the symmetry of the two neurons' preferred stimuli relative to the starting estimate $\mu 0$, the posterior mean shifts at the start of the trial toward the preferred stimulus of the second neuron due to its lower firing rate.

The continuous variance update, equation 3.9a, consists of the difference of two positive semidefinite terms and, accordingly, the posterior variance may increase or decrease between spikes along various directions. In Figure 3, the posterior variance decreases before the first spike, and increases between spikes afterward.

### 3.2 Continuous Population Approximation

#### 3.2.1 Motivation

The filtering equations 3.9a to 3.9f implement sensory decoding for nonuniform populations. However, their applicability to studying neural encoding in large heterogeneous populations is limited for two closely related reasons. First, the computational cost of the filter is linear in the number of neurons. Second, the size of the parameter space describing the population is also linear in the number of neurons, making optimization of large populations computationally costly. These traits are shared with other filters designed for heterogeneous populations, namely Bobrowski et al. (2009), Eden and Brown (2008), and Twum-Danso and Brockett (2001).

To reduce the parameter space and simplify the filter, we approximate large neural populations by an infinite continuous population, characterized by a distribution over neuron parameters. These distributions are described by few parameters, resulting in a filter whose complexity does not grow with the population size and allowing optimization of population parameters rather than individual tuning functions.

For example, consider a homogeneous population of neurons with preferred stimuli equally spaced on an interval, as depicted in Figure 4a. If the population is large, its firing pattern statistics may be modeled as an infinite population of neurons, with preferred stimuli uniformly distributed on the same interval, as in Figure 4c. In this continuous population model, each spike is characterized by the preferred stimulus of the firing neuron, which is a continuous variable, rather than by the neuron's index. Such a population is parameterized by only two variables representing the end points of the interval, in addition to the tuning function height and width parameters. There is no need for a parameter representing the density of neurons on the interval, as scaling the density of neurons is equivalent to identical scaling of each neuron's maximum firing rate.

#### 3.2.2 Marked Point Processes as Continuous Population Models

**$y$**, in lieu of the previous notation $\lambda ix$. For example, the gaussian case, equation 3.3, takes the form

Figure 4 illustrates how the activity of a neural population may be represented as a marked point process and how the firing statistics are approximated by a continuous population. In this case, a homogeneous population of neurons with equally spaced preferred stimuli is approximated by a continuous population with uniformly distributed preferred stimuli.

*intensity kernel*of the marked point-process $N$ with respect to the history $(Nt,X0,t)$ (e.g., Brémaud, 1981). The dynamics of $N$ may be described heuristically by means of the intensity kernel as

#### 3.2.3 Filtering

Using equation 3.16g, we may evaluate the continuous update equations 3.16a and 3.16b in closed form for some specific forms of the population distribution $f$. Note that the discontinuous update equations 3.16c and 3.16d do not depend on $f$ and are already in closed form. We now consider several population distributions where the continuous updates may be brought to closed form.

*Single neuron.*The result for a single neuron with parameters $h,\theta ,H,R$ is trivial to obtain from equations 3.16a and 3.16b, yielding

*Uniform population.*Here all neurons share the same height $h$ and shape matrices $H,R$, whereas the location parameter $\theta $ covers $Rm$ uniformly, $fdh',d\theta ,dH',dR'=\delta hdh'\delta HdH'\delta RdR'd\theta $, where $\delta x$ is a Dirac measure at $x$, $\delta xA=1{x\u2208A}$, and $d\theta $ indicates the Lebesgue measure in the parameter $\theta $. A straightforward calculation from equations 3.16a, 3.16b, and 3.16g yields

*Gaussian population.*As in the uniform population case, we assume all neurons share the same height $h$ and shape matrices $H,R$ and differ only in the location parameter $\theta $. Abusing notation slightly, we write $f(dh',d\theta ,dH',dR')=\delta h(dh')\delta H(dH')\delta R(dR')f(d\theta )$ where the preferred stimuli are normally distributed,

We take $f$ to be normalized, since any scaling of $f$ may be included in the coefficient $h$ in equation 3.13, resulting in the same point process. Thus, when used to approximate a large population, the coefficient $h$ would be proportional to the number of neurons.

Figure 5a demonstrates the continuous update terms, equation 3.21, as a function of the current mean estimate $\mu t$, for various values of the population variance $\sigma pop2$, including the case of a single neuron, $\sigma pop2=0$. The continuous update term $d\mu tc$ pushes the posterior mean $\mu t$ away from the population center $c$ in the absence of spikes. This effect weakens as $\mu t-c$ grows due to the factor $\lambda ^tf$, consistent with the idea that far from $c$, the lack of events is less surprising, hence less informative. The continuous variance update term $d\sigma t2,c$ increases the variance when $\mu t$ is near $\theta $, otherwise decreases it. This stands in contrast with the Kalman-Bucy filter, where the posterior variance cannot increase when estimating a static state.

#### 3.2.4 Uniform Population on an Interval

Figure 5b demonstrates the continuous update terms, equation 3.23, as a function of the current mean estimate $\mu t$. When the mean estimate is around an end point of the interval, the mean update $\mu tc$ pushes the posterior mean outside the interval in the absence of spikes. The posterior variance $\sigma t2$ decreases outside the interval, where the absence of spikes is expected, and increases inside the interval, where it is unexpected.^{7} When the posterior mean is not near the interval end points, the updates are near zero, consistent with the uniform population case, equation 3.18.

*Finite Mixtures.* Note that the continuous updates, equations 3.16a and 3.16b, are linear in $f$. Accordingly, if $fdy=\u2211i\alpha ifidy,$ where each $fi$ is of one of the above forms, the updates are obtained by the appropriate weighted sums of the filters derived above for the various special forms of $fi$. This form is quite general: it includes populations where $\theta $ is distributed according to a gaussian mixture, as well as heterogeneous populations with finitely many different values of the shape matrices $H,R$. The resulting filter includes a term for each component of the mixture.

## 4 Numerical evaluation

Since the filter, equation 3.16, is based on an assumed density approximation, its results may be inexact. We tested the accuracy of the filter in the gaussian population case (see equation 3.20), by numerical comparison with particle filtering (PF; Doucet & Johansen, 2009).

Figure 6 shows two examples of filtering a one-dimensional process observed through a gaussian population (see equation 3.19) of gaussian neurons (see equation 3.3), using both the ADF approximation (see equation 3.20) and a PF for comparison. (See the figure caption for details.) Figure 7 shows the distribution of approximation errors and the deviation of the posterior from gaussian. The approximation errors plotted are the relative error in the mean estimate $\u220a\mu \u225c\mu ADF-\mu PF/\sigma PF$ and the error in the posterior standard deviation estimate $\u220a\sigma \u225c\sigma ADF-\sigma PF/\sigma PF$, where $\mu ADF,\mu PF,\sigma ADF,\sigma PF$ are, respectively, the posterior mean obtained from ADF and PF and the posterior standard deviation obtained from ADF and PF. The deviation of the posterior distribution from gaussian is quantified using the Kolmogorov-Smirnov (KS) statistic $supxFx-Gx$ where $F$ is the particle distribution cdf and $G$ is the cdf of a gaussian matching $F$'s first two moments. For comparison, the orange lines in Figure 7 show the distribution of this KS statistic under the hypothesis that the particles are drawn independently from a gaussian, which is known as the Lilliefors distribution (see Lilliefors, 1967). As seen in the figure, the gaussian posterior assumption underlying the ADF approximation is quite accurate despite the fact that the population is nonuniform. Accordingly, approximation errors are typically of a few percent (see Table 1).

Statistics of the estimation error distribution for these examples are provided in Table 1.

(a) 1D Example (Figure 6) | ||||

$h=1000$ | $h=2$ | |||

$\u220a\mu $ | $\u220a\sigma $ | $\u220a\mu $ | $\u220a\sigma $ | |

Median | −0.00272 | $1.29\xd710-4$ | $-2.84\xd710-4$ | $2.96\xd710-4$ |

5th percentile | −0.0601 | −0.0185 | −0.0184 | −0.0245 |

95th percentile | 0.0482 | 0.0192 | 0.0186 | 0.0178 |

Mean | −0.00415 | $1.41\xd710-4$ | $3.34\xd710-4$ | $-9.35\xd710-4$ |

SD | 0.0345 | 0.0126 | 0.0119 | 0.0122 |

Median absolute value | 0.0188 | 0.00722 | 0.00662 | 0.00766 |

Mean absolute value | 0.0251 | 0.00919 | 0.0086 | 0.00942 |

(b) 2D Example (Figure 8) | ||||

Position | Velocity | |||

$\u220a\mu $ | $\u220a\sigma $ | $\u220a\mu $ | $\u220a\sigma $ | |

Median | $-1.32\xd710-4$ | $2.28\xd710-4$ | $-3.37\xd710-4$ | $-1.53\xd710-4$ |

5th percentile | −0.0337 | −0.0253 | −0.0234 | −0.0148 |

95th percentile | 0.0361 | 0.0257 | 0.0258 | 0.0154 |

Mean | $-$0.00101 | $2.95\xd710-5$ | $1.51\xd710-5$ | $2.95\xd710-5$ |

SD | 0.0236 | 0.0157 | 0.0169 | 0.00922 |

Median absolute value | 0.0115 | 0.00920 | 0.00908 | 0.00564 |

Mean absolute value | 0.0163 | 0.0118 | 0.0121 | 0.00711 |

(a) 1D Example (Figure 6) | ||||

$h=1000$ | $h=2$ | |||

$\u220a\mu $ | $\u220a\sigma $ | $\u220a\mu $ | $\u220a\sigma $ | |

Median | −0.00272 | $1.29\xd710-4$ | $-2.84\xd710-4$ | $2.96\xd710-4$ |

5th percentile | −0.0601 | −0.0185 | −0.0184 | −0.0245 |

95th percentile | 0.0482 | 0.0192 | 0.0186 | 0.0178 |

Mean | −0.00415 | $1.41\xd710-4$ | $3.34\xd710-4$ | $-9.35\xd710-4$ |

SD | 0.0345 | 0.0126 | 0.0119 | 0.0122 |

Median absolute value | 0.0188 | 0.00722 | 0.00662 | 0.00766 |

Mean absolute value | 0.0251 | 0.00919 | 0.0086 | 0.00942 |

(b) 2D Example (Figure 8) | ||||

Position | Velocity | |||

$\u220a\mu $ | $\u220a\sigma $ | $\u220a\mu $ | $\u220a\sigma $ | |

Median | $-1.32\xd710-4$ | $2.28\xd710-4$ | $-3.37\xd710-4$ | $-1.53\xd710-4$ |

5th percentile | −0.0337 | −0.0253 | −0.0234 | −0.0148 |

95th percentile | 0.0361 | 0.0257 | 0.0258 | 0.0154 |

Mean | $-$0.00101 | $2.95\xd710-5$ | $1.51\xd710-5$ | $2.95\xd710-5$ |

SD | 0.0236 | 0.0157 | 0.0169 | 0.00922 |

Median absolute value | 0.0115 | 0.00920 | 0.00908 | 0.00564 |

Mean absolute value | 0.0163 | 0.0118 | 0.0121 | 0.00711 |

## 5 Encoding

We demonstrate the use of the assumed density filter in determining optimal encoding strategies: selecting the optimal population parameters $\varphi $ (see section 2). To illustrate the use of ADF for the encoding problem, we consider two simple examples. We also use the first example as a test for the filter's robustness. We will study optimal encoding issues in more detail in a subsequent paper.

### 5.1 Optimal Encoding Depends on Prior Variance

Previous work using a finite neuron population and a Fisher information-based criterion (Harper & McAlpine, 2004) has suggested that the optimal distribution of preferred stimuli depends on the prior variance. When it is small relative to the tuning width, optimal encoding is achieved by placing all preferred stimuli at a fixed distance from the prior mean. When the prior variance is large relative to the tuning width, optimal encoding is uniform (see Figure 2 in Harper & McAlpine, 2004). These results are consistent with biological observations reported in Brand, Behrend, Marquardt, McAlpine, & Grothe (2002) concerning the encoding of aural stimuli.

Note that we are optimizing only the two parameters $c,\sigma pop2$ rather than each preferred stimulus as in Harper and McAlpine (2004). This mitigates the considerable additional computational cost due to simulating the decoding process rather than computing Fisher information. This direct simulation, though more computationally expensive, offers two advantages over the Fisher information-based method used in Harper and McAlpine (2004) and prevalent in computational neuroscience. First, the simple computation of Fisher information from tuning functions, commonly used in the neuroscience literature, is based on the assumption of a static state, whereas our method can be applied in a fully dynamic context, including the presence of observation-dependent feedback. Second, simulations of the decoding process allow for the minimization of arbitrary criteria, including the direct minimization of posterior variance or mean square error (MSE). As discussed in section 1, Fisher information may be a poor proxy for the MSE for finite decoding times.

We also test the robustness of the filter to inaccuracies in model parameters or observation/encoding parameters in this problem. We use inaccurate values for the model parameter $a$ in equation 5.1 and the observation/encoding parameter $h$ in equation 3.13 in the filter. Specifically, we multiply or divide each of the two parameters by a factor of $5/4$ in the filter, while the dynamics and the observing neural population remain unaltered. The results remain qualitatively similar, as seen in Figure 11.

### 5.2 Adaptation of Homogeneous Population to Stimulus Statistics

Benucci, Saleem, and Carandini (2013) measured the tuning of cat primary visual cortex (V1) neurons to oriented gratings after adapting to either a uniform distribution of orientations or a distribution with one orientation being more common. The population's preferred stimuli are roughly uniformly distributed and adapt to the prior statistics through change in amplitude. The adapted population was observed to have a decreased firing rate near the common orientation so that the mean firing rate is constant across the population.

## 6 Comparison to Previous Work

### 6.1 Neural Decoding

Neural Code | Decoding | ||||

Reference | Dynamics | Rates | Population | Exact | Complexity |

Snyder (1972); Segall, Davis, and Kailath (1975) | Diffusion | Any | Finite$a$ | $\u221a$ | $\u221e$ |

Snyder (1972)$b$ | Diffusion | Any | Finite | – | $n$ |

Rhodes and Snyder (1977) | lin. G. diff. | Gaussian | Uniform | $\u221a$ | 1 |

Komaee (2010) | lin. G. diff. | Any | Uniform | – | 1 |

Huys et al. (2007) | 1d G. process | Gaussian | Uniform | $\u221a$ | $Nt3$$c$ |

Eden and Brown (2008) | lin. G. diff. | Any | Finite | – | $n$ |

Susemihl et al. (2011, 2013) | 1d G. Mat. | Gaussian | Uniform | – | 1 |

Bobrowski et al. (2009); Twum-Danso and Brockett (2001) | CTMC | Any | Finite | $\u221a$ | $n$ |

This work | Diffusion | Gaussian | Nonuniform | – | 1$d$ |

Neural Code | Decoding | ||||

Reference | Dynamics | Rates | Population | Exact | Complexity |

Snyder (1972); Segall, Davis, and Kailath (1975) | Diffusion | Any | Finite$a$ | $\u221a$ | $\u221e$ |

Snyder (1972)$b$ | Diffusion | Any | Finite | – | $n$ |

Rhodes and Snyder (1977) | lin. G. diff. | Gaussian | Uniform | $\u221a$ | 1 |

Komaee (2010) | lin. G. diff. | Any | Uniform | – | 1 |

Huys et al. (2007) | 1d G. process | Gaussian | Uniform | $\u221a$ | $Nt3$$c$ |

Eden and Brown (2008) | lin. G. diff. | Any | Finite | – | $n$ |

Susemihl et al. (2011, 2013) | 1d G. Mat. | Gaussian | Uniform | – | 1 |

Bobrowski et al. (2009); Twum-Danso and Brockett (2001) | CTMC | Any | Finite | $\u221a$ | $n$ |

This work | Diffusion | Gaussian | Nonuniform | – | 1$d$ |

Notes: The complexity column lists the asymptotic computational complexity of each time step in a discrete-time implementation of the filter, as a function of population size $n$ and number of spikes $Nt$. A complexity of $\u221e$ denotes an infinite-dimensional filter. Lin. G. diff.: linear gaussian diffusion; 1d G. Mat.: one-dimensional gaussian process with Matern kernel autocorrelation; CTMC: continuous-time Markov chain.

$a$The setting of Snyder (1972) and Segall et al. (1975) is a single observation point process, but the result is readily extended to a finite population by summation.

$b$ Snyder (1972) includes both an exact PDE for the posterior statistics and an approximate solution.

$c$This filter is nonrecursive, and its complexity grows with the history of spikes. The exponent 3 is related to inversion of an $Nt\xd7Nt$ matrix, which in principle can be performed with lower complexity.

$d$The complexity is constant for distributions over preferred stimuli as described in section 3.2.3 (including the uniform coding case). For heterogenous mixture populations, the complexity is linear in the number of mixture components.

Few previous analytically oriented works studied neural decoding for dynamic stimuli or without the uniform coding assumption. The filtering problem for a dynamic state with a general (nonuniform) finite population is tractable when the state space is finite; in this case, the posterior is finite-dimensional and obeys SDEs derived in Brémaud (1981). These results have been presented in a neuroscience context in Twum-Danso and Brockett (2001) and Bobrowski et al. (2009).

We are aware of a single previous work (Eden & Brown, 2008) that derived a closed-form filter for nonuniform coding of a diffusion process in continuous time. The setting of Eden and Brown (2008) is a finite population with arbitrary tuning functions, and the derivation uses an approximation similar to the one used in our letter. Our work differs from Eden and Brown (2008) most notably by the use of a parameterized “infinite” population. In this sense, Eden and Brown (2008) setting is less general, corresponding to the case of finite population described in section 3.1. On the other hand, Eden and Brown (2008) deal with a more general form for the tuning functions. Although they also approximate the posterior using a gaussian distribution, it is not equivalent to our filter. The two filters are compared in detail in appendix D.

### 6.2 Neural Encoding

As the work we report in this letter is concerned with efficient neural decoding as a tool for studying neural encoding in nonuniform populations, we briefly mention some work that studies neural encoding analytically. As we have noted, we will study encoding in more detail in a subsequent paper. Much previous work used Fisher information to study nonuniform coding of static stimuli. As we noted in section 1, Fisher information does not necessarily provide a good estimate for possible decoding performance, and optimizing it may yield qualitatively misleading results (see Bethge et al., 2002; Pilarski & Pokora, 2015; Yaeli & Meir, 2010). However, we note one such work, Ganguli and Simoncelli (2014), that is particularly relevant in comparison to our work, as it similarly studies large parameterized populations. The neural populations Ganguli and Simoncelli (2014) studied are characterized by two functions of the stimulus: a density function, which described the local density of neurons, and a “gain” function, which modulates each neuron's gain according to its preferred location. The density function also distorts tuning functions so that tuning width is approximately inversely proportional to neuron density, resulting in approximately uniform coding when the gain function is constant. However, the introduction of the gain function produces a violation of uniform coding. This population is optimized for Fisher information and several related measures. For each of these measures, the optimal population density is shown to be monotonic with the prior: more neurons should be assigned to more probable states. This is in contrast to the results we present in section 5 (e.g., see Figure 10, left), where the optimal population distribution is shifted relative to the prior distribution when the prior is narrow. This discrepancy may be attributed to the limitations of Fisher information in predicting actual decoding error or to the coupling of neuron density and tuning width used in Ganguli and Simoncelli (2014) to facilitate the derivation of closed-form solutions.

Some previous work attempted direct minimization of decoding MSE rather than Fisher information. Yaeli and Meir (2010) derived an explicit expression for the MSE in the static case with uniform coding and used it to characterize optimal tuning function width and its relation to coding time. More recently, Wang, Stocker, and Lee (2016) studied $Lp$-based loss measures and presented exact results for optimal tuning functions in the case of univariate static signals, for single neurons and homogeneous populations. Susemihl et al. (2011, 2013) suggested a mean-field approximation to allow efficient evaluation of the MSE in a dynamic setting.

## Notes

^{1}

Although we describe this work using neuroscience terminology, the motivation and formulation in Rhodes and Snyder (1977) is not related to neuroscience.

^{2}

^{3}

In the sense that any two solutions $Xt1,Xt2$ defined over $0,T$ with $X01=X02$ satisfy $P[Xt1=Xt2\u2200t\u22080,T]=1$. See, e.g., Øksendal (2003, theroem 5.2.1) for sufficient conditions.

^{4}

If equation 3.1 includes an additional feedback (control) term, $Ni$ are not DSPPs. Our results apply with little modification to this case, as described in appendix A.

^{5}

See Segall and Kailath (1975, theorem 2).

^{7}

This holds only approximately, when the tuning width is not too large relative to the size of the interval. For wider tuning functions, the behavior becomes similar to the single sensor case.

^{8}

A more detailed discussion of this definition and of innovation processes is available in Segall and Kailath (1975).

^{9}

That is, it is strictly a function of the history $Ft$.

## Appendix A: Derivation of Filtering Equations

### A.1 Setting and Notation

We denote by $ptN\xb7$ the posterior density—that is, the density of $Xt$ given $Nt$, and by $EtN\xb7$ the posterior expectation—that is, expectation conditioned on $Nt$.

### A.2 The Innovation Measure

We derive the filtering equations in terms of the innovation measure of the marked point process, which is a random measure closely related to the notion of the innovation process associated with unmarked point processes or diffusion processes. The role of the innovation measure in filtering marked point processes is analogous to the role of the innovation process in Kalman filtering.

^{8}Consider an unmarked point-process $Nt$ with $Nt$ denoting its history up to time $t$. Given some history $Ft$ containing $Nt$ (e.g., $Ft$ might include the observation process $N$ and its driving state process $X$), the process $\lambda tF$ is called the intensity of $N$ relative to $F$ when $\lambda t$ is $Ft$-measurable

^{9}and

*innovation process*. We may write

### A.3 Exact Filtering Equations

^{10}In this case, the posterior density $ptN$ obeys the equation

### A.4 ADF Approximation for Gaussian Tuning

We now proceed to apply the gaussian ADF approximation $ptNx\u2248Nx;\mu t,\Sigma t$ to equations A.12 to A.15 in the case of gaussian neurons (see equation 3.3), deriving approximate filtering equations written in terms of the population density $fdy$. From here on, we use $\mu t,\Sigma t$, and $ptN$ to refer to the ADF approximation rather than to the exact values.

We use the following algebraic results. The first is a slightly generalized form of a well-known result about the sum of quadratic forms, which is useful for multiplying gaussians with possibly degenerate precision matrices:

**Claim 1.**Let $x,a,b\u2208Rn$ and $A,B\u2208Rn\xd7n$ be symmetric matrices such that $A+B$ is nonsingular. Then

The proof is accomplished by straightforward expansion of each side. $\u25a1$

Next, we note two matrix inversion lemmas, the first of which is known as the Woodbury identity. These are useful for transferring variance matrices between state and perceptual coordinates in our model. Derivations may be found in Henderson and Searle (1981).

**Claim 2.**For $U,V\u22ba\u2208Rm\xd7n$ and nonsingular $A\u2208Rm\xd7m,C\u2208Rn\xd7n$, the following equalities hold,

### A.5 Approximation of Continuous Terms for Specific Population Distributions

#### A.5.1 Gaussian Population

#### A.5.2 Uniform Population on an Interval

## Appendix B: Implementation Details

### B.1 State Dynamics

### B.2 Continuous Neural Population

The simulation of marked point processes, used to model continuous neural populations (see section 3.2.2), involves the generation of random times and corresponding random marks. In the case of a finite population, there is a finite number of marks, and the point process corresponding to each mark may be simulated separately. For an infinite population, a different approach is required.

Accordingly, the simulation of the marked process $N$ proceeds as follows:

Using the generated trajectory $xk\u2248Xk\Delta t$, simulate a Poisson process with rate $rXt$, yielding the random times $(t1,\u2026tNT)$. This may be accomplished either via direct generation of a point for each time step with probability $rxk\Delta t$ or more efficiently via time rescaling (see, e.g., Brown, Barbieri, Ventura, Kass, & Frank, 2002).

Generate random marks $(y1,\u2026yNT)$ by sampling independently from the distribution $\kappa Xti,dy$.

As in section 3.2.3, when $h,H,R$ are fixed across the population, we abuse notation and write $fdh',d\theta ,dH',dR'=\delta hdh'\delta HdH'\delta RdR'fd\theta $, and similarly for $\kappa x;d\theta $. The functions $rx$ and the distribution $\kappa (x,d\theta )$ for each of the distributions of preferred stimuli in section 3.2.3 are given in closed form in Table 3. For a finite heterogeneous mixture population, $r$ and $\kappa $ may be obtained through the appropriate weighted summation; however, it is easier to simulate each component separately.

$fd\theta $ | $rx$ | $\kappa x;d\theta $ |

$\delta \theta 0d\theta $ | $\lambda x;\theta 0$ | $\delta \theta 0d\theta $ |

$d\theta $ | $h2\pi mdetR$ | $N(\theta ;Hx,R-1)d\theta $ |

$N(\theta ;c,G)d\theta $ | $h2\pi mdetRNc;Hx,R-1+G$ | $N(\theta ;GRGHx+R-1RGc,(R+G-1)-1)d\theta ,$ |

where $RG=R-1+G-1$ | ||

$1a\u2264\theta \u2264bd\theta $ | $h2\pi R\Phi (zbx)-\Phi (zax)$, | $Na,b\theta ;Hx,R-1d\theta $ |

where $zsx=Rs-Hx$ | (truncated normal distribution) |

$fd\theta $ | $rx$ | $\kappa x;d\theta $ |

$\delta \theta 0d\theta $ | $\lambda x;\theta 0$ | $\delta \theta 0d\theta $ |

$d\theta $ | $h2\pi mdetR$ | $N(\theta ;Hx,R-1)d\theta $ |

$N(\theta ;c,G)d\theta $ | $h2\pi mdetRNc;Hx,R-1+G$ | $N(\theta ;GRGHx+R-1RGc,(R+G-1)-1)d\theta ,$ |

where $RG=R-1+G-1$ | ||

$1a\u2264\theta \u2264bd\theta $ | $h2\pi R\Phi (zbx)-\Phi (zax)$, | $Na,b\theta ;Hx,R-1d\theta $ |

where $zsx=Rs-Hx$ | (truncated normal distribution) |

Note: The derivations of these closed forms are straightforward for the Dirac and uniform distributions; the derivation for gaussian distribution is by multiplication of gaussians, completely paralleling the computations in section A.4.

### B.3 Filter

For the linear dynamics (see equation B.1) used in simulations throughout this letter, the prior terms $d\mu t\pi ,d\Sigma t\pi $ are given by equation 3.8. The continuous update terms $d\mu tc,d\Sigma tc$ depend on the population as described in section 3.2.3. The discontinuous updates $d\mu tN,d\Sigma tN$ are given by equations 3.16c and 3.16d, and are nonzero only at time steps containing a spike.

## Appendix C: Variance as Proxy for MSE

Figure 13 shows the variance and MSE in estimating the same process as in Figure 10 after averaging across 10,000 trials. The results indicate that the filter has good accuracy with these parameters, so that the variance is a reasonable approximation for the MSE.

## Appendix D: Comparison to Previous Work: Additional Details

As noted in section 6.1, we are aware of a single previous work (Eden & Brown, 2008) deriving a closed-form filter for nonuniform coding of a diffusion process in continuous time. We now present a detailed comparison of our work to Eden and Brown (2008).

In contrast, the starting point in our derivation is the exact update equations for the first two moments expressed in terms of posterior expectations. A gaussian posterior is substituted into these expectations, resulting in tractable integrals in the case of gaussian tuning functions. The tractability of these integrals depends on the gaussian form of the tuning functions.

We compared the performance of our filter and equations D.1 in simulations of a simple one-dimensional setup. Figure 15 shows an example of filtering a static one-dimensional state observed through two gaussian neurons (see equation 3.3) using both the ADF approximation, equations 3.9a and 3.9b, and the EB filter, equation D.1, for comparison. Since the mean square error is highly sensitive to outliers where the approximation fails and the filtering error becomes large, we compare the filters by observing the distribution of absolute errors (AE) in estimating the state. Figure 16 compares the AE in estimating the state for the two filters. As expected from our analysis, the ADF filter has an advantage, particularly in earlier times when the posterior variance is large. This is seen most clearly in the 95th percentile of AEs in Figure 16a and in the tail histograms in Figure 16c, but a small advantage may also be observed for the median in Figure 16d. However, in some trials, the error in the EB filter remains large throughout.

### D.1 Derivation of Equations D.1

Note that this is a corrected version of the definition used in Eden and Brown (2008), which is unusable when the Hessian is singular (this occurs in our model whenever $m<n$). The definition of $StEB$ given here extends it to the singular case.

## Acknowledgments

The work of Y.H. and R.M. is partially supported by grant 451/17 from the Israel Science Foundation and by the Ollendorff Center of the Viterbi Faculty of Electrical Engineering at the Technion.

## References

*Theoretical neuroscience: Computational and mathematical modeling of neural systems*