I'm a staff research scientist at Google DeepMind. I'm also a visiting professor at Imperial College London in the Department of Physics, where I supervise work on applications of deep learning to computational quantum mechanics. My own research interests span artificial intelligence, machine learning and scientific computing.
Prior to joining DeepMind, I was a PhD student at the Center for Theoretical Neuroscience at Columbia, where I worked on algorithms for analyzing and understanding high-dimensional data from neural recordings with Liam Paninski and nonparametric Bayesian methods for predicting time series data with Frank Wood. I also had a stint as a research assistant in the group of Mike DeWeese at UC Berkeley, jointly between Physics and the Redwood Center for Theoretical Neuroscience.
Current research interests include applications of machine learning to computational physics and connections between differential geometry and unsupervised learning.
Gamma rays produced by positron annihilation are used as a sensitive probe of matter at atomic length scales. Technologies for manipulating positrons and studying their interactions with ordinary matter are rapidly progressing. This motivates the development of accurate ab initio methods for modelling positronic interactions with molecular matter. Here, we apply the recently developed Fermionic neural network (FermiNet) wavefunction ansatz to the problem of finding the ground-state properties of mixed positron-electron systems. We find that FermiNet produces highly accurate, in some cases state-of-the-art, ground-state energies across a range of atoms and small molecules with a wide range of qualitatively distinct positron binding characteristics. We highlight the capabilities of our method by calculating the positron binding energy of the challenging non-polar Benzene molecule.
Neural networks like the FermiNet are great for computing the quantum behavior of electrons, from which most of chemistry can be derived, but there are many other kinds of particles out there. Positrons are the antiparticle of electrons - they behave like electrons with a positive charge. Most computational quantum chemistry approaches struggle with positrons, but we show that it only takes a small, simple change to the FermiNet to enable accurate positronic calculations. It is particularly challenging to get QMC methods to work on non-polar molecules, but we show that the method works well on benzene.
@article{cassella2023neural,
title={Neural Network Variational Monte Carlo for Positronic Chemistry},
author={Cassella, Gino and Foulkes, W. M. C. and Pfau, David and Spencer, James S},
journal={arXiv preprint arXiv:2310.05607},
year={2023}
}
We present a variational Monte Carlo algorithm for estimating the lowest excited states of a quantum system which is a natural generalization of the estimation of ground states. The method has no free parameters and requires no explicit orthogonalization of the different states, instead transforming the problem of finding excited states of a given system into that of finding the ground state of an expanded system. Expected values of arbitrary observables can be calculated, including off-diagonal expectations between different states such as the transition dipole moment. Although the method is entirely general, it works particularly well in conjunction with recent work on using neural networks as variational Ansatze for many-electron systems, and we show that by combining this method with the FermiNet and Psiformer Ansatze we can accurately recover vertical excitation energies and oscillator strengths on molecules as large as benzene. Beyond the examples on molecules presented here, we expect this technique will be of great interest for applications of variational quantum Monte Carlo to atomic, nuclear and condensed matter physics.
We have a track record of papers showing how deep learning can be used in conjunction with a technique called "variational Monte Carlo" (VMC) to accurately compute the ground state of quantum systems. However, many of the most interesting things in quantum physics require going beyond the ground state, to so-called "excited states". It is well-established how to use VMC to compute the ground state - just minimize the energy. But computing excited states by VMC is more of an open topic - there are many competing approaches, but none of them feel "natural" - they often make special assumptions, or require adding extra penalty terms to the loss, or have lots of free variables. We've developed a new method for computing excited states by VMC which has none of these drawbacks, and works very well in conjuction with deep learning methods. We think this is the natural way to compute excited states by VMC.
@article{pfau2023natural,
title={Natural Quantum Monte Carlo Computation of Excited States},
author={Pfau, David and Axelrod, Simon and Sutterud, Halvard and von Glehn, Ingrid and Spencer, James S},
journal={arXiv preprint arXiv:2308.16848},
year={2023}
}
Deep learning methods outperform human capabilities in pattern recognition and data processing problems and now have an increasingly important role in scientific discovery. A key application of machine learning in molecular science is to learn potential energy surfaces or force fields from ab initio solutions of the electronic Schrödinger equation using data sets obtained with density functional theory, coupled cluster or other quantum chemistry (QC) methods. In this Review, we discuss a complementary approach using machine learning to aid the direct solution of QC problems from first principles. Specifically, we focus on quantum Monte Carlo methods that use neural-network ansatzes to solve the electronic Schrödinger equation, in first and second quantization, computing ground and excited states and generalizing over multiple nuclear configurations. Although still at their infancy, these methods can already generate virtually exact solutions of the electronic Schrödinger equation for small systems and rival advanced conventional QC methods for systems with up to a few dozen electrons.
In the last 3 or so years, several methods have appeared that use neural networks to approximate wavefunctions for quantum chemical calculations, as well as other problems in electronic structure theory. This review article presents an accessible introduction to the field, aimed at both a chemistry and machine learning audience, and summarizes much of the work in the last few years, both the foundational papers and recent extensions building on this work.
@article{hermann2023ab,
title={Ab-initio Quantum Chemistry with Neural-Network Wavefunctions},
author={Hermann, Jan and Spencer, James S. and Choo, Kenny and Mezzacapo, Antonio and Foulkes, W. M. C. and Pfau, David and Carleo, Giuseppe and Noé, Frank},
journal={Nature Reviews Chemistry},
volume={7},
number={8},
year={2023}
}
Understanding superfluidity remains a major goal of condensed matter physics. Here we tackle this challenge utilizing the recently developed Fermionic neural network (FermiNet) wave function Ansatz for variational Monte Carlo calculations. We study the unitary Fermi gas, a system with strong, short-range, two-body interactions known to possess a superfluid ground state but difficult to describe quantitatively. We demonstrate key limitations of the FermiNet Ansatz in studying the unitary Fermi gas and propose a simple modification that outperforms the original FermiNet significantly, giving highly accurate results. We prove mathematically that the new Ansatz, which only differs from the original Ansatz by the method of antisymmetrization, is a strict generalization of the original FermiNet architecture, despite the use of fewer parameters. Our approach shares several advantages with the FermiNet: the use of a neural network removes the need for an underlying basis set; and the flexibility of the network yields extremely accurate results within a variational quantum Monte Carlo framework that provides access to unbiased estimates of arbitrary ground-state expectation values. We discuss how the method can be extended to study other superfluids.
In previous work, we had shown that neural networks can be used as powerful and accurate approximations (or "Ansatzes") to solutions of the Schroedinger equation, which is to quantum physics what Newton's laws are to classical physics. We had previously focused on predicting the energies of molecules or model systems like the electron gas, for which these models worked well. Here we look at a model system used to study superconductors and superfluids, where electrons will pair up in surprising ways. We find that the neural networks that worked so well for molecules fail to capture this pairing behavior, but with only a small change we can capture this pairing behavior without sacrificing accuracy on other systems. This opens the door to using these neural network Ansatzes for studying the exotic behavior of materials at low temperatures.
@article{lou2023neural,
title={Neural Wave Functions for Superfluids},
author={Lou, Wan Tong and Sutterud, Halvard and Cassella, Gino and Foulkes, WMC and Knolle, Johannes and Pfau, David and Spencer, James S},
journal={arXiv preprint arXiv:2305.06989},
year={2023}
}
A fast and accurate turbulence transport model based on quasilinear gyrokinetics is developed. The model consists of a set of neural networks trained on a bespoke quasilinear GENE dataset, with a saturation rule calibrated to dedicated nonlinear simulations. The resultant neural network is approximately eight orders of magnitude faster than the original GENE quasilinear calculations. ITER predictions with the new model project a fusion gain in line with ITER targets. While the dataset is currently limited to the ITER baseline regime, this approach illustrates a pathway to develop reduced-order turbulence models both faster and more accurate than the current state-of-the-art.
If you want to predict how well nuclear fusion will happen inside a plasma, you need to understand how fast heat leaks out. To understand how fast heat leaks out, you need to be able to simulate turbulence. Unfortunately, this is one of the hardest things to simulate about a plasma. Previously, some of the authors had built a fast approximate neural network model of turbulent heat transport trained on data from "quasilinear" calculations. Unfortunately these quasilinear methods themselves were not that accurate. We generated a large dataset of linear gyrokinetic calculations, which are more accurate than quasilinear calculations, and used that to train a fast neural network model for ITER that we hope is even more accurate.
@article{citrin2023fast,
title={Fast transport simulations with higher-fidelity surrogate models for ITER},
author={Citrin, J and Trochim, P and Goerler, T and Pfau, D and van de Plassche, KL and Jenko, F},
journal={Physics of Plasmas},
volume={30},
number={6},
year={2023},
publisher={AIP Publishing}
}
We present a novel neural network architecture using self-attention, the Wavefunction Transformer (PsiFormer), which can be used as an approximation (or "Ansatz") for solving the many-electron Schrödinger equation, the fundamental equation for quantum chemistry and material science. This equation can be solved from first principles, requiring no external training data. In recent years, deep neural networks like the FermiNet and PauliNet have been used to significantly improve the accuracy of these first-principle calculations, but they lack an attention-like mechanism for gating interactions between electrons. Here we show that the PsiFormer can be used as a drop-in replacement for these other neural networks, often dramatically improving the accuracy of the calculations. On larger molecules especially, the ground state energy can be improved by dozens of kcal/mol, a qualitative leap over previous methods. This demonstrates that self-attention networks can learn complex quantum mechanical correlations between electrons, and are a promising route to reaching unprecedented accuracy in chemical calculations on larger systems.
In the last 3 or so years, several methods have appeared that use neural networks to approximate wavefunctions for quantum chemical calculations, as well as other problems in electronic structure theory. We show that by dropping in a standard self-attention layer as in a Transformer encoder, we can greatly increase the accuracy of these neural network ansatzes.
@article{von2022self,
title={{A Self-Attention Ansatz for Ab-initio Quantum Chemistry}},
author={von Glehn, Ingrid and Spencer, James S and Pfau, David},
journal={11th International Conference on Learning Representations (ICLR)},
year={2023}
}
Deep neural networks have been extremely successful as highly accurate wave function ansätze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no a priori knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.
In a previous paper, we had shown that a type of neural network called the FermiNet can be used as an extremely accurate method for solving quantum mechanical calculations from first principles. But because of the way we did things, we were restricted to looking at molecules. For most of the really exotic things in condensed matter physics, you need periodic boundary conditions, because materials repeat the same microscopic structure over and over, so we took the FermiNet and extended it to periodic boundary conditions. We then trained this network to solve the “homogeneous electron gas” and found a clear signature of the phase transition from a Fermi liquid to a Wigner crystal. This is really exciting, because almost all other ways of calculating this phase transition have to be told what phase to look for. If you expect a crystal phase, you have to choose a method that only works for a crystal phase. The FermiNet has no idea what phase to expect! Now that we’ve proven the FermiNet can work with periodic boundary conditions, we can apply it to all sorts of other condensed matter systems, including those where we don’t even know what phase transitions might occur. We think this demonstrates a path forward for neural networks being used not just as a tool for scientific calculation, but for scientific discovery.
@article{cassella2023discovering,
title={{Discovering Quantum Phase Transitions with Fermionic Neural Networks}},
author={Cassella, Gino and Sutterud, Halvard and Azadi, Sam and Drummond, N D and Pfau, David and Spencer, James S and Foulkes, W Matthew C},
journal={Physical Review Letters},
volume={130},
number={3},
pages={036401},
year={2023},
publisher={APS}
}
Nuclear fusion using magnetic confinement, in particular in the tokamak configuration, is a promising path towards sustainable energy. A core challenge is to shape and maintain a high-temperature plasma within the tokamak vessel. This requires high-dimensional, high-frequency, closed-loop control using magnetic actuator coils, further complicated by the diverse requirements across a wide range of plasma configurations. In this work, we introduce a previously undescribed architecture for tokamak magnetic controller design that autonomously learns to command the full set of control coils. This architecture meets control objectives specified at a high level, at the same time satisfying physical and operational constraints. This approach has unprecedented flexibility and generality in problem specification and yields a notable reduction in design effort to produce new plasma configurations. We successfully produce and control a diverse set of plasma configurations on the Tokamak à Configuration Variable, including elongated, conventional shapes, as well as advanced configurations, such as negative triangularity and ‘snowflake’ configurations. Our approach achieves accurate tracking of the location, current and shape for these configurations. We also demonstrate sustained ‘droplets’ on TCV, in which two separate plasmas are maintained simultaneously within the vessel. This represents a notable advance for tokamak feedback control, showing the potential of reinforcement learning to accelerate research in the fusion domain, and is one of the most challenging real-world systems to which reinforcement learning has been applied.
Nuclear fusion is the holy grail of clean energy - abundant fuel, small footprint, runs 24/7, zero meltdown risk or long-lasting waste. But despite 70 years of work, it has yet to become a reality. The most mature approach, magnetic confinement fusion, works by compressing a plasma in a donut-shaped magnetic bottle called a tokamak. Keeping this plasma stable is incredibly complex - plasma is a 3D self-organizing fluid that must be contained at incredibly high temperatures. If you don’t have active control on these machines, the plasma can suffer disruption - in the worst case, causing serious damage to the machine. Historically, these control algorithms were based on classical control and had to be carefully designed by experts.
We set out to replace these classical control algorithms with deep neural networks trained by reinforcement learning. Because these machines can only be run a few dozen times a day, we trained these policies in simulation and transferred them to a real machine. With our fantastic partners at EPFL, we ran these deep RL policies on the Tokamak à Configuration Variable (TCV). This is an excellent machine for testing out exotic control, because it is capable of forming exotic plasma shapes that no other machine can run. While we don’t set any records for fusion power (TCV is a relatively small machine) we might set a record for the weirdest plasma stabilized - a configuration called the “droplet” which is really two plasmas on top of each other, which we keep stable essentially indefinitely. With our deep RL approach validated on a real machine, we are hopeful that these methods can be used to push the envelope of what magnetic confinement fusion is capable of, taming plasma configurations that had previously been considered too challenging to pursue.
@article{degrave2022magnetic,
title={Magnetic Control of Tokamak Plasmas Through Deep Reinforcement Learning},
author={Degrave, Jonas and Felici, Federico and Buchli, Jonas and Neunert, Michael and Tracey, Brendan and Carpanese, Francesco and Ewalds, Timo and Hafner, Roland and Abdolmaleki, Abbas and de las Casas, Diego and Donner, Craig and Fritz, Leslie and Galperti, Cristian and Huber, Andrea and Keeling, James and Tsimpoukelli, Maria and Kay, Jackie and Merle, Antoine and Moret, Jean-Marc and Noury, Seb and Pesamosca, Federico and Pfau, David and Sauter, Olivier and Sommariva, Cristian and Coda, Stefano and Duval, Basil and Fasoli, Ambroglio and Kohli, Pushmeet and Kavukcuoglu, Koray and Hassabis, Demis and Riedmiller, Martin},
journal={Nature},
volume={602},
pages={414--419},
year={2022}
}
Density functional theory describes matter at the quantum level, but all popular approximations suffer from systematic errors that arise from the violation of mathematical properties of the exact functional. We overcame this fundamental limitation by training a neural network on molecular data and on fictitious systems with fractional charge and spin. The resulting functional, DM21 (DeepMind 21), correctly describes typical examples of artificial charge delocalization and strong correlation and performs better than traditional functionals on thorough benchmarks for main-group atoms and molecules. DM21 accurately models complex systems such as hydrogen chains, charged DNA base pairs, and diradical transition states. More crucially for the field, because our methodology relies on data and constraints, which are continually improving, it represents a viable pathway toward the exact universal functional.
This is a paper on making quantum chemistry far more accurate and scalable by combining machine learning and density functional theory (DFT). Density functional theory is the workhorse for computational chemistry. It treats electrons quantum mechanically, but it is very, very scalable. That's because it only deals with the average density of electrons, instead of the exact configuration of all electrons. That means no matter how big your system is, all you have to compute is a 3-dimensional electron density function, and then plug it in to an "exchange-correlation functional" to figure out the energy of that system. Because it's so scalable, it's also very popular. The B3LYP functional has been cited by nearly 100,000 chemistry papers.
You would imagine that because it only deals with electron densities, it's doomed to be inaccurate. But you can prove that, in theory, it can be made arbitrarily accurate. However, actually achieving this accuracy requires knowledge of the true "exchange-correlation functional". And there is no formula that allows you to get the exact form of this magic functional. What we did was to generate a large dataset of very accurate all-electron calculations, including some with so-called "fractional charge", and use this to train a new neural-network functional we call DM21. The DM21 functional avoids many of the pitfalls of common functionals, and reaches far higher accuracy than any other DFT functional on challenging examples in organic chemistry.
@article{kirkpatrick2021pushing,
title={Pushing the frontiers of density functionals by solving the fractional electron problem},
author={Kirkpatrick, James and McMorrow, Brendan and Turban, David HP and Gaunt, Alexander L and Spencer, James S and Matthews, Alexander GDG and Obika, Annette and Thiry, Louis and Fortunato, Meire and Pfau, David and Castellanos, Lara Roman and Petersen, Stig and Nelson, Alexander WR and Kohli, Pushmeet and Mori-Sánchez, Paula and Hassabis, Demis and Cohen, Aron J},
journal={Science},
volume={374},
number={6573},
pages={1385--1389},
year={2021},
publisher={American Association for the Advancement of Science}
}
How should a machine intelligence perform unsupervised structure discovery over streams of sensory input? One approach to this problem is to cast it as an apperception task. Here, the task is to construct an explicit interpretable theory that both explains the sensory sequence and also satisfies a set of unity conditions, designed to ensure that the constituents of the theory are connected in a relational structure.
However, the original formulation of the apperception task had one fundamental limitation: it assumed the raw sensory input had already been parsed using a set of discrete categories, so that all the system had to do was receive this already-digested symbolic input, and make sense of it. But what if we don't have access to pre-parsed input? What if our sensory sequence is raw unprocessed information?
The central contribution of this paper is a neuro-symbolic framework for distilling interpretable theories out of streams of raw, unprocessed sensory experience. First, we extend the definition of the apperception task to include ambiguous (but still symbolic) input: sequences of sets of disjunctions. Next, we use a neural network to map raw sensory input to disjunctive input. Our binary neural network is encoded as a logic program, so the weights of the network and the rules of the theory can be solved jointly as a single SAT problem. This way, we are able to jointly learn how to perceive (mapping raw sensory information to concepts) and apperceive (combining concepts into declarative rules).
The world around us consists of both objects and rules governing how those objects behave. For many simple model worlds, like games, the rules describing how objects interact are quite simple and can be easily written down in formal logic. But in many settings in artificial intelligence, we do not have access to either the rules or knowledge of the individual objects - all we have is an undifferentiated stream of sensory input. This work combines deep learning methods popular for perception with tools from logic programming to create a system, called the Apperception Engine, which can simultaneously learn to recognize simple objects and learn the rules that govern interactions between those objects, for simple domains like the game Sokoban.
@article{evans2021making,
title={Making Sense of Raw Input},
author={Evans, Richard and Bošnjak, Matko and Buesing, Lars and Ellis, Kevin and Pfau, David and Kohli, Pushmeet and Sergot, Marek},
journal={Artificial Intelligence},
year={2021},
volume={299},
pages={103521},
doi = {https://doi.org/10.1016/j.artint.2021.103521},
url = {https://www.sciencedirect.com/science/article/pii/S0004370221000722}
}
We introduce a method for reconstructing an infinitesimal normalizing flow given only an infinitesimal change to a (possibly unnormalized) probability distribution. This reverses the conventional task of normalizing flows -- rather than being given samples from a unknown target distribution and learning a flow that approximates the distribution, we are given a perturbation to an initial distribution and aim to reconstruct a flow that would generate samples from the known perturbed distribution. While this is an underdetermined problem, we find that choosing the flow to be an integrable vector field yields a solution closely related to electrostatics, and a solution can be computed by the method of Green's functions. Unlike conventional normalizing flows, this flow can be represented in an entirely nonparametric manner. We validate this derivation on low-dimensional problems, and discuss potential applications to problems in quantum Monte Carlo and machine learning.
For many problems in machine learning and computational physics, an optimization problem and sampling problem are coupled together. The optimization depends on the sampler reaching equilibrium, and the sampler has to re-run every iteration as the optimization changes the target equilibrium. It would be very convenient if it were possible to update the sampler based on previous iterations, instead of restarting an MCMC algorithm with no knowledge of past steps. We derive a deterministic update for samples from a distribution when given the change to the log probability of the (unnormalized) distribution. Essentially, we calculate the "electric field" created by a set of samples, where the charge is the change in the probability distribution. Mathematically, this has close connections to Neural ODEs, Stein Variational Gradient Descent, and the Fokker-Planck equation. Unfortunately, the update seems to suffer quite badly from the curse of dimensionality, meaning its application to real problems is uncertain.
@article{pfau2020integrable,
title={Integrable Nonparametric Flows},
author={Pfau, David and Rezende, Danilo},
journal={arXiv preprint arXiv:2012.02035},
year={2020}
}
The Fermionic Neural Network (FermiNet) is a recently-developed neural network architecture that can be used as a wavefunction Ansatz for many-electron systems, and has already demonstrated high accuracy on small systems. Here we present several improvements to the FermiNet that allow us to set new records for speed and accuracy on challenging systems. We find that increasing the size of the network is sufficient to reach chemical accuracy on atoms as large as argon. Through a combination of implementing FermiNet in JAX and simplifying several parts of the network, we are able to reduce the number of GPU hours needed to train the FermiNet on large systems by an order of magnitude. This enables us to run the FermiNet on the challenging transition of bicyclobutane to butadiene and compare against the PauliNet on the automerization of cyclobutadiene, and we achieve results near the state of the art for both.
The FermiNet, which we introduced in a paper earlier in the same year, is a neural network that can represent wavefunctions of many-electron systems. This makes it possible to solve for the energies of chemical systems from first principle to very high accuracy. But, scaling this method is very difficult. We show that by switching from TensorFlow to JAX and stripping out a few features from the network that weren't really doing anything we can run calculations much faster. This allows us to train bigger networks that reach higher accuracy on systems like larger atoms, and run many calculations in parallel, like different possible transition states from the same chemical reaction.
@article{spencer2020better,
title={Better, Faster Fermionic Neural Networks},
author={Spencer, James S. and Pfau, David and Botev, Aleksandar and Foulkes, W. M. C.},
journal={arXiv preprint arXiv:2011.07125},
year={2020}
}
We present a novel nonparametric algorithm for symmetry-based disentangling of data manifolds, the Geometric Manifold Component Estimator (GEOMANCER). GEOMANCER provides a partial answer to the question posed by Higgins et al. (2018): is it possible to learn how to factorize a Lie group solely from observations of the orbit of an object it acts on? We show that fully unsupervised factorization of a data manifold is possible if the true metric of the manifold is known and each factor manifold has nontrivial holonomy -- for example, rotation in 3D. Our algorithm works by estimating the subspaces that are invariant under random walk diffusion, giving an approximation to the de Rham decomposition from differential geometry. We demonstrate the efficacy of GEOMANCER on several complex synthetic manifolds. Our work reduces the question of whether unsupervised disentangling is possible to the question of whether unsupervised metric learning is possible, providing a unifying insight into the geometric nature of representation learning.
"Disentangling" is a somewhat nebulous term in ML, but it is broadly about building models that can separate out different latent factors of variation - for instance, in vision, separating translation, rotation, and changes in lighting or color that leave objects invariant. There are many definitions of disentangling - this paper is focused on the "symmetry-based" definition, which formalizes different possible invariances in the world as a product of continuous transformations, also known as Lie groups. We formalized symmetry-based disentangling in a previous paper - in short, a representation is disentangled if it matches the product structure of the group transformations that act on objects in the world. While this helped clarify terms used in the field, it did not provide any recipe for how to learn this product structure for Lie groups. That's where GEOMANCER comes in.
GEOMANCER was inspired by an observation about analogical reasoning. When working with vector representations, you can make analogies just by adding vectors together. For instance, hand + leg - arm = foot. But this model of analogies breaks down when you move from vector representations to Lie groups. Suddenly things don't commute any more! This is especially a problem when dealing with 3D rotations, which are ubiquitous in computer vision. Some image analogies we can complete without a problem, while others are more ambiguous. The idea behind GEOMANCER is to use this ambiguity as a learning signal itself. Directions that are disentangled from one another will be those such that analogies made in those directions can be completed unambiguously, even over long distances. Formalizing this idea mathematically leads to a branch of differential geometry known as holonomy theory, that specifically deals with how much vectors deviate from their behavior in flat spaces when moved around a curved manifold. Working through the math, we arrive at an algorithm based around the idea of subspaces undergoing random walk diffusion on a data manifold.
On synthetic manifolds, we are able to automatically discover the correct number of submanifolds, their dimension, and (up to sampling noise) learn the disentangled directions almost exactly. This works on the product of as many as 5 manifolds, far more than other methods. But, our method assumes that the data is already in a space where distances are correct and disentangled directions are at right angles. That usually isn't the case for raw data - so the problem of symmetry-based disentangling is only half-solved! Because we start from the symmetry-based definition and work backwards from first principles, we believe that GEOMANCER is a promising first step in a research direction that will lead to more general and robust disentangling algorithms.
@article{pfau2020disentangling,
title={Disentangling by Subspace Diffusion},
author={Pfau, David and Higgins, Irina and Botev, Aleksandar and Racani\`ere,
S{\'e}bastian},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2020}
}
Given access to accurate solutions of the many-electron Schrödinger equation, nearly all chemistry could be derived from first principles. Exact wavefunctions of interesting chemical systems are out of reach because they are NP-hard to compute in general, but approximations can be found using polynomially-scaling algorithms. The key challenge for many of these algorithms is the choice of wavefunction approximation, or Ansatz, which must trade off between efficiency and accuracy. Neural networks have shown impressive power as accurate practical function approximators and promise as a compact wavefunction Ansatz for spin systems, but problems in electronic structure require wavefunctions that obey Fermi-Dirac statistics. Here we introduce a novel deep learning architecture, the Fermionic Neural Network, as a powerful wavefunction Ansatz for many-electron systems. The Fermionic Neural Network is able to achieve accuracy beyond other variational Monte Carlo Ansätze on a variety of atoms and small molecules. Using no data other than atomic positions and charges, we predict the dissociation curves of the nitrogen molecule and hydrogen chain, two challenging strongly-correlated systems, to significantly higher accuracy than the coupled cluster method, widely considered the gold standard for quantum chemistry. This demonstrates that deep neural networks can outperform existing ab-initio quantum chemistry methods, opening the possibility of accurate direct optimisation of wavefunctions for previously intractable molecules and solids.
The Schrödinger equation - basically Newton's laws at the atomic scale - have been known for almost 100 years. But the equations are impossible to solve in closed form for anything more complicated than a hydrogen atom. To quote Paul Dirac, "Physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble." People have been solving these equations computationally almost as long as there have been computers - but an incredibly high level of accuracy is needed for these computations to be relevant to chemistry - something like 99.999% accuracy or higher. We've developed a new neural network architecture that can represent wavefunctions for systems of fermions - the kind of particles that make up most matter - and show that it is much more accurate than conventional approximate wavefunctions.
@article{pfau2020abinitio,
title={Ab-initio Solution of the Many-Electron Schr{\"o}dinger Equation with Deep Neural Networks},
author={Pfau, David and Spencer, James S. and Matthews, Alexander G. de G. and Foulkes, W. M. C.},
journal={Phys. Rev. Research},
year={2020},
volume={2},
issue = {3},
pages={033429},
doi = {10.1103/PhysRevResearch.2.033429},
url = {https://link.aps.org/doi/10.1103/PhysRevResearch.2.033429}
}
We present Spectral Inference Networks, a framework for learning eigenfunctions of linear operators by stochastic optimization. Spectral Inference Networks generalize Slow Feature Analysis to generic symmetric operators, and are closely related to Variational Monte Carlo methods from computational physics. As such, they can be a powerful tool for unsupervised representation learning from video or pairs of data. We derive a training algorithm for Spectral Inference Networks that addresses the bias in the gradients due to finite batch size and allows for online learning of multiple eigenfunctions. We show results of training Spectral Inference Networks on problems in quantum mechanics and feature learning for videos on synthetic datasets as well as the Arcade Learning Environment. Our results demonstrate that Spectral Inference Networks accurately recover eigenfunctions of linear operators, can discover interpretable representations from video and find meaningful subgoals in reinforcement learning environments.
Computing the eigendecomposition of a matrix is a ubiquitous problem in computational sciences. Often this is an approximation of an eigenfunction of a linear operator from finite points. We show how to use generic function approximators like neural networks trained by gradient descent to approximately solve this problem. This has the advantage that generalization is extremely fast and simple compared to alternative approaches like the Nystrom method. Slow feature analysis, a classic unsupervised learning algorithm, is a special case of the framework we outline here.
@InProceedings{pfau2019spectral,
title={Spectral Inference Networks: Unifying Deep and Spectral Learning},
author={Pfau, David and Petersen, Stig and Agarwal, Ashish and Barrett, David and Stachenfeld, Kimberly L.},
booktitle={7th International Conference on Learning Representations},
year={2019}
}
How can intelligent agents solve a diverse set of tasks in a data-efficient manner? The disentangled representation learning approach posits that such an agent would benefit from separating out (disentangling) the underlying structure of the world into disjoint parts of its representation. However, there is no generally agreed-upon definition of disentangling, not least because it is unclear how to formalise the notion of world structure beyond toy datasets with a known ground truth generative process. Here we propose that a principled solution to characterising disentangled representations can be found by focusing on the transformation properties of the world. In particular, we suggest that those transformations that change only some properties of the underlying world state, while leaving all other properties invariant, are what gives exploitable structure to any kind of data. Similar ideas have already been successfully applied in physics, where the study of symmetry transformations has revolutionised the understanding of the world structure. By connecting symmetry transformations to vector representations using the formalism of group and representation theory we arrive at the first formal definition of disentangled representations. Our new definition is in agreement with many of the current intuitions about disentangling, while also providing principled resolutions to a number of previous points of contention. While this work focuses on formally defining disentangling - as opposed to solving the learning problem - we believe that the shift in perspective to studying data transformations can stimulate the development of better representation learning algorithms.
Learning to automatically disentangle different factors of variation in data (for instance, object pose, illumination, color and identity) is a major recent topic of interest in unsupervised learning. However, no one can really agree on what it means for a representation to be "disentangled". This paper is an attempt to make our intuitive notion of "disentangled representation" mathematically precise, using machinery from group representation theory.
@article{higgins2018towards,
title={Towards a Definition of Disentangled Representations},
author={Higgins, Irina and Amos, David and Pfau, David and Racaniere, Sebastian and Matthey, Loic and Rezende, Danilo and Lerchner, Alexander},
journal={arXiv preprint arXiv:1812.02230},
year={2018}
}
Spectral algorithms for learning low-dimensional data manifolds have largely been supplanted by deep learning methods in recent years. One reason is that classic spectral manifold learning methods often learn collapsed embeddings that do not fill the embedding space. We show that this is a natural consequence of data where different latent dimensions have dramatically different scaling in observation space. We present a simple extension of Laplacian Eigenmaps to fix this problem based on choosing embedding vectors which are both orthogonal and \textit{minimally redundant} to other dimensions of the embedding. In experiments on NORB and similarity-transformed faces we show that Minimally Redundant Laplacian Eigenmap (MR-LEM) significantly improves the quality of embedding vectors over Laplacian Eigenmaps, accurately recovers the latent topology of the data, and discovers many disentangled factors of variation of comparable quality to state-of-the-art deep learning methods.
In the early 2000's, algorithms like LLE, IsoMap and Laplacian Eigenmaps became popular tools for dimensionality reduction, often under the rubric "manifold learning". For a number of reasons, these methods largely fell by the wayside, except for a few like t-SNE that remain popular for visualization. We address the reason behind one of the failure modes of a certain type of manifold learning method. We show that once fixed, these classic algorithms can learn to disentangle complex data as well as modern deep learning methods - particularly data with complex topology - without the need for a generative model and with limited data.
@InProceedings{pfau2018minimally,
title={Minimally Redundant Laplacian Eigenmaps},
author={Pfau, David and Burgess, Christopher P.},
booktitle={6th International Conference on Learning Representations, Workshop Track},
year={2018}
}
We introduce a method to stabilize Generative Adversarial Networks (GANs) by defining the generator objective with respect to an unrolled optimization of the discriminator. This allows training to be adjusted between using the optimal discriminator in the generator’s objective, which is ideal but infeasible in practice, and using the current value of the discriminator, which is often unstable and leads to poor solutions. We show how this technique solves the common problem of mode collapse, stabilizes training of GANs with complex recurrent generators, and increases diversity and coverage of the data distribution by the generator.
Generative adversarial networks (GANs) have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. This may be because the model consists of two neural networks, each of which is being optimized relative to the current state of the other one, meaning each network is trying to hit a moving target. We describe a practical method for optimizing one of these networks with respect to where the other will be in the future instead of where it is now, hopefully preventing some of these pathologies common to training GANs.
@InProceedings{metz2017unrolled,
title={Unrolled Generative Adversarial Networks},
author={Metz, Luke and Poole, Ben and Pfau, David and Sohl-Dickstein, Jascha},
booktitle={5th International Conference on Learning Representations},
year={2017}
}
Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities.
Generative adversarial networks have become popular in the world of deep unsupervised learning recently, but are notorious for being hard to optimize. Actor-critic methods in reinforcement learning have much the same reputation. We show that the two methods are actually very closely related, and review strategies used in both communities to improve the stability of training and diversity of samples, in the hopes of encouraging cross-pollination between the fields.
@InProceedings{pfau2016connecting,
title={Connecting Generative Adversarial Networks and Actor-Critic Methods},
author={Pfau, David and Vinyals, Oriol},
booktitle={NIPS Workshop on Adversarial Training},
year={2016}
}
The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.
A lot of the success of deep learning has been in showing that features in domains like computer vision that had been hand-designed could be learned instead. Learning itself is mostly done with hand-designed optimization algorithms, however. This paper attempts to apply the successes of deep learning at the meta-level, to the optimization algorithms used to train the deep networks themselves. In other words: "Yo dawg, I heard you like optimizers, so I put a deep network in your deep network so you can learn while you learn."
@InProceedings{andrychowicz2016learning,
title = {Learning to Learn by Gradient Descent by Gradient Descent},
author = {Andrychowicz, Marcin and Denil, Misha and Gomez, Sergio and Hoffman, Matthew W and Pfau, David and Schaul, Tom and de Freitas, Nando},
booktitle = {Advances in Neural Information Processing Systems},
year = {2016}
}
In this work we introduce a differentiable version of the Compositional Pattern Producing Network, called the DPPN. Unlike a standard CPPN, the topology of a DPPN is evolved but the weights are learned. A Lamarckian algorithm, that combines evolution and learning, produces DPPNs to reconstruct an image. Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters. The regularization ability of the DPPN allows it to rediscover (approximate) convolutional network architectures embedded within a fully connected architecture. Such convolutional architectures are the current state of the art for many computer vision applications, so it is satisfying that DPPNs are capable of discovering this structure rather than having to build it in by design. DPPNs exhibit better generalization when tested on the Omniglot dataset after being trained on MNIST, than directly encoded fully connected autoencoders. DPPNs are therefore a new framework for integrating learning and evolution.
Evolutionary computing is a type of stochastic search that randomly changes parameters in a model and keeps around the models that score the highest. While crude, a type of model called "compositional pattern producing networks" can generate interesting images, and can even fool state-of-the-art computer vision algorithms. We show that a mix of gradient descent and stochastic search works better for training convolutional pattern producing networks to produce the parameters of a neural network than stochastic search alone. The neural network parameters that are learned have a structure somewhat similar to convolutions, which are a type of invariance normally built into neural networks by hand. This suggests that important invariances could possibly be discovered instead of being hand-encoded into models.
@InProceedings{fernando2016convolution,
title = {Convolution by Evolution: Differentiable Pattern Producing Networks},
author = {Chrisantha Fernando, Dylan Banarse, Malcolm Reynolds, Frederic Besse, David Pfau, Max Jaderberg, Marc Lanctot, Daan Wierstra},
booktitle = {The Genetic and Evolutionary Computation Conference},
year = {2016}
}
We present a modular approach for analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity from the slow dynamics of the calcium indicator. Our approach relies on a constrained nonnegative matrix factorization that expresses the spatiotemporal fluorescence activity as the product of a spatial matrix that encodes the spatial footprint of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. This framework is combined with a novel constrained deconvolution approach that extracts estimates of neural activity from fluorescence traces, to create a spatiotemporal processing algorithm that requires minimal parameter tuning. We demonstrate the general applicability of our method by applying it to in vitro and in vivo multi-neuronal imaging data, whole-brain light-sheet imaging data, and dendritic imaging data.
Calcium imaging is a powerful class of experimental techniques that allow us to image from hundreds to thousands of neurons simultaneously in living animals. However, the information we really care about - which neuron is spiking when - is mixed together in complex ways in the raw video data. This paper presents a new statistical method that can simultaneously identify where neurons are, unmix the signals from overlapping neurons, and infer when a spike is occurring from noisy data, potentially saving experimenters a lot of time and energy.
@Article{pnevmatikakis2016simultaneous,
title={Simultaneous denoising, deconvolution, and demixing of calcium imaging data},
author={Pnevmatikakis, Eftychios A and Soudry, Daniel and Gao, Yuanjun and Machado, Timothy A and Merel, Josh and Pfau, David and Reardon, Thomas and Mu, Yu and Lacefield, Clay and Yang, Weijian and others},
journal={Neuron},
volume={89},
number={2},
pages={285--299},
year={2016},
publisher={Elsevier}
}
Making intelligent decisions from incomplete information is critical in many applications: for example, robots must choose actions based on imperfect sensors, and speech-based interfaces must infer a user’s needs from noisy microphone inputs. What makes these tasks hard is that often we do not have a natural representation with which to model the domain and use for choosing actions; we must learn about the domain’s properties while simultaneously performing the task. Learning a representation also involves trade-offs between modeling the data that we have seen previously and being able to make predictions about new data. This article explores learning representations of stochastic systems using Bayesian nonparametric statistics. Bayesian nonparametric methods allow the sophistication of a representation to scale gracefully with the complexity in the data. Our main contribution is a careful empirical evaluation of how representations learned using Bayesian nonparametric methods compare to other standard learning approaches, especially in support of planning and control. We show that the Bayesian aspects of the methods result in achieving state-of-the-art performance in decision making with relatively few samples, while the nonparametric aspects often result in fewer computations. These results hold across a variety of different techniques for choosing actions given a representation.
Is it possible for an agent to learn the structure of the world while learning how to act optimally in the world if it isn't able to see everything about the world all at once? We certainly hope so, or artificial intelligence may not be possible. We use a number of techniques for learning structure from time series in a Bayesian nonparametric way, including the Probabilistic Deterministic Infinite Automata (PDIA), to try to address this question. On small problems some of the methods tried do in fact recover the true structure of the world. Not the PDIA, sadly.
@article{doshi2015bayesian,
title={Bayesian nonparametric methods for partially-observable reinforcement learning},
author={Doshi-Velez, Finale and Pfau, David and Wood, Frank and Roy, Nicholas},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={37},
number={2},
pages={394--407},
year={2015},
publisher={IEEE}
}
Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience.
6 years of my life compressed into 150-odd pages. Most of chapter 2 and 3 had been published at NIPS already, but some of the material included has not been published elsewhere. The first chapter gives a good summary of the role of information theory in neuroscience and provides some of the motivation for the work in Chapter 2 on time series models. Chapter 2 includes experiments with the PDIA on neuroscience data that have not been published elsewhere, showing that we can learn long-range dependencies in data better than a GLM (at least for data where the observations are binary). Chapter 4 provides a number of experiments in processing calcium imaging data that eventually led to work published in Neuron, but in quite a different form from what's presented here.
@phdthesis{pfau2015learning,
author = {Pfau, David},
title = {Learning Structure in Time Series for Neuroscience and Beyond},
school = {Columbia University},
year = 2015,
month = 2,
}
We present a structured matrix factorization approach to analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity of each neuron from the slow dynamics of the calcium indicator. The matrix factorization approach relies on the observation that the spatiotemporal fluorescence activity can be expressed as a product of two matrices: a spatial matrix that encodes the location of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. We present a simple approach for estimating the dynamics of the calcium indicator as well as the observation noise statistics from the observed data. These parameters are then used to set up the matrix factorization problem in a constrained form that requires no further parameter tuning. We discuss initialization and post-processing techniques that enhance the performance of our method, along with efficient and largely parallelizable algorithms. We apply our method to in vivo large scale multi-neuronal imaging data and also demonstrate how similar methods can be used for the analysis of in vivo dendritic imaging data.
A preliminary version of our work on processing calcium imaging data later published in Neuron.
@Article{pnevmatikakis2014structured,
title={A structured matrix factorization framework for large scale calcium imaging data analysis},
author={Pnevmatikakis, Eftychios A and Gao, Yuanjun and Soudry, Daniel and Pfau, David and Lacefield, Clay and Poskanzer, Kira and Bruno, Randy and Yuste, Rafael and Paninski, Liam},
journal={arXiv preprint arXiv:1409.2903},
year={2014}
}
Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be fit efficiently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global influences. The results can be combined with spectral methods to learn dynamical systems models. The basic method extends PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to fitting linear dynamical models without nuclear norm smoothing.
New technologies make it possible to record from massive populations of neurons, but making that data interpretable is challenging. Dimensionality reduction is one approach, which looks for a few factors in the data which account for most of the variability. State space models extend dimensionality reduction by modeling dynamics in the low-dimensional space of factors, as well as allowing for more complex models of noise that are more appropriate for neural data. In the machine learning community, state space models are typically fit with methods like expectation-maximization, which may be sensitive to the choice of initialization. We show here that state space models that are of interest to neuroscientists can also be fit using techniques from convex optimization - in particular techniques from the matrix completion and system identification community.
@InProceedings{pfau2013robust,
title={Robust learning of low-dimensional dynamics from large neural ensembles},
author={Pfau, David and Pnevmatikakis, Eftychios A and Paninski, Liam},
booktitle={Advances in neural information processing systems},
pages={2391--2399},
year={2013}
}
The opacity of typical objects in the world results in occlusion, an important property of natural scenes that makes inference of the full three-dimensional structure of the world challenging. The relationship between occlusion and low-level image statistics has been hotly debated in the literature, and extensive simulations have been used to determine whether occlusion is responsible for the ubiquitously observed power-law power spectra of natural images. To deepen our understanding of this problem, we have analytically computed the two- and four-point functions of a generalized “dead leaves” model of natural images with parameterized object transparency. Surprisingly, transparency alters these functions only by a multiplicative constant, so long as object diameters follow a power-law distribution. For other object size distributions, transparency more substantially affects the low-level image statistics. We propose that the universality of power-law power spectra for both natural scenes and radiological medical images, formed by the transmission of x-rays through partially transparent tissue, stems from power-law object size distributions, independent of object opacity.
If you compute the correlation between pixels in an image as a function of distance between pixels, a common statistical distribution emerges across nearly all images. The reason for this distribution was not clear - one camp held that it was due to the presence of objects of many different sizes in an image, while another held that it was caused by sharp edges. We show conclusively that the former camp is correct by analytically calculating the correlations in a model of natural images that factors in transparency. We show that changing the transparency of objects in the model does not change the correlation structure, but changing the distribution of sizes in the model does. Thus object sizes, not edges, lead to the complex correlations in nearly all natural images.
@Article{zylberberg2012dead,
title={Dead leaves and the dirty ground: Low-level image statistics in transmissive and occlusive imaging environments},
author={Zylberberg, Joel and Pfau, David and DeWeese, Michael Robert},
journal={Physical Review E},
volume={86},
number={6},
pages={066112},
year={2012},
publisher={APS}
}
A major goal for brain machine interfaces is to allow patients to control prosthetic devices with high degrees of independent movements. Such devices like robotic arms and hands require this high dimensionality of control to restore the full range of actions exhibited in natural movement. Current BMI strategies fall well short of this goal allowing the control of only a few degrees of freedom at a time. In this paper we present work towards the decoding of 27 joint angles from the shoulder, arm and hand as subjects perform reach and grasp movements. We also extend previous work in examining and optimizing the recording depth of electrodes to maximize the movement information that can be extracted from recorded neural signals.
One of the great potential applications of neural decoding is in neural prosthetics - potentially granting locked in patients the ability to move again. Neural decoding has demonstrated the ability to decode monkey reaching in 2 or 3 dimensions, but natural motion is far more complex than that. We showed that baseline algorithms from the neural prosthetics community could scale to controlling a virtual limb with dozens of degrees of freedom, opening the way to more realistic and rich movements from brain-machine interfaces.
@InProceedings{wong2012decoding,
title={Decoding arm and hand movements across layers of the macaque frontal cortices},
author={Wong, Yan T and Vigeral, Mariana and Putrino, David and Pfau, David and Merel, Josh and Paninski, Liam and Pesaran, Bijan},
booktitle={2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society},
pages={1757--1760},
year={2012},
organization={IEEE}
}
We propose a novel Bayesian nonparametric approach to learning with probabilistic deterministic finite automata (PDFA). We define and develop a sampler for a PDFA with an infinite number of states which we call the probabilistic deterministic infinite automata (PDIA). Posterior predictive inference in this model, given a finite training sequence, can be interpreted as averaging over multiple PDFAs of varying structure, where each PDFA is biased towards having few states. We suggest that our method for averaging over PDFAs is a novel approach to predictive distribution smoothing. We test PDIA inference both on PDFA structure learning and on both natural language and DNA data prediction tasks. The results suggest that the PDIA presents an attractive compromise between the computational cost of hidden Markov models and the storage requirements of hierarchically smoothed Markov models.
Probabilistic deterministic finite automata (PDFA) are a class of probabilistic models for sequences - they assign a probability to every possible sequence, like a string of text. They fall in between hidden Markov models and n-gram models in complexity. Like n-gram models, inference is fast and cheap as there is no uncertainty about what the context is. Like hidden Markov models, complex dependencies in how states transition can be learned - transitions which cannot be learned by an n-gram model no matter how long the context is. We develop a nonparametric Bayesian way of learning PDFA and show it can recover the true structure of artificial grammars that psychologists used to study human sequence learning, as well as learning very compact models of text and DNA.
@InProceedings{pfau2010probabilistic,
title={Probabilistic deterministic infinite automata},
author={Pfau, David and Bartlett, Nicholas and Wood, Frank},
booktitle={Advances in neural information processing systems},
pages={1930--1938},
year={2010}
}
We propose a novel dependent hierarchical Pitman-Yor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence prediction task where it is shown to achieve state of the art sequence prediction performance while using significantly less memory
The sequence memoizer is a powerful probabilistic model of sequential data like text. One downside of the sequence memoizer is that it grows linearly in memory with the amount of data. We show that by forgetting intelligently, a constant-memory sequence memoizer performs comparably to the original linear-memory algorithm.
@InProceedings{bartlett2010forgetting,
title={Forgetting counts: Constant memory inference for a dependent hierarchical Pitman-Yor process},
author={Bartlett, Nicholas and Pfau, David and Wood, Frank},
booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)},
pages={63--70},
year={2010}
}
Under natural viewing conditions, our eyes alternate between saccadic movement and fixation. However,
even during fixation there are constant small movements, which can be decomposed into miniature saccades
and diffusion-like random eye movements. Some diffusion helps prevent adaptation to a particular
stimulus, but diffusion also blurs the image of the world across the retina. Despite this, humans can resolve
fine spatial detail very well, and this diffusion may even enhance the ability to distinguish high-frequency
components of an image [1]. This suggests that the brain compensates for fixational eye diffusion and may
even extract useful information from it. To investigate the effect of eye diffusion on image reconstruction,
we extended a generalized linear model (GLM) of retinal encoding/decoding to incorporate random-walk
drift of the image falling on the retina. GLMs have been successfully applied to modeling a range of neural
systems, including retinal ganglion cells [2]. Previously developed GLMs of the retina, directly estimated
from spiking data, generate simulated network spike trains with the correct spatiotemporal filtering and
correlation structure. Finally, given this network spiking encoding model and a statistical model of the spatiotemporal
visual inputs, there is a natural Bayesian method for decoding the response [3]. For our model
incorporating fixational eye diffusion, the decoding model would assign a probability to all possible random
walks the image could take. However, the number of possible paths grows exponentially with time, making
this method computationally intractable. Instead, we approximate the posterior distribution of images given
the observed spikes as a mixture of Gaussians, and track the diffusive movements of the mixture components
by a particle filtering approximation. This method is both computationally tractable and effective at
reconstructing the encoded image. Preliminary results show that the image reconstruction is poor at both
very low and very high diffusion rates, while reconstruction works reasonably well at intermediate diffusion
rates. Thus, a well-defined optimal diffusion rate exists, and in general depends on statistical properties
of both the stimulus and the retinal spatiotemporal receptive fields, such as the strength of the sustained
response component and whether the transient component lasts longer than the persistence time of the
eye movements. We are currently pursuing quantitative comparisons to the real diffusion coefficient during
head-fixed viewing.
References
[1] Miniature eye movements enhance fine spatial detail. M. Rucci,
R. Iovin, M. Poletti, and F. Santini, Nature 447(7146):851-854, 2007.
[2] Spatio-temporal correlations and
visual signalling in a complete neuronal population. J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M.
Litke, E. J. Chichilnisky and E. P. Simoncelli, Nature 454(7202):995-999, 2008.
[3] Model-based decoding,
information estimation, and change-point detection in multi-neuron spike trains. J. W. Pillow, L. Paninski., 2008.
Even when staring fixed at an object, our eyes are moving in subtle ways, yet the world appears fixed to us. Somehow, our brain must be compensating for these random movements of our eyes to create a coherent and stable perception of the world. We developed a Bayesian method to decode both the content of a scene and the motion of the eye simultaneously from a model of the signal the optic nerve sends to the brain. We also derived the optimal amount of eye movement given this model, but found it was ten times smaller than the actual amount of movement, suggesting other factors are in play.
@InProceedings{pfau2009bayesian,
title={A Bayesian method to predict the optimal diffusion coefficient in random fixational eye movements},
author={Pfau, David and Pitkow, Xaq and Paninski, Liam},
booktitle={Conference abstract: Computational and systems neuroscience},
year={2009}
}
Data from neuroscience is fiendishly complex. Neurons exhibit correlations on very long timescales and across large populations, and the activity of individual neurons is difficult to extract from noisy experimental data. I will present work on several projects to address these issues, both abstract and applied. First I will discuss the Probabilistic Deterministic Infinite Automata (PDIA), a nonparametric model of discrete sequences such as natural language or neural spiking. The PDIA explicitly enumerates latent states that are predictive of the future, and by using a Hierarchical Dirichlet Process prior can learn arbitrary transitions between those states. The model class learned by the PDIA is smaller than hidden Markov models but yields superior predictive performance on data with strong history dependence, like text. One weakness of the PDIA is that it is hard to scale when the space of possible observations is very large, as is the case with large populations of neurons. In this limit we are instead interested in reducing the dimensionality of data, and I will present work on unifying the generalized linear model (GLM) framework in neuroscience with dimensionality reduction. The resulting models can be efficiently learned using convex techniques from the matrix completion literature, and can be combined with spectral methods to learn surprisingly accurate models of the dynamics of real neural data. To apply these models to the kinds of high-dimensional neural data now becoming available, we have to bridge the gap between raw data and units of neural activity. I will present joint work with Misha Ahrens and Jeremy Freeman on extracting neural activity from whole-brain recordings in larval zebrafish, as a step towards the long-term goal of making dynamics modeling a daily part of the data analysis routine in neuroscience.
I try to contribute to open source as much as I can from within a private corporation, and some examples include the code from our Spectral Inference Networks paper, as well as various useful linear algebra operators and gradients in TensorFlow and JAX. In particular, the matrix exponential operator in TensorFlow was used to make a novel discovery in the theory of supergravity.
Though it hasn't been updated much since I joined DeepMind, you can find my personal GitHub here. Notable projects include a collection of methods for learning state space models for neuroscience data, some of which has been integrated into the pop_spik_dyn package, a Matlab implementation of Learning Recurrent Neural Networks with Hessian-Free Optimization, and the Java implementation of the Probabilistic Deterministic Infinite Automata used our paper. For those interested in probabilistic programming, I have also provided a PDIA implementation in WebChurch.
I also contributed a C++ implementation of Beam Sampling for the Infinite Hidden Markov Model to the Data Microscopes project. At a factor of 40 faster than existing Matlab code, it's likely the fastest beam sampler for the iHMM in the world.
Not everything makes it into a paper, but that doesn't mean it's not important. You can find short notes and other writings that don't have a home elsewhere here.
A simple result that I hadn't seen published elsewhere. Other research on generalized bias-variance decompositions historically has focused on 0-1 loss and is relevant to classificiation and boosting. In probabilistic modeling, error is measured through log probabilities instead of classification accuracy, often with distributions in the exponential family. Exponential family likelihoods and Bregman divergences are closely related, and it turns out it's straightforward to generalize the bias-variance decomposition for squared error to all Bregman divergences.
Several years after writing this note, Frank Nielsen pointed me to "Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications" by Buja, Stuetzle and Shen. Right there in Section 21 is essentially the same derivation. However, it's still a fairly niche result, and I haven't seen a clean, standalone derivation before, so I hope these notes are helpful.
A short essay about the process behind writing the paper "Disentangling by Subspace Diffusion", containing my own thoughts on the research process and giving some insight into just how long and arduous the process of going from idea to paper can be.