Large Deviations Theory and Applications
SMALL RANDOM PERTURBATIONS OF DYNAMIC SYSTEMS
DIMENSION ESTIMATION FOR TIME SERIES STOCHASTIC MODELS
ASYMPTOTICS OF SIMULATED ANNEALING
NON PARAMETRIC ASYMPTOTIC ESTIMATION
RARE EVENTS AND
SMALL RANDOM PERTURBATIONS OF DYNAMIC SYSTEMS
My first brush up with large deviations and rare events was a very intensive joint work with Gabriel RUGET. We had both just been involved in a one year reflexion group at Ecole Normale Sup. Paris on adaptive control of dynamic systems, jointly organized with Albert BENVENISTE and Philippe COURREGE, and the connected question of probabilistic models for learning processes was evoked there, only to be quickly dismissed due to the definitely “soft mathematics” colour of the then available litterature on probabilistic learning.
Fueled by our tête-à-tête discussions on sufficiently generic but still manageable probabilistic models for learning processes, G. RUGET and myself started an exciting quest to discover computable asymptotics for “slow learning” random dynamics.
We decided to study a large class of continuous random dynamic systems, determined by a finite set of random vector fields on a differentiable manifold : at each point in space, the dynamic system picks its tangent vector at random among the finite set of vector fields. These local random choices are defined by a smooth field (µx) of probabilities carried at each space point x by the tangent space Tx .
Slowing down the time clock of these “learning” dynamics brings trajectories close to the “mean field” deterministic dynamics, as could be deduced from adequate extensions of local laws of large numbers. We then studied the asymptotics of the necessarily very rare escape paths from basins of attraction of the mean field deterministic dynamics. This opened up a whole Pandora box of local large deviations theory for our slow learning processes, and we were naturally led to build a complex extension of the major work of VENTCELL-FREIDLIN on small diffusions. Our random systems could indeed be viewed as a whole new class of small random perturbations of deterministic dynamic systems.
We thus proved that for these random systems with slow dynamics, escape from stable fixed points of the deterministic mean field dynamics, were rare events with exponentially decaying probabilities, and we computed a closed formula for an “energy functional” L(f) defined on the path space, providing the exponential rate of decay for probability of escape along any given path f . Namely, we showed that L(f) was the integral energy of the path speed df/dt for a non Riemannian metrics, defined by a continuous field of positive convex functions Lx on the tangent spaces Tx. In fact, for each x, the function Lx is the CRAMER-CHERNOV transform of the probability µx .
This led ud to prove that escape from an equilibrium point of the mean field dynamics, whenever it occurred, most likely took place along “critical escape paths”, and that these critical paths solved an explicit variational problem. Namely, critical paths were geodesics of the non Riemannian metrics just defined. It was then possible to write the nonlinear Hamilton-Jacobi partial differential equations solved by the critical paths.
The masterly early papers of Srinavasa VARADHAN and Daniel STROOCK on large deviations theory of course offered us the right conceptual probabilistic framework in which to frame and formulate our results.
Inevitably, I had many more opportunities to rely on S.R.S. VARADHAN’s deep contributions to large deviations theory, in particular while directing the PhD thesis of Francis COMETS on large deviations for Gibbs fields.
F. COMETS proved interesting and technically difficult results on low temperature asymptotics for transitional dynamics between two equilibria of spins fields, as well as on large deviations for the empirical long time average dynamics of stationary Gibbs fields, in the spirit of VARADHAN’s fundamental papers on large deviations for stationary random processes. He went on to become a recognized expert in this field.
A few years later, I directed the PhD thesis of Isabelle NAGOT to extend the results RUGET and I had obtained for continuous measure valued random fields (µx) to measure valued random fields having discontinuities when x belongs to a smooth sub-manifold of the state space. This turned out to raise particularly delicate technicalities to analyze process behaviour across such discontinuity surfaces, but I. NAGOT obtained explicit formulations of the partial differential equations solved by critical paths
This in depth incursion into large deviations theory launched me on a long term systematic study of Rare Events theory, LEGENDRE duality between convex functions, and CRAMER-CHERNOV formulae for large deviations of sums of multivariate random variables. I soon understood clearly the existence of a powerful knot between the study of small random perturbations of dynamic systems (or “small diffusions”) à la VENTCEL-FREIDLIN and more classical large deviations theory.
The idea I developed was that diffusion processes with small diffusion coefficients could be seen in fact as “almost smooth” pathwise transforms of small ordinary brownian motion, and that it must hence be possible to transfer, by direct image of probabilities in path spaces, the easily formalized large deviations results for small gaussian processes, into large deviations in paths space for small diffusions.
This esthetic and intuitive machinery worked very well, and enabled me to give a new cleaner proof of VENTCELL-FREIDLIN results, and to obtain, as a side benefit, a natural extension of their results from elliptic diffusions to hypo-elliptic small diffusions, a difficult task which was not easily feasible with their original method.
I was then invited by Paul Andre MEYER, Jacques NEVEU, Didier DACUNHA-CASTELLE to give a summer course on Large Deviations in the yearly Ecole of Probabilites of Saint-Flour (France), where I linked these results to a systematic formalization of rare events theory, in the spirit of BAHADUR-ZABELL papers, and of course of the VARADHAN-STROOCK formulations. The intellectual impact of my course on a group of very active french probability PhDs encouraged me to intensify my activity in the field of large deviations theory, and to explore applications in physics of “small random perturbations in path spaces”.
I directed the PhD thesis of P. LOTI-VIAUD, who had begun exploring an idea of G. RUGET, and studied the “large deviations” asymptotics of random evolutions for populations where the generating functions governing the offspring on any individual depended on its spatial localization. The goal was to obtain and numerically solve the partial differential equations governing the asymptotics of population propagation in time and space.
This was a strong motivation to launch an intensive 2 years team work on “geodesics and small time behaviour of diffusion processes” following MOLCHANOV, as described above.
A collaboration with Halim DOSS then was the opportunity to apply probabilistic stationary phase methods on WIENER spaces of trajectories to the asymptotic “semi classical expansions” for the fundamental solution of the Schroedinger operator when the Planck constant tends to zero.
I then undertook the exploitation of “rare events” probabilistic theory in path spaces to generate rigourous and precise small time asymptotics expansions of fundamental solutions of smooth 2nd order parabolic operators, viewed as probability densities of diffusions processes. Indeed MOLCHANOV’s elegant work only yielded the proper geometric interpretation for the 1st term c(x,y) t-k/2 exp[-d2(x,y)/2t] in the small t expansion of the density pt(x,y) on a Riemannian manifold with dimension k .
My approach provided complete small time asymptotic expansions of diffusions probability densities pt(x,y) , replacing the constant c(x,y) by a polynomial in positive integer powers of t1/2, with good estimates of the remainder term.
I thus obtained interesting closed form stochastic integrals formulae for the coefficients of this polynomial, and introduced for this purpose a key probabilistic tool : the construction of explicit pathwise stochastic Taylor formulas for the random paths of diffusion processes in small time, where the coefficient of tn/2 in the pathwise Taylor series involves multiple stochastic integrals of order n . In the particular case of the logarithm of Brownian motion on Lie groups, the corresponding pathwise Taylor stochastic expansions coincided neatly with the explicit expansions obtained for the Lie group case by G. BENAROUS in his thesis.
This point of view, applied to delicate asymptotic expansions of the Brownian bridge on Riemannian manifolds, became helpful when the french BOURBAKI group asked me to present a synthetic analysis of J.M. BISMUT’s remarkable probabilist proof of ATIYAH-SINGER index theorem, and I was thus able to provide an interesting shortcut for a key probabilistic point of his complex and original approach.
My understanding of this BISMUT’s paper became the topic of a two months seminar I gave in 1986 at Courant Institute, New-York, at the invitation of S.R.S.VARADHAN.
Selected References R. Azencott :
[16] [21] [25] [28] [30] [31] [32] [33] and Springer Lecture Notes book [22]
LARGE DEVIATIONS AND APPLIED STATISTICS
DIMENSION ESTIMATION FOR TIME SERIES STOCHASTIC MODELS
During my two years of research in the Statistics Dept at Univ. of California (Berkeley), I participated intensively in seminars led by Lucien LECAM, and became quite interested by LECAM’s powerful approach to asymptotics of statistical experiments and parameter estimation. At the same time, I absorbed innovative ideas of HUBER and BICKEL on robust parameter estimators, built to resist the noisyness of random observations and biasedness of stochastic models.
The natural links of their approaches with LECAM’s distance between experiments, and Kullback distance between probability models, as well as with BAHADUR and ZABELL exponential rates of convergence for parameter estimators, clinched excitingly with my ongoing work on large deviations theory.
This led me to launch, at University Paris 11, in collaboration with Didier DACUNHA-CASTELLE, an intensive one year seminar on large deviations applied to asymptotic statistics.
I thus obtained concrete results for two large classes of robust estimators for the unknown mean of independent random variables, the so called M-estimators and R-estimators generated by adaptive censoring away of extreme observations.
Simultaneously, I had created at Universites Paris 7 and Paris 1, and directed since 1975, a new cursus “Mathematics and Economy”, in which I taught several applied graduate courses on auto-regressive/moving average models (ARMA techniques) for modelization and forecast of random time series. These popular statistical techniques, relied essentially on estimating adequate rational fractions to approximate the spectral density of 2nd order stationary random processes.
A key problem in this context, as noticed then by AKAIKE, is to avoid over-parametrization of the statistical model, which leads inexorably to asyptotically inconsistent estimators. But even the intuitive AKAIKE approach to control dimension estimates in ARMA models, had just been shown to still lead to asymptotically inconsistent dimension estimates for the underlying model.
I thus undertook, with D. DACUNHA-CASTELLE, a rigorous and thorough study of computable consistent estimates of parametric dimensions for ARMA models, which led us to publish an in depth book on random time series and their forecast. Our book had 3 editions, successively in french, english, and japanese.
The statistical expertise involved in adequate modeling of real life time series by ARIMA techniques was, in the applied statistics community, quite fuzzy and not well formalized.
I felt that this situation could be seriously improved, and collaborated with applied statisticians and computer scientists (Y. and B. GIRARD, P. ASTIER, M.M. MARTIN) to realize a desktop scientific software offering quick and easy implementations of consistent dimension estimates as well as efficient expert parametric estimates of random time series by ARIMA models.
This became a successful joint project with EDF/GDF, the french national electricity company, which supported the research, and adopted our scientic software MANDRAKE in its R&D dept.
A collaboration with INRETS (national research institute for transportations), on efficient stochastic modelization and forecasts for massive national data recorded on highway vehicule accidents, led me to direct Anne RICORDEAU’s applied PhD thesis, and to focus her work on Markov fields models for interacting families of Poisson stochastic processes.
A couple of years later, the INRETS collaboration was extended, with the financial support of the french transportation ministry, and I led an efficient scientific team (Y. and B. GIRARD, J. LACAILLE, B. DURAND) in the delicate extension of MANDRAKE automatic ARIMA modeling to Multivariate random Time series.
The number of ARIMA parameters to estimate increases very fast with the dimension of the observed random vectors, so that a good mathematical approach to avoid overparametrization becomes a key practical question, and it was an exciting challenge to link the theoretical results of RISSANEN on “model dimension estimates” to new concrete algorithms implementable in an easy to use scientific software.
We then collaborated with EDF/GDF, to implement at the national operational level a boosted up and customized version of our software for on line automatization of the massive nationwide regional short-term forecasts of gaz consumptions.
Selected References R. Azencott :
[19] [20] [29] [34] [40] [41] [42] [70] [71] , including books [29] [34] [41], and scientific softwares [42] [70] [71].
Scientific direction of several short MA theses in applied statistics, and of Anne RICORDEAU’s PhD thesis.
LARGE DEVIATIONS AND STOCHASTIC GIBBS DYNAMICS
ASYMPTOTICS OF SIMULATED ANNEALING
In 1983, I attended an advanced seminar at Les Houches, France, on new trends in statistical physics, sponsored by IBM, and organized by interesting physicists (SHERRINGTON, KIRKPATRICK, G. TOULOUSE, M. MEZARD). The seminar focused on Spin Glasses and Simulated Annealing, a seductive stochastic gradient descent algorithm, originally inspired by clever analogies with the physical dynamics of interactive fields of spin vectors, and by the actual cooling of large arrays of interactive particles. Simulated annealing seemed to have “universal” minimizing virtues for NP hard combinatorial optimization.
I was struck by the variety of massive optimization problems which simulated annealing could apparently attack, and by its “physical model” as semi-random progressive self organization capacities for large arrays of interactive particles submitted to “slowly cooling” temperature dynamics, which stabilizes the particle system in configurations minimizing a global energy function.
Given the startling lack of formal probabilistic study for asymptotics of simulated annealing, I jumped on this exciting new subject in 1984, and received another strong impetus when I read the ground breaking work of Donald and Stuart GEMAN (1984) in an exciting new domain : performing restoration of noisy digitalized image by Markov Random Fields “energy” modelization, and energy minimization by simulated annealing.
This opened for me a 10 years intensive research activity in three interwoven domains : Simulated annealing, Markov Random Fields, and Energy minimization for digital image analysis.
I decided to explore them in depth and simultaneously, and enrolled the active implication of a group of about twelve bright young applied mathematicians who became almost contemporaneously PhD students under my direction. With their support, I founded at Ecole Normale Sup. Rue d’Ulm Paris, a new laboratory called DIAM : distributed artificial intelligence and mathematics.
I centered our Markov fields and image analysis activities (see below) at University Paris 11 in Orsay, and our mathematical explorations on simulated annealing and neural networks at Ecole Normale Sup, Rue d' Ulm.
On the theoretical level, I soon understood that the detailed asyptotics study of simulated annealing was closely related to large deviations estimates of rare events such as excursions out of energy wells for the random gradient descent.
My own large deviations approach to VENTCEL-FREIDLIN random excursions for dynamic systems submitted to small perturbations then became an essential guideline to seriously attack precise asymptotics for simulated annealing.
I hence focused and directed along this path the PhD research of two unusually gifted mathematicians, Olivier CATONI and Alain TROUVE, to reach a large array of definitive and powerful theoretical results on exact asymptotic rates of convergence for simulated annealing.
In particular, CATONI’s PhD thesis completely clarified the links of these asymptotic rates with appropriate energy landscape descriptors, and with a “cycle decomposition” of this energy landscape in the spirit of VENTCEL-FREIDLIN; his deep results brought him a coveted IBM mathematical prize.
I then oriented my research group towards the extensions of these sophisticated large deviations techniques in a very rich direction : the effect of synchronous parallel dynamics on simulated annealing, in particular through the scientific direction of A. TROUVE’s PhD thesis.
The powerful and elegant results of A. TROUVE brought another mathematical layer to CATONI’s highly technical results, boosting up their clarity and generality, and used them to cleverly analyze the unexpected effects of partial and/or total synchronicity for parallel dynamics in simulated annealing.
TROUVE’s work was later an efficient building block in Raphael CERF’s clever asymptotic analysis of the so called “genetic” optimization algorithms.
My intense interest in parallelization schemes for stochastic relaxation of large arrays of interactive cells or sites was linked to the emergence and availability in France of massively parallel computers such as the Connection Machine with 64 000 processors or the MASPAR with 4096 processors.
I was convinced that massively parallel computing was an astonishing new potential for high acceleration of complex computing tasks which could naturally be mapped onto such arrays of processors, such as low level artificial vision tasks performable by artificial retinas.
Thus I began a joint theoretical and applied seminar with specialists of parallel computing and artificial retinas, such as Patrick GARDA, Laurent VIROT, and altri, and launched with my DIAM resarch lab a three years intense mathematical exploration of massively parallel dynamics for simulated annealing and their impact on convergence acceleration.
To complete our theoretical results by extensive simulations on massively parallel computers, I organized a network of scientific collaborations with major computer science labs in France (ETCA, Univ. of Orleans, IRISA, ENS Lyon).
This team work led me to impulse, coordinate, and edit a pioneering book on parallelization schemes for simulated annealing (R. AZENCOTT, O. CATONI, M. DREYFUS, P. GARDA, I. GAUDRON, C. GRAFFIGNE, A. TROUVE, L. VIROT).
Selected References R. Azencott :
[35] [36] [43] [59] [60] [61] [62] [63], including a book [59].
LARGE DEVIATIONS AND AUTOMATIC LEARNING THEORY
NON PARAMETRIC ASYMPTOTIC ESTIMATION
When I became interested in formal neural networks around 1985, I read the then major theoretical papers on the topic, mostly written by inventive physicists, and was struck by their lack of familiarity with deep and relevant mathematical tools, linked to advanced probability theory and theoretical statistics, which should have provided a powerful mathematical background to analyze learning capacities and complexity computation for learning problems.
I hence organized with O. CATONI, A. TROUVE, L. YOUNES several Ecole Normale Sup. seminars circa 1994-95 on analyses of major books and papers by VAPNIK, CERVONENKIS, VALIANT, BARON, RISSANEN, et altri, to get a better grasp of links between minimal coding theory, complexity computation, speed of functional approximations by finite dimensional subspaces, non parametric statistical asymptotics, V-C dimension and learning speed.
In 1996, I invited VAPNIK to give an advanced course at Ecole Normale Sup., which was an opportunity to technically deepen my own mathematical analysis of his work.
Recall that a very large class of automatic learning problems are essentially formalized by the following paradigm : in a fixed infinite dimensional function space H, select a good approximation of an unknown target function F, given N “training examples” , i.e. given the values of F at N randomly selected points. A learning problem can be consistently solved if there is a learning algorithm which optimally matches the number of hidden parameters (empirical model dimension) with the number N of “training examples”, in order to insure that fixed size estimation errors have probabilities converging to zero as N tends to infinity.
VAPNIK’s major results essentially show that if, for each target function F in the function space H, the corresponding learning problem can always be consistently solved, then convergence to zero for probabilities of fixed size learning errors must occur at exponential speed, the space H must have finite Vapnik-Cervonenkis dimension "V-Cdim", and the exponential rates of convergence can be universally computed from the V-C dimension of H.
I intuitively felt that these results could be understood differently, and quite as deeply, as consequences of “uniform” large deviations theory applied to wide classes of underlying probability distributions for the training examples. The universal exponential rate defined by the celebrated V-C dimension would then correspond only to the most pessimistic situation where nothing is known about the underlying probability distribution for the selection of training examples.
After sketching the intuitive steps for a potential proof of this conjecture, I explicited these ideas in an invited 1997 lecture at Brown University 50th aniversary Maths Congress. In 1998, I began directing Nicolas VAYATIS in his PhD thesis, during which, along with further interesting results, he proved this conjecture in 2001-2002.
I had in any case already started in 1996 to apply ideas derived from large deviations techniques and from VAPNIK’s deep statistical approaches such as “support vector machines”, to reach pragmatic and computer implementable algorithms for “parametric dimension selection” in very concrete learning problems, such as optimal automatic dimensionning of formal neural networks in function of the available number of training examples and of the regularity of the function to emulate.
From 2001 to 2002, we intensified our R&D efforts in this direction within the MIRIAD consulting group, to conceive and realize classes of completely autonomous software learning agents, endowed with the ability to automatically monitor their own learning.
Selected References R. Azencott : [84] [89]
|