### in this section

## Statistician Dawn Woodard Joins ORIE Faculty

Professor Dawn B. Woodard received a B.S. degree in mathematical and computational science from Stanford University and holds M.S. and Ph.D. degrees from Duke University, where she studied in the Department of Statistical Science. She joined ORIE in Fall 2008 and currently teaches the Statistical Data Mining course, which deals with extracting information from large sets of data in order to draw conclusions relevant to business and other problems. In welcoming Woodard to ORIE and the wide Cornell community of Statistical Science, Professor Bruce W. Turnbull noted that "with the rapid advances in computer technology and electronic data capture, the vast data sets that are available in fields such as marketing, finance, manufacturing, genomics, geographic information systems, etc, present an added set of challenges to a statistician."

While a graduate student, Woodard was a contractor for Insightful Corporation, the makers of a popular statistical software package called S-PLUS. There she developed and tested software to assist users of S-PLUS in areas such as drug safety, determination of appropriate doses, and analysis of what happens to an ingested pharmaceutical in the body. She has also worked for SAS Institute (also a statistical software vendor); for Peakstone Corporation, a start up company providing tools to manage business performance; and for Rockwell Scientific, now part of Teledyne.

#### Research Area

Woodard's research deals with Bayesian statistics, one of two prominent methodologies for deriving inferences about some unknown aspect of a population from data about it. While the alternative frequentist approach asserts the existence of an inherent and experimentally measurable probability that an event occurs, the Bayesian approach (named for the 18th century minister Thomas Bayes) starts with an estimate of the probability that is modified as evidence is accumulated. In the work of Woodard and other statisticians, the use of Bayesian methods extends beyond single events to analyzing the frequency (or probability) distribution of a range of possible outcomes, and to estimating properties of this distribution such as the mean (average), variance (dispersion about the average), maximum value, and the like. Bayesian methods are increasingly used for a wide variety of statistical problems, and are even considered as a fundamental mechanism in human cognition.

#### Medical Application

Prior to joining the faculty, Woodard gave a talk at Cornell that described the application of Bayesian inference to a problem in what has come to be called "data-driven" medicine. A large amount of data (some 500,000 cases) is available about mammograms performed to detect breast cancer, including information about whether breast cancer was found within one year of the procedure. The accuracy (sensitivity and specificity) with which radiologists have read these mammograms varies, in part from differences in their medical practices (including concern about malpractice), in part from differences in patient demographics,and in part from the kind of variation that arises in all such situations characterized by underlying randomness. Woodard and her colleagues used Bayesian methods to tease apart the factors that influence accuracy and to adjust the data so that performance can be measured relative to a common standard.

Each dot shows the estimated difference between the specificity for a particular radiologist and a standard appropriate to that radiologist. The vertical bars represent the uncertainty in the difference estimates. |

They found that the adjusted measure of sensitivity (the percentage of patients with breast cancer correctly identified as having it) does not vary significantly among radiologists, but specificity (the percentage of patients without breast cancer correctly identified as not having it) does vary significantly. The analytical approach combines two different statistical models in a hierachical way that "is only possible in a Bayesian setting," according to Woodard. We hope that these results can help radiologists refine their approach so as to reduce unnecessary and costly followup," she says.

#### Environmental Application

In a recent seminar talk to faculty and graduate students in ORIE, Woodard discussed a different statistical application, one relating to groundwater contamination in the mid-Atlantic states. The permissible amount of nitrates in groundwater, often the result of runoff from agricultural fertilizers, is regulated at various geographic scales, such as counties, watersheds and census blocks. According to Woodard, "regulators are interested in phrasing regulations in terms of risk to individuals, but generally make use of more *ad hoc* criteria because of difficulty in relating measurement data to such a formulation." She presented a method for estimating multiple risk measures, such as the average concentration, the probability of exceeding a particular limit, or the size of the maximum concentration in a region, and doing so simulataneously for various geographic scales. While there is considerable data available, taken from 929 wells measured during more than a decade, the challenge is to relate the risk measures to the data.

In this diagram, the symbols represent nitrate readings, color coded by nagnitude, while the shading represents the estimated nitrate concentrations using the Bayesian approach. |

Woodard's modeling method, based on moving averages and Bayesian statistics, for interpolaties from the available data in order to obtain intensity maps for the whole region. From these, the various risk measures can be computed and mapped. The resulting maps show various 'hot spots,' for example in the Chesapeake Bay region. However they also show that the probability of exceeding the regulatory limits is generally low in most areas despite the fact that 15% of the individual readings exceed the regulatory limit. According to Woodard, the results shows that many of the high measurement values "can be attributed to local variability rather than to high regional nitrate levels."

#### Thesis Research

Woodard's thesis research deals with situations in which it is necessary to sample (repeatedly) from a probability distribution, for example in developing a computer simulation of a situation characterized by randomness or in a Bayesian analysis. Such situations occur in analyzing many physical and statistical problems of interest, including the nitrate contamination problem discussed above. In dealing with such problems, computer software is used to repeatedly draw sample values from the distribution so that the resulting values can be used to estimate quantities of interest in the world, to predict additional quantities of interest, to test hypotheses about the underlying phenomenon, or to fill in missing data values. Techniques, the most straightforward of which is called the Monte Carlo method, have been devised to estimate the quantities of interest.

However, many problems that arise in physics and statistics entail mathematical formulations that require use of complex computational methods, such as Markov Chain Monte Carlo methods, to obtain the samples on which the calculations are based. A common difficulty in the use of these methods (and even more advanced ones based on them) arises when the probability distribution has multiple statistical modes, i.e. several values for which the likelihood is greater than any nearby values but might be a 'local' mode that is not necessarily greatest overall. In such cases the computations may take a very long time, or may get 'trapped' in a local mode. In her thesis, Woodard has been able to characterize conditions under which computations will work well and under which they will unfortunately be slow, although at this point the conditions are difficult to verify for specific problems. "If we can understand the limitations of current methods for multimodal Bayesian statistics, that can help us design new computational methods and also to understand when we are using models that are simply too hard to compute."

In spring 2009, Woodard will teach an advanced graduate course on topics related to her research.

For additional information, click here.