Monday’s WSJ published a purported research piece assessing the forecast accuracy of FOMC participants. Unfortunately, the work is fatally flawed from a design and conceptual perspective for reasons that will be discussed below. But perhaps more importantly, such work implicitly carries with it potentially damaging implications for the policy formulation process. Before we delve into the issues in more detail, it is worth emphasizing one positive point about this work. Specifically, the Journal’s columnists have been quite transparent with respect to both their methodology and the data that they have employed. For those who opt to do some digging, the WSJ website accompanying the article provides useful detailed information.
What the Authors Did
The WSJ columnists attempt to assess the accuracy of FOMC participants’ forecasts of GDP growth, inflation, and employment. Their input data were generated by scouring the speeches, public statements, and testimony of the FOMC participants concerning their expectations for the three variables of interest. In total, some 700 plus such statements by FOMC participants were examined over the period from June 2009 through December 2012. The forecasts were distributed unequally across the three variables, yielding 560 or so usable scores. Where possible, the columnists compared the numerical forecast statements with the actual realizations of the variables in assigning arbitrary scores. A score of 1 was assigned if the forecast fell within one standard deviation of what the authors considered to be the normal variation in that variable, whereas it was given a score of -1 if it fell outside one standard deviation. Additionally, more qualitative statements by participants were assigned scores of 0.5 or -0.5 if they were generally in the "right direction but lacked precision." When statements were found to contain mixtures of correct and incorrect predictions, a score of zero was sometimes assigned. The scores were then averaged for each FOMC participant for each of the three variables individually and then across all three, and ranks were computed. Reportedly, the work was evaluated by both outside academics and Moody’s.
What’s Wrong: Design and Concept
There are three critical problems with the study that suggest no weight should be given to the results. First, the scores are based upon arbitrary weights, or subjectively defined points, that are a mixture of level and direction. The choice of the weights, whether they are plus or minus 1, map whether a forecast is within or outside of one standard deviation, while the assignment of plus or minus a half a point reflects only direction. The assignment of zero meant that a statement contained a mixture of correct and incorrect forecasts. The choice of each of these weights determines, by definition, the range of feasible scores and the ordinal rankings that flow from those scores. In other words, a different choice of scores would produce different rankings, and hence the rankings are essentially meaningless.
There are other nits that we can pick with the methodology. While the columnists looked at over some 700 statements, they don’t tell us in the article what the final sample size was overall or what it was for individual FOMC participants. But we can ferret that information out by going to the website. Nor do they tell us how they defined the “normal standard deviation” upon which they based their arbitrary scoring system. Was the measure of the variability of GDP, for example, computed over some long time period or just over the sample period – or was it based upon measures of variability of professionals' (or other) forecasts of GDP? Again, the choice of the standard deviation measure affects whether a participant's statement was assigned a plus or minus 1, depending on whether it was deemed to be more or less accurate. That measure choice could significantly affect a score, especially if there were only a few observations obtainable for some FOMC participants. For example, the standard deviation of GDP spanning only the current recovery examined in the WSJ report is different from what the standard deviation would be over a much longer time span or over the course of a typical recovery or economic cycle.
It should also be noted that the sample sizes for each participant varied significantly. For example, there were only 10 scored observations for Governor Duke, consisting of three each for inflation, growth, and employment, and one score of zero. This is hardly a sample that would justify her high rank in the overall assessment of the FOMC participants’ forecasts. In contrast, there were about 35 scored observations for President Evans, but they were unevenly distributed across the three variables (18 for inflation, 8 for labor, and 9 for growth). Thus, President Evan’s overall score and ranking is heavily influenced by his inflation forecast. Similar imbalances, and hence biases in rankings, are reflected in the distribution of scores for many of the other participants as well.
The second critical problem relates to how the structure of the weighting system treats errors across the three variables of interest. Some variables are easier to forecast than others. Research on predicting forecast accuracy done by myself and former colleagues at the Federal Reserve Bank of Atlanta showed that predicting GDP is much more difficult than predicting employment.[1] Hence we argued that any weighting system should penalize errors in easy-to-forecast variables more heavily than errors in hard-to-forecast variables. The WSJ article does consider the variability of the variables, but the weighting system treats an error in GDP the same as an error in unemployment. Indeed, using the columnists’’ methodology, the standard deviation of real GDP growth over the June 2009-December 2012 was 1.29 percent with a mean of 1.96 percentage points. If this were the period employed to define a “normal standard deviation”, then a forecast real GDP growth between 3.33 and .67 would be considered to be equally accurate and assigned a score of 1, despite the fact that percentage error rates would be huge and would fail a test of reasonableness.
Furthermore, the columnists’’ methodology treats an error one standard deviation above the mean the same as an error one standard deviation below the mean, when the actual value may have been, for example, below the mean. Such errors are clearly not equal in importance. And of course we know that the time frame encompassed by a prediction, whether it is quarterly or some longer period, is also critical to the assessment of error. It appears from the materials provided that the data are drawn from a mixture of time periods, and this inconsistency again taints the analysis and the weight assignment.
Finally, the problem introduced by the uneven distribution of scores across the three variables noted in the previous paragraph becomes even more important when we recognize the relative ease of predicting inflation and employment compared to growth. Some forecasters, including Presidents Evans and Rosengren, had a much lower frequency of the harder-to-forecast growth variable scored relative to the other two variables. In such instances, participants with the relatively lower frequency of growth forecasts tended to score better than those who provided more statements on growth.
The third critical problem, which my Federal Reserve colleagues and I also pointed to (and which is addressed in our own methodology) is that FOMC participants are making joint forecasts of employment, inflation, and GDP; and we know that certain combinations of these variables are more likely than others. The WSJ authors should have considered this joint forecast issue in assigning weights. Our methodology, for example, would give higher weight to a combination of correct forecasts and would penalize a joint forecast that might be accurate in one dimension but inaccurate in two others. The authors consider only part of that problem and assign a neutral score to a mixed forecast, when such errors should have been assigned a more negative weight. Again, this is an example of how the choice of the weighting system can potentially skew the results.
The Bigger Problem
Unfortunately, the false sense of precision that many readers may derive from such work can have significant and potentially damaging implications for markets and for the policy-making process more generally. The rankings imply that certain people – Vice-Chairman Janet Yellen and President Dudley, for instance – are “better forecasters” than others. Hence, based upon the WSJ's rankings, markets might now give more credence to such people's positions than is warranted. A more prudent approach would be to recognize that, when we find ourselves, as we do now, well outside the normal range of economic activity and policy, with data being generated that are not accounted for in broad-based econometric models, we would do best to sift carefully through questions of policy, armed with a variety of analyses and opinions. In fact, we would hope and expect that reasonable people might disagree over both policy and the direction of the economy.
Our own research on forecast accuracy supports three key observations. First, virtually no forecaster is consistently right over protracted periods of time. Second, some forecasters are better at predicting particular parts of an economic cycle and turning points than they are at predicting others. This raises the important issue of trying to decide whose forecasts to give more weight to over what parts of the business cycle. Finally, we also found that forecasts that are more models-based and consensus-based tend to consistently outperform those of individual forecasters.
For all these reasons, it would be a shame if work such as that published in the WSJ were to lead FOMC participants to downgrade or discount the views of some of their colleagues in the policy-formulation process. It would also be detrimental if the boards of directors of Reserve Banks were to react in any way to such misleading information. In either case, the damage to the policy-making process could be very significant for the country.
If one is truly interested in assessing forecast accuracy, then the appropriate data to examine are the sets of forecasts in the FOMC’s SEP (Summary of Economic Projections) process, using a much more sophisticated methodology than that employed by the WSJ. Unfortunately, at this time, those forecasts are available only inside the Federal Reserve System.
[1] See Eisenbeis, Robert A., Daniel Waggoner and Tao Zha, ”Evaluating Wall Street Journal Survey Forecasters: A Multivariate Approach,” Business Economics, July 2002; Eisenbeis, Robert A., Tao Zha, Daniel Waggoner and Andrew Bauer, “Forecast Evaluation with Cross-Sectional Data: The Blue Chip Surveys,” Economic Review, Federal Reserve Bank of Atlanta, (Second Quarter 2003) and Eisenbeis, Robert A., Andrew Bauer, Daniel F. Waggoner, and Tao Zha, “Transparency, Expectations, and Forecasts.” Economic Review, Federal Reserve Bank of Atlanta, 91 (First Quarter 2006). Incidentally, our methodology has been used to assess the accuracy of economic forecasters by both the Wall Street Journal and USA Today.
Source: Cumberland Advisors