Forecast#

Overview#

Forecast provides a systematic way to make predictions across numerous data environments:

  • Single time series (e.g., trading volume for a single grain company), with or without “exogenous” (“independent”) data (e.g., weather and advertising).

  • Multiple time series (e.g., trading volume of multiple grain companies that would likely be correlated with each other), with or without “exogenous” (“independent”) data (e.g., weather and advertising).

  • Cross-sectional data across many different units (e.g., regions).

  • “Fixed effects” data that allows for both time and categorical controls.

  • Date pooled across the cross-sectional and time.

Forecast includes many advanced estimation features:

  • Rigorously handles categorical (non-numerical) Exogenous (“right-hand side”) and Endogenous (“left-hand side”) data.

  • “Confidence intervals” (numerical Endogenous data) or “prediction ranges” (categorical Endogenous data) reflecting uncertainty in the forecast.

  • Recommends the best forecast methods consistent with your data. Methods are chosen from a range of advanced techniques available from the fields of statistics and machine learning.

  • When appropriate, calibration / tuning the “best” statistical method (e.g., determine the optimal (p, d, q) combination for time series models) against a suitable “information criterion.”

Videos#

Note

Video goes here.

Examples#

Clicking on Forecast produces the following form (with some selections filled for a specific example):

Step 1: Select Data#

Select the relevant data.

Explanation:

  • “Endogenous data to forecast” points to the Excel location of the data desired to be forecasted, either into the future or across different scenarios. In statistics, Endogenous data is sometimes called “dependent” or “left-hand side” data. There are two optional settings for reading in the data:

    • Top row is the header row Click this option if the top row in the endogenous data should be treated as the header.

    • The left most column is an index Click this option if the left most column in the endogenous data should be treated as the index column.

  • “Is there exogenous data?” Click this option to allow the statistical forecast model to include additional Exogenous data that might help determine the values of the Endogenous data. Exogenous data is sometimes called “independent” or “right-hand side” data. This selection is optional because some Forecast methods do not require Exogenous data. The best fitting statistical models, however, typically use Exogenous data. The presence of Exogenous data also allows for statistical forecasting against hypothetical future changes. Clicking this option unlocks several more choices:

    • “Exogenous data” points to the Excel location of the Exogenous data.

    • “Preforecast exogenous data” holds the Excel location of the Exogenous data consistent with the forecast. This information is required if Exogenous data is provided.

    • “One-hot encoding of non-numerical exogenous data?” Click this option to allow categorical (non-numerical) Exogenous data to be included in the forecast by converting categorical data into a set of “dummy” (fixed effects) variables. If this option is not clicked, Finplicity will attempt to convert data stored as Excel text into numerical format (e.g., “500.3” becomes 500.3), generating an error if it fails (e.g., “dog” has no numerical value).

    • “Endogenous data should be treated as categorical” Click this option to force Endogenous data to be treated as categorical even if it is numerical. For example, a firm might have four distinct plant operations, labeled as 1, 2, 3, and 4. This option allows these plant identifiers to be treated appropriately as separate plants instead of, for example, representing four different levels of output of the same plant that varies with the exogenous data.

    • “Join rule if Endogenous data index and Exogenous data index ever differ:” Endogenous and Exogenous data must have a common table index (e.g., date) to line up. Select the index rule to be used if the actual indexes ever differ between Endogenous and Exogenous data.

      • Endogenous: select this option to insert or delete rows in Exogenous data to include rows with index values that exactly matches the Endogenous data. If rows are added, empty values can be filled in with interpolation discussed below.

      • Exogenous: select this option to insert or delete rows in Endogenous data to include rows with index values that exactly matches the Exogenous data. If rows are added, empty values can be filled in with interpolation discussed below.

      • Intersection: select this option to only include rows with common index values in Endogenous and Exogenous data. Will result in all data being dropped if there are no index values in common.

      • Union: select this option to insert or delete rows in both Exogenous and Endogenous data to include all index values found in the Exogenous and Endogenous data. If rows are added, empty values can be filled in with interpolation discussed below.

    • “Select missing continuous data rule (missing categorical data is always dropped)” Decide how to treat missing data, either with linear interpolation using the surrounding data in the same column, or by dropping the missing data.

Step 2: Select Model Options#

Select the desired statistical forecast model, choose the desired forecast output and where to place it.

Explanation:

  • “Select model” Finplicity automatically presents appropriate statistical models based on your data. The models are presented in an order that should be tried first. The models will differ from the ones shown below based on the data provided.

  • “Output location for (out-sample) forecast” Select the location for the (“out-of-sample”) forecast of the Endogenous data based on the chosen “Number of Forecast Periods”. The number of forecast periods can be either fixed (static) as an “Input value” or read from an Excel “Input cell”. Selecting “Input cell” allows the number of forecast periods to be changed in Excel either directly on the worksheet or using the Control Board discussed below. This information is always required.

  • “Output location for (in-sample) fit” Select this option to examine how the statistical model would have performed (“fitted”) over the “in-sample” historical data used to form the statistical model.

  • “Confidence / Prediction Interval: in-sample fit” Select this option to show confidence intervals (if Endogenous data is numerical) or prediction intervals (if Endogenous data is categorical) for the in-sample fit.

  • “Confidence / Prediction Interval: out-sample forecast” Select this option to show confidence intervals (if Endogenous data is numerical) or prediction intervals (if Endogenous data is categorical) for the out-sample forecast.

  • “Launch Control Board” Select this option to launch the Control Board discussed below.

Step 3: Advanced Settings#

Choose advanced settings specific to the forecast method chosen.

Methods: Numerical Endogenous Data#

Forecast methods can be decomposed into two distinct sets of methods, depending on whether the Endogenous (“dependent” or “left-hand side”) data is numerical or categorical. This section reviews the methods available when the Endogenous data is numerical. A complete example is provided in Example: Numerical Endogenous Data.

The figure above shows the first few rows of a data set. The Endogenous data includes trading volumes for two firms (vol1 and vol2) that is indexed by the Date column. The Exogenous data includes intra-day prices (High, Open, Close) and a subjective categorical characterization of the weather that can take the values Calm or Storm. The data is messy with blanks (shown in red) and the Exogenous discussed below might have fewer rows than the Endogenous data. Still, Finplicity is robust to these and other issues.

Finplicity supports multiple types of numerical Endogenous data:

Case: Single-variable Endogenous Data with no Exogenous data#

Suppose we want to forecast future values of variable vol1 but do not have the shown Exogenous data. Select the A3:B63 (where 63 is the last row of data) but do not check the exogenous data indicator.

Only one appropriate forecasting model is shown in Step 2, SARIMA, which is documented in more detail within the Python “statsmodels” library, https://www.statsmodels.org.

Case: Multi-variable Endogenous Data with no Exogenous data#

Suppose we forecast future values of variables vol1 and vol2 together to take advantage of their correlation, increasing their forecasting power. We do not have access to Exogenous data. Now, select the range A3:C63 in Step 1 but do not check the exogenous data indicator. In this case, the only one appropriate model, VAR, is shown in Step 2, which is documented in more detail within the Python “statsmodels” library, https://www.statsmodels.org.

Case: Single-variable Endogenous Data with Exogenous data#

Suppose we forecast future values of variable vol1. However, we now do have access to the Exogenous data. (Preforecast exogenous data might have been previously forecasted from the Exogenous historical data using the “Multi-variable Endogenous Data with no Exogenous Data” case discussed above.) Step 1 now looks like the following form:

In this case, several more models in Step 2 are appropriate:

  • OLS, Ordinary Least Squares, as documented in more detail within the Python “statsmodels” library, https://www.statsmodels.org.

  • SARIMAX, as documented in more detail within the Python “statsmodels” library, https://www.statsmodels.org.

  • Random Forest (Continuous), as documented in more detail within the Python “scikit-learn” library, https://scikit-learn.org.

  • SVR, Support Vector Regression, as documented in more detail within the Python “scikit-learn” library, https://scikit-learn.org.

Case: Multi-variable Endogenous Data with Exogenous data#

Suppose we forecast future values of variables vol1 and vol2 together to take advantage of their correlation, increasing their forecasting power. We now do have access to the Exogenous data. Step 1 looks like the following form:

In this case, the only appropriate model in Step 2 is VARMAX, as documented in more detail within the Python “statsmodels” library, https://www.statsmodels.org.

Methods: Categorical Endogenous Data#

Forecast methods can be decomposed into two distinct sets of methods, depending on whether the Endogenous (“dependent” or “left-hand side”) data is numerical or categorical. This section reviews the methods available when the Endogenous data is categorical. A complete example is provided in Example: Categorical Endogenous Data.

The figure above shows the first few rows of a data set. The Exogenous and Preforecast data is the same as before, but the Endogenous data is now categorical and indicates a decision (Buy, Sell, or Hold) previously made. The objective is to forecast the best future decision for the Preforecast data. Two rules must always be satisfied with categorical Endogenous data:

  • Endogenous data must be a single column, as there no reliable statistical method that supports multiple columns or a mix of Endogenous categorical and numerical data.

  • Exogenous data (and, hence, Preforecasted exogenous data) must be provided, as there are no reliable statistical methods that permit forecasting of categorical data without Exogenous data.

The form for Step 1 looks like the following figure:

Notice that “Endogenous data should be treated as categorical” (see red arrow) must be checked.

The form for Step 2 should look like the following figure, with the forecasting options shown:

Several methods are available, as documented in more detail within the Python “scikit-learn” library, https://scikit-learn.org.

Example: Numerical Endogenous Data#

This section considers a complete example where the Endogenous data is numerical.

The above figure shows the first few rows of a data set. The Endogenous data includes trading volumes for two firms (vol1 and vol2) that is indexed by the Date column. The Exogenous data includes intra-day prices (High, Open, Close) and a subjective categorical characterization of the weather that can take the values Calm or Storm. The data is messy with blanks (shown in red) and the Exogenous discussed below might have fewer rows than the Endogenous data. Still, Finplicity is robust to these issues.

This example shows how to forecast vol1 data into the future dates listed in the Preforecast table.

Step 1: Select Data#

Fill out the form in Step 1 as shown below.

Step 2: Select Model Options#

Choose Ordinary Least Squares (OLS) from the model dropdown list. And fill out the remaining of the form as follows:

Step 3: Advanced Settings#

For OLS, the only required decision is whether to allow for a constant in the model. In general, the available settings will vary by the model selected in Step 2. The shown default values are usually the best settings for the problem and should be changed only by advanced users

Output to Excel Worksheet#

The output to Excel will look like the following figure.

Explanation:

  • “In-sample fit” reflects how the chosen Forecast method “fits” over the Original (“actual” or “historical”) data, sometimes called a “back test”. The in-sample fit is often used to judge the performance of the chosen Forecast method. (Alternatively, the Original Endogenous and Exogenous data can be split between “training” and “testing” sections, a method that has advantages and disadvantages relative to the simpler approach taken herein.) Missing data is handled as follows:

    • Data with an index value present but with missing data values are filled in by interpolation since “interpolate” was chosen in Step 1.

    • Data with missing index values are dropped since the index is an important determinant for interpolation since, for example, interpolation over dates might not be equally spaced.

  • “In-sample Confidence Interval” at level alpha = 0.05 (chosen above) indicates possible variation in the in-sample fit data. There is only a 5% chance that the Original (or historical) data for each date would be outside the “lower” and “upper” value found. Naturally, the lower-to-upper range includes the In-sample fit (point estimate) discussed above, which reflects an average value.

  • “Out-sample fit” reflects Forecast best guess for the future dates in Preforecast data. The corresponding uncertainty is reflected in the “Out-sample Confidence Interval”.

Control Board#

Since the Control Board option was selected, it will be shown in a separate window, which can even be moved to a different computer display for computers with more than one active monitor.

An explanation of each pane in the Control Board is as follows.

Pane: Forecast

  • Choose variables for the different Endogenous variables to plot (vol1 is the only selection in this example).

  • The vertical dotted line separates Original (sometimes called “historical” or “actual”) data to the left of the vertical dotted line from Forecast data to the right of the vertical dotted line.

    • “Original Values” is the Endogenous data chosen in Step 1.

    • “Predicted Values” combines the in-sample fit data (left of dotted line) with the out-sample forecast data (right of dotted line).

  • This pane is interactive: Hover over the data to see values. Use box select, zoom, and other control options. Add or remove series by clicking on their labels in the legend area. Double-click on any blank area to un-zoom.

    • Box Select:

    • Zoom:

Pane: Compare Two Reports

  • Select available reports from the pull-down menus.

  • The available reports will vary significantly by the forecast method chosen in Step 2. However, they have a central theme in that they usually indicate a sense of the model fit by comparing the in-sample (fit) against the Original (“actual” or “historical”) values. Some methods also produce additional reports that should be noticeable to users familiar with those methods. We are constantly developing learning materials related to each method.

  • In this example, notice that block dots in the leftward report are almost-symmetrically distributed around the red 45-degree line without any noticeable systematic bias (above or below). This result suggests that the forecast method (OLS) is not biased for the current example. However, the fact that the dots do not actually line up on 45-degree line suggests the current forecast method (OLS) fails to handle extreme values efficiently. This “lack of efficiency” can also be seen in the histogram in the rightward report which reports that the fitted (red) values are more concentrated than the actual data (green).

  • Each report is static and summarizes the original data in the Forecast pane prior to any box select used in the Forecast pane.

Example: Categorical Endogenous Data#

The above figure shows the first few rows of a data set. The Exogenous and Preforecast data is the same as in the last example, but the Endogenous data is now categorical and indicates a decision (Buy, Sell, or Hold) previously made. In this example, we want to forecast this decision into the future based on the Exogenous and Preforecast data. To demonstrate the flexibility of the Forecast function, the data is formatted a bit more compactly, without a separate index for Exogenous data.

Step 1: Select Data#

Fill out the form in Step 1 as shown below.

Explanation:

  • For Exogenous data, the index column is not chosen. Since the index column was provided for Endogenous data, it will be used for Exogenous data by matching relative row positions (e.g., the 9th row in Exogenous data uses the index found in the 9th row of Endogenous data). The Exogenous data doesn’t even have to be located right next to the Endogenous data for this matching to occur.

  • “Endogenous data should be treated as categorical” is to indicate that the data is categorical.

Step 2: Select Model Options#

In this example, select the Logistic model. For most categorical data, the Logistic model is the most appropriate option — so much so that you should have a compelling reason to not use it. Select the other options shown on the form. Notice that the alpha value information for confidence intervals is not displayed, as that information does not apply when the endogenous variable is categorical. As discussed below, prediction intervals are instead used for categorical data.

Step 3: Advanced Settings#

The selected model does not require any additional settings for our purposes.

Output to Excel Worksheet#

The following output is produced at the specified location in the Excel Worksheet.

Explanation: Prediction Intervals are used instead of Confidence Intervals for in-sample fit and out-sample forecast.

  • A prediction interval gives the probability that a given value of the categorical variable will arise.

  • For example, for the “In-sample Prediction Interval” on 4/10/2018, Forecast estimated that there was an 8.4% chance that “Buy” would have materialized, an 88.8% chance that “Hold” would have materialized and an 2.8% chance that “Sell” would have materialized. In fact, the actual data in the Endogenous data columns indicates that “Hold” materialized on 4/10/2018.

  • The same reasoning can apply to out-sample forecast values. On 6/28/2018, Forecast estimates a Buy decision with a 53.4% chance.

Control board#

An explanation of each pane in the Control Board is as follows.

Pane: Forecast

  • Results are shown using a novel graph for the Trade categorical variable.

  • The vertical dotted line separates historical data (left) from forecast (right) using graphical triplet and paired bar graphs, which are described in more detail below.

    • Historical triplet bar graphs: (actual, fitted, stacked probabilities)

    • Forecast paired bar graphs: (predicted, stacked probabilities)

  • This pane is interactive: Hover over the data to see values. Use box select, zoom, and other control options. Add or remove series by clicking on their labels in the legend area.

Pane: Compare Two Reports

  • Select available reports from the pull-down menus.

  • The available reports will vary by the forecast method used and usually indicate a sense of the model fit by comparing the in-sample (fit) against the Actual (historical) values. In this example, the left report shows the “Prediction matrix normalized” (left) that compares predicted versus

  • Each report is static and summarizes the original data in the Forecast pane prior to any box select used in the Forecast pane.

Forecast Paired Bar Graphs in More Detail

Line graphs that are typical with numerical Endogenous data are meaningless with Endogenous categorical data. Forecasted data in the Forecast pane are grouped in (predicted, stacked probabilities) pairs. For example, use the Zoom option in the “hover dashboard” shown below that appears when you hover your mouse in the upper-righthand of the figure shown in the Forecast pane.

Highlight some of the forecast to the right of vertical dotted line to as shown below.

The following zoomed-in figure now appears:

There are two bars for June 28, 2018. The first bar is always a solid color and indicates a “predicted Buy” in this example. The second bar is a stacked probability bar that indicates the prediction probabilities between “probability Sell” (blue), “probability Hold” (orange) and “probability Buy” (green). These probabilities add up to 1 (100%). Similar type of information is shown for June 29. Hover the mouse to see the underlying data, as shown for June 29, which indicates a 43.9% probability of sell. Double left-click mouse on a blank area in the graph to un-zoom.

Historical Triplet Bar Graphs

Historical data in the Forecast pane are grouped in (actual, fitted, stacked probabilities) pairs. For example, use the Zoom option in the “hover dashboard” shown below that appears when you hover your mouse in the upper-righthand of the figure shown in the Forecast pane.

Highlight some of the forecast to the left of vertical dotted line to as shown below.

The following zoomed-in figure now appears:

There are three bars for April 9, 2018. The first bar is always a solid color and indicates that the original trade was “Sell”. The second bar is always a solid color and indicates that the predicted (fitted) trade was “Sell”. The third bar holds stacked probabilities that indicate the prediction probabilities between “probability Sell” (blue), “probability Hold” (orange) and “probability Buy.” Naturally, these probabilities add up to 1 (100%). Similar type of information is shown for April 10 and April 11. Hover the mouse on the graph to get more detailed information. Double left-click mouse in a blank area in the graph to un-zoom.

Reports in More Detail

There are numerous potential reports available, including:

  • Histogram: A histogram of Original (historical) and In-sample fit (predicted for historical data) over the potential categorical values, normalized so that the total probability of each series sums to one.

    In the current example, notice that the distribution of buy, hold and sell decisions in the in-sample fit lines up closely with the distribution in the Original data. However, that does not mean that they line up “correctly” at each calendar date, only the total counts are similar.

  • In-sample fit matrix: The in-sample fit matrix helps determine if Forecast’s predictions for the in-sample fit actually lines up with the Original data at the row level (e.g., at each Calendar date).

    In the current example, when “Hold” was the “True” decision in the Original data (vertical axis), Forecast correctly “Predicted” the decision of “Hold” for the in-sample fit (the horizontal axis) 32 times. However, the decision “Buy” was incorrectly predicted 1 time and “Sell” was incorrectly predicted 1 time.

  • In-sample fit matrix, normalized: The normalized version of the “In-sample fit matrix” states the same information but as a percentage that sums to 1 (i.e., 100%) across each row.

    In the current example, when “Hold” was the “True” decision in the Original data (vertical axis), Forecast correctly “Predicted” the decision of “Hold” for the in-sample fit (the horizontal axis) 94% of the time. However, the decisions of “Buy” and “Sell” were each incorrectly predicted 2.9% of the time. A “perfect” fit would have 1.0 (100%) across the diagonal from left to right. But perfect fits might be the result of “data mining” where a model is “over-fitted” to the Original data.

  • In-sample fit probabilities: Confidence Intervals do not make sense when the Endogenous data is categorical. “Prediction Intervals” are used instead. The In-sample fit probabilities table provides even more granular information than the In-sample fit matrix by showing the uncertainty in Forecast at the row level. The column “Original” indicates the original decision made in the data.

    In the current example, look at the first row of the table for the date 2018-04-06. Forecast reports that if it is again provided with the same Exogenous data that it was previously provided for 2018-04-06, Forecast will predict “Buy” with a roughly 0% chance, “Hold” with a 0.3% chance, and “Sell” with a 99.7% chance.

  • Out-sample forecast probabilities: Similarly, the out-sample forecast probabilities table shows “Prediction Intervals” for the out-sample forecast. It indicates the “confidence” of Forecast in making the predicted decision.

    In the current example, Forecast predicts “Sell” for the date 2018-06-21. However, Forecast is only 77.8% confident of this choice. Forecast places a 22.1% chance on the decision “Hold”.

  • Confusion matrix: This matrix is provided for advanced users. It is very similar to the “In-sample fit matrix, normalized” discussed above but uses bootstrapping (sampling with replacement) to provide prediction probabilities.

FAQ#

  • What is the difference between Endogenous and Exogenous data?

Endogenous data is sometimes called “left-hand side” or “dependent” data. It is the data that is desired to be forecasted, either into the future or across different scenarios. Endogenous data is always required for a forecast. Exogenous is sometimes called “independent” or “right-hand side” data. This data might be available that helps determine the values of the Endogenous data. Some Forecast methods do not require Exogenous data, but the best fitting methods use Exogenous data.

  • What is the difference between numerical and categorical Endogenous data?

Consider an automaker. “Car sales” is numerical data since it can be represented by a number expressed in some units (in this case, dollars). “Car color” is categorical data since it has no meaningful numerical representation.

  • Can Endogenous data containing numbers still sometimes be categorical?

In some limited cases, yes. For example, integer numbers (e.g., 1, 2, and 3) could be intended as categorical if these numbers are purely symbolic representations, for example, factory identifiers of an automaker. More generally, whether numbers should be treated as categorical data can be determined by asking whether an inequality between those numbers has any meaning for the forecast objective. For example, 2 > 1 is meaningful if these numbers represent millions of car sales since more sales (2 million) are better than fewer sales (1 million). These numbers are not symbolic and should be treated as numerical. So, in Step 1 of the Forecast form, do not check the option “Endogenous data should be treated as categorical”. But 2 > 1 is meaningless if these numbers represent factory identifiers (e.g., plant #1, plant #2, etc.). Unlike numerical data, plan identifiers can be easily switched (e.g., former plant #1 is now called plant #2 and vice versa) and still be proper identifiers since they are just symbolic. It follows that factor identifiers are categorical, as indicated by checking “Endogenous data should be treated as categorical” in Step 1 of the Forecast form.

  • Why is there a distinction whether Endogenous data is numerical or categorical but not whether Exogenous data is numerical or categorical?

Whether Endogenous data is numerical or categorical determines the available set of Forecast methods. However, most these Forecast methods support Exogenous numerical and categorical data if “one-hot encoding” is selected in Step 1 of the Forecast form.

  • In Step 1 for Forecast, what is “one-hot encoding”? Why is it selected by default?

One-hot encoding simply means that Exogenous categorical data is converted to a set of dummy variable indicators so that each categorical value plays a unique role in the statistical model as an “intercept shifter.” It is selected by default because this approach is consistent with most applications that has categorical Exogenous data. Data in Excel that is clearly numerical will not be one-hot encoded. However, data in Excel that is clearly non-numerical will be converted to dummy variable indicators. If one-hot encoding is not selected, Forecast will instead try to convert all Excel text data to numerical format in hope that the text data is just poorly formatted and intended as numerical. So, for example, “310.5” text data will be converted to 310.5 but “dog” cannot be converted. If one-hot encoding is not selected and the text-to-numerical conversion fails, Finplicity will generate an error message.

  • Is there an example of when one-hot encoding might not be reasonably selected?

Technically, yes, but it is not best practice. One-hot encoding could be reasonably unselected in limited cases if all the text data in Excel is intended to be numerical. But, even then, we suggest first using Finplicity Data Wrangling methods to first clean the data as part of a Workflow to transform the data to a proper numerical format. In fact, simply reading the data from worksheet using the related Finplicity data reader will usually correctly do this conversion upon output.

  • I heard that “one-hot encoding” reduces “degrees of freedom” of Forecast. Isn’t that a bad thing?

One-hot encoding does reduce degrees of freedom, but that is the cost of doing valid statistical analysis for categorical Exogenous data. In the automobile example discussed above, degrees of freedom could be increased by unselecting one-hot encoding so that plant identifiers are now treated as numeric. That forecast, however, would lack meaning since variation in the plant identifier is arbitrary if taken as a numerical value; it only has meaning as an identifier. In fact, without one-hot encoding checked, switching the plant identifiers would typically change the Forecast output, which is a nonsensical result. With one-hot encoding checked, this switch would have no impact, as intended.

  • What is a better way to preserve “degrees of freedom” than treating categorical data as numeric?

A better approach is to drop Exogenous data that is not very meaningful a priori. If you do not have a strong reason to include some Exogenous data, then do not include it. So-called “kitchen-sink” forecasts that include extraneous Exogenous data typically lack predictive power.

  • The date indexes in my Endogenous and Exogenous data are not the same. Is that a problem?

The Toolkit’s Forecast is typically robust to these types of data issues. If dates do not match because of Excel formatting, Finplicity can usually figure out that distinction. However, if the date indexes do not initially match because the dates differ, Finplicity provides different “join rule” options in Step 1 to use matching dates in actual forecasts. The most conservative join rule is “intersection” and could potentially limit your available data because it limits to data with matching indexes in the original data for both Endogenous and Exogenous data. The most liberal join rule is “union” that uses all index values available, potentially even temporarily creating missing Endogenous data and missing Exogenous data when necessary to maximize the data set potential. Missing data is then either filled in by interpolation or dropped, based on the next option shown in Step 1, “Select missing continuous data rule …”.

  • The date indexes in my Endogenous and Exogenous data have very little overlap. Is that a problem? What can I do about it?

If there is no sufficient overlap of dates in the original Endogenous and Exogenous data, then Forecast might produce an error since any forecast method has very little meaning. To fix, consider two options. First, consider dropping extraneous Exogenous data that has little a priori reason for its inclusion. Second, consider using Forecast methods that do not require Exogenous data.

  • The rows in my Endogenous and Exogenous data are intended to match on a relative basis. For example, the 1st row in the Endogenous data is meant to match the 1st row in the Exogenous data, similar for the 2nd row, and so on. But the data indexes do not match. Or, maybe one data index is missing or mostly missing. Or, maybe both data indexes are missing or mostly missing. How do I ensure that Forecast aligns the rows correctly?

If you want to use an index for one of the Endogenous or Exogenous data tables for both data tables, check the option “The left column is an index” in Step 1 of the Forecast form for the data with an index that you want to use. But for the other data, do not check this option. For example, suppose that you want to use the index for Exogenous data for both Exogenous and Endogenous data. Then, check “The left column is an index” for Exogenous data but not for Endogenous data. If you do not want to use an index for either data source, just leave this option unchecked for both. In this case, a generic index (0, 1, 2, …) is used and will ensure that relative row-matching data is used across Endogenous and Exogenous data, even if those data sources do not happen to line up row-by-row within Excel itself (e.g., the data sources come from different worksheets or places within the worksheet).

  • Why shouldn’t I always just ignore the indexes in my data and always leave “The left column is an index” in Step 1 and just let Forecast do the work to line up things?

There are two reasons to include an index with Endogenous and Exogenous data if it does exist. First, suppose that the rows in the two data sources are not intended to match on a relative basis, as described above. The data index information is then used by Forecast to create the match. Second, even if the rows are intended to match across Endogenous and Exogenous data on a relative basis, the index information is useful if the index is based on dates. Even if the date index appears to be nicely formatted, they still might not be equally spaced (e.g., daily data is not uniformly spaced if data on weekends is missing). Forecast can often use this additional information to improve accuracy.

  • What happens if the number of rows in Endogenous and Exogenous tables are not the same?

Forecast will expand or contract the merged data required for forecasting based on the join and interpolation rules picked in Step 1.

  • How is Step 1 in the Forecast form to be completed if Endogenous and Exogenous data are inside a common Excel table with a single index?

As shown in Example: Categorical Endogenous Data, you can extract both Endogenous and Exogenous data from a common Excel table. Just select the index option for the data source that is adjacent to (to the right of) the intended index. For the other table, do not select the index option.

  • How is Step 1 in the Forecast form to be completed if Endogenous and Exogenous data are inside a common Excel table with a single index, but neither data set is adjacent to the index itself?

You have three options. First, do not select the index option for either the Endogenous or the Exogenous table. This approach is fine if using a generic table index (0, 1, 2, …), as described above, is reasonable for your forecast application. Second, use Data Select in the Toolkit to create a new common table in Excel (either in-place or at a new location) with the desired arrangement of the columns where the index is adjacent to either the Endogenous or Exogenous tables. Then, use Forecast as discussed above. This option is preferred over the previous option if the index holds dates because date information is more specific (i.e., maybe not equally spaced) than a generic index. Third, use Data Select twice to create separate Endogenous and Exogenous data tables at new Excel locations, each data source with its own index. There is no forecasting benefit to this approach relative to the last one except maybe for subjective perception in the readability of your Excel worksheet.

  • Why do I see the error: “In-sample Exogenous data provided without out-sample Preforecast exogenous data.”

Exogenous data was provided in Step 1 of the Forecast form, but Preforecast data was not provided. Exogenous data and Preforecast data must be used together. To fix, if Exogenous data is provided in Step 1, then also provide Preforecast data. Otherwise, deselect Exogenous data in Step 1 to use a Forecast method that does not use Exogenous data.

  • Why do I see the error: “Out-sample Preforecast exogenous data provided without in-sample Exogenous data.”

Preforecast exogenous data was provided in Step 1 of the Forecast form, but Exogenous data was not provided. Exogenous data and Preforecast data must be used together. To fix, if Preforecast exogenous data is provided in Step 1, then also provide Exogenous data. Otherwise, deselect Preforecast data in Step 1 to use a Forecast method that does not use Exogenous data.

  • Why do I see the error: “Endogenous data has non-numerical data that is not compatible with method ___? Attempts to convert this non-numerical data to float were unsuccessful.”

In Step 1 of Forecast form, you did not check the option “Endogenous data should be treated as categorical” and so you were provided with forecasting methods in Step 2 that require Endogenous data to be numerical, not categorical. However, your Endogenous data contains non-numerical data anyway. Forecast tried to convert this data to a numerical format but was unable to do so. To fix, if you intended to check the option “Endogenous data should be treated as categorical” then check it in Step 1. Otherwise, your intended numerical Endogenous data must suffer from poor enough formatting that Forecast is unable to convert to numerical format. In this case, Finplicity Data Wrangling tools to address.

  • Why do I see the error: “In-sample Exogenous data and out-sample Preforecast data must have the same number of data columns.”

Preforecast exogenous data is itself a forecast of Exogenous data into the future. Or, Preforecast exogenous data represents some Exogenous data under a scenario to be examined, e.g., a stress test. So, each column in the Preforecast exogenous data set must represent the same data concept as the corresponding column in the Exogenous data set, even if the values are different. For example, if Exogenous data set has the three columns (GDP, population, land mass) of Original (historical) data then Preforecast exogenous data must present the same three columns in the same column order, even though the actual values in each column in Exogenous will, of course, differ from the values in the corresponding Preforecast column.