Desktop Client: Use of Random Number Distributions in a Model Overview of Variable Data and Statistical Functions in iGrafx
This article generally applies to iGrafx Process and iGrafx Process for Six Sigma Client tools where simulation is possible. The setup, however, can be defined by iGrafx FlowCharter.
Variable data is any data that is not constant. This includes attributes, equations, logic, built-in functions and user defined functions. iGrafx offers an extensive opportunity to represent variable data in a model. The locations where you may use variable data include, but are not limited to, transaction generation arrival times, generation quantities, per-transaction costs, activity task times, decisions, and the many places where attributes may be initialized, assigned or manipulated.
The statistical function category in iGrafx contains an extensive list of random number distribution functions. It is not an exhaustive list for a statistician; rather it is a very comprehensive list with respect to dynamic process modeling with discrete event simulation. In iGrafx a modeler has great flexibility to represent variable data in a model based on data that was collected from the real process.
Why Use Statistical Function Random Number Distributions?
If thousands or millions of data points were collected in a process they could be used in a model, and the use of random number distributions might be eliminated. However, this would 'lock' the model into behaving exactly as it had in the past (at least if the values were used in sequence); using a distribution function that has the same characteristics as the data, but allows picking values in a different order, may be better to simulate the system under varying conditions. In addition, a modeler does not often have the luxury of data availability or the desire to put in the modeling construct to support using an external data source. For example, a modeler commissions a time study of 30 samples to be conducted. If the model used the real data in a model that processed many thousands of transactions, then the same data points would be used over and over. This is where random number distributions (functions) come to the rescue. After a modeler uses his or her statistical tools, or iGrafx Process for Six Sigma Fit Data functionality, to determine the appropriate random number distribution to represent the data, the modeler may select and use the appropriate distribution in the model. A modeler may not always have an ability to confidently determine the best random number distribution, so there are a few guidelines worth trying.
Some Common Statistical Distributions, the Function in iGrafx, and Common Uses:
- Uniform (the Between function in iGrafx): A uniform distribution gives an equal probability of a value in the range given by a minimum and maximum, and is a good starting place to get a guess on some randomness and to get your model dynamics started so that you can come back later when you have better data.
- Between Normal (the BetweenNorm function): The Normal distribution is often used as a starting point for normally (or bell-curve) distributed data, where an average (mean) value is more likely to occur than the minimum (min) or maximum (max) value. The difference from a normal distribution is that a 'between normal' distribution is specified simply by a min and max value, and iGrafx calculates the mean for you (which you can think of as (min + max)/2). This fits an approximate 3 sigma (+ or - 3 sigma on either side of the mean) normal curve between the min and max. This eliminates outlying data outside of the min and max, which can be either desirable or not depending on the objective.
- Normal (NormDist): The NormalDist function allows you to specify the Mean and Standard Deviation of a Normal (bell-shaped) curve. Again, the Normal distribution is often also used as a good starting point, especially for task time data. In addition, with a large enough sample size, the Central Limit Theorem states that data tends towards a Normal distribution, so using a Normal distribution may seem to be a natural choice.
- Exponential (ExponDist): Exponential distributions typically represent mean time between failures (MTBF) well because the underlying causes of the failures are unknown and the result is failures that can either happen in rapid succession or go very long periods between observations. For the same reason, exponential distributions do a good job of representing mean time to repair (MTTR). In addition, Exponential distributions can model arrival rates of transactions to a system when the driving mechanism is unknown and you want random arrivals around a mean. For example the underlying reason for arrivals to a fast food counter is largely unknown.
- Weibull (WeibullStdDist): Weibull distributions are good at modeling time data, as the Weibull function can ‘skew’ (stretch) to one side to include outliers and have a 'tail'; a few values outside (e.g. times that take longer) than a normal distribution would give.
- Log Normal (LogNormStdDist): Log Normal distributions are good for modeling of usable life spans for products or product time to failure (PTTF). However, you may also find that actual task times of equipment may match this distribution as well.
- Beta (BetaDist): A Beta distribution can work when actual data is unknown. If you have a minimum and maximum from a Subject Matter Expert (SME), you can place a density function on an interval between 'a' and 'b' and assume the numbers are Beta distributed on this interval. In iGrafx, the BetaDist function has arguments of A, B, Min, and Max. This distribution provides more modeling flexibility with a wider variety of 'shapes' to the statistical curve than the Beta density function can assume (refer to wikipedia on Beta distributions for more). For example, if A = B = 1, then this is the Uniform(a, b) or Between(a, b) in iGrafx. When B > A > 1, then the shape is skewed to the right.
- Binomial (BinomDist): Binomial distributions are good at representing a transaction that has binary states such as: pass/fail, yes/no, true/false, good/bad, go/no-go, etc. You may want to use the "PercentYes" function in iGrafx, a special case of the Binomial distribution that returns a 'Yes' (a value of 1) with a frequency close to the specified percentage of the time.
- Poisson (PoissonDist): Poisson distributions are good at modeling arrival data. This distribution is commonly used to represent the arrival of flights in air traffic control and planning processes.
- Triangle (TriangleDist): Triangle distributions are often used as a starting point when a modeler has data that is skewed to one side or only has the best guess of the Subject Matter Expert (SME). The specifications for a triangle distribution are min, max and mode. Remember the difference between mode and mean. Mode is the most observed or most frequent value. When using a triangular distribution that is skewed to one side or the other, a modeler will get a higher number of observations far away from the mode than what would be experienced with a normal or Weibull distribution. This is due to the actual shape of the triangle distribution. The number of observations drops off at a linear rate, which is unlike the curve of a normal or Wiebull distribution.
More on Normal Distributions, Particularly with Time Data:
Normal distributions are often where the trouble begins in a simulation model. The underlying problem occurs when a modeler uses a normal distribution to represent non-normal data. This is common for two reasons:
- All students of statistics have been taught to assume normality when they can. However some do not study the data to demonstrate that the data is, in fact, normal or approximately normal prior to assuming normality.
- A modeler forgets to go back and update the data representation beyond the initial guess made by using a normal distribution
What is really important is the implication of using the wrong random distribution for some applications. One of the most misleading of modeling situations is misrepresenting time data in a model. What’s really dangerous is that the error is hidden unless a modeler thinks to test for it. A modeler simply wouldn’t think to test for it until simulation output results are unreasonable enough to go looking. Even then, the root cause might not be uncovered. If the results are off but not enough to look for a problem, then the problem with data representation my go unnoticed and poor process management recommendations may result.
Time representations cannot be negative. Neither can length, weight nor some other measurements, but let’s focus on time for this discussion. If a random number distribution assigns a negative number where a negative number cannot be used, a simulation engine rounds the number up to zero. The result is that the data actually used has experienced a shift in means upward as compared with the input data. If all a simulation model did was run a transaction into a shape, assign a number from a random number distribution that allows negative numbers, use that number in the task time and then log the transaction time for each transaction, then the resulting data would have a higher mean and different distribution shape than the input data which was used to represent the real data. This may invalidate task times, cycle times, resource statistics, queue statistics, process capacity, and more.
Example: A certain activity in a physical process is difficult, costly, or too infrequent to observe in a given week more than around ten times. A ten sample time study of a real life physical process activity results in the following data in minutes.
5.45 0.09 0.89 0.06 0.85 1.3 0.21 1.89 3.17 2.77
The mean of these values is 1.67. The standard deviation of these values is 1.71. Right away we can see that negative numbers are only 1 standard deviation from the mean. This data is not even close to normal. If a modeler uses a normal distribution, then a large percentage of the values selected using a random normal distribution will be negative and therefore be rounded up to zero when used as a time. The data generated by a random normal distribution for ten transactions was:
3.90 -1.51 0.82 3.51 0.76 1.31 -0.74 1.87 2.76 2.49
The mean of these values is 1.52. The standard deviation of these values is 1.75. When these values were used as time units for activity task time the result was:
3.90 0.00 0.82 3.51 0.76 1.31 0.00 1.87 2.76 2.49
Notice that the two negative date points were rounded to zero. The mean of these values is 1.74, which is higher due to a special cause (the data being changed to usable values). The standard deviation of these values is 1.39 which is lower than the original standard deviation due to the shift of some data points towards the mean, therefore changing the shape of the distribution.
If the original time study data had been analyzed, then the modeler would have found that the data would best be represented by a Weibull distribution and then secondarily by an Exponential distribution relative to the other distribution options. Looking at goodness of fit calculation values also reveals that the data does not necessarily fit well to any distribution and therefore there is good reason to go out and collect additional data points. A thirty sample size would be a more acceptable minimum sample size to collect. Collecting additional data, and the cost of doing so, should be considered as compared with the cost of making a good or bad process management decision.