Physio Analytics - Technical Report

Physio Analytics - Technical Report

 

Number:         TR0502031

Date:               May 2, 2003

Author:           Thomas B. Waggener, Ph.D.

Title:

Use of Automated Graph Readers to Digitize Data

 

 

There is widespread use of automated graph readers to digitize published data for use in modeling or plotting. While these programs generally work as advertised, they do sometimes introduce rather subtle problems, which can lead to errors in the collected data.

The particular program to be used as an example is the freeware program “DataThief”, but similar problems may arise in other programs.

Two problems to be addressed in particular are errors in value and phase shifting. For those unfamiliar with the program, here is a quick rundown as to how it works.

The point of the program is to take a published graph and generate the data points represented by the graph. Step one is to get a digital form of the graph. It may be available online as a graphics file, or it can be scanned and saved as a graphics file. DataThief  requires “gif” format, so if the file is in another format it must be converted to “gif”, as is easily done using any standard graphics program. Step two is to run the file through the digitizing program, in this case, DataThief. After starting the program, you open the graphics file and then move, match, and scale the digitizing axes to match the axes on the graph. You then designate a curve to digitize and the digitizing program will automatically trace the curve and produce an ASCII file giving the data points, which match the graph. Overall, the program is easy and effective. However the user should be cautious about a couple of points.

 

How the curve is identified from the graph

            The program operates on pixels. The user identifies the color of the pixels, which correspond to the curve, by using a cursor to mark the curve. The program then follows the curve, laying down it’s own line on the graph as it goes, so you can see what it is identifying.

            Problems arise due to the fact that the DataThief trace is only one pixel wide, while the graphics file curve it is tracing is usually considerably more than one pixel wide.

The DataThief trace does not track down the center of the graphics file curve. Instead it tracks down the right-hand edge, presumably because it is using an edge technique to identify the curve and is scrolling through the file from right to left.

 

Errors in Value

The result of this asymmetry is that the numbers from the DataThief trace consistently underestimate the value of the curve when the curve’s slope is positive and overestimate the value of the curve when the slope of the curve is negative.

 

            More precisely, if:

                        curve width = w in pixels

                        Curve slope = Q in degrees

then:

defining error as the difference between the values of the original curve and the values of the digitized trace;

 

                        error ,e, in pixels at any point is calculated as

 

                                    e = ((w-1)/2)/cosine(Q)

 

So the error is determined by the width of the graphics file curve and the slope of the curve. Note that at the local maxima and minima, the slope is zero and thus the error is at it’s minimum value, i.e. (w-1)/2.

Published curves are often drawn with thick lines so that they can be easily seen. Furthermore, in order for a graph to be successfully scanned, the curve has to be thick enough to produce a clean and continuous line. It is not uncommon to have curves whose thickness is up to 10% of their average value. In such a case there is a minimum +-5% error just based on curve thickness. Whether this level of error is tolerable depends on how you are using the data.

 

Errors in Phase

            A more subtle problem seen with these graph readers is phase shifting. This results from the same process described above. Consider a local maximum for the graphics file curve. As the curve rises, the digitizing trace is just on the right of, and below, the curve. As the curve goes through it’s peak, the digitizing trace crosses through the thickness of the curve and starts to track the descending curve, also on the right, but now above the curve. The peak of the digitizing trace is thus shifted to the right relative to the original curve. This phenomenon is illustrated in Figures 1 and 2. How far to the right the peak is shifted depends on the thickness of the curve and the sharpness, i.e. radius of curvature, of the peak. A similar shift is seen for minima.

 

Figure 1.         An example which uses the DataThief curve digitizing program on a sinusoid. The yellow trace is the sinusoid from the “gif” file, the red trace is the fit from the DataThief program.

Figure 2.         Detail from Figure 1 showing the peak of the curve. The yellow trace is from the “gif” file, the red trace is the fit from the digitizing program. The arrows show the maximum of each trace. The distance between the arrows is the phase shift . In this case the shift is 1.2 units out of a cycle time of 100 units for a phase shift of 4.3°.

 

            Because the phase shift depends on the radius of curvature, most real data will have different absolute shifts at different maxima and minima in the curve. This further complicates the use of the digitized data for identification of phase.

 

Summary

            Automated graph digitizers can be extremely useful in appropriate circumstances, however one must be careful when using them for quantitative studies as they can introduce significant errors. The problems seen in DataThief are related to how it tracks the curve on the graph. A smarter tracking algorithm could be used to run the trace down the middle of the curve, rather than down the edge, thus eliminating the errors discussed here. However one will still be left with the errors inherent in scanning a printed page and manually fitting the axes to the curve. One is also faced ultimately with the resolution limit of the single pixel.

In general, if you need numbers which are more precise than half the width of the plotted curve, or 5% as a rule of thumb, or phase information more accurate than 5°, you shouldn’t be using this technique. You should contact the author of the original graph and try to get access to the table of values used to generate the original curve.

 

 

Ó 2003 by Physio Analytics, 24 Wyoming Road, Newton, MA  USA 02460-1235