|
ABSTRACT
The first and foremost
important step in the data analyses process is
to display data using graphical methods. Where
two features in a study can be measured accurately,
a visual presentation such as a scatter plot may
indicate an interesting relationship, if it does
not seem random. This helps researchers to understand
the relation between different variables in a
particular dataset. It also aids investigators
to make the appropriate decision about how to
further analyze the data by applying the most
suitable statistical models. The chief aim of
the present article therefore, is to examine one
of the most powerful graphical diagrams for data
visualization i.e. a scatter plot using a real
public health dataset.
Key words: Accurate
measurement, Scatter plot, Correlation coefficient,
Data visualization
|
INTRODUCTION
It has been discussed that health
care professionals including family physicians are increasingly
becoming involved in public health data analyses and
as a result they should befamiliar with the different
approaches of data analyses(1,2). The first and
foremost important step in data analyses process is
to display data using graphical tools(3,4). This
helps researchers to understand the relation between
different variables in a particular dataset. It also
aids investigators to make the appropriate decision
about how to further analyze the data by applying the
most suitable statistical models(5).
Maps and box plots are among
the best known graphical tools to display public health
data(1,6). For instance, it has been documented
that maps reveal the spatial relationships, which might
not be seen in corresponding tables(7). There are also
other important graphical tools to display public health
data such as scatter plot, which may exhibit patterns
that cannot be expressed easily in writing. The chief
aim of the present article therefore, is to examine
this powerful method for data visualization using a
real public health dataset.
SCATTER PLOT
One of the most powerful graphical
methods, especially for describing the relationship
between two continuous variables, is the scatter plot(8). To create this graph the value of the dependent
variable is plotted in the Y axis, whilst the value
of the independent variable is plotted in the X axis
(9). As a result, this graph displays each pair of data
values using (x, y) coordinates in a plane(8).
The shape of this plot, which
is used to describe the relationship between the two
variables, has two elements. These two elements are
related to its position and variability. The position
might be measured as a line or a curve that runs through
the bulk of the data, and the variability might be measured
in terms of deviation of (x, y) points from the curve(8).
These functions would help researchers
to detect any relation that might exist between two
variables. They also aid investigators to make the appropriate
decision about how to further analyze data by applying
the most suitable statistical models. For instance,
if the relation between two variables emerges to be
a linear one, then a linear regression model would be
applied to further analyse the data(9).
Sometimes we may also need to
supplement the scatter plot by calculating a statistic
known as the correlation coefficient, which is usually
denoted by r(10). This coefficient quantifies both
the direction and the strength of the observed relationship
and it can take any value between +1 and - 1. A positive
value indicates that the two variables tend to be either
large or small together and a negative sign implies
that one variable takes large values when the other
is small, and vice versa. A value near to zero implies
that there is no linear relationship between the two
variables(10).
PUBLIC HEALTH DATASET
The data comes from a cross
sectional study that was performed during September
to October 2005, on 606 Afghani pupils aged 6-14 years
within Shahriar County of Tehran, province of Iran.
The sample size included 312 (56.1%) boys and 284 (46.9%)
girls who were originally recruited in order to determine
their nutritional status.
Among variables under study
there were two continuous variables i.e. height and
the weight of the pupils, which were measured with the
minimum clothes and no shoes. Weights of the pupils
were measured by an accurately calibrated Seca digital
scale, with an accepted error of 0.1 kg and their heights
by a stadio-meter with an accepted error of 0.1 cm.
Let us imagine that the researchers
who conducted the above study are keen to determine
the relation between the heights and the weights of
the pupils. In other words they would like to know to
what extent, by increasing the heights of the students,
their weights are increased.
To answer this question, Diagram
1, which is a scatter graph, is produced. In this graph
the height of the pupils (i.e. independent variable)
are shown on the X axis and the weight of the pupils
(i.e. independent variable) on the Y axis. As the diagram
depicts a linear line runs through the bulk of the data
and the correlation coefficient (r =0.8) indicates that
there is a positive strong relationship between these
two variables. This implies that by increasing the heights
of the pupils their weights are also increased to a
large extent. As the graph depicts whilst most observations
are gathered relatively around the linear line there
are also some outliers. For instance, case number 490
who is a 16 year old boy with 176 cm height and 70 kg
weight and case number 399 who is an 11 year old girl
with 133 cm height and 17 kg weight are clearly two
outliers.
Diagram
1. Scatter plot depicting the relation between
height and weight of 606 Afghani pupils aged 6-14 years

Since other variables such as
gender may confound the relationship between height
and weight, two separate scatter plots were produced
for boys (Diagram 2) and girls (Diagram 3), respectively.
Both diagrams highlight that there is also a positive
strong relationship between these two variables in both
sexes. However, as Diagram 2 depicts, more observations
are gathered relatively around the linear line in comparison
to Diagram 3. Furthermore, based on the values of the
correlation coefficients it is also evident that this
relationship is relatively stronger among boys (r =0.81)
compared with girls (r =0.79).
Diagram 2. Scatter
plot depicting the relation between height and weight
of 312 boys Afghani pupils aged 6-14 years
Diagram 3. Scatter
plot depicting the relation between height and weight
of 284 girls Afghani pupils aged 6-14 years

CONCLUSION
Scatter plot is among the most
powerful graphical methods which examine the relationship
between two continuous variables in terms of position
and variability. Therefore, the use of this plot in
the early stages of data analyses i.e. data visualization
is strongly recommended(8-10).
ACKNOWLEDGMENT
The author would like to appreciate
the valuable comments of Ian Enzer on the earlier draft
of this article.
REFERENCES
- Rezaeian, M. How to visualize public health data?
Part one: Box plot and map. Middle East J Family Med
2008; 10 :20-24.
- Rezaeian, M. How to visualize public health data?
Part two: Direct and indirect standardization methods.
Middle East J Family Med 2009; 1 :42-44.
- Cleveland WS. Visualising data. Hobart Press, Summit,
NJ, 1993.
- Everitt BSE, Dunn G. Applied multivariate data analysis.
London: Arnold, 2001.
- Rezaeian, M. Dunn, G. St. Leger, S. Appleby L. Geographical
epidemiology, spatial analysis and geographical information
systems: a multidisciplinary glossary. J Epidemiol
Community Health 2007; 61 : 98-102.
- Rezaeian, M. Dunn, G. St. Leger, S. Appleby L. The
production and interpretation of disease maps: A methodological
case-study. Soc Psychiatry Psychiatr Epidemiol. 2004;
39: 947-954.
- Bell BS, Broemeling LD. A Bayesian analysis for
spatial processes with application to disease mapping.
Stat Med 2000; 19 : 957-974.
- Martin MA, Welsh AH. Graphical display. In: Armitage
P, Colton T. International encyclopaedia of biostatistics,
pp 1750-1771. Chichester: John Wiley, 1998.
- Harrington R. Scatter plot. In: Boslaugh S. Encyclopaedia
of epidemiology, pp 938-939. California: SAGE Publications,
Inc., 2008.
- Dunn G, Everitt B. Clinical biostatistics. London:
Edward Arnold, 1995.
|