Have we Forgotten about Humour?
A. Abyad
Models and Methods and Clinical Research
Getting to Know The Scatter Plot
Dr. Mohsen Rezaeian
 

 

Chief editor
Abdulrazak Abyad MD, MPH, MBA, AGSF, AFCHSE

Editorial office:
Abyad Medical Centre & Middle East Longevity Institute
Azmi Street, Abdo Centre
PO Box 618
Tripoli, Lebanon
Tel: 961 6 443 684
Fax: 961 6 443 685
aabyad@cyberia.net.lb


Publisher
Lesley Pocock
medi+WORLD International

572 Burwood Road,
Hawthorn 3122, VIC
Australia
Tel: +61 3 9819 1224
Fax: +61 3 98193269
Lesleypocock@mediworld.com.au

 

 

While all efforts have been made to ensure the accuracy of the information in this journal, opinions expressed are those of the authors and do not necessarily reflect the views of The Publishers, Editor or the Editorial Board. The publishers, Editor and Editorial Board cannot be held responsible for errors or any consequences arising from the use of information contained in this journal; or the views and opinions expressed. Publication of any advertisements does not constitute any endorsement by the Publishers and Editors of the product advertised.

The contents of this journal are copyright. Apart from any fair dealing for purposes of private study, research, criticism or review, as permitted under the Australian Copyright Act, no part of this program may be reproduced without the permission of the publisher.

 
March 2009, Volume 6 - Issue 2

Getting to Know The Scatter Plot

Dr. Mohsen Rezaeian (PhD, Epidemiologist, Associate Professor)
Social Medicine Department, Rafsanjan Medical School, Rafsanjan, Iran.
Tel: +98 391 5234003
Fax: +98 391 5225209
Email: moeygmr2@yahoo.co.uk



ABSTRACT

The first and foremost important step in the data analyses process is to display data using graphical methods. Where two features in a study can be measured accurately, a visual presentation such as a scatter plot may indicate an interesting relationship, if it does not seem random. This helps researchers to understand the relation between different variables in a particular dataset. It also aids investigators to make the appropriate decision about how to further analyze the data by applying the most suitable statistical models. The chief aim of the present article therefore, is to examine one of the most powerful graphical diagrams for data visualization i.e. a scatter plot using a real public health dataset.

Key words: Accurate measurement, Scatter plot, Correlation coefficient, Data visualization

INTRODUCTION

It has been discussed that health care professionals including family physicians are increasingly becoming involved in public health data analyses and as a result they should befamiliar with the different approaches of data analyses(1,2). The first and foremost important step in data analyses process is to display data using graphical tools(3,4). This helps researchers to understand the relation between different variables in a particular dataset. It also aids investigators to make the appropriate decision about how to further analyze the data by applying the most suitable statistical models(5).

Maps and box plots are among the best known graphical tools to display public health data(1,6). For instance, it has been documented that maps reveal the spatial relationships, which might not be seen in corresponding tables(7). There are also other important graphical tools to display public health data such as scatter plot, which may exhibit patterns that cannot be expressed easily in writing. The chief aim of the present article therefore, is to examine this powerful method for data visualization using a real public health dataset.


SCATTER PLOT

One of the most powerful graphical methods, especially for describing the relationship between two continuous variables, is the scatter plot(8). To create this graph the value of the dependent variable is plotted in the Y axis, whilst the value of the independent variable is plotted in the X axis (9). As a result, this graph displays each pair of data values using (x, y) coordinates in a plane(8).

The shape of this plot, which is used to describe the relationship between the two variables, has two elements. These two elements are related to its position and variability. The position might be measured as a line or a curve that runs through the bulk of the data, and the variability might be measured in terms of deviation of (x, y) points from the curve(8).

These functions would help researchers to detect any relation that might exist between two variables. They also aid investigators to make the appropriate decision about how to further analyze data by applying the most suitable statistical models. For instance, if the relation between two variables emerges to be a linear one, then a linear regression model would be applied to further analyse the data(9).

Sometimes we may also need to supplement the scatter plot by calculating a statistic known as the correlation coefficient, which is usually denoted by r(10). This coefficient quantifies both the direction and the strength of the observed relationship and it can take any value between +1 and - 1. A positive value indicates that the two variables tend to be either large or small together and a negative sign implies that one variable takes large values when the other is small, and vice versa. A value near to zero implies that there is no linear relationship between the two variables(10).


PUBLIC HEALTH DATASET

The data comes from a cross sectional study that was performed during September to October 2005, on 606 Afghani pupils aged 6-14 years within Shahriar County of Tehran, province of Iran. The sample size included 312 (56.1%) boys and 284 (46.9%) girls who were originally recruited in order to determine their nutritional status.

Among variables under study there were two continuous variables i.e. height and the weight of the pupils, which were measured with the minimum clothes and no shoes. Weights of the pupils were measured by an accurately calibrated Seca digital scale, with an accepted error of 0.1 kg and their heights by a stadio-meter with an accepted error of 0.1 cm.

Let us imagine that the researchers who conducted the above study are keen to determine the relation between the heights and the weights of the pupils. In other words they would like to know to what extent, by increasing the heights of the students, their weights are increased.

To answer this question, Diagram 1, which is a scatter graph, is produced. In this graph the height of the pupils (i.e. independent variable) are shown on the X axis and the weight of the pupils (i.e. independent variable) on the Y axis. As the diagram depicts a linear line runs through the bulk of the data and the correlation coefficient (r =0.8) indicates that there is a positive strong relationship between these two variables. This implies that by increasing the heights of the pupils their weights are also increased to a large extent. As the graph depicts whilst most observations are gathered relatively around the linear line there are also some outliers. For instance, case number 490 who is a 16 year old boy with 176 cm height and 70 kg weight and case number 399 who is an 11 year old girl with 133 cm height and 17 kg weight are clearly two outliers.

Diagram 1. Scatter plot depicting the relation between height and weight of 606 Afghani pupils aged 6-14 years


Since other variables such as gender may confound the relationship between height and weight, two separate scatter plots were produced for boys (Diagram 2) and girls (Diagram 3), respectively. Both diagrams highlight that there is also a positive strong relationship between these two variables in both sexes. However, as Diagram 2 depicts, more observations are gathered relatively around the linear line in comparison to Diagram 3. Furthermore, based on the values of the correlation coefficients it is also evident that this relationship is relatively stronger among boys (r =0.81) compared with girls (r =0.79).

Diagram 2. Scatter plot depicting the relation between height and weight of 312 boys Afghani pupils aged 6-14 years


Diagram 3. Scatter plot depicting the relation between height and weight of 284 girls Afghani pupils aged 6-14 years


CONCLUSION

Scatter plot is among the most powerful graphical methods which examine the relationship between two continuous variables in terms of position and variability. Therefore, the use of this plot in the early stages of data analyses i.e. data visualization is strongly recommended(8-10).

 

ACKNOWLEDGMENT

The author would like to appreciate the valuable comments of Ian Enzer on the earlier draft of this article.



REFERENCES

  1. Rezaeian, M. How to visualize public health data? Part one: Box plot and map. Middle East J Family Med 2008; 10 :20-24.
  2. Rezaeian, M. How to visualize public health data? Part two: Direct and indirect standardization methods. Middle East J Family Med 2009; 1 :42-44.
  3. Cleveland WS. Visualising data. Hobart Press, Summit, NJ, 1993.
  4. Everitt BSE, Dunn G. Applied multivariate data analysis. London: Arnold, 2001.
  5. Rezaeian, M. Dunn, G. St. Leger, S. Appleby L. Geographical epidemiology, spatial analysis and geographical information systems: a multidisciplinary glossary. J Epidemiol Community Health 2007; 61 : 98-102.
  6. Rezaeian, M. Dunn, G. St. Leger, S. Appleby L. The production and interpretation of disease maps: A methodological case-study. Soc Psychiatry Psychiatr Epidemiol. 2004; 39: 947-954.
  7. Bell BS, Broemeling LD. A Bayesian analysis for spatial processes with application to disease mapping. Stat Med 2000; 19 : 957-974.
  8. Martin MA, Welsh AH. Graphical display. In: Armitage P, Colton T. International encyclopaedia of biostatistics, pp 1750-1771. Chichester: John Wiley, 1998.
  9. Harrington R. Scatter plot. In: Boslaugh S. Encyclopaedia of epidemiology, pp 938-939. California: SAGE Publications, Inc., 2008.
  10. Dunn G, Everitt B. Clinical biostatistics. London: Edward Arnold, 1995.