Data Analysis With R – ICT110 Introduction to Data Science

This assignment corresponds to the ICT110 Introduction to Data Science. The assignment involves analysis of the data from World Bank Data for Health and Population Statistics from the years 2001 to 2015.

Find the Question file here.

Download your data here

Solution
Introduction

The data presented for this analysis were extracted from World Bank statistics (http://databank.worldbank.org) about Health and Population in Asia-Pacific countries, between 2001 and 2015. Several variables that would be of interest to various organisations in public and private sector (for example, Government Health services, Research academics, Pharmaceutical industry) were analysed as single variable or two-variable analyses and results are summarized.

The dataset contains the following attributes: • Birth rate, crude (per 1,000 people) • Fertility rate, total (births per woman) • Adolescent fertility rate (births per 1,000 women ages 15-19) • Death rate, crude (per 1,000 people) • Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total) • Cause of death, by injury (% of total) • Cause of death, by non-communicable diseases (% of total) • Mortality caused by road traffic injury (per 100,000 people) • Health expenditure per capita (current US$) • GNI per capita, Atlas method (current US$) • Health expenditure, private (% of GDP) • Health expenditure, public (% of GDP) • Health expenditure, total (% of GDP) • Maternal mortality ratio (national estimate, per 100,000 live births) • Immunization, BCG (% of one-year-old children) • Life expectancy at birth, male (years) • Life expectancy at birth, female (years) • Life expectancy at birth, total (years) • School enrollment, primary (% gross) • School enrollment, secondary (% gross) • School enrollment, tertiary (% gross) • School enrollment, tertiary, female (% gross) • Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) • Unemployment, female (% of female labor force) (modeled ILO estimate) • Unemployment, male (% of male labor force) (modeled ILO estimate) • Unemployment, total (% of total labor force) (modeled ILO estimate).

Data dimensions and structures
There were many gaps in the data, with complete sets for the entire period under consideration being available for a few countries only. Analysis was limited to those variables of the few countries for which the data sets were complete at least for 10 consecutive years.

Modules used in the data mining R software are as follows:

ggplot2()
reshape()
cluster()

One-variable analyses

Table 1: Number of Children born to adolescent (15-19yr) females in ten Asia-Pacific countries; Mean of 15 years (2001-2015)

Country Mean Std Dev

Australia 16.2 1.1

Cambodia 48.5 1.5

China 7.7 0.4

Indonesia 50.6 1.3

Japan 4.9 0.6

Korea 1.9 0.2

Malaysia 12.6 1.1

New Zealand 27.1 2.1

Thailand 42.8 0.8

Vietnam 32.5 3.9

The analyses are presented as graphs. These graphs are appropriate because in Figures 1 and 2 the purpose is to show the trends in the number of child births to adolescent females and the percent of adolescents engaged in education respectively, in the selected countries over a period of 15 years. It can be seen that there is an increasing trend in school enrollment particularly in the low-income countries. This is a good trend because the issue of pregnancy in young females is mostly evident in low income countries. Such information cannot be obtained without looking at the trends over a few years, as from a graph such as these.

On the other hand, Figure 3 shows a bar graph indicating the per capita healthcare expenditure in the same ten countries for the year 2015. In this case, showing only one year in the form of a bar graph is appropriate because, when the data is examined, the ranking of the countries is the same over the 15-year period.

Figure 1: No. of Births per 1000 female adolescents across all years from 2001 to 2015

Figure 2: Plot of % of Secondary School Enrollment

Table 2. Means and Standard Deviations of 15 yr data on the percentage of adolescents in tertiary education in 10 Asia-Pacific countries

Countries	Mean	S.D.
Australia	76.9	6.4
Cambodia	17.6	5.1
China	21.1	5.8
Indonesia	22.4	6.6
Japan	56.8	4.1
Korea	93.2	4.9
Malaysia	31.9	4.3
New Zealand	78.6	6.4
Thailand	46.6	4.9
Vietnam	18.5	6.6

Figure 3: Plot of per capita healthcare

From Left to Right: New Zealand, Australia, Japan, Korea, Malaysia, China, Thailand, Vietnam. Indonesia, Cambodia

Two-variable analyses

Following are the results of 2 two-variable analyses; 1. To determine the relationship between adolescent engagement in education and teenage pregnancy, 2. The relationship between per capita gross national income (GNI) and life-expectancy at birth.
In both analyses, the selected graph is appropriate because the purpose is to simply show the relationship between the two variables.

Figure 4: Relationship between Child births and Secondary School enrollment

Clustering

Clustering refers to the grouping of a set of data objects into clusters in analysing (Cornish, 2007).
Example: In marketing, cluster analysis help marketers to find out information about distinct groups (example: 18-30yr old males, those living in a particular suburb) in their customer bases, and then use this knowledge to develop targeted marketing programs.

The k-means clustering algorithm attempts to split a given unknown data set (a set containing no information as to class identity) into a fixed number (k) of clusters, based on the closeness of the group’s mean to that of as given cluster.

With the data in consideration for the present exercise, they can be clustered into, for example, country groups based on GNI as High, Medium and Low, and analyses of various variables carried out within group and between groups.

Linear regression analysis

Simple linear regression is a statistical method that allows the study of relationships between two continuous (quantitative) variables. Usually, the two variables are referred to as the independent and dependent variables. For example, if we study the growth of a child, we’ll find that the child’s body weight increases with ae, up to a point. So, the age (years) is the independent variable and the weight (Kgs) is the dependent variable. If we draw a graph with age on X axis and weight on the Y axis, we can see a positive relationship. Based on that we can fit a regression line (see Figures 6 & 7) going down (negative relationship) or going up (positive relationship).

The fitting of the regression line is done by calculating the Y data for each of the X data point, based on the regression equation; Y = a + b(X), where Y is the unknown (to be estimated), a = intercept of the graph on the Y axis , b = regression coefficient, and X is the known value.

For example, if a shoe manufacturing company plots a linear regression of the production costs of shoes, it can use the regression method according to the formula Y = a + b(X) where Y is the total cost. Now, there is an overhead cost (made up of factory rent, electricity etc.) which will remain the same whether you make 1 pair or 10000 pairs. That is represented by ‘a.’ b = regression coefficient which is calculated from the regression analysis of the data. X is the number of pairs of shoes (eg: 20000 pairs).

The usefulness of regression analysis is that when the regression equation is known, values of Y can be predicted from it. For example, what will be the production cost of 50,000 pairs of shoes?

Figure 5: Linear regression of Child births vs Secondary school enrollment

Conclusion

Analysis was performed on a set of World Bank data on several health-related characteristics pertaining to countries in the Asia-Pacific region. The three one-variable analyses were done to find out about the teenage pregnancy rates, teenage engagement in secondary education and the Government’s per capita expenditure on healthcare services in ten selected countries. The selection of the countries was based only on the availability of comprehensive data, as there were data missing for some countries and for some years. The one-variable analyses yielded results on the country mean and standard deviation (a measure of the variation) for each variable for a 15 year period.

Two 2-variable analyses were performed. Two interesting subjects were chosen for these. First, the relationship between adolescent engagement in education and teenage pregnancy rate was determined. It showed a negative relationship. In those countries where a lower percentage of teenagers were engaged in secondary school education, there were more child births to adolescent females. This is important for those concerned with social issues.

In the second analysis, a positive relationship was shown between a country’s GNI and life-expectancy at birth. In other words, those in rich countries live longer compared to those in poorer countries. This finding again is interesting for those interested in World politics and in providing an explanation for such a difference. However, it must be noted that part of the difference may be due to the genetic make-up of the people, and perhaps the life-styles including what they eat, and not necessarily due to the country’s wealth. There are several genetically different groups of people living in the Asia-Pacific region countries.

Reflections
No difficulties were experienced in the data analysis because R is a highly user-friendly programme. The only issue was the time taken to extract, organize, and transfer the data to the R environment.

Data analysis was done using R. The latter was found to be too complicated.

References
Rosie Cornish. 2007. Cluster analysis. Mathematics learning support centre. www.statstutor.ac.uk/resources/uploaded/clusteranalysis.pdf

Data Analysis With R – ICT110 Introduction to Data Science

Contact Form