SIT718 Real World Analytics

Download the Question file here

This post gives statistical solutions in R programming such as data analysis, relationship between variables, linear regression, and much more.

Solution

1. Understand the Data

Given data is related to the forest fires dataset (Happe, Harry, 2017).

i. the.data < - as.matrix (read.table(“Forest718.txt”)

On running the above command, the data is loaded into the matrix.

ii. my.data < - the.data [sample(1:517, 200), c(1:13)]

newdata <- subset(my.data,select=c(V5, V6, V7, V8, V13))

Cor (newdata)

           V5        V6        V7        V8       V13

V13 0.5007270 0.6123075 0.6070099 0.5609810

Scatter Plots

pairs (newdata)

For studying about the relationship of the two or more variables, correlation in terms of correlation coefficients and the scatter plots help to trace the characteristics of the variables (Cohen, P., et.al, 2014).

· Relationship of FFMC and target variable Area

The Fine Fuel Moisture Code is the numerical rating of the humidity or moisture quantity in surface litter and fine fuels. The association of the FFMC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of FFMC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of FFMC and area is 0.5.

· Relationship of DMC and target variable Area

Duff Moisture Code [DMC] is the numerical rating for the moisture of loosely attached organic layers at moderate depth. The association of the DMC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of DMC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of DMC and area is 0.6.

· Relationship of DC and target variable Area

Drought Code [DC] is the numerical rating for the moisture of compact and organic layers at depth. The association of the DC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of DC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of DC and area is 0.6.

· Relationship of ISI and target variable Area

The initial spread index [ISI] denotes the rating of the fire that spreads in the early stages. The association of the ISI with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of ISI with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of ISI and area is 0.5.

Histograms

hist(newdata[,c(1)],main = "Histogram of FFMC", xlab = "FFMC")

hist(newdata[,c(2)],main = "Histogram of DMC", xlab = "DMC")

hist(newdata[,c(3)],main = "Histogram of DC", xlab = "DC")

hist(newdata[,c(4)],main = "Histogram of ISI", xlab = "ISI")

hist(newdata[,c(5)],main = "Histogram of Area", xlab = "Area")

The histogram of FFMC denotes that the frequencies of 90 to 100 ratings are very high when compared to other ratings.

The histogram of DMC denotes that the frequencies of 100 to 150 ratings are very high when compared to other ratings.

The histogram of DC denotes that the frequencies of 600 to 800 ratings are very high when compared to other ratings.

The histogram of ISI denotes that the frequencies of 5 to 10 ratings are very high when compared to other ratings.

The histogram of area denotes that the frequencies of 0.010 to 0.015 ratings are very high when compared to other ratings.

2. Transforming the data

i. Assign the transformation to the variable x

For assigning the four variables X5, X6, X7, X8 and the variable of interest X13=Y, perform appropriate transformations.

x<-array(newdata, c(200,5,2))

write.table(x,"name-transformed.txt")

ii. Explanation about relationship of the variables

Relationship of FFMC and target variable Area

The variables FFMC and area are having positive linear correlation in which both variables increase at constant rate with each other.

Relationship of DMC and target variable Area

The variables DMC and area are having positive linear correlation in which both variables increase at constant rate with each other.

Relationship of DC and target variable Area

The variables DC and area are having positive linear correlation in which both variables increase at constant rate with each other.

Relationship of ISI and target variable Area

The variables ISI and area are having positive linear correlation in which both variables increase at constant rate with each other.

3. Development of models and investigation of variable importance

i. source ("AggWaFit718.R")

ii. Parameters from fitting functions

Weighted Arithmetic Mean

function(x) {x}

The parameter x is involved in the function associated to the weighted arithmetic mean

Weighted Power Means with p = 0.5

function(x) {x^0.5}

The parameter x is involved in the function associated to the weighted power means with power = 0.5.

Weighted Power Means with p = 2

The parameters involved in the function associated to the weighted power means with p = 2 include x, w, p

Ordered Weighted averaging function

The parameters involved in the function associated to the weighted averaging function include x and w. For fitting functions in Ordered Weighted averaging, the parameters included the data x, output.1, stats.1 as per fit.OWA function

Choquet Integral

The parameters involved in Choquet Integral function include the x, v, n, which is the length of x, w as array from 0 to n.

For fitting functions in Choquet Integral, the parameters included the data x, output.1, stats.1 as per fit.choquet function.

iii. The functions are executed by using the parameters for data, output.1, and stats.1

fit.OWA(newdata,output.1="E:/output1.txt",stats.1="E:/stats1.txt")

On executing the function, the output and statistics results are written in the given path. The following table shows the error measures and weight summaries of the data.

Table showing error measures and weight summaries for Ordered Weighted Averaging (OWA) model

Error Measures	Values	I	W_i
RMSE	10.1221619736531	1	1
Av. abs error	9.23358681685263	2	0
Pearson correlation	0.561829942314725	3	0
Spearman correlation	0.491228745329846	4	0
Orness	0

Table showing error measures and weight summaries for Choquet model

Error Measures	Values	Binary Number	Fm.Weights
RMSE	10.1221619736547	1	0
Av. abs error	9.23358681685425	2	0
Pearson correlation	0.561829942314806	3	0
Spearman correlation	0.49528209412681	4	0
Orness	0.222222222222222	5	0
I	Shapley i	6	0
1	0	7	0
2	0.499999999999998	8	0
3	0	9	0
4	0.499999999999998	10	0.999999999999997
		11	0.999999999999997
		12	0
		13	0

iv. Comparison and interpretation of the data in tables

The data in tables are compared and interpreted each other.

The model seems good as it comprises Root mean squared error (RMSE) as 10.12
Among the four variables, DMC and DC influence the model much as compared to other variables

c Yes, the variables are complementary

For better model, the inputs may be either higher or lower. In the given case, the variables considered are 4 and the observations assumed are 200. Based on these inputs, it is clear that the data consists of 4 variables.

From the results, it is clear that all models possess higher inputs. All these inputs are required for analysing the appropriate variables.

4. Model for Prediction

i. For the following inputs, the area is predicted with the best fitting model:

X5=91.6; X6=181.3; X7=613; X8=7.6; X9=24.6; X10=44; X11=4; X12=0.

Best fitting model is developed by using the lm function (Draper, N.R. and Smith, H., 2014). The data in array is transformed to data frame and used with lm function for prediction of the area.

lm(formula = X5 ~ X1 + X2 + X3 + X4, data = dd)

Output

Residuals:

       Min         1Q     Median         3Q        Max

-0.0034333 -0.0010825 -0.0003257  0.0005763  0.0063718

Coefficients:

             Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.243e-03  2.093e-03   0.594 0.553251

X5          7.862e-05  2.486e-05   3.163 0.001812 **

X6          1.203e-05  2.732e-06   4.405 1.74e-05 ***

X7          2.536e-06  6.940e-07   3.654 0.000332 ***

X8          2.081e-04  3.441e-05   6.047 7.38e-09 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001734 on 195 degrees of freedom

Multiple R-squared:  0.5921,          Adjusted R-squared:  0.5837

F-statistic: 70.77 on 4 and 195 DF,  p-value: < 2.2e-16

From the output, it is clear that the variables X5, X6, X7, X8 are very significant.

As per linear regression, the equation is given as follows:

X13 = 1.243e-03 + 7.862e-05 *X5 + 1.203e-05 *X6 + 2.536e-06*X7 + 2.081e-04*X8

By using the given values of X5, X6, X7, and X8 in the given equation, the value of X13 is predicted.

= 0.001243 + (0.00007862 *91.6) + (0.00001203 *181.3) + (0.000002536*613) + (0.0002081*7.6)

X13 = 0.0137

ii. Based on the given inputs, the value of Y = X13 is predicted using linear regression model

The value denotes the acceptable range for the area.

iii. Ideal conditions for the selected variables under which an area will result

For FFMC, the rating is high from 90 to 100, it results to high area coverage.

For DMC, the rating is high from 100 to 150, it results to high area coverage.

For DC, the rating is high from 600 to 800, it results to high area coverage.

For ISI, the rating is high from 5 to 10, it results to high area coverage.

These variables have positive linear relationship over the area.

References

Happe, Harry. "Meteomalaga". https://Malagaweather.com. N.p., 2017. Web. 29 Apr. 2017.

Cohen, P., West, S.G. and Aiken, L.S., 2014. Applied multiple regression/correlation analysis for the behavioral sciences. Psychology Press.

Draper, N.R. and Smith, H., 2014. Applied regression analysis (Vol. 326). John Wiley & Sons

SIT718 Real World Analytics

Business Strategies of Coles Group - SWOT Analysis

Data Analysis With R – ICT110 Introduction to Data Science

Research on Customer preferences toward the E-commerce industry

Comparative Analysis of the Wesfarmers and Coles

SIT718 Real World Analytics

Analysis on the relationship of the straining in Jobs and the Obesity

Contact Form