SIT718 Real World Analytics

Download the Question file here

This post gives statistical solutions in R programming such as data analysis, relationship between variables, linear regression, and much more.

Solution

1.    Understand the Data
Given data is related to the forest fires dataset (Happe, Harry, 2017).
i.      the.data < - as.matrix (read.table(“Forest718.txt”)

On running the above command, the data is loaded into the matrix.
 

ii.    my.data < - the.data [sample(1:517, 200), c(1:13)]
      newdata <- subset(my.data,select=c(V5, V6, V7, V8, V13))
      Cor (newdata)

           V5        V6        V7        V8       V13
V13 0.5007270 0.6123075 0.6070099 0.5609810 

Scatter Plots
pairs (newdata)                                                     
For studying about the relationship of the two or more variables, correlation in terms of correlation coefficients and the scatter plots help to trace the characteristics of the variables (Cohen, P., et.al, 2014).
·         Relationship of FFMC and target variable Area
The Fine Fuel Moisture Code is the numerical rating of the humidity or moisture quantity in surface litter and fine fuels. The association of the FFMC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of FFMC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of FFMC and area is 0.5.
·         Relationship of DMC and target variable Area
Duff Moisture Code [DMC] is the numerical rating for the moisture of loosely attached organic layers at moderate depth. The association of the DMC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of DMC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of DMC and area is 0.6.
·         Relationship of DC and target variable Area
Drought Code [DC] is the numerical rating for the moisture of compact and organic layers at depth. The association of the DC with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of DC with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of DC and area is 0.6.
·         Relationship of ISI and target variable Area
The initial spread index [ISI] denotes the rating of the fire that spreads in the early stages. The association of the ISI with that of the burnt area of the forest is determined by using the correlation technique. The scatter plot of ISI with area shows the linear trend with positive relationship. The correlation coefficient for the relationship of ISI and area is 0.5.
Histograms
hist(newdata[,c(1)],main = "Histogram of FFMC", xlab = "FFMC")
hist(newdata[,c(2)],main = "Histogram of DMC", xlab = "DMC")
hist(newdata[,c(3)],main = "Histogram of DC", xlab = "DC")
hist(newdata[,c(4)],main = "Histogram of ISI", xlab = "ISI")
hist(newdata[,c(5)],main = "Histogram of Area", xlab = "Area")

   The histogram of FFMC denotes that the frequencies of 90 to 100 ratings are very high when compared to other ratings.



The histogram of DMC denotes that the frequencies of 100 to 150 ratings are very high when compared to other ratings.


 

The histogram of DC denotes that the frequencies of 600 to 800 ratings are very high when compared to other ratings. 
  
 
The histogram of ISI denotes that the frequencies of 5 to 10 ratings are very high when compared to other ratings. 


 
The histogram of area denotes that the frequencies of 0.010 to 0.015 ratings are very high when compared to other ratings.

 

2.                  Transforming the data
i.                    Assign the transformation to the variable x
For assigning the four variables X5, X6, X7, X8 and the variable of interest X13=Y, perform appropriate transformations.
x<-array(newdata, c(200,5,2))
write.table(x,"name-transformed.txt")
ii.                  Explanation about relationship of the variables
Relationship of FFMC and target variable Area
The variables FFMC and area are having positive linear correlation in which both variables increase at constant rate with each other.
Relationship of DMC and target variable Area
The variables DMC and area are having positive linear correlation in which both variables increase at constant rate with each other.
Relationship of DC and target variable Area
The variables DC and area are having positive linear correlation in which both variables increase at constant rate with each other.
Relationship of ISI and target variable Area
The variables ISI and area are having positive linear correlation in which both variables increase at constant rate with each other.
3.                  Development of models and investigation of variable importance
i.                    source ("AggWaFit718.R")
      



ii.                  Parameters from fitting functions
Weighted Arithmetic Mean
function(x) {x}  
The parameter x is involved in the function associated to the weighted arithmetic mean
Weighted Power Means with p = 0.5
function(x) {x^0.5}
The parameter x is involved in the function associated to the weighted power means with power = 0.5.
Weighted Power Means with p = 2
The parameters involved in the function associated to the weighted power means with p = 2 include x, w, p

Ordered Weighted averaging function

The parameters involved in the function associated to the weighted averaging function include x and w. For fitting functions in Ordered Weighted averaging, the parameters included the data x, output.1, stats.1 as per fit.OWA function

Choquet Integral

The parameters involved in Choquet Integral function include the x, v, n, which is the length of x, w as array from 0 to n.
For fitting functions in Choquet Integral, the parameters included the data x, output.1, stats.1 as per fit.choquet function.

iii. The functions are executed by using the parameters for data, output.1, and stats.1
fit.OWA(newdata,output.1="E:/output1.txt",stats.1="E:/stats1.txt")
On executing the function, the output and statistics results are written in the given path. The following table shows the error measures and weight summaries of the data.  

Table showing error measures and weight summaries for Ordered Weighted Averaging (OWA) model
Error Measures
Values
I
W_i
 RMSE
10.1221619736531
1
1
Av. abs error
9.23358681685263
2
0
Pearson correlation
0.561829942314725
3
0
Spearman correlation
0.491228745329846
4
0
Orness
0



Table showing error measures and weight summaries for Choquet model
Error Measures
Values
Binary Number
Fm.Weights
RMSE
10.1221619736547
1
0
Av. abs error
9.23358681685425
2
0
Pearson correlation
0.561829942314806
3
0
Spearman correlation
0.49528209412681
4
0
Orness
0.222222222222222
5
0
I
Shapley i
6
0
1
0
7
0
2
0.499999999999998
8
0
3
0
9
0
4
0.499999999999998
10
0.999999999999997


11
0.999999999999997


12
0


13
0

iv. Comparison and interpretation of the data in tables
The data in tables are compared and interpreted each other.   
The model seems good as it comprises Root mean squared error (RMSE) as 10.12 
Among the four variables, DMC and DC influence the model much as compared to other variables
c    Yes, the variables are complementary
For better model, the inputs may be either higher or lower. In the given case, the variables considered are 4 and the observations assumed are 200. Based on these inputs, it is clear that the data consists of 4 variables.
     
     From the results, it is clear that all models possess higher inputs. All these inputs are required for analysing the appropriate variables. 
 
      4. Model for Prediction
     i.    For the following inputs, the area is predicted with the best fitting model:
    X5=91.6; X6=181.3; X7=613; X8=7.6; X9=24.6; X10=44; X11=4; X12=0.

Best fitting model is developed by using the lm function (Draper, N.R. and Smith, H., 2014). The data in array is transformed to data frame and used with lm function for prediction of the area.
 
lm(formula = X5 ~ X1 + X2 + X3 + X4, data = dd)

Output
Residuals:
       Min         1Q     Median         3Q        Max 
-0.0034333 -0.0010825 -0.0003257  0.0005763  0.0063718 
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.243e-03  2.093e-03   0.594 0.553251    
X5          7.862e-05  2.486e-05   3.163 0.001812 ** 
X6          1.203e-05  2.732e-06   4.405 1.74e-05 ***
X7          2.536e-06  6.940e-07   3.654 0.000332 ***
X8          2.081e-04  3.441e-05   6.047 7.38e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.001734 on 195 degrees of freedom
Multiple R-squared:  0.5921,          Adjusted R-squared:  0.5837 
F-statistic: 70.77 on 4 and 195 DF,  p-value: < 2.2e-16

From the output, it is clear that the variables X5, X6, X7, X8 are very significant.
As per linear regression, the equation is given as follows:
X13 = 1.243e-03 + 7.862e-05 *X5 + 1.203e-05 *X6 + 2.536e-06*X7 + 2.081e-04*X8

By using the given values of X5, X6, X7, and X8 in the given equation, the value of X13 is predicted.
= 0.001243 + (0.00007862 *91.6) + (0.00001203 *181.3) + (0.000002536*613) + (0.0002081*7.6)
     X13 = 0.0137

ii.     Based on the given inputs, the value of Y = X13 is predicted using linear regression model
The value denotes the acceptable range for the area.
iii.                Ideal conditions for the selected variables under which an area will result
For FFMC, the rating is high from 90 to 100, it results to high area coverage.
For DMC, the rating is high from 100 to 150, it results to high area coverage.
For DC, the rating is high from 600 to 800, it results to high area coverage.
For ISI, the rating is high from 5 to 10, it results to high area coverage.
These variables have positive linear relationship over the area.

References
Happe, Harry. "Meteomalaga". https://Malagaweather.com. N.p., 2017. Web. 29 Apr. 2017.
Cohen, P., West, S.G. and Aiken, L.S., 2014. Applied multiple regression/correlation analysis for the behavioral sciences. Psychology Press.
Draper, N.R. and Smith, H., 2014. Applied regression analysis (Vol. 326). John Wiley & Sons
Post a Comment (0)
Previous Post Next Post