Title Application of statistical modeling and data mining method to the fish stock analyses
Authers Hiroshi SHONO
Keywords CPUE standardization, data mining, generalized linear model, model selection, Tweedie distribution
Citation Bull. Fish. Res. Agen. No. 22, 1-85, 2008
 In this thesis, we focused on the various problems in the filed of fish population analysis, especially regarding the analyses of CPUE (catch per unit effort) which shows the relative abundance. We suggested several techniques to solve these issues by the statistical modeling and approaches for data mining using the actual fishery data on tuna and its related species, and computer simulation experiments.
 Catch per unit effort (CPUE) is an important concept which is corresponding to the relative stock size and is proportional to the stock abundance. However, because the nominal CPUE may include various spatiotemporal and environmental effects except for stock density such as area, season and fishing gears, we need to remove these effects to grasp the annual variation of the stock. Therefore, it has been traditionally carried out to estimate the factorial effect of year using analysis of covariance (ANCOVA) model (e.g. CPUE Log-Normal model) where natural logarithm of CPUE is set to the response variable and assumed factorial effects are incorporated into the model as explanatory variables under the normal error, and generalized linear model (GLM) (e.g. Catch Poisson model, Catch Negative-Binomial model) in which catch, discrete variable, is set to the response one and Poisson or negative binomial distribution and so on is assumed. Such work is called CPUE standardization and approaches for data mining such as tree-regression model and neural networks have been recently utilized for it in addition to the statistical modeling.
 In this study, we dealt with the CPUE standardization, major issue in the fish population analysis, as main theme of this paper and discuss in detail three problems about CPUE analysis as follow:
1) Choice of the factorial effects, performance evaluation of the model through the various information criteria and stepwise test in the ANOVA type model supposing the CPUE standardization (Chapter 3)
2) Approach of CPUE prediction and the simple method for attribution analysis (i.e. method for extracting CPUE year trend) in the time-space without operation for southern bluefin tuna by the neural networks (Chapter 4)
3) Performance evaluation of Tweedie model if it includes many zero-catch and comparison of Tweedie distribution and the traditional methods (ad hoc ANCOVA method, Catch model) (Chapter 5)
 Chapter 1 becomes an introduction and describes the background, purpose of this research and composition of this thesis. In chapter 2, we outlined CPUE standardization from the viewpoint of the statistical modeling, approach for data mining, proper problems of fish stock and reviewed several related issues, especially main three problems to be coped with in this study.
 In chapter 3, we performed the model selection by various information criteria (AIC, BIC, CAIC, c-AIC, HQ, TIC etc.) using the generalized linear models corresponding to the CPUE standardization through several cases such as in small samples, large samples. It is also presented that the result of model selection may be different depend on the used information criteria in actual fishery data. We evaluated the selection performance of these information criteria using the computer simulation in which we calculated the selection performance to choose the true model among several candidate models generated random numbers from the true model. We also compared the performance of information criteria and stepwise test by computer experiments because some stepwise test such as chi-square or F test can be applied in the nested model. The variable selections are an important and essential issue in terms of selecting the factorial effects statistically to affect the CPUE. In addition, the results of model selection based on the information criteria and stepwise test may cause the difference of the attribution analysis (i.e. estimated CPUE year trend), which may lead to the big difference of estimated absolute abundance in the model where CPUE year trends are included as the tuning indices. Specific study results in this chapter are as follow:
- It was found that the result of model selection in small samples and in the case that there are many parameters compared to the sample size by c-AIC, which is a finite correction of AIC, is different from that by AIC and the selection performance of c-AIC is better than that of AIC through the ANOVA-type simulation in such cases.
- It was shown that AIC may have a bias in large sample, the result of model selection is different depend on the information criteria utilized and the consistent information criteria (BIC, HQ and CAIC) is superior to AIC as a whole through the analysis by actual fishery data and simulation by linear regression, respectively. We also suggested the recommendation value and formula of the constant term in the consistent information criterion, HQ.
- It was proofed that the expectation of TIC, which is known as having good performance traditionally in the nested model, becomes theoretically equal to that of AIC in the generalized linear model with having normal error and identity link function, and the selection performance of TIC is almost the same as that of AIC by the computer simulation.
- In the nested model, we found that the information criteria is generally a little superior to stepwise test by our computer experiments and the simple model with a few parameters tend to be selected if the significance level is low in the stepwise test.
 In Chapter 4, we focused on the issue of CPUE interpretation of southern bluefin tuna, the problem of CPUE prediction in the spatiotemporal cells without observation, and carried out the CPUE analysis using the neural networks. In terms of the relative abundance, it is reasonable to define the CPUE as multiplying standardized CPUE by relative area size and which is called abundance index (AI). In the stock of southern bluefin tuna, because the fishing ground has shrunk from past to present, it has influenced on the abundance index that the assumption of CPUE in the cell with operation in the past and without one now, that is whether CPUE in these cells is assumed to the same as that in the surrounding areas or 0. This cause the difference of CPUE year trend obtained from the abundance index.
 Therefore, in this paper, we predicted the CPUE in such missing cells using the error back propagation method, which is a typical algorithm in the supervised neural networks, and suggested the simple way of attribution analysis to extract the CPUE year trend. We compared to the MCMC method based on the EM algorithm in same conditions by cross-validation to evaluate the accuracy of the neural networks. Performance check and comparison of the models were carried out using the n-fold cross-validation based on the correlation coefficient between observed and predicted values and mean squared error (MSE).
 As a result, the ratio of CPUE without operations over with ones based on the CPUE predicted values by the neural networks showed the range of 0.8 to 1.0. This does not imply extreme contradiction with the CPUE ratio in the Japanese Experimental Fishing Program (EFP) which was locally done for 1998 to 2000, where CPUE ratio was recorded about 0.7 although year, season and area of the experiment were very limited. Predicted performance of CPUE by the neural networks is rather superior to that by MCMC method based on the EM algorithm in the same situation as the neural networks and the CPUE year trend calculated from the predicted CPUE is very similar to that by generalized linear model including the ANCOVA. The results suggest the excellence of the predicted performance of the neural networks and the validity of the simple method of the attribution analysis proposed.
 In Chapter 5, we discussed in detail the issue where the ANCOVA model (in which the natural logarithm of CPUE is set to the response variable) can not be applied if it includes the data that catch is zero called zero-catch problem, supposing the shark species caught by tuna longline fishery. We carried out the CPUE standardization for yellowfin tuna in the Indian Ocean caught by the Japanese commercial longline fishery in which the ratio of zero-catch is low about 10% and silky shark in the North Pacific Ocean by Japanese training vessels (for silky shark where the zero-catch ratio is high more than 80%) using the so-called Tweedie distribution which is an extension of compound Poisson model and can be uniformly dealt with the zero data. Actually, we compared the CPUE year trends obtained from the Tweedie model, ad hoc ANCOVA model to add the constant term to all CPUE and Catch Negative-Binomial model. As a result, there is no extreme difference of year trends between the Treedie model and ad hoc method for yellwofin tuna in the Indian Ocean, a target species with low zero-catch rate. On the other hand, CPUE year trend obtained from the Tweedie model is different from that based on the Catch model and ad hoc method for silky sharks in the North Pacific Ocean, a by-catch species with high zero-catch ratio..
 Accuracy of the Tweedie distribution is higher in each case judging from the performance check of the candidate models based on the both indicators, correlation coefficient between observed and predicted values and MSE, using n-fold cross-validation as well as our analysis by the neural networks. As a result of cross-validation, the superiority of the Tweedie model does not appear so clearly if the rate of zero-catch is low and it has few problems to apply the ad hoc method practically. On the contrary, if the ratio of zero-catch is high, then the superiority of the correlation coefficient and MSE is the order of the Tweedie model, Catch model, ad hoc method and Tweedie model, ad hoc method, Catch model, respectively. However, the ad hoc method has a large bias because almost all of the estimated CPUE show extreme low regardless of the magnitude of the observed CPUE values. Therefore, we concluded that it is not adequate to apply the ad hoc method in the case that the ratio of zero-catch is high such as shark species.
 The last Chapter 6 shows the conclusion of this thesis. We methodically described the study results of three issues which were dealt with in this paper from the viewpoint of fish population analysis, applied statistics and research problem for the future.
URI http://www.fra.affrc.go.jp/bulletin/bull/bull22/shono.pdf