Distribution functions in Scilab

Introduction

Scilab provides several probability and statistical features and provides several distribution functions:

In fact, a detailed analysis shows that the existing features would be easily enhanced on the following points:

The goal would be to provide a quality which could not be easily be proved wrong. The current state is that it would be easy to investigate the accuracy of Scilab in the same way that the accuracy of Excel was investigated [1,2,3]. We notice that Matlab and R provide both accurate and various distribution functions. The small number of distribution functions has been noticed in [4] in which Scilab receives for this topic (section 3.5) a note equal to 35% with respect to 47% for Matlab (fortunately, this author did not investigate the numerical accuracy).

More tests of accuracy of distribution functions

The accuracy of distribution function is a central point in the context of the assessment of the quality of Scilab. This particular point lead several researchers to inquire this topic in Excel, but also Gnumeric, R and others [1,2,3]. But Scilab does not have tools to assess the quality of its distribution functions. Worse, we have evidences that the function cdfbeta only provides 8 accurate digits instead of roughly 16. In fact, it is extremely easy, by using symbolic computations systems such as Mathematica or Maple to get the required number of significant digits and to compare with Scilab.

Update (03/2012). The bug #7569 (http://bugzilla.scilab.org/show_bug.cgi?id=7569) has been fixed, which increases the accuracy of many cdf* functions. Still, there is a need for improved tests, as shown by at least two known bugs :

More accurate probability distribution functions

Scilab only provides a limited number of CDF and a large number of very common PDFs are not provided. For example, the hypergeometric distribution function is not provided. Worse, if the user uses toolboxes such as the Stixbox, we have evidences that the hypergeometric distribution function provided in this package is numericaly inaccurate, i.e. does not provide any single significant digits for moderate input arguments.

This corresponds to the bug report : http://forge.scilab.org/index.php/p/stixbox/issues/98/

The actual problem is not to fix this particular bug. The real problem is to test all the distributions in Stixbox, so that we can be sure that all functions are accurate. Since this requires a lot of work, it is more efficient to redesign a new set of functions.

More PDFs and CDFs

Scilab provide some cumulated distribution functions (CDF) but does not provide any probability distribution function (PDF). Practical experience shows that it is non trivial to implement an accurate probability distribution function. For example, it is very easy to develop an extremely inaccurate Poisson distribution function (see for example in Excel). But it is easy to implement an accurate PDF, given that we are aware of the limitations of the floating point arithmetic.

The progress during 2012-2013

The distfun project has improved a lot since its creation in 2012, where it provided only 5 distributions. Part of this success is based on the GSOC 2012 (see Contributor-stats-GSOC2012) At this time, we have added several basic distributions not included in previous releases : Binomial, Poisson, Chi-Square, Hypergeometric, F, Geometric.

Another boost has been done after the completion of the GSOC, where most of the work has been translated into C source code for increased performance and consistency. The T distribution were also added. A lot of accuracy bugs were fixed, typically for large or small input parameters. The nonlinear equation solver was also updated, leading to an improved robustness, speed and portability. The uniform random number generator was updated, with a clarified API (and a clarified implementation). Distfun now provides 13 documented, tested, robust distributions.

Ideas

In this section, we provide a list of of potential taks related to this topic. We especially detail the expected outputs of each potential task. We also analyse the tools which might be required. In each case, a small scientific report at the end of the task will be welcome. We emphasize the benefits of the task for the student. We also detail the software management of the produced source code.

The expected output of these tasks is a collection of Scilab macros (.sci), unit tests (.tst) and help pages (.xml). If possible, a set of .c source code may be provided.

Other projects related to the same issues are welcome.

Sources of inspiration

http://cran.r-project.org/web/views/Distributions.html

gives an overview of what distributions are available in R.

Bibliography

Author

2013 - Michaël Baudin

public: Contributor - stats (last edited 2013-05-29 15:40:18 by paul.bignier@scilab-enterprises.com)