The Ideal Statistics Module

Abstract

In this document, we design an hypothetical "ideal" statistics module for Scilab.

Introduction

The goal of this document is to design an hypothetical "ideal" statistics module for Scilab. First, we analyse the limitations of the current statistics features provided by Scilab, by Stixbox and by other toolboxes. In the second part, we present the ideal statistics module and its features.

The goal of this document is not to provide an analysis of the current features in this field (see Documents and tutorials for Probabilities and Statistics in Scilab on this topic).

Issues with existing tools

Issues with Scilab

Here are the current functions in the statistics section.

Central Tendency:

Measures of Dispersion:

Measures of Shape:

Data with Missing Values:

Descriptive Statistics

Summaries

Sampling

Principal Component Analysis

Hypothesis Testing

Regression

Distribution functions:

Random number generators:

Scilab provides several probability and statistical features and provides several distribution functions:

We can see the statistics-related bugs in bugzilla at:

http://bugzilla.scilab.org/buglist.cgi?cmdtype=runnamed&namedcmd=statistics&list_id=16472

In fact, a detailed analysis shows that the existing features would be easily enhanced on the following points.

See the Statistics category in the bug reports for a complete reports of the bugs:

http://bugzilla.scilab.org/buglist.cgi?cmdtype=runnamed&namedcmd=Bugs%20Stats&list_id=11752

Issues with Stixbox

Here are the functions provided by Stixbox.

There are many issues with Stixbox.

The issues of the Stixbox are reported at:

http://forge.scilab.org/index.php/p/stixbox/issues/

Issues with regtools

Regtools is a toolbox which is packaged on Atoms:

http://atoms.scilab.org/toolboxes/regtools

The regtools module provides the following functions:

A review has been done at:

http://wiki.scilab.org/New%20Scientific%20Features%20in%202011#A6th_of_February_2011:_Regression_tools

There are several issues with Regtools.

Issues with CASCI

The CASCI toolbox includes various functions for probability & statistics that are used by P. Castagliola's lab at Université de Nantes.

http://atoms.scilab.org/toolboxes/casci

The toolbox is developped on the Forge:

http://forge.scilab.org/index.php/p/casci/source/tree/master/

The CASCI toolbox provides the following functions:

There are several issues with CASCI:

Other tools

During the 2012 International Open Source Software Contest, a statistical toolbox was created:

TODO : review this toolbox

Existing statistical tools outside of Scilab

Matlab

Matlab has a solid set of statistical functions.

The following page is the entry point for the statistical toolbox:

http://www.mathworks.fr/fr/help/stats/index.html

The following page presents the list of functions for hypothesis tests:

http://www.mathworks.fr/fr/help/stats/hypothesis-tests-1.html

R

The statistical features of R are huge, so that Scilab will probably never reach that level of specialization. Anyway, this is an excellent reference to look at.

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/00Index.html

The following page presents the list of distributions in R:

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html

Octave

The following modules and toolboxes for octave exists:

SciPy

The ideal statistics API

Architecture of the modules

Rather than designing a single module, we may rather think of several separated modules, with clear and orthogonal goals. This restricts the work, increases the chances of reuse and limits the dependencies. For, it allows to separate the graphics issues from the computational issues. This is one of the main difficulty with several modules (e.g. Metanet), which has been a strong obstacle to their maintenance over the years.

Here are the modules that we could create.

Designs of experiments : scidoe

The scidoe toolbox was created for this purpose:

http://forge.scilab.org/index.php/p/scidoe/

The goal of this toolbox is to provide design of experiments techniques, along with functions for model building.

This project is part of the GSOC 2012, managed by Maria Christopoulou.

Here is the list of functions which are available.

Factorial Design

Response Surface Designs

Goals

Here is the list of functions which *will* be created at the end of the project.

Latin Hypercube Designs

Factorial Design

Response Surface Designs

Optimal Designs

Supersaturated Designs

Model Building

General Functions

Distribution functions : distfun

The goal of the distribution function toolbox is to provide the following functions:

This section is the goal of the distfun project. This project is developped on the Forge:

http://forge.scilab.org/index.php/p/distfun/

and available on Atoms:

http://atoms.scilab.org/toolboxes/distfun

This project is part of the GSOC 2012, managed by Prateek Papriwal.

For each distribution x, we provide five functions :

Distributions available :

Support

Random Number Generator

Still, the work is not finished and there are many distributions which are still missing in the distfun module.

Datasets

In this section, we present functions to manage datasets.

This section is a *DRAFT*.

Datasets distributed alongside R : rdataset

rdataset is a collection of 597 datasets that were originally distributed alongside the statistical software environment "R" and some of its add-on packages.

Datasets which are available in R can be used in Scilab with rdataset. The toolbox needs around 50 MByte. The datasets can be used in order to tests statistical function in scilab and compare the results with the output of R. As the dastasets are included in R, no data have no loaded manually.

This project is developped on the Forge:

http://forge.scilab.org/index.php/p/rdataset/

For example the dataset survey from the library MASS can be loaded using:

[data,desc] = rdataset_read("MASS","survey");

Statistical visualization : statvis

In this section, we present functions to produce statistical graphics.

This section is a *DRAFT*.

Descriptive statistics

In this section, we present functions to produce descriptive statistics.

This section is a *DRAFT*.

Regression analysis : regmod

Hypothesis testing : hypt

In this section, we present functions which compute tests, confidence intervals and model estimation.

One of the principles of this toolbox is to be compatible with MATLAB, especially the Hypothesis Testing sub-module:

http://www.mathworks.fr/fr/help/stats/hypothesis-tests-1.html

This section is a *DRAFT*.

High Priority List:

Low Priority List:

Excluded The following functions will not be part of the hypt toolbox:

ANOVA module : anova

TODO

Design:

Development:

Authors

public: The-Ideal-Statistics-Module (last edited 2013-03-17 21:39:22 by michael.baudin@contrib.scilab.org)