The Ideal Statistics Module
Abstract
In this document, we design an hypothetical "ideal" statistics module for Scilab.
Contents
Introduction
The goal of this document is to design an hypothetical "ideal" statistics module for Scilab. First, we analyse the limitations of the current statistics features provided by Scilab, by Stixbox and by other toolboxes. In the second part, we present the ideal statistics module and its features.
The goal of this document is not to provide an analysis of the current features in this field (see Documents and tutorials for Probabilities and Statistics in Scilab on this topic).
Issues with existing tools
Issues with Scilab
Here are the current functions in the statistics section.
Central Tendency:
- geomean — geometric mean
- harmean — harmonic mean
- mean — mean (row mean, column mean) of vector/matrix entries
- meanf — weighted mean of a vector or a matrix
- trimmean — trimmed mean of a vector or a matrix
Measures of Dispersion:
- iqr — interquartile range
- mad — mean absolute deviation
- strange — range
Measures of Shape:
- cmoment — central moments of all orders
- moment — non central moments of all orders
- perctl — computation of percentils
- quart — computation of quartiles
Data with Missing Values:
- nancumsum — Thos function returns the cumulative sum of the values of a matrix
- nand2mean — difference of the means of two independent samples
- nanmax — max (ignoring Nan's)
- nanmean — mean (ignoring Nan's)
- nanmeanf — mean (ignoring Nan's) with a given frequency.
- nanmedian — median of the values of a numerical vector or matrix
- nanmin — min (ignoring Nan's)
- nanstdev — standard deviation (ignoring the NANs).
- nansum — Sum of values ignoring NAN's
- thrownan — Eliminates nan values
Descriptive Statistics
- center — center
- wcenter — center and weight
- correl — correlation of two variables
- covar — covariance of two variables
- median — median
- msd — mean squared deviation
- mvvacov — computes variance-covariance matrix
- stdevf — standard deviation
- st_deviation — standard deviation
- variance — variance of the values of a vector or matrix
- variancef — standard deviation of the values of a vector or matrix
Summaries
- tabul — frequency of values of a matrix or vector
- nfreq — frequence of the values in a vector or matrix
Sampling
- sample — Sampling with replacement
- samplef — sample with replacement from a population and frequences of his values.
- samwr — Sampling without replacement
Principal Component Analysis
- pca — Computes principal components analysis with standardized variables
- princomp — Principal components analysis
- show_pca — Visualization of principal components analysis results
Hypothesis Testing
- ftest — Fisher ratio
- ftuneq — Fisher ratio for samples of unequal size.
Regression
- regress — regression coefficients of two variables
Distribution functions:
- cdfbet — cumulative distribution function Beta distribution
- cdfbin — cumulative distribution function Binomial distribution
- cdfchi — cumulative distribution function chi-square distribution
- cdfchn — cumulative distribution function non-central chi-square distribution
- cdff — cumulative distribution function F distribution
- cdffnc — cumulative distribution function non-central f-distribution
- cdfgam — cumulative distribution function gamma distribution
- cdfnbn — cumulative distribution function negative binomial distribution
- cdfnor — cumulative distribution function normal distribution
- cdfpoi — cumulative distribution function poisson distribution
- cdft — cumulative distribution function Student's T distribution
Random number generators:
- grand
- rand
Scilab provides several probability and statistical features and provides several distribution functions:
- CDF: cumulated density function (e.g. the cdfnor function),
- iCDF: inverse cumulated density function (e.g. the cdfnor function),
- RNG: random number generator (the grand function).
We can see the statistics-related bugs in bugzilla at:
http://bugzilla.scilab.org/buglist.cgi?quicksearch=statistics&list_id=50800
In fact, a detailed analysis shows that the existing features would be easily enhanced on the following points.
Accuracy of the CDF: currently, there is almost no accuracy test. Indeed, there are several accuracy bugs in Scilab. This can be shown first by analysing the current bug reports, for example: http://bugzilla.scilab.org/show_bug.cgi?id=8019, http://bugzilla.scilab.org/show_bug.cgi?id=8030, http://bugzilla.scilab.org/show_bug.cgi?id=8031. But this is only the surface of a huge set of accuracy problems, some of which have been revealed in the Distfun project. To see this, look at the bug reports at: http://forge.scilab.org/index.php/p/distfun/issues/status/closed/. For example, see : http://forge.scilab.org/index.php/p/distfun/issues/982/, http://forge.scilab.org/index.php/p/distfun/issues/976/, http://forge.scilab.org/index.php/p/distfun/issues/953/. But there are many others.
- Creation of accurate PDFs: currently, there is no PDF in Scilab. Experience proves that this task may be challenging, especially if we want to have a sufficiently high robustness and accuracy. PDF functions can be used, for example, when trying to fit a distribution to data, with the maximum likelihood method. This is also useful to do bayesian estimation.
- Creation of missing accurate PDF, CDF and inverse CDF: currently, there are a large number of missing distribution functions in Scilab, for example the Exponential or Hypergeometric distributions.
- Consistency between the random number generators and the CDF and iCDF. The generators in grand are not consistent with the cdf* functions.
- correl: it is not clear which correlation is calculated. Is it not clear what fre should be.
mvvacov: Wikipedia says, that the input must be a vector http://en.wikipedia.org/wiki/Covariance_matrix. But the example uses a matrix. It is not clear how the variance-covariance matrix is calculated.
covar: fre is not explained. Does covar calculate a cross-covariance? http://en.wikipedia.org/wiki/Covariance_matrix
- nfreq and tabul: they are doing almost the same
- ftuneq: incomplete help file
- regress: is doing only x\y. this is not a regression analysis
See the Statistics category in the bug reports for a complete reports of the bugs:
http://bugzilla.scilab.org/buglist.cgi?cmdtype=runnamed&namedcmd=Bugs%20Stats&list_id=11752
The bottom line is that distfun outperforms Scilab on CDFs, iCDF (quantile) and RNGs functions.
Issues with Stixbox
Here are the functions provided by Stixbox.
- Datasets
- getdata — Returns a dataset.
- Graphics
- stixbox_graphics — Demos of the graphics.
- bubblechart — Plot a bubble chart
- bubblematrix — Plot a bubble chart matrix
- histo — Plot a histogram
- identify — Identify points on a plot with mouse clicks
- plotmatrix — Plot an X vx Y scatter plot matrix
- plotsym — Plot with symbols
- qqnorm — Normal probability paper
- qqplot — Create a QQ-plot
- stairs — Stairstep graph
- Logistic Regression
- lodds — Log odds function
- loddsinv — inverse of log odd
- logitfit — Fit a logistic regression model
- Miscellaneous
- betaln — Logarithm of beta function
- corrcoef — Correlation coefficient
- cov — Covariance matrix
- ksdensity — Kernel smoothing density estimate
- quantile — Empirical quantile
- Polynomials
- polyfit — Polynomial curve fitting
- polyval — Polynomial evaluation
- Regression
- cmpmod — Compare linear submodel versus larger one
- lsfit — Fit a multiple regression normal model
- lsselect — Select a predictor subset for regression
- regres — Multiple linear regression
- regresprint — Print linear regression
- Resampling Techniques
- stixbox_resamplingT — How to use extra-arguments in T.
- ciboot — Bootstrap confidence intervals
- covboot — Bootstrap estimate of the variance of a parameter estimate.
- covjack — Jackknife estimate of the variance of a parameter estimate.
- rboot — Simulate a bootstrap resample
- stdboot — Bootstrap estimate of the parameter standard deviation.
- stdjack — Jackknife estimate of the standard deviation of a parameter estimate.
- Tests
- ciquant — Nonparametric confidence interval for quantile
- kstwo — Kolmogorov-Smirnov statistic from two samples
- test1b — Bootstrap test that mean equals zero
- test1n — Test that mean equals zero (Normal)
- test1r — Test for median equals 0 using rank test
- test2n — Tests two normal samples with equal variance
- test2r — Test location equality of two samples using rank test
There are many issues with Stixbox.
Some help pages are almost empty (for example, the cmpmod function: http://forge.scilab.org/index.php/p/stixbox/issues/1082/)
- Some unit tests are missing
- The arguments of some functions is unchecked (e.g. number of input / output arguments, type, size, content).
- The function are not localized.
- Some functions are duplicated in Scilab.
The issues of the Stixbox are reported at:
http://forge.scilab.org/index.php/p/stixbox/issues/
Issues with regtools
Regtools is a toolbox which is packaged on Atoms:
http://atoms.scilab.org/toolboxes/regtools
The regtools module provides the following functions:
- linregr : an interactive user interface for linear regression analysis, including plot facilities and the most relevant statistical information at the solution.
- nlinregr : an interactive user interface for performing non linear (weighted) regression analysis. Also here plot facilities and statistical information are available. Both functions can be called in silent command line mode.
- nlinlsq : non linear (weighted) regression analysis function - called by nlinregr(). nlinlsq() uses the scilab function optim() for solving the regression problem. Supports both analytical and numerical derivatives.
- qqplot - quantile-quantile plots.
A review has been done at:
There are several issues with Regtools.
- The project is not developped on the Forge, so that we cannot report bugs easily.
- On the design, regtools provides the qqplot function, which should be in a statistical graphics toolbox and not in regtools.
- There is no unit test.
- The functions contains both the computations, the message prints and the GUI.
- The function names are too short and may interact with other toolboxes.
The tests do not use the nistdataset module (http://forge.scilab.org/index.php/p/nistdataset/), while they could.
The nlinlsq function uses optim, while it should use lsqrsolve (although leastsq may be a good choice too). See Non linear optimization for parameter fitting example for details.
Issues with CASCI
The CASCI toolbox includes various functions for probability & statistics that are used by P. Castagliola's lab at Université de Nantes.
http://atoms.scilab.org/toolboxes/casci
The toolbox is developped on the Forge:
http://forge.scilab.org/index.php/p/casci/source/tree/master/
The CASCI toolbox provides the following functions:
- intbinomial - binomial confidence interval
- intexponential - exponential confidence interval
- intnormalm - normal confidence interval for µ
- intnormals - normal confidence interval for s
- intpoisson - Poisson confidence interval
- cdfbeta - beta type 1 cdf
- cdfbeta2 - beta type 2 cdf
- cdfbinomial - binomial cdf
- cdfchi2 - ?2 (central and non-central) cdf
- cdfcv - sample coefficient of variation cdf
- cdfdphase - discrete Phase-Type cdf
- cdfexponential - exponential cdf
- cdffisher - Fisher (central and non-central) cdf
- cdffoldednormal - folded normal cdf
- cdfgamma - gamma cdf
- cdfgev - generalized Extreme Value cdf
- cdfhypergeometric - hypergeometric cdf
- cdfjohnson - Johnson’s cdf
- cdflognormal - lognormal cdf
- cdfmedian - normal sample median cdf
- cdfnormal - normal cdf
- cdfpareto - Pareto cdf
- cdfpascal - Pascal cdf
- cdfpoisson - Poisson cdf
- cdfrnge - normal range cdf
- cdfstandev - normal sample standard-deviation cdf
- cdfstudent - Student (central and non-central) cdf
- cdfweibull - Weibull cdf
- boxbehnken - Box-Behnken designs
- boxcoxlinear - Box-Cox linearity transformation
- centralcomposite - central composite designs
- coded2natural - coded to natural variables
- doxpand - design expansion
- doxptim - design optimisation
- equiradial - equiradial designs
- factorial2 - two levels full and fractional factorial designs
- mulreg - multilinear regression analysis
- mulregdisp - multilinear regression analysis results display
- mulregplot - multilinear regression analysis results plot
- natural2coded - natural to coded variables
- plackettburman - Plackett-Burman designs
- simpdex - simplex designs
- fitbeta - beta type 1 parameters estimation
- fitgamma - gamma parameters estimation
- fitgev - generalized Extreme Value parameters estimation
- fitjohnson - Johnson parameters estimation
- fitlognormal - lognormal parameters estimation
- fitweibull - Weibull parameters estimation
- idfbeta - beta type 1 idf
- idfbeta2 - beta type 2 idf
- idfchi2 - ?2 (central and non-central) idf
- idfcv - sample coefficient of variation idf
- idfexponential - exponential idf
- idffisher - Fisher (central and non-central) idf
- idfgamma - gamma idf
- idfgev - generalized Extreme Value idf
- idfjohnson - Johnson’s idf
- idflognormal - lognormal idf
- idfmedian - normal sample median idf
- idfnormal - normal idf
- idfpareto - Pareto idf
- idfstandev - normal sample standard-deviation idf
- idfstudent - Student (central and non-central) idf
- idfweibull - Weibull idf
- allcombination - matrix element combinations
- allpermutation - matrix element permutations
- arrangement - number Ap of arrangements
- combination - number Cn of combinations
- confhyper - confluent hypergeometric function
- depth - non parametric multivariate depth
- hausdorff - Hausdorff (median) distance between polylines
- lowess - LOcally WEighted Scatterplot Smoothing
- momdphase - first moments of a Discrete Phase-Type distribution
- nearestneighbors - find the k nearest neighbors
- neldermead - Nelder Mead’s downhill simplex nonlinear optimization algorithm
- savitzkygolay - Savitzky-Golay smoothing filter
- simplex - simplex computation
- simplexolve - solve a system of non-linear equations
- torczon - Torczon’s multidirectional nonlinear optimization algorithm
- vandercorput - Van der Corput’s sequence
- boxplot - Box plot
- qplot - quantile plot
- qqplot - quantile-quantile plot
- pdfbeta - beta type 1 pdf
- pdfbeta2 - beta type 2 pdf
- pdfbinomial - binomial pdf
- pdfchi2 - ?2 (central and non-central) pdf
- pdfcp - CP pdf
- pdfcpk - CP K pdf
- pdfcpm - CP M pdf
- pdfcpmk - CP M K pdf
- pdfcpuv - V ¨nnman’s Cp (u, v) pdf
- pdfcv - sample coefficient of variation pdf
- pdfdphase - discrete phase-type pdf
- pdfexponential - exponential pdf
- pdffisher - Fisher (central and non-central) pdf
- pdffoldednormal - folded normal pdf
- pdfgamma - gamma pdf
- pdfgev - generalized Extreme Value pdf
- pdfhypergeometric - hypergeometric pdf
- pdfkernel - kernel smoothed pdf
- pdfjohnson - Johnson’s pdf
- pdflognormal - lognormal pdf
- pdfmedian - normal sample median pdf
- pdfmultinormal - multinormal pdf
- pdfnormal - normal pdf
- pdfpareto - Pareto pdf
- pdfpascal - Pascal pdf
- pdfpoisson - Poisson pdf
- pdfrnge - normal range pdf
- pdfstandev - normal sample standard-deviation pdf
- pdfstudent - Student pdf
- pdfweibull - Weibull pdf
- quadhermite - Gauss-Hermite quadrature
- quadlaguerre - Gauss-Laguerre quadrature
- quadlegendre - Gauss-Legendre quadrature
- quadsimpson - Simpson quadrature
- rndbeta - beta type 1 random number generator
- rndbeta2 - beta type 2 random number generator
- rndbinomial - binomial random number generator
- rndexponential - exponential random number generator
- rndfoldednormal - folded normal random number generator
- rndgamma - gamma random number generator
- rndgev - generalized Extreme Value random number generator
- rndjohnson - Johnson’s random number generator
- rndlognormal - lognormal random number generator
- rndmultinormal - multinormal random number generator
- rndnormal - normal random number generator
- rndpareto - Pareto random number generator
- rndpascal - Pascal random number generator
- rndpoisson - Poisson random number generator
- rndstandev - normal sample standard-deviation random number
- rndweibull - Weibull random number
- generator
- autocorrelation - autocorrelation coefficient
- bootstrap - bootstrap sampling
- correlation - correlation matrix
- crosscorrelation - crosscorrelation coefficient
- kurtosis - kurtosis coefficient
- quantile - quantile
- rnge - range
- skewness - skewness coefficient
- standev - standard deviation
- totalmedian - total median coefficients
- varcovar - variance-covariance matrix
- arlmean - ARL of the mean control chart
- arlmeanRR - ARL of the Run Rules mean control chart
- arlmedian - ARL of the median control chart
- arlmedianRR - ARL of the Run Rules median control chart
- arlrnge - ARL of the range control chart
- arlstandev - ARL of the standard-deviation control chart
- arlstandevRR - ARL of the Run Rules standard-deviation control chart
- cp - capability index CP estimation and confidence interval
- cpk - capability index CP K estimation and confidence interval
- krnge - range coefficients KR (n)
- kstandev - standard-deviation coefficients KS (n, r)
- mcpshahriari - Shahriari’s multivariate capability index CP
- mcptaam - Taam’s multivariate capability index CP
- andersondarling - Anderson-Darling’s normality test
- ansaribradley - Ansari-Bradley’s test
- bartlett - Bartlett’s test
- grubbs - Grubbs test
- kendall - Kendall’s test
- levene - Levene’s test
- mardia - Mardia’s test
- spearman - Spearman’s test
- tstbinomial1 - binomial one sample p test
- tstbinomial2 - binomial two samples p test
- tstexponential - exponential ? test
- tstnormalm1 - normal one sample µ test
- tstnormalm2 - normal two samples µ test
- tstnormals1 - normal one sample s test
- tstnormals2 - normal two samples s test
- tstsku - normal skewness and kurtosis test
- waldwolfowitz - Wald-Wolfowitz’s run test
- wilcoxon1 - Wilcoxon’s one sample (paired) test
- wilcoxon2 - Wilcoxon’s two samples test
There are several issues with CASCI:
There is no help page (the help is provided in a PDF http://forge.scilab.org/index.php/p/casci/source/tree/master/help/cascidoc.pdf, which is non standard).
- There is no unit test of the module. Hence, we cannot check how the functions perform.
- There are several other toolboxes which have the same purposes and some of them provide it more consistently than CASCI.
Confidence intervals: "intbinomial" can be found in Conint (http://atoms.scilab.org/toolboxes/Conint), but with help and tests.
Distributions: "cdfexponential" can be found in Distfun (http://atoms.scilab.org/toolboxes/distfun), but with help and tests.
Optimization: "neldermead" is already provided by Scilab http://help.scilab.org/fminsearch, but with help, tests and Matlab compatibility.
Permutations/Combinations : "combination" is provided by "specfun_nchoosek" (http://atoms.scilab.org/toolboxes/number), but with help and tests.
Low discrepancy: "vandercorput" is provided by lowdisc (http://atoms.scilab.org/toolboxes/lowdisc), but with help, tests and both compiled (for speed) and macro (for learning) formats.
- The function names are poorly chosen: e.g. "cp"
Other tools
During the 2012 International Open Source Software Contest, a statistical toolbox was created:
- Authors: Wang Shuaili, Wu Wei, Li Weicai, from Ecole Centrale de Pékin, Beihang University
TODO : review this toolbox
Existing statistical tools outside of Scilab
Matlab
Matlab has a solid set of statistical functions.
The following page is the entry point for the statistical toolbox:
http://www.mathworks.fr/fr/help/stats/index.html
The following page presents the list of functions for hypothesis tests:
http://www.mathworks.fr/fr/help/stats/hypothesis-tests-1.html
R
The statistical features of R are huge, so that Scilab will probably never reach that level of specialization. Anyway, this is an excellent reference to look at.
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/00Index.html
The following page presents the list of distributions in R:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html
Octave
The following modules and toolboxes for octave exists:
The statistics toolbox from octave forge : http://octave.sourceforge.net/statistics/overview.html
Source code is available at http://octave.svn.sourceforge.net/viewvc/octave/trunk/octave-forge/main/statistics/
The statistics module in octave itself : http://www.gnu.org/software/octave/doc/interpreter/Statistics.html
Source code: http://hg.savannah.gnu.org/hgweb/octave/file/f34bea431e4f/scripts/statistics
SciPy
Statistical functions (scipy.stats) http://docs.scipy.org/doc/scipy-0.10.0/reference/stats.html.
Tutorial: http://docs.scipy.org/doc/scipy-0.10.0/reference/tutorial/stats.html.
Source code: https://github.com/scipy/scipy/tree/master/scipy/stats
The ideal statistics API
Architecture of the modules
Rather than designing a single module, we may rather think of several separated modules, with clear and orthogonal goals. This restricts the work, increases the chances of reuse and limits the dependencies. For, it allows to separate the graphics issues from the computational issues. This is one of the main difficulty with several modules (e.g. Metanet), which has been a strong obstacle to their maintenance over the years.
Here are the modules that we could create.
- scidoe : design of experiments. Only the functions to create the designs should be here. There should be no function to build models (e.g. linear models).
- distfun : distribution functions. Only the functions to manage the PDF, CDF, iCDF RNG, and stats should be here. There should be no statistics functions in this module. This might be a problem, because testing these functions may require statistics functions, for example the cov function.
- datgra : graphics for data analysis. Only the functions to create the graphics should be here.
- regmod : regression analysis module. Only the functions to produce regression analysis should be here.
Designs of experiments : scidoe
The scidoe toolbox was created for this purpose:
http://forge.scilab.org/index.php/p/scidoe/
The goal of this toolbox is to provide design of experiments techniques, along with functions for model building.
This project is part of the GSOC 2012, managed by Maria Christopoulou.
Here is the list of functions which are available.
Factorial Design
- X = scidoe_fullfact(levels) // A full factorial design
- X = scidoe_ff2n(n) // A full factorial design with 2 factors
Response Surface Designs
- X = scidoe_bbdesign(n) // Box-Behnken design
Goals
Here is the list of functions which *will* be created at the end of the project.
Latin Hypercube Designs
- X = scidoe_lhsdesign(n,p) // LHS Design (without improvement).
- X = scidoe_lhsdesign(n,p,'criterion',criterion) // Optimized LHS Design based on a criterion. Criterion can be 'maximin','correlation' or 'centered'.
Factorial Design
- X = scidoe_fracfact(generators) // A fractional factorial design
- X = scidoe_star(nb_var) // Star Design of Experiments
Response Surface Designs
- X = scidoe_ccdesign(n) // Central composite design
Optimal Designs
- X = scidoe_optdesign(n) // Optimal design (a-optimal)
- X = scidoe_optdesign(n,'criterion',criterion) // Optimal Design based on a criterion. Criterion can be 'a' for A-Optimal, 'd' for D-Optimal, 'g' for G-Optimal and 'o' for O-Optimal Design.
- X = mtlb_doptdesign(n) // Matlab compatible D-optimal Design
Supersaturated Designs
- X = scidoe_comp_ssd(M_doe, model) // Supersaturated Design ('a-optimal').
- X = scidoe_comp_ssd(M_doe, model,'criterion',criterion) // Supersaturated Design based on a criterion. Criterion can be 'a-optimal', 'd-optimal', 'average-khi2', 'maximum-khi2', 'r-value'(correlation coefficient).
Model Building
- X = scidoe_poly_model(mod_type, nb_var, order) // Produces a polynomial model
- X = scidoe_model_select(nb_var, model_old, measures, Log,criterion) // Produces a new polynomial model using forward or backward selection. Default is 'forward'.
- X = scidoe_plot_model(meas_learn, estim_learn, meas_valid, estim_valid) // Plots regression line and residuals distribution
- X = scidoe_build_regression_matrix(H,model,build) // Regression matrix of a model
- X = scidoe_var_regression_matrix(H, x, model, sigma) // Regression matrix of the variance of a model
- X = scidoe_lars(X, y, method, stop, useGram, Gram, _trace) // Least Angle Regression or Lasso Regression
- X = scidoe_rsquared(Y,Y_model) // R2 Computation
General Functions
- X = scidoe_unnorm_doe_matrix(H, min_levels, max_levels) // Adjusts high and low values of a design to specified maximum and minimum values
- scidoe_comp_WD2_crit.sci // Wrap-around L2 discrepancy criterion
- scidoe_comp_CL2_crit(Data).sci // Centered L2 discrepancy criterion
- scidoe_crossvalidate.sci // K-flod cross validation
- scidoe_cvplot.sci // Plots cross validation results
- scidoe_prbs.sci // A pseudo random binary signal generator
- scidoe_merge.sci // Merges two samples
- scidoe_diff.sci // Computes the difference of two samples
- scidoe_scramble.sci // Permutes a sample
- scidoe_standardize.sci // Center and normalize a sample
- scidoe_normalize.sci // Normalises a sample
Distribution functions : distfun
The goal of the distribution function toolbox is to provide the following functions:
- PDF: probability density function,
- CDF: cumulated density function,
- iCDF: inverse cumulated density function,
- RNG: random number generator.
This section is the goal of the distfun project. This project is developped on the Forge:
http://forge.scilab.org/index.php/p/distfun/
and available on Atoms:
http://atoms.scilab.org/toolboxes/distfun
This project is part of the GSOC 2012, managed by Prateek Papriwal.
For each distribution x, we provide five functions :
- distfun_xcdf — x CDF
- distfun_xinv — x Inverse CDF
- distfun_xpdf — x PDF
- distfun_xrnd — x random numbers
- distfun_xstat — x mean and variance
Distributions available :
- Beta (with x=beta)
- Exponential (with x=exp)
- Gamma (with x=gam)
- Geometric (with x=geo)
LogNormal (with x=logn)
- Normal (with x=norm)
- Uniform (with x=unif)
Support
- distfun_erfcinv — Inverse erfc function
- distfun_getpath — Returns path of current module
Random Number Generator
- rng_overview — An overview of the Random Number Generators of the Distfun toolbox.
- distfun_genget — Get the current random number generator
- distfun_genset — Set the current random number generator
- distfun_seedget — Get the current state of the current random number generator
- distfun_seedset — Set the current state of the current random number generator
- distfun_streamget — Get the current stream
- distfun_streaminit — Initializes the current substream
- distfun_streamset — Set the current stream
Still, the work is not finished and there are many distributions which are still missing in the distfun module.
Datasets
In this section, we present functions to manage datasets.
This section is a *DRAFT*.
- getdata : Famous datasets
Datasets distributed alongside R : rdataset
rdataset is a collection of 597 datasets that were originally distributed alongside the statistical software environment "R" and some of its add-on packages.
Datasets which are available in R can be used in Scilab with rdataset. The toolbox needs around 50 MByte. The datasets can be used in order to tests statistical function in scilab and compare the results with the output of R. As the dastasets are included in R, no data have no loaded manually.
This project is developped on the Forge:
http://forge.scilab.org/index.php/p/rdataset/
- rdataset_read : Reads a dataset from R
For example the dataset survey from the library MASS can be loaded using:
[data,desc] = rdataset_read("MASS","survey");
Statistical visualization : statvis
In this section, we present functions to produce statistical graphics.
This section is a *DRAFT*.
- statvis_identify : Identify points on a plot by clicking with the mouse (draft from Stixbox)
- statvis_plotsym : Plot with symbols (draft from Stixbox)
- statvis_qqnorm : Normal probability paper (draft from Stixbox)
- statvis_qqplot : Plot empirical quantile vs empirical quantile (draft from Stixbox, from Nan-Toolbox)
- statvis_boxplot : Draw a box-and-whiskers plot for data provided as column vectors (draft from Stixbox)
- statvis_cdfplot : plots empirical commulative distribution function (draft from Stixbox)
- statvis_normplot : Produce a normal probability plot for each column of X (draft from Stixbox)
statvis_plotmatrix : Scatter plot matrix - http://www.mathworks.fr/help/techdoc/ref/plotmatrix.html (draft from Stixbox, from Nan-Toolbox)
statvis_cdfplot : http://www.mathworks.fr/fr/help/stats/cdfplot.html. (draft = nan_cdfplot from Nan-Toolbox)
statvis_gscatter : http://www.mathworks.fr/fr/help/stats/gscatter.html (draft = nan_gscatter from Nan-Toolbox)
- statvis_boxplot
- statvis_normplot
- statvis_andrewsplot
statvis_hist : http://www.mathworks.fr/fr/help/matlab/ref/hist.html (draft = histo from Stixbox, and nan_hist from Nan-Toolbox)
- statvis_ecdfhist
- statvis_fscatter3
- statvis_gplotmatrix
- statvis_parallelcoords
- statvis_errorb
- statvis_errorbar
- statvis_nhist
- statvis_bubblechart — Plot a bubble chart
- statvis_bubblematrix — Plot a bubble chart matrix
- statvis_inthisto : Discrete histogram (draft is distfun_inthisto in distfun)
Descriptive statistics
In this section, we present functions to produce descriptive statistics.
This section is a *DRAFT*.
- Location Measures
- mean :
- median :
- rms : root mean square
- Dispersion Measures
- standardDeviation
meandev : MeanDeviation
mediandev : MedianDeviation
- mad : estimates the Mean Absolute deviation
- iqr : calculates the interquartile range
- Shape Measures
- Skewness
- kurtosis : estimates the kurtosis
- moment
- variance
- General Measures
- quantile : Empirical quantile (percentile).
- sem : calculates the standard error of the mean
- Dependance
- cov : Covariance matrix. The Stixbox/cov function does this.
- corrcoef: calculates the correlation matrix from pairwise correlations. The Stixbox/corrcoef function does this. (gives back the confidence interval of estimated parameters and the R^2, F and p values)
- fss : feature subset selection and feature ranking
- parcoef : partial correlation
- Resampling
- ciboot : Various bootstrap confidence intervals
- covboot : Bootstrap estimate of the variance of a parameter estimate
- covjack : Jackknife estimate of the variance of a parameter estimate
- rboot : Simulate a bootstrap resample from a sample
- stdboot : Bootstrap estimate of the parameter standard deviation
- stdjack : Jackknife estimate of the standard deviation of a parameter estimate
- Miscellaneous
- zscore : removes the mean and normalizes the data to a variance of 1
- zScoreMedian : removes the median and standardizes by the 1.483*median absolute deviation
- kappa : estimates Cohen's kappa coefficient and related statistics
Regression analysis : regmod
regmod_regress : Multiple Linear Regression. http://www.mathworks.fr/help/toolbox/stats/regress.html The scidoe_regress function does this: http://forge.scilab.org/index.php/p/scidoe/source/tree/master/macros/scidoe_regress.sci. All we have to do is to rename the function.
- regmod_linreg : Linear or polynomial regression
- regmod_simplelinreg : univariate linear regression
- regmod_multilinreg : multivariate linear regression
- regmod_yates : Yates Algorithm
- regmod_ryates : Inverse Yates Algorithm
Hypothesis testing : hypt
In this section, we present functions which compute tests, confidence intervals and model estimation.
One of the principles of this toolbox is to be compatible with MATLAB, especially the Hypothesis Testing sub-module:
http://www.mathworks.fr/fr/help/stats/hypothesis-tests-1.html
This section is a *DRAFT*.
High Priority List:
hypt_barttest : Bartlett test for homogeneity of variances (http://en.wikipedia.org/wiki/Bartlett_test). MATLAB Compatible. Source of implementation : CACSCI ?
- hypt_crosstab Cross-tabulation. MATLAB Compatible. Source of implementation : TODO ?
hypt_chi2gof : CHI Squared Goodness Of Fit test (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). MATLAB Compatible. Source of implementation : TODO ?
hypt_kstest: Perform a one sample Kolmogorov-Smirnov test of the null hypothesis that the samples x comes from given distribution (http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). MATLAB Compatible. Source of implementation : TODO ?
hypt_kstest2 : Kolmogorov-Smirnov statistic from two samples (http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). MATLAB Compatible. Source of implementation : kstwo in Stixbox ?
- hypt_ranksum Wilcoxon rank sum test. MATLAB Compatible. Source of implementation : TODO ?
hypt_signrank Wilcoxon signed rank test (http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test). MATLAB Compatible. Source of implementation : TODO ?
hypt_signtest Sign test (http://en.wikipedia.org/wiki/Sign_test). MATLAB Compatible. Source of implementation : TODO ?
hypt_ttest One-sample and paired-sample t-test (http://en.wikipedia.org/wiki/Student%27s_t-test). MATLAB Compatible. Source of implementation : TODO ?
hypt_ttest2 Two-sample t-test (http://en.wikipedia.org/wiki/Student%27s_t-test). MATLAB Compatible. Source of implementation : TODO ?
hypt_vartest Chi-square variance test (http://en.wikipedia.org/wiki/Chi-squared_test). MATLAB Compatible. Source of implementation : TODO ?
hypt_vartest2 Two-sample F-test for equal variances (http://en.wikipedia.org/wiki/F-test_of_equality_of_variances). MATLAB Compatible. Source of implementation : TODO ?
zhypt_ztest z-test (Test for mean of a normal sample with known variance) (http://en.wikipedia.org/wiki/Z-test). MATLAB Compatible. Source of implementation : TODO ?
Low Priority List:
- hypt_ciquant : Nonparametric confidence interval for quantile
- hypt_cmpmod : Compare linear submodel versus larger one
- hypt_test1b : Bootstrap t test and confidence interval for the mean
- hypt_test1n : Tests and confidence intervals based on a normal sample
- hypt_test1r : Test for median equals 0 using rank test
- hypt_test2n : Tests and confidence intervals based on two normal samples with common variance
hypt_welchtest : Welch two-sample t test (http://en.wikipedia.org/wiki/Welch%27s_t_test)
hypt_kriskalwallis : Perform a Kruskal-Wallis one-factor "analysis of variance". (http://en.wikipedia.org/wiki/Kruskal-Wallis)
hypt_utest (the same as test2r) : Perform a Mann-Whitney U-test(http://en.wikipedia.org/wiki/Mann-Whitney-Wilcoxon_test)
hypt_mcnemartest : McNemar's test for symmetry (http://en.wikipedia.org/wiki/McNemar%27s_test)
Excluded The following functions will not be part of the hypt toolbox:
- lsfit : Fit a multiple regression normal model. This function is more appropriate in the "regression" toolbox.
- lsselect : Select a predictor subset for regression. This function is more appropriate in the "regression" toolbox.
ANOVA module : anova
anova : Perform a one-way analysis of variance (ANOVA) (http://en.wikipedia.org/wiki/Anova)
manova : a one-way multivariate analysis of variance (MANOVA) (http://en.wikipedia.org/wiki/MANOVA)
TODO
Design:
- Find a toolbox name for "Descriptive statistics"
- Find a toolbox name for "Hypothesis testing" (maybe hypt?)
- Rename the function to match Matlab
- Insert a prefix to avoid name conflicts
- Add missing functions
- Indicate potential implementations
Development:
- Create the associated missing projects on the Forge
- write unit tests (As reference R could be used)
- create demos
- create help pages for each function
Authors
- 2012-2013, Michaël Baudin
- 2012, Holger Nahrstaedt