Linear algebra performances

Abstract

In this page, we present performance of various Scilab scripts involving linear algebra. We emphasize the use of Mflops as a measure of performance of linear algebra routines used in Scilab. We consider here two benchmarks:

the dense, real, matrix-matrix multiply,
the solution of dense, real linear systems of equations.

See "Programming in Scilab" [1] for more details on this topic.

Contents

Linear algebra performances

Introduction

In order to get better performances, users may install ATLAS or the Intel MKL inside Scilab (see [1] for details).

In all cases, comparing the various performances requires to have the following parameters:

the version of Scilab,
the version of the operating system,
the parameters of the CPU (and, if possible, the amount of physical memory),
the linear algebra library.

There are (at least) three linear algebra libraries for the benchmark presented here:

Reference Blas,
ATLAS,
the Intel MKL.

By default, Scilab uses the Intel MKL on Windows and Reference Blas on Linux (see [1] for details).

The size n of the matrix is a parameter which can be changed to get higher performances. The time should be kept in a reasonable range, say from 1 second to 10 seconds. In order to find the value n which allows your machine to express its best performance, run the two scripts in attachment:

In the Scilab terminal, we can launch the script, which performs a loop over the size of the matrix. The following session presents the result of a typical session. The first column is n, the second is the time in seconds, the third one is the Mflops.

-->exec C:\Users\baudin\Desktop\bench_matmul.sce;
Memory: 1085 (MB)
Maximum n: 11646
Run #1: n=  1107, T=0.187 (s), Mflops= 14508
Run #2: n=  1329, T=0.249 (s), Mflops= 18854
Run #3: n=  1595, T=0.811 (s), Mflops= 10006
Run #4: n=  1914, T=0.645 (s), Mflops= 21741
Run #5: n=  2297, T=1.157 (s), Mflops= 20949
Run #6: n=  2757, T=1.929 (s), Mflops= 21727
Run #7: n=  3309, T=3.323 (s), Mflops= 21806
Run #8: n=  3971, T=4.680 (s), Mflops= 26759
Run #9: n=  4766, T=7.878 (s), Mflops= 27483
Best performance: N=4766, T=7.878 (s), MFLOPS=27483

We see that the performance increases with the size of the matrix. We can take the best performance, associated with the largest mflops.

Matrix-Matrix Product

This product involves the product of two square, real, dense, matrices of doubles.

The script

The following is a short benchmark.

stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
B = rand(n,n);
tic();
C = A * B;
t = toc();
mflops = round(2*n^3/t/1.e6);
disp([n t mflops])

A more complete benchmark is available in bench_matmul.sce or [3].

The results

Scilab	OS	CPU	Physical Memory	Library	n	Time (s)	MFLOPS
scilab-5.4.1	Windows Vista Business 32 bits	Intel Xeon 8*2.93GHz	24 GB	Intel MKL	3971	1.794	69808
scilab-5.3.0-beta-4-x64	Windows Seven Ultimate 64 bits	Intel Xeon X5570 16*2.93GHz	4 GB	Intel MKL	3309	1.248	58063
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	Intel MKL	4766	8.172	26494
scilab-5.2.2-x64	Windows Seven Ultimate 64 bits	Intel Core 2 6600 4*2.4 Ghz	8 GB	Intel MKL	3971	4.727	26493
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	ATLAS 32 bits tuned (sse&mt)	4766	8.073	26819
scilab-5.3.3	Windows 7 Prof. 32 bits	Intel i5 2520M 4*2.5GHz	4 GB	Intel MKL	3971	6.656	18815
scilab-5.3.3 x64	Windows 7 64 bits	Intel Pent. P6200 2*2.13GHz	4 GB	Intel MKL	3309	7.928	9140
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	AMD ACML 4.3.0	3309	8.694	8334
scilab-5.4.1	Windows 7 Prof. 32 bits	Intel Celeron T3100 2*1.90GHz	4 GB	Intel MKL	3309	10.199	7104
scilab-5.3.0-beta-4	Fedora Linux 13 64 bits	Intel Core2 6600 2*2.4 GHz	4 GB	ATLAS 64 bits sse2 (tuned)	2757	10.140	4133
scilab-5.3.0-beta-4	Fedora Linux 13 64 bits	Intel Core2 6600 2*2.4 GHz	4 GB	ATLAS 64 bits sse2	2297	5.897	4110
scilab-5.3.2	Windows Seven Ultimate 64 bits	AMD Fusion E-350 1.6 Ghz	8 GB	Intel MKL	1914	5.504	2547
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	ATLAS	1595	3.698	2194
scilab-5.3.0-beta-4	Fedora Linux 13 64 bits	Intel Core2 6600 2*2.4 GHz	4 GB	Ref. BLAS 64 bits	533	0.162	1869
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	Ref. BLAS	444	0.125	1400
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	Ref BLAS	444	0.129	1357
scilab-5.3.3	Windows 7 64 bits	Intel Pent. P6200 2*2.13GHz	4 GB	Intel MKL	1914	13.187	1063
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	ATLAS	1500	?	~2300
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	Intel MKL	1500	?	~2300
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	Ref. BLAS	1000	?	~500

Some comments

The Intel MKL or the ATLAS libraries improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x5 on a single core processor.

Scilab	OS	CPU	Physical Memory	Library	n	Time (s)	MFLOPS
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	ATLAS	1500	?	~2300
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	Intel MKL	1500	?	~2300
scilab-5.3.0-beta-4	Windows XP 32 bits	AMD Athlon 3200+ 2 GHz	1 GB	Ref. BLAS	1000	?	~500

On a 64 bits system, the 64 bits Scilab improves the performances over the Ref. BLAS. See for example the following experiment where the performance ratio is x9 on a dual core processor.

Scilab	OS	CPU	Physical Memory	Library	n	Time (s)	MFLOPS
scilab-5.3.3 x64	Windows 7 64 bits	Intel Pent. P6200 2*2.13GHz	4 GB	Intel MKL	3309	7.928	9140
scilab-5.3.3	Windows 7 64 bits	Intel Pent. P6200 2*2.13GHz	4 GB	Intel MKL	1914	13.187	1063

Backslash

This product involves the computation of the solution of a linear system of equations. This is often called the "LINPACK" benchmark [2], but Scilab uses LAPACK.

The script

s= stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
b = rand(n,1);
tic();
x = A\b;
t = toc();
mflops = round((2/3*n^3 + 2*n^2)/t/1.e6);
disp([n t mflops])

A more complete benchmark is available in bench_backslash.sce or [4].

The results

Scilab	OS	CPU	Physical Memory	Library	n	Time (s)	MFLOPS
scilab-5.2.2-x64	Windows Seven Ultimate 64 bits	Intel Core2 6600 4*2.4 GHz	8 GB	Intel MKL	6864	9.655	22339
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	Intel MKL	5720	6.376	19578
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	ATLAS 32 bits tuned (sse&mt)	6864	11.304	19080
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	AMD ACML 4.3.0	3971	5.498	7598
scilab-5.3.0-beta-4	Fedora Linux 13 64 bits	Intel Core2 6600 2*2.4 GHz	4 GB	ATLAS 64 bits sse2 (tuned)	2757	10.140	4133
scilab-5.3.2	Windows Seven Ultimate 64 bits	AMD Fusion E-350 1.6 Ghz	8 GB	Intel MKL	3309	10.802	2238
scilab-5.3.0-beta-4	Fedora Linux 13 64 bits	Intel Core2 6600 2*2.4 GHz	4 GB	Ref. BLAS 64 bits	1914	2.570	1821
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	Ref. BLAS	2757	10.514	1330
scilab-5.3.0-beta-4	Windows Vista Ultimate 32 bits	Intel Xeon E5410 4*2.33 GHz	4 GB	ATLAS	3309	12.074	2002
scilab-5.3.0-beta-4	Debian GNU/Linux 32 bits	Intel Core2 4*2.66 GHz	4 GB	Ref. BLAS	1914	3.29	1422
scilab-5.3.0-beta-4	Linux Ubuntu 32 bits	Intel Pentium M 2 GHz	1 GB	Ref. BLAS	1000	?	~700
scilab-5.3.0-beta-4	Linux Ubuntu 32 bits	Intel Pentium M 2 GHz	1 GB	ATLAS	3000	?	~1400

Notes

The backslash operator may use the multi-core of our machine, depending on the configuration of Scilab.
Both benchmarks may fail, because the maximum stack size has been reached.
The timer function should not be used, because of it measures the CPU time, and not the elapsed time. On multi-core machines, the CPU time measured by the timer function is the sum of the times of all cores. This is why the tic()/toc() functions should be used instead (see bug #8276: http://bugzilla.scilab.org/show_bug.cgi?id=8276 for the lack of documentation of this point in the help page of timer).
For large matrices, the backslash test may fail, because the backslash operator switches to a least squares computation algorithm, instead of keeping on the Gaussian elimination. This is bug #7497 : http://bugzilla.scilab.org/show_bug.cgi?id=7497
See a message on this topic : http://lists.scilab.org/cgi-bin/ezmlm-browse?list=dev&cmd=showmsg&msgnum=1849
We have packaged these benchmarks into an ATOMS module:

atomsInstall("scibench")
atomsLoad("scibench")

To run the matmul benchmark:

lines(0);
stacksize("max");
scf();
perftable = scibench_matmul ( %t , %t , 0.1 , 8 , 1.2 )

To run the backslash benchmark:

lines(0);
stacksize("max");
scf();
perftable = scibench_backslash ( %t , %t , 0.1 , 8 , 1.2 )

References

[1] "Programming in Scilab", Michael Baudin, 2010, (HTTP)

[2] "Benchmarks: LINPACK and MATLAB - Fame and fortune from megaflops", Cleve Moler, 1994, (PDF)

[3] Benchmarking matrix-matrix product, Michael Baudin, 2010, (bench_matmul.sce)

[4] Benchmarking backslash, Michael Baudin, 2010, (bench_backslash.sce)

[5] Benchmark programs and reports, http://www.netlib.org/benchmark/

[6] Automatically tuned linear algebra software, R. Clint Whaley and Jack J. Dongarra. In Supercomputing '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1-27, Washington, DC, USA, 1998. IEEE Computer Society.

[7] Automated empirical optimizations of software and the atlas project, R. Clint Whaley, Antoine Petitet, R. Clint, Whaley Antoine, Petitet Jack, and Jack J. Dongarra, 2000