Linear algebra performances

Abstract

In this page, we present performance of various Scilab scripts involving linear algebra. We emphasize the use of Mflops as a measure of performance of linear algebra routines used in Scilab. We consider here two benchmarks:

See "Programming in Scilab" [1] for more details on this topic.

Introduction

In order to get better performances, users may install ATLAS or the Intel MKL inside Scilab (see [1] for details).

In all cases, comparing the various performances requires to have the following parameters:

There are (at least) three linear algebra libraries for the benchmark presented here:

By default, Scilab uses the Intel MKL on Windows and Reference Blas on Linux (see [1] for details).

The size n of the matrix is a parameter which can be changed to get higher performances. The time should be kept in a reasonable range, say from 1 second to 10 seconds. In order to find the value n which allows your machine to express its best performance, run the two scripts in attachment:

In the Scilab terminal, we can launch the script, which performs a loop over the size of the matrix. The following session presents the result of a typical session. The first column is n, the second is the time in seconds, the third one is the Mflops.

-->exec C:\Users\baudin\Desktop\bench_matmul.sce;
Memory: 1085 (MB)
Maximum n: 11646
Run #1: n=  1107, T=0.187 (s), Mflops= 14508
Run #2: n=  1329, T=0.249 (s), Mflops= 18854
Run #3: n=  1595, T=0.811 (s), Mflops= 10006
Run #4: n=  1914, T=0.645 (s), Mflops= 21741
Run #5: n=  2297, T=1.157 (s), Mflops= 20949
Run #6: n=  2757, T=1.929 (s), Mflops= 21727
Run #7: n=  3309, T=3.323 (s), Mflops= 21806
Run #8: n=  3971, T=4.680 (s), Mflops= 26759
Run #9: n=  4766, T=7.878 (s), Mflops= 27483
Best performance: N=4766, T=7.878 (s), MFLOPS=27483

We see that the performance increases with the size of the matrix. We can take the best performance, associated with the largest mflops.

Matrix-Matrix Product

This product involves the product of two square, real, dense, matrices of doubles.

The script

The following is a short benchmark.

stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
B = rand(n,n);
tic();
C = A * B;
t = toc();
mflops = round(2*n^3/t/1.e6);
disp([n t mflops])

A more complete benchmark is available in bench_matmul.sce or [3].

The results

Scilab

OS

CPU

Physical Memory

Library

n

Time (s)

MFLOPS

scilab-5.4.1

Windows Vista Business 32 bits

Intel Xeon 8*2.93GHz

24 GB

Intel MKL

3971

1.794

69808

scilab-5.3.0-beta-4-x64

Windows Seven Ultimate 64 bits

Intel Xeon X5570 16*2.93GHz

4 GB

Intel MKL

3309

1.248

58063

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

Intel MKL

4766

8.172

26494

scilab-5.2.2-x64

Windows Seven Ultimate 64 bits

Intel Core 2 6600 4*2.4 Ghz

8 GB

Intel MKL

3971

4.727

26493

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

ATLAS 32 bits tuned (sse&mt)

4766

8.073

26819

scilab-5.3.3

Windows 7 Prof. 32 bits

Intel i5 2520M 4*2.5GHz

4 GB

Intel MKL

3971

6.656

18815

scilab-5.3.3 x64

Windows 7 64 bits

Intel Pent. P6200 2*2.13GHz

4 GB

Intel MKL

3309

7.928

9140

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

AMD ACML 4.3.0

3309

8.694

8334

scilab-5.4.1

Windows 7 Prof. 32 bits

Intel Celeron T3100 2*1.90GHz

4 GB

Intel MKL

3309

10.199

7104

scilab-5.3.0-beta-4

Fedora Linux 13 64 bits

Intel Core2 6600 2*2.4 GHz

4 GB

ATLAS 64 bits sse2 (tuned)

2757

10.140

4133

scilab-5.3.0-beta-4

Fedora Linux 13 64 bits

Intel Core2 6600 2*2.4 GHz

4 GB

ATLAS 64 bits sse2

2297

5.897

4110

scilab-5.3.2

Windows Seven Ultimate 64 bits

AMD Fusion E-350 1.6 Ghz

8 GB

Intel MKL

1914

5.504

2547

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

ATLAS

1595

3.698

2194

scilab-5.3.0-beta-4

Fedora Linux 13 64 bits

Intel Core2 6600 2*2.4 GHz

4 GB

Ref. BLAS 64 bits

533

0.162

1869

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

Ref. BLAS

444

0.125

1400

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

Ref BLAS

444

0.129

1357

scilab-5.3.3

Windows 7 64 bits

Intel Pent. P6200 2*2.13GHz

4 GB

Intel MKL

1914

13.187

1063

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

ATLAS

1500

?

~2300

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

Intel MKL

1500

?

~2300

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

Ref. BLAS

1000

?

~500

Some comments

Scilab

OS

CPU

Physical Memory

Library

n

Time (s)

MFLOPS

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

ATLAS

1500

?

~2300

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

Intel MKL

1500

?

~2300

scilab-5.3.0-beta-4

Windows XP 32 bits

AMD Athlon 3200+ 2 GHz

1 GB

Ref. BLAS

1000

?

~500

Scilab

OS

CPU

Physical Memory

Library

n

Time (s)

MFLOPS

scilab-5.3.3 x64

Windows 7 64 bits

Intel Pent. P6200 2*2.13GHz

4 GB

Intel MKL

3309

7.928

9140

scilab-5.3.3

Windows 7 64 bits

Intel Pent. P6200 2*2.13GHz

4 GB

Intel MKL

1914

13.187

1063

Backslash

This product involves the computation of the solution of a linear system of equations. This is often called the "LINPACK" benchmark [2], but Scilab uses LAPACK.

The script

s= stacksize("max");
s = stacksize();
floor(sqrt(s(1))) // The maximum size of a square dense matrix of doubles
round(s(1)*8/10^6) // The memory, in MB
rand( "normal" );
n = 1000;
A = rand(n,n);
b = rand(n,1);
tic();
x = A\b;
t = toc();
mflops = round((2/3*n^3 + 2*n^2)/t/1.e6);
disp([n t mflops])

A more complete benchmark is available in bench_backslash.sce or [4].

The results

Scilab

OS

CPU

Physical Memory

Library

n

Time (s)

MFLOPS

scilab-5.2.2-x64

Windows Seven Ultimate 64 bits

Intel Core2 6600 4*2.4 GHz

8 GB

Intel MKL

6864

9.655

22339

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

Intel MKL

5720

6.376

19578

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

ATLAS 32 bits tuned (sse&mt)

6864

11.304

19080

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

AMD ACML 4.3.0

3971

5.498

7598

scilab-5.3.0-beta-4

Fedora Linux 13 64 bits

Intel Core2 6600 2*2.4 GHz

4 GB

ATLAS 64 bits sse2 (tuned)

2757

10.140

4133

scilab-5.3.2

Windows Seven Ultimate 64 bits

AMD Fusion E-350 1.6 Ghz

8 GB

Intel MKL

3309

10.802

2238

scilab-5.3.0-beta-4

Fedora Linux 13 64 bits

Intel Core2 6600 2*2.4 GHz

4 GB

Ref. BLAS 64 bits

1914

2.570

1821

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

Ref. BLAS

2757

10.514

1330

scilab-5.3.0-beta-4

Windows Vista Ultimate 32 bits

Intel Xeon E5410 4*2.33 GHz

4 GB

ATLAS

3309

12.074

2002

scilab-5.3.0-beta-4

Debian GNU/Linux 32 bits

Intel Core2 4*2.66 GHz

4 GB

Ref. BLAS

1914

3.29

1422

scilab-5.3.0-beta-4

Linux Ubuntu 32 bits

Intel Pentium M 2 GHz

1 GB

Ref. BLAS

1000

?

~700

scilab-5.3.0-beta-4

Linux Ubuntu 32 bits

Intel Pentium M 2 GHz

1 GB

ATLAS

3000

?

~1400

Notes

atomsInstall("scibench")
atomsLoad("scibench")

To run the matmul benchmark:

lines(0);
stacksize("max");
scf();
perftable = scibench_matmul ( %t , %t , 0.1 , 8 , 1.2 )

To run the backslash benchmark:

lines(0);
stacksize("max");
scf();
perftable = scibench_backslash ( %t , %t , 0.1 , 8 , 1.2 )

References

[1] "Programming in Scilab", Michael Baudin, 2010, (HTTP)

[2] "Benchmarks: LINPACK and MATLAB - Fame and fortune from megaflops", Cleve Moler, 1994, (PDF)

[3] Benchmarking matrix-matrix product, Michael Baudin, 2010, (bench_matmul.sce)

[4] Benchmarking backslash, Michael Baudin, 2010, (bench_backslash.sce)

[5] Benchmark programs and reports, http://www.netlib.org/benchmark/

[6] Automatically tuned linear algebra software, R. Clint Whaley and Jack J. Dongarra. In Supercomputing '98: Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), pages 1-27, Washington, DC, USA, 1998. IEEE Computer Society.

[7] Automated empirical optimizations of software and the atlas project, R. Clint Whaley, Antoine Petitet, R. Clint, Whaley Antoine, Petitet Jack, and Jack J. Dongarra, 2000

public: Linalg performances (last edited 2013-12-09 15:49:31 by michael.baudin@contrib.scilab.org)