1. Machine Learning features in Scilab
Contents
1.1. Motivations
1.1.1. In education:
Why Is Machine Learning (CS 229) The Most Popular Course At Stanford? http://www.forbes.com/sites/anthonykosner/2013/12/29/why-is-machine-learning-cs-229-the-most-popular-course-at-stanford/
1.1.2. In the industry:
Dassault Aviation is learning from simulation:
https://www.scilab.io/use-cases/design-of-experiments-and-optimization-of-aircraft-design/
Self Organizing maps (special kind of neural networks) displaying the impacts of the different parameters on the design of the aircraft
1.2. Categories:
For machine learning capabilities in Scilab, we need to consider both supervised and unsupervised approaches.
The different kinds of algorithms are then:
- regression
- classification
- clustering
1.3. Presentation of the results
For every machine learning algorithms considered, you should precisely detail with example:
- the hypothesis function
- the cost function
- the optimization used to learn (gradient descent or other...)
1.4. Machine Learning Example
Let us look at a Machine Learning problem in Classification.
Classification in Machine Learning lingo refers to separating given data into different classes. The machine learning task here is to separate such data automatically into these classes correctly. A simple example is of a baby learning different colours from pictures of differently coloured circles. As the baby learns more and more, it is expected from him/her to differentiate colours not only in circles, but even on more complex objects like cars or fruits.
Let us look at one of the most commonly used Classification algorithm in Machine Learning: Logistic Regression. Despite it’s name, Logistic Regression is a Classification algorithm and not for Regression. Let us explore Logistic Regression through an example:
Suppose you have a set of data points and their (binary) class given by
D = {(x1, y1), (x2, y2), ... , (xn, yn)} where xi belongs to Rd (d-dimensional Real space) and yi = {0, 1} which is to be identified.
Let’s take a simple linear model which could help us identify the class y as: w0 + w1x1 + w2x2 + ... + wnxn = wTx
Where wi are the weights associated with each of the datapoints x. The problem is that this isn’t exactly a probability distribution on y. To make this a probability distribution, we introduce a sigmoid function over wTx
which is represented by sig(wTx) = 1/(1 + e(-wTx)). This sigmoid function lies between 0 to 1 and thus behaves like the probability function. In general, for binary classification, this will be defined as:
Now, we want to maximize this probability. So we tweak the function a bit to make it easier to work with. Instead of maximizing the original function, we maximize the logarithm of the same function. Further, since we are more familiar dealing with minimization of functions, instead of maximizing the logarithm function, we minimize the negative of logarithm function. We also substitute the sigmoid function we had defined as α. So this is how our function looks now:
This gives us n equations but which are non-linear and don’t have any proper closed form solution. Thus we need better algorithms to solve these equations:
1. Gradient Descent: A simple way to find the w that minimizes the above function is to ’roll downhill’ on the convex function with a step size given by η. The update function on w has been shown:
* Step-Size problem: The problem with this algorithm is that we have to choose the step size η very carefully. Illustrated below, we can see that if the step size is too small, the algorithm will take too long and too many steps to converge. On the other hand, if our step-size is too large, the algorithm would simply keep overshooting the minima and would never converge.
To address this problem, a new method called Newton-Raphson method was proposed. A variable step-size is chosen in this algorithm that is dependent the slope of the convex curve that we want to minimize. Thus, when the slope is too large, the algorithm keeps a small step-size and when the slope is small, the algorithm allows large step-sizes to be taken. The new update function for w then becomes:
The convergence is then achieved and the optimal w is obtained faster and with lesser number of iterations.
1.5. Content existing in Scilab
ANN Toolbox Based on “Matrix techniques for ANN” Ryurick Hristev, 2000
Neural Network Module http://atoms.scilab.org/toolboxes/neuralnetwork/2.0
libSVM (Support Vector Machine) http://atoms.scilab.org/toolboxes/libsvm This tool provides a simple interface to LIBSVM, a library for support vector machines (http://www.csie.ntu.edu.tw/~cjlin/libsvm).
NaN Toolbox https://atoms.scilab.org/toolboxes/nan https://pub.ist.ac.at/~schloegl/matlab/NaN/ for data with and w/o MISSING VALUES encoded as NaN's.
For more content refer to the tutorials on scilab.io: http://scilab.io/category/machine-learning/
1.6. Implementation
The algorithms can be directly implemented with the Scilab language, as described with this example on logistic regression:
http://scilab.io/machine-learning-logistic-regression-tutorial/
Another "integration" approach could be to implement already existing libraries from other languages:
Python Scikit-learn: http://scikit-learn.org,
Tensorflow: https://www.tensorflow.org/
MLC++: https://www.sgi.com/tech/mlc/source.html
R libraries
1.7. Sources of inspiration
Matlab (R) Statistics and Machine Learning Toolbox: https://www.mathworks.com/products/statistics.html
Massive Open Online Course (MOOC) of Andrew NG, Stanford, on Coursera https://www.coursera.org/learn/machine-learning