What is decimal scaling normalization

How and why do normalization and feature scaling work?


I see that many machine learning algorithms with medium cancellation and covariance compensation work better. For example, neural networks tend to converge faster, and K-Means generally provides better clustering with preprocessed features. I don't see that the intuition behind these preprocessing steps leads to an increase in performance. Can anyone explain to me?

Reply:


It's simply about getting all of your data on the same scale: if the scales for the various functions are very different, it can have a negative impact on your ability to learn (depending on the methods you use to do it). . By ensuring standardized characteristic values, all characteristics are implicitly weighted equally in their representation.





It is true that preprocessing is a pretty black art in machine learning. It is not often written down in newspapers why multiple preprocessing steps are essential for it to work. I am also not sure that it will be understood in every case. To further complicate matters, it depends a lot on the method you are using and also the problem domain.

For example, some methods are invariant with affine transformation. If you have a neural network and only apply an affine transformation to your data, the network does not lose or gain anything in theory. In practice, however, a neural network works best when the inputs are centered and white. That is, their covariance is diagonal and the mean is the zero vector. Why does it make things better? This is only because the optimization of the neural network works more elegantly, since the hidden activation functions do not suffice as quickly and therefore you do not get slopes close to zero at the beginning of learning.

Other methods, e.g. B. K-Means, depending on the preprocessing, can lead to completely different solutions. This is because an affine transformation implies a change in metric space: the Euclidean distance between two samples is different after this transformation.

At the end of the day, you want to understand what you are doing with the data. The brightening in computer vision and the tentative normalization is something the human brain does in its vision pipeline as well.


Some ideas, references and illustrations why input normalization can be useful for ANN and k-means:

K-means:

K- means that the cluster formation is "isotropic" in all spatial directions and therefore tends to produce more or less round (instead of elongated) clusters. In this situation, the inequality of the variances means that variables with lower variance are weighted more heavily.

Example in Matlab:

(FYI, how can I tell if my dataset is clustered or not clustered (that is, forming a single cluster)?

Distributed clustering:

The comparative analysis shows that the results of the distributed clustering depend on the type of normalization procedure.

Artificial neural network (inputs):

If the input variables are combined linearly, as in an MLP, it is seldom absolutely necessary, at least in theory, to standardize the inputs. The reason for this is that any rescaling of an input vector can be effectively undone by changing the appropriate weights and biases, so you get exactly the same output as before. However, there are a number of practical reasons why unifying inputs can speed up training and reduce the chance of getting stuck in local optima. Weight loss and Bayesian estimation can also be carried out more conveniently with standardized inputs.

Artificial neural network (inputs / outputs)

Should you do any of these things with your data? The answer is, it depends.

Standardizing input or target variables leads to better behavior of the training process by improving the numerical conditions (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html) of the optimization problem and ensuring various default settings values ​​involved in initialization and termination are appropriate. Standardizing goals can also affect goal function.

Standardization of cases should be approached with caution as it discards information. When this information is irrelevant, standardizing cases can be very helpful. When this information is important, standardizing cases can be disastrous.


Interestingly, changing the units of measure can even lead to a very different cluster structure: Kaufman, Leonard and Peter J. Rousseeuw. "Finding Groups in Data: An Introduction to Cluster Analysis." (2005).

In some applications, changing the units of measure can even result in a very different cluster structure. For example, the age (in years) and height (in centimeters) of four imaginary people are given in Table 3 and shown in Figure 3. It appears that {A, B) and {C, 0) are two well separated clusters. On the other hand, if the altitude is expressed in feet, Tables 4 and 4 are obtained in which the apparent clusters are now {A, C} and {B, D}. This partition is fundamentally different from the first one, as each topic has a different companion. (Figure 4 would have flattened even more if age had been measured in days.)

In order to avoid this dependency on the choice of units of measurement, it is possible to standardize the data. This converts the original measured values ​​into variables without units.

Kaufman et al. Continue with some interesting considerations (page 11):

From a philosophical point of view, standardization doesn't really solve the problem. Indeed, the choice of units of measure leads to relative weights of the variables. Expressing a variable in smaller units results in a larger area for that variable, which then has a significant impact on the resulting structure. On the other hand, one tries through standardization to give all variables an equal weight in the hope of achieving objectivity. As such, it can be used by a practitioner who has no prior knowledge. However, it may well be that some variables are inherently more important than others in a particular application and the assignment of weights should then be based on expertise (see e.g. Abrahamowicz, 1985). On the other hand, attempts have been made to develop clustering techniques that are independent of the size of the variable (Friedman and Rubin, 1967). The proposal by Hardy and Rasson (1982) is to look for a partition that minimizes the total volume of the convex hulls of the clusters. In principle, such a method is invariant with regard to linear transformations of the data, but unfortunately there is no algorithm for its implementation (with the exception of an approximation restricted to two dimensions). Therefore, the standardization dilemma currently seems inevitable and the programs described in this book leave the choice up to the user. The proposal by Hardy and Rasson (1982) is to look for a partition that minimizes the total volume of the convex hulls of the clusters. In principle, such a method is invariant with regard to linear transformations of the data, but unfortunately there is no algorithm for its implementation (with the exception of an approximation restricted to two dimensions). Therefore, the standardization dilemma currently seems inevitable and the programs described in this book leave the choice up to the user. The proposal by Hardy and Rasson (1982) is to look for a partition that minimizes the total volume of the convex hulls of the clusters. In principle, such a method is invariant with regard to linear transformations of the data, but unfortunately there is no algorithm for its implementation (with the exception of an approximation restricted to two dimensions). Therefore, the standardization dilemma currently seems inevitable and the programs described in this book leave the choice up to the user.


Why does feature scaling work? I can give you an example (from Quora)

Let me answer this from the general ML perspective and not just from neural networks. When you collect data and extract features, the data is often collected on different scales. For example, the ages of employees in a company can be between 21 and 70 years old, the size of the house they live in is between 500 and 5000 square feet, and salaries can be between 80,000. In this situation you are using a simple Euclidean metric, the age characteristic does not matter as it is several orders of magnitude smaller than other characteristics. However, it may contain some important information that may be useful for the task. In this case, you may want to normalize the features to the same scale independently, e.g. B. [0,1] so that they make the same contribution when calculating the distance.30000−



There are two different problems:

a) Learning the correct function, eg k-means: The input scale basically indicates the similarity, so that the clusters found depend on the scaling. Regularization - e.g. regulation of 12 weights - you assume that each weight should be "equally small" - if your data is not scaled "appropriately", this is not the case

b) Optimization, namely by gradient descent (e.g. most neural networks). For the gradient descent, you need to select the learning rate. However, a good learning rate (at least on the first hidden level) depends on the input scaling: small [relevant] inputs usually require larger weights, so you want a larger learning rate for these weights (to get there faster) and vv for large inputs ... since you only want to use a single learning rate, rescale your entries. (For the same reason, lightening or decorating is also important.)



This article is all about k-means, but it explains and proves the data preprocessing requirement very well.

Standardization is the central preprocessing step in data mining to standardize values ​​of features or attributes from different dynamic ranges in a certain area. In this article, we have analyzed the performance of the three standardization methods on a conventional K-mean algorithm. When comparing the results with data sets on infectious diseases, it was found that the result obtained with the z-Score standardization method is more effective and efficient than the standardization methods with min-max and decimal scaling.

.

... if there are some features of great size or great variability, those kinds of features will have a major impact on the clustering result. In this case, data standardization would be an important preprocessing task to scale or control the variability of the data sets.

.

... the features must be dimensionless, since the numerical values ​​of the ranges of the dimensional features depend on the units of measurement and therefore a selection of the units of measurement can significantly change the results of the clustering. Therefore one should not use distance measures like the Euclidean distance without having a normalization of the data sets

Source: http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf


Often preprocessing works because it removes features of the data that are unrelated to the classification problem being solved. For example, think of the classification of audio from different speakers. Volume fluctuations (amplitude fluctuations) are possibly irrelevant, whereas the frequency spectrum is the really relevant aspect. In this case, amplitude normalization is very helpful for most ML algorithms because it removes an irrelevant aspect of the data and a neural network would cause false patterns to appear.


I think this is simply done so that the trait with a larger value does not overshadow the effects of the trait with a smaller value when learning a classifier. This is particularly important if the characteristic actually contributes to class separability with smaller values. The classifiers like logistic regression would have difficulty learning the decision limit, for example if it is present at the micro level of a feature and we have other features on the order of millions. The algorithm also helps to converge better. That's why we don't take any chances when we encode them in our algorithms. It is much easier for a classifier to learn the contributions (weights) of features in this way. For K also applies when Euclidean norms are used (confusion due to the scaling). Some algorithms can also work without normalization.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.