Feature Selection Using Expected Entropy Loss: Perl Script
Many papers about applications of neural nets, classifiers have feature
selection as the first step of preparing data for input. One of the way to do it is to use expected entropy loss.
The brief description is in [1],[2].
Our goal here is to create source code that could be used for problems like that.
The problem for illustration feature selection algorithm is simple and defined as following.
Let's say we have 3 cell object. Each of cell can be white or black coded as 0 , 1.
The object also can be assigned to class (white or black).
By definition (but we do not know it in real situation)
it's black if all 3 cells or any 2 cells together are black. Otherwise it's white.
Thus white will be 0,0,0 or 1,0,0 ; while black will be 1,1,0 or 1,1,1
The data is randomly generated.
In addition following features are added:
if first cell is black and the object class is black, we use 0 (true), otherwise 1 (false).
if 2 cells are black and the object class object is black, we use 0 (true), otherwise 1 (false).
also adding the same feature as the object class data.
What features are most descriptive for describing object class.
Now we will use the following property: higher expected entropy loss for the feature
means that this feature is more descriptive than others.
The formulas for calculating entropy loss can be found at [1],
they require counting of probabilities and entropy.
Feature selection perl script is provided here.
The output of this script is index of features.
And it shows that the highest expected enropy loss has feature that more close
describe the class data.
For general case feature selection perl script for input from data file
can be used. The input is text file. The field separator is space. The script is using all columns as features and the last column as Y column.
In this example I noticed that sometimes there is log of 0 which
gives error, so the program change the argument to small value like 0.001
and looks like it works well.
Thus the source code was developed for feature selection.