Applied Math & Computer Science Lab
Data Analysis, Optimization & Mathematical Modeling, Artificial Intelligence, Neural Net For Everyday Life Applications
Artificial Intelligence/Data Mining Links Webmaster Resources AMCSL Forum: Web Mining Submit Link Archive

Feature Selection Using Expected Entropy Loss: Perl Script

    Many papers about applications of neural nets, classifiers have feature selection as the first step of preparing data for input. One of the way to do it is to use expected entropy loss. The brief description is in [1],[2]. Our goal here is to create source code that could be used for problems like that.
    The problem for illustration feature selection algorithm is simple and defined as following. Let's say we have 3 cell object. Each of cell can be white or black coded as 0 , 1. The object also can be assigned to class (white or black). By definition (but we do not know it in real situation) it's black if all 3 cells or any 2 cells together are black. Otherwise it's white.
     Thus white will be 0,0,0 or 1,0,0 ; while black will be 1,1,0 or 1,1,1
     The data is randomly generated.
   In addition following features are added: if first cell is black and the object class is black, we use 0 (true), otherwise 1 (false). if 2 cells are black and the object class object is black, we use 0 (true), otherwise 1 (false). also adding the same feature as the object class data. What features are most descriptive for describing object class.
    Now we will use the following property: higher expected entropy loss for the feature means that this feature is more descriptive than others. The formulas for calculating entropy loss can be found at [1], they require counting of probabilities and entropy.
Feature selection perl script is provided here. The output of this script is index of features. And it shows that the highest expected enropy loss has feature that more close describe the class data.
    For general case feature selection perl script for input from data file can be used. The input is text file. The field separator is space. The script is using all columns as features and the last column as Y column.
    In this example I noticed that sometimes there is log of 0 which gives error, so the program change the argument to small value like 0.001 and looks like it works well.
  Thus the source code was developed for feature selection.

References

1. What’s the code? Automatic Classification of Source Code Archives
2. Using Web Structure for Classifying and Describing Web Pages
3. Feature Selection Perl Script
4. Feature Selection Perl ScriptData input from file