ROSE (Rough Outlier Set Extraction)
ROSE is a machine learning algorithm that implements outlier detection of an unlabeled spatiotemporal dataset using a rough set approach. It also provides a representative subset of the original data, describing the same structure, with which it is possible to detect the same outliers, named kernel set.
The paper that reports all the details and should be cited when the code is used is
Albanese A, Sankar K P, Petrosino A., IEEE Transactions on Knowledge and Data Engineering, Vol. 26, no. 1, pp. 194-207, 2014, DOI: 10.1109/TKDE.2012.234.
ROSE is written in Java and can be downloaded here. It is platform independent and bundled as jar package. Java Runtime Environment (JRE) is needed to run this software.
ROSE package content:
- ROSE.jar (java executable file)
- Original paper 2014_Albanese_Sankar_Petrosino.pdf
- Supplementary material folder ttk2014010194
ROSE can be executed from a Unix-like shell, a Terminal for Mac-OS or Windows CMD as follows:
java -jar ROSE.jar <path input dataset> <number of outliers> <number of neighbors> <number of elements once> <alpha>
This is an example using the bundled sample dataset (assuming you are in the ROSE root folder):
java -jar ROSE.jar sample_dataset.txt 16 16 18 1
path input dataset - Input dataset filename (complete absolute path). This has to be a text file where each row contains numerical values, ranging in 0, 1. Each numeric value must be space separated (two or three space characters are allowed). These values are the coordinates of the instances to be processed.
Each row specifies:
required configuration: (x,y,t) triplets where (x,y) is a spatial coordinate and (t) is a timestamp.
0.84506904 0.10248589 0.15307587
0.84543708 0.10339779 0.15308264
0.84524589 0.10418321 0.15308941
0.84540431 0.10549813 0.15309618
0.84588570 0.10667086 0.15310294
optional configuration: 13 values are allowed at most, (x,y,t) + (f1,…,f10) where the first 3 values are inherited from required configuration and (f1,…,f10) is a list of other features.
0.84506904 0.10248589 0.15307587 0.84588570 0.10667086 0.15310294
0.84543708 0.10339779 0.15308264 0.84524589 0.10418321 0.15308941
0.84540431 0.10549813 0.15309618 0.84588570 0.10667086 0.15310294
standard configuration: not all 13 values need to be specified and just some features are specified (the sample dataset only contains (x,y,t)).
number of outliers - the number of outliers to be extracted from the data set.
number of neighbors - the number of neighbors to be considered.
number of elements once - it is the number of items that are extracted from the input data set at each iteration (and then compared with the rest of the file). This must be greater than the number of outliers. number of elements once > number of outliers
alpha - is a multiplier value of the linear combination of the weights, ranging between 0,1 and 1 - alpha (for temporal weight). Special cases: alpha = 1 only spatial components are considered. alpha = 0 only temporal component is considered.
Each execution of the ROSE software will produce an output folder named as the dataset input file name. Each execution on the same input dataset will overwrite the existing output folder.
- TopOutlier-numoutlier-alpha-weight-milliseconds.txt - (milliseconds value in the filename is used to maintain processing history, TopOutlier-inputfilename-numoutliers-alpha-weight.txt is the most recent iteration). These files contain TopOutliers search results at each iteration. The last two files contain the Lower and Upper approximation of the outlier set respectively.
- WeightedData-inputfilename-numoutliers-weight.txt - contains the starting points set with weights for each point.
- SolvingSet-milliseconds.txt - contains the kernel set (milliseconds value in the filename is used to maintain processing history, SolvingSet.txt is the most recent iteration).
- NegativeRegion.txt - contains discarded instances.