Outlier detection with data mining techniques and statistical methods

Authors

  • Marcos Patricio Orellana Cordero Universidad del Azuay
  • Priscila Cedillo Universidad de Cuenca

DOI:

https://doi.org/10.29019/enfoque.v11n1.584

Keywords:

outlier, data mining, KNN, chi-square, financial fraud.

Abstract

The detection of outliers in the field of data mining (DM) and the process of knowledge discovery in databases (KDD) is of great interest in areas that require support systems for decision making. A straightforward application can be found in the financial area, where DM can potentially detect financial fraud or find errors produced by the users. Thus, it is essential to evaluate the veracity of the information, through the use of methods for the detection of unusual behaviors in the data. This paper proposes a method to detect values ​​that are considered outliers in a database of nominal type data. The method implements a global algorithm of "k" closest neighbors, a clustering algorithm called k-means and a statistical method called chi-square. These techniques have been implemented on a database of clients who have requested a financial credit. The experiment was performed on a data set with 1180 tuples, where, outliers were deliberately introduced. The results showed that the proposed method is able to detect all the outliers entered.

Metrics

Downloads

Download data is not yet available.

References

Aldahdooh, R. T. y Ashour, W. M. (2013). DIMK-means Distance-based Initialization Method for K-means Clustering Algorithm. Intelligent Systems and Applications, 5 (2): 41-51.
Amer, M. y Goldstein, M. (2012). Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner. Proceedings of the 3rd RapidMiner Community Meeting and Conferernce (RCOMM 2012), 1-12.
Arce, D., Lima, F., Orellana, M., Ortega, J. y Sellers, C. (2018). Discovering behavioral patterns among air pollutants : A data mining approach (Descubriendo patrones de comportamiento entre contaminantes del aire : Un enfoque de minería de datos ). Enfoque UTE 9 (4): 168-179.
Atkinson, A. C. (1981). Identification of Outliers. Biometrics, 37 (4): 860-861.
Bansal, R., Gaur, N. y Singh, S. N. (2016). Outlier Detection: Applications and techniques in Data Mining. 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), 373-377. https://doi.org/10.1109/CONFLUENCE.2016.7508146
Bhattacharyya, S., Jha, S., Tharakunnel, K. y Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50 (3): 602-613. https://doi.org/10.1016/j.dss.2010.08.008
Dang, T. T., Ngan, H. Y. T. y Liu, W. (2015). Distance-based k-nearest neighbors outlier detection method in large-scale traffic data. International Conference on Digital Signal Processing, DSP, 2015-September, 507-510. https://doi.org/10.1109/ICDSP.2015.7251924
Ganji, V. R. (2012). Credit card fraud detection using anti-k nearest neighbor algorithm. International Journal on Computer Science and Engineering, 4 (6): 1035-1039.
Gol, M. y Abur, A. (2015). A modified Chi-Squares test for improved bad data detection. 2015 IEEE Eindhoven PowerTech, PowerTech 2015, (1): 1-5. https://doi.org/10.1109/PTC.2015.7232283
Gu, Y., Ganesan, R. K., Bischke, B., Bernardi, A., Maier, A., Warkentin, H., … Dengel, A. (2017). Grid-based outlier detection in large data sets for combine harvesters. Proceedings-2017 IEEE 15th International Conference on Industrial Informatics, INDIN 2017: 811-818. https://doi.org/10.1109/INDIN.2017.8104877
Hassanat, A. B., Abbadi, M. A. y Alhasanat, A. A. (2014). Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and Information Security (IJCSIS), 12 (8): 33-39. https://doi.org/10.1007/s00500-005-0503-y
Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A., y Alhasanat, A. A. (2015). A SURVEY OF OUTLIER DETECTION IN DATA MINING. International Journal of Advance Engineering and Research Development, 3 (01). https://doi.org/10.21090/ijaerd.ncrretcs06
Khan, M. A., Pradhan, S. K. y Fatima, H. (2017). Applying Data Mining Techniques in Cyber Crimes. 2nd International Conference on Anti-Cyber Crimes, 2-5. https://doi.org/doi: 10.1109/Anti-Cybercrime.2017.7905293
Kuna, H, Rambo, A. y Caballero, S. (2012). Procedimientos para la identificación de datos anómalos en bases de datos. Proceedings Of. Retrieved from http://sistemas.unla.edu.ar/sistemas/gisi/papers-HK/procedimientos para la identidficacion de datso anomalos en bases de datos.pdf
Kuna, Horacio, Pautsch, G., Rambo, A., Rey, M., Cortes, J., Rolón, S. y Informática, D. De. (2013). Procedimiento de Explotación de Información para la Identificación de Campos anómalos en Base de Datos Alfanuméricas. Revista Latinoamericana de Ingeniería de Software, 1 (3): 102-106. Retrieved from http://sistemas.unla.edu.ar/sistemas/redisla/ReLAIS/relais-v1-n3-p-102-106.pdf
Malini, N. y Pushpa, M. (2017). Analysis on credit card fraud identification techniques based on KNN and outlier detection. Proceedings of the 3rd IEEE International Conference on Advances in Electrical and Electronics, Information, Communication and Bio-Informatics, AEEICB 2017: 255-258. https://doi.org/10.1109/AEEICB.2017.7972424
Mandhare, H. y Idate, S. (2017). A Comparative Study of Cluster Based Outlier Detection, Distance Based Outlier Detection and Density Based Outlier Detection Techniques. International Conference on Intelligent Computing and Control Systems: 931-935.
Monamo, P. M., Marivate, V. y Twala, B. (2017). A multifaceted approach to Bitcoin fraud detection: Global and local outliers. Proceedings - 2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016, 188–194. https://doi.org/10.1109/ICMLA.2016.19
Onan, A. (2017). A K-medoids based clustering scheme with an application to document clustering. 2nd International Conference on Computer Science and Engineering, UBMK 2017: 354-359. https://doi.org/10.1109/UBMK.2017.8093409
Ougiaroglou, S., Evangelidis, G. y Dervos, D. A. (2014). FHC: An adaptive fast hybrid method for k-NN classification. Logic Journal of the IGPL, 23 (3): 431–450. https://doi.org/10.1093/jigpal/jzv015
Rosero-Montalvo, P. D., Umaquinga-Criollo, A. C., Flores, S., Suarez, L., Pijal, J., Ponce-Guevara, K. L., … Moncayo, K. (2018). Neighborhood criterion analysis for prototype selection applied in WSN data. Proceedings-2017 International Conference on Information Systems and Computer Science, INCISCOS 2017, 2017-Novem: 128-132. https://doi.org/10.1109/INCISCOS.2017.47
Sinwar, D. y Dhaka, V. S. (2015). Outlier detection from multidimensional space using multilayer perceptron, RBF networks and pattern clustering techniques. Conference Proceeding-2015 International Conference on Advances in Computer Engineering and Applications, ICACEA 2015: 573-579. https://doi.org/10.1109/ICACEA.2015.7164757
Sumaiya Thaseen, I. y Aswani Kumar, C. (2017). Intrusion detection model using fusion of chi-square feature selection and multi class SVM. Journal of King Saud University - Computer and Information Sciences, 29 (4): 462-472. https://doi.org/10.1016/j.jksuci.2015.12.004
Yan, K., You, X., Ji, X., Yin, G. y Yang, F. (2016). A hybrid outlier detection method for health care big data. Proceedings - 2016 IEEE International Conferences on Big Data and Cloud Computing, BDCloud 2016, Social Computing and Networking, SocialCom 2016 and Sustainable Computing and Communications, SustainCom 2016: 157-162. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.34
Zhang, H. y Wang, L. (2018). An information-Theoretic outlier detection method for prescription data. 2017 3rd IEEE International Conference on Computer and Communications, ICCC 2017, 2018-January: 2361-2365. https://doi.org/10.1109/CompComm.2017.8322957

Published

2020-01-31

How to Cite

Orellana Cordero, M. P., & Cedillo, P. (2020). Outlier detection with data mining techniques and statistical methods. Enfoque UTE, 11(1), pp. 56 - 67. https://doi.org/10.29019/enfoque.v11n1.584

Issue

Section

Computer Science, ICTs