Outlier detection with data mining techniques and statistical methods
DOI:
https://doi.org/10.29019/enfoque.v11n1.584Keywords:
outlier, data mining, KNN, chi-square, financial fraud.Abstract
The detection of outliers in the field of data mining (DM) and the process of knowledge discovery in databases (KDD) is of great interest in areas that require support systems for decision making. A straightforward application can be found in the financial area, where DM can potentially detect financial fraud or find errors produced by the users. Thus, it is essential to evaluate the veracity of the information, through the use of methods for the detection of unusual behaviors in the data. This paper proposes a method to detect values that are considered outliers in a database of nominal type data. The method implements a global algorithm of "k" closest neighbors, a clustering algorithm called k-means and a statistical method called chi-square. These techniques have been implemented on a database of clients who have requested a financial credit. The experiment was performed on a data set with 1180 tuples, where, outliers were deliberately introduced. The results showed that the proposed method is able to detect all the outliers entered.
Downloads
References
Amer, M. y Goldstein, M. (2012). Nearest-Neighbor and Clustering based Anomaly Detection Algorithms for RapidMiner. Proceedings of the 3rd RapidMiner Community Meeting and Conferernce (RCOMM 2012), 1-12.
Arce, D., Lima, F., Orellana, M., Ortega, J. y Sellers, C. (2018). Discovering behavioral patterns among air pollutants : A data mining approach (Descubriendo patrones de comportamiento entre contaminantes del aire : Un enfoque de minería de datos ). Enfoque UTE 9 (4): 168-179.
Atkinson, A. C. (1981). Identification of Outliers. Biometrics, 37 (4): 860-861.
Bansal, R., Gaur, N. y Singh, S. N. (2016). Outlier Detection: Applications and techniques in Data Mining. 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), 373-377. https://doi.org/10.1109/CONFLUENCE.2016.7508146
Bhattacharyya, S., Jha, S., Tharakunnel, K. y Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50 (3): 602-613. https://doi.org/10.1016/j.dss.2010.08.008
Dang, T. T., Ngan, H. Y. T. y Liu, W. (2015). Distance-based k-nearest neighbors outlier detection method in large-scale traffic data. International Conference on Digital Signal Processing, DSP, 2015-September, 507-510. https://doi.org/10.1109/ICDSP.2015.7251924
Ganji, V. R. (2012). Credit card fraud detection using anti-k nearest neighbor algorithm. International Journal on Computer Science and Engineering, 4 (6): 1035-1039.
Gol, M. y Abur, A. (2015). A modified Chi-Squares test for improved bad data detection. 2015 IEEE Eindhoven PowerTech, PowerTech 2015, (1): 1-5. https://doi.org/10.1109/PTC.2015.7232283
Gu, Y., Ganesan, R. K., Bischke, B., Bernardi, A., Maier, A., Warkentin, H., … Dengel, A. (2017). Grid-based outlier detection in large data sets for combine harvesters. Proceedings-2017 IEEE 15th International Conference on Industrial Informatics, INDIN 2017: 811-818. https://doi.org/10.1109/INDIN.2017.8104877
Hassanat, A. B., Abbadi, M. A. y Alhasanat, A. A. (2014). Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and Information Security (IJCSIS), 12 (8): 33-39. https://doi.org/10.1007/s00500-005-0503-y
Hassanat, A. B., Abbadi, M. A., Altarawneh, G. A., y Alhasanat, A. A. (2015). A SURVEY OF OUTLIER DETECTION IN DATA MINING. International Journal of Advance Engineering and Research Development, 3 (01). https://doi.org/10.21090/ijaerd.ncrretcs06
Khan, M. A., Pradhan, S. K. y Fatima, H. (2017). Applying Data Mining Techniques in Cyber Crimes. 2nd International Conference on Anti-Cyber Crimes, 2-5. https://doi.org/doi: 10.1109/Anti-Cybercrime.2017.7905293
Kuna, H, Rambo, A. y Caballero, S. (2012). Procedimientos para la identificación de datos anómalos en bases de datos. Proceedings Of. Retrieved from http://sistemas.unla.edu.ar/sistemas/gisi/papers-HK/procedimientos para la identidficacion de datso anomalos en bases de datos.pdf
Kuna, Horacio, Pautsch, G., Rambo, A., Rey, M., Cortes, J., Rolón, S. y Informática, D. De. (2013). Procedimiento de Explotación de Información para la Identificación de Campos anómalos en Base de Datos Alfanuméricas. Revista Latinoamericana de Ingeniería de Software, 1 (3): 102-106. Retrieved from http://sistemas.unla.edu.ar/sistemas/redisla/ReLAIS/relais-v1-n3-p-102-106.pdf
Malini, N. y Pushpa, M. (2017). Analysis on credit card fraud identification techniques based on KNN and outlier detection. Proceedings of the 3rd IEEE International Conference on Advances in Electrical and Electronics, Information, Communication and Bio-Informatics, AEEICB 2017: 255-258. https://doi.org/10.1109/AEEICB.2017.7972424
Mandhare, H. y Idate, S. (2017). A Comparative Study of Cluster Based Outlier Detection, Distance Based Outlier Detection and Density Based Outlier Detection Techniques. International Conference on Intelligent Computing and Control Systems: 931-935.
Monamo, P. M., Marivate, V. y Twala, B. (2017). A multifaceted approach to Bitcoin fraud detection: Global and local outliers. Proceedings - 2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016, 188–194. https://doi.org/10.1109/ICMLA.2016.19
Onan, A. (2017). A K-medoids based clustering scheme with an application to document clustering. 2nd International Conference on Computer Science and Engineering, UBMK 2017: 354-359. https://doi.org/10.1109/UBMK.2017.8093409
Ougiaroglou, S., Evangelidis, G. y Dervos, D. A. (2014). FHC: An adaptive fast hybrid method for k-NN classification. Logic Journal of the IGPL, 23 (3): 431–450. https://doi.org/10.1093/jigpal/jzv015
Rosero-Montalvo, P. D., Umaquinga-Criollo, A. C., Flores, S., Suarez, L., Pijal, J., Ponce-Guevara, K. L., … Moncayo, K. (2018). Neighborhood criterion analysis for prototype selection applied in WSN data. Proceedings-2017 International Conference on Information Systems and Computer Science, INCISCOS 2017, 2017-Novem: 128-132. https://doi.org/10.1109/INCISCOS.2017.47
Sinwar, D. y Dhaka, V. S. (2015). Outlier detection from multidimensional space using multilayer perceptron, RBF networks and pattern clustering techniques. Conference Proceeding-2015 International Conference on Advances in Computer Engineering and Applications, ICACEA 2015: 573-579. https://doi.org/10.1109/ICACEA.2015.7164757
Sumaiya Thaseen, I. y Aswani Kumar, C. (2017). Intrusion detection model using fusion of chi-square feature selection and multi class SVM. Journal of King Saud University - Computer and Information Sciences, 29 (4): 462-472. https://doi.org/10.1016/j.jksuci.2015.12.004
Yan, K., You, X., Ji, X., Yin, G. y Yang, F. (2016). A hybrid outlier detection method for health care big data. Proceedings - 2016 IEEE International Conferences on Big Data and Cloud Computing, BDCloud 2016, Social Computing and Networking, SocialCom 2016 and Sustainable Computing and Communications, SustainCom 2016: 157-162. https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.34
Zhang, H. y Wang, L. (2018). An information-Theoretic outlier detection method for prescription data. 2017 3rd IEEE International Conference on Computer and Communications, ICCC 2017, 2018-January: 2361-2365. https://doi.org/10.1109/CompComm.2017.8322957
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 Enfoque UTE
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
The articles and research published by the UTE University are carried out under the Open Access regime in electronic format. This means that all content is freely available without charge to the user or his/her institution. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles, or use them for any other lawful purpose, without asking prior permission from the publisher or the author. This is in accordance with the BOAI definition of open access. By submitting an article to any of the scientific journals of the UTE University, the author or authors accept these conditions.
The UTE applies the Creative Commons Attribution (CC-BY) license to articles in its scientific journals. Under this open access license, as an author you agree that anyone may reuse your article in whole or in part for any purpose, free of charge, including commercial purposes. Anyone can copy, distribute or reuse the content as long as the author and original source are correctly cited. This facilitates freedom of reuse and also ensures that content can be extracted without barriers for research needs.
This work is licensed under a Creative Commons Attribution 3.0 International (CC BY 3.0).
The Enfoque UTE journal guarantees and declares that authors always retain all copyrights and full publishing rights without restrictions [© The Author(s)]. Acknowledgment (BY): Any exploitation of the work is allowed, including a commercial purpose, as well as the creation of derivative works, the distribution of which is also allowed without any restriction.