MyMedR

Displaying publications 1 - 20 of 64 in total

Abstract:

Sort:

Fulltext A review of subsequence time series clustering

Zolhavarieh S, Aghabozorgi S, Teh YW

ScientificWorldJournal, 2014;2014:312521.
PMID: 25140332 DOI: 10.1155/2014/312521

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.

Matched MeSH terms: Data Mining*
Fulltext A new split based searching for exact pattern matching for natural texts

Hakak S, Kamsin A, Shivakumara P, Idna Idris MY, Gilkar GA

PLoS One, 2018;13(7):e0200912.
PMID: 30048486 DOI: 10.1371/journal.pone.0200912

Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.

Matched MeSH terms: Data Mining/methods*
Fulltext A novel approach for heart disease prediction using strength scores with significant predictors

Yazdani A, Varathan KD, Chiam YK, Malik AW, Wan Ahmad WA

BMC Med Inform Decis Mak, 2021 06 21;21(1):194.
PMID: 34154576 DOI: 10.1186/s12911-021-01527-5

BACKGROUND: Cardiovascular disease is the leading cause of death in many countries. Physicians often diagnose cardiovascular disease based on current clinical tests and previous experience of diagnosing patients with similar symptoms. Patients who suffer from heart disease require quick diagnosis, early treatment and constant observations. To address their needs, many data mining approaches have been used in the past in diagnosing and predicting heart diseases. Previous research was also focused on identifying the significant contributing features to heart disease prediction, however, less importance was given to identifying the strength of these features.
METHOD: This paper is motivated by the gap in the literature, thus proposes an algorithm that measures the strength of the significant features that contribute to heart disease prediction. The study is aimed at predicting heart disease based on the scores of significant features using Weighted Associative Rule Mining.
RESULTS: A set of important feature scores and rules were identified in diagnosing heart disease and cardiologists were consulted to confirm the validity of these rules. The experiments performed on the UCI open dataset, widely used for heart disease research yielded the highest confidence score of 98% in predicting heart disease.
CONCLUSION: This study managed to provide a significant contribution in computing the strength scores with significant predictors in heart disease prediction. From the evaluation results, we obtained important rules and achieved highest confidence score by utilizing the computed strength scores of significant predictors on Weighted Associative Rule Mining in predicting heart disease.

Matched MeSH terms: Data Mining
Fulltext A novel association rule mining approach using TID intermediate itemset

Aqra I, Herawan T, Abdul Ghani N, Akhunzada A, Ali A, Bin Razali R, et al.

PLoS One, 2018;13(1):e0179703.
PMID: 29351287 DOI: 10.1371/journal.pone.0179703

Designing an efficient association rule mining (ARM) algorithm for multilevel knowledge-based transactional databases that is appropriate for real-world deployments is of paramount concern. However, dynamic decision making that needs to modify the threshold either to minimize or maximize the output knowledge certainly necessitates the extant state-of-the-art algorithms to rescan the entire database. Subsequently, the process incurs heavy computation cost and is not feasible for real-time applications. The paper addresses efficiently the problem of threshold dynamic updation for a given purpose. The paper contributes by presenting a novel ARM approach that creates an intermediate itemset and applies a threshold to extract categorical frequent itemsets with diverse threshold values. Thus, improving the overall efficiency as we no longer needs to scan the whole database. After the entire itemset is built, we are able to obtain real support without the need of rebuilding the itemset (e.g. Itemset list is intersected to obtain the actual support). Moreover, the algorithm supports to extract many frequent itemsets according to a pre-determined minimum support with an independent purpose. Additionally, the experimental results of our proposed approach demonstrate the capability to be deployed in any mining system in a fully parallel mode; consequently, increasing the efficiency of the real-time association rules discovery process. The proposed approach outperforms the extant state-of-the-art and shows promising results that reduce computation cost, increase accuracy, and produce all possible itemsets.

Matched MeSH terms: Data Mining/methods*
Gaining Insights on Nasopharyngeal Carcinoma Treatment Outcome Using Clinical Data Mining Techniques

Ghaibeh AA, Kasem A, Ng XJ, Nair HLK, Hirose J, Thiruchelvam V

Stud Health Technol Inform, 2018;247:386-390.
PMID: 29677988

The analysis of Electronic Health Records (EHRs) is attracting a lot of research attention in the medical informatics domain. Hospitals and medical institutes started to use data mining techniques to gain new insights from the massive amounts of data that can be made available through EHRs. Researchers in the medical field have often used descriptive statistics and classical statistical methods to prove assumed medical hypotheses. However, discovering new insights from large amounts of data solely based on experts' observations is difficult. Using data mining techniques and visualizations, practitioners can find hidden knowledge, identify interesting patterns, or formulate new hypotheses to be further investigated. This paper describes a work in progress on using data mining methods to analyze clinical data of Nasopharyngeal Carcinoma (NPC) cancer patients. NPC is the fifth most common cancer among Malaysians, and the data analyzed in this study was collected from three states in Malaysia (Kuala Lumpur, Sabah and Sarawak), and is considered to be the largest up-to-date dataset of its kind. This research is addressing the issue of cancer recurrence after the completion of radiotherapy and chemotherapy treatment. We describe the procedure, problems, and insights gained during the process.

Matched MeSH terms: Data Mining*
Fulltext Bioactive Molecule Prediction Using Extreme Gradient Boosting

Babajide Mustapha I, Saeed F

Molecules, 2016 Jul 28;21(8).
PMID: 27483216 DOI: 10.3390/molecules21080983

Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today's drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound's molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.

Matched MeSH terms: Data Mining/methods*
Fulltext Multi-population Black Hole Algorithm for the problem of data clustering

Salih SQ, Alsewari AA, Wahab HA, Mohammed MKA, Rashid TA, Das D, et al.

PLoS One, 2023;18(7):e0288044.
PMID: 37406006 DOI: 10.1371/journal.pone.0288044

The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.

Matched MeSH terms: Data Mining/methods
Fulltext Adapting Document Similarity Measures for Ligand-Based Virtual Screening

Himmat M, Salim N, Al-Dabbagh MM, Saeed F, Ahmed A

Molecules, 2016 Apr 13;21(4):476.
PMID: 27089312 DOI: 10.3390/molecules21040476

Quantifying the similarity of molecules is considered one of the major tasks in virtual screening. There are many similarity measures that have been proposed for this purpose, some of which have been derived from document and text retrieving areas as most often these similarity methods give good results in document retrieval and can achieve good results in virtual screening. In this work, we propose a similarity measure for ligand-based virtual screening, which has been derived from a text processing similarity measure. It has been adopted to be suitable for virtual screening; we called this proposed measure the Adapted Similarity Measure of Text Processing (ASMTP). For evaluating and testing the proposed ASMTP we conducted several experiments on two different benchmark datasets: the Maximum Unbiased Validation (MUV) and the MDL Drug Data Report (MDDR). The experiments have been conducted by choosing 10 reference structures from each class randomly as queries and evaluate them in the recall of cut-offs at 1% and 5%. The overall obtained results are compared with some similarity methods including the Tanimoto coefficient, which are considered to be the conventional and standard similarity coefficients for fingerprint-based similarity calculations. The achieved results show that the performance of ligand-based virtual screening is better and outperforms the Tanimoto coefficients and other methods.

Matched MeSH terms: Data Mining*
Stacking with Recursive Feature Elimination-Isolation Forest for classification of diabetes mellitus

Idris NF, Ismail MA, Jaya MIM, Ibrahim AO, Abulfaraj AW, Binzagr F

PLoS One, 2024;19(5):e0302595.
PMID: 38718024 DOI: 10.1371/journal.pone.0302595

Diabetes Mellitus is one of the oldest diseases known to humankind, dating back to ancient Egypt. The disease is a chronic metabolic disorder that heavily burdens healthcare providers worldwide due to the steady increment of patients yearly. Worryingly, diabetes affects not only the aging population but also children. It is prevalent to control this problem, as diabetes can lead to many health complications. As evolution happens, humankind starts integrating computer technology with the healthcare system. The utilization of artificial intelligence assists healthcare to be more efficient in diagnosing diabetes patients, better healthcare delivery, and more patient eccentric. Among the advanced data mining techniques in artificial intelligence, stacking is among the most prominent methods applied in the diabetes domain. Hence, this study opts to investigate the potential of stacking ensembles. The aim of this study is to reduce the high complexity inherent in stacking, as this problem contributes to longer training time and reduces the outliers in the diabetes data to improve the classification performance. In addressing this concern, a novel machine learning method called the Stacking Recursive Feature Elimination-Isolation Forest was introduced for diabetes prediction. The application of stacking with Recursive Feature Elimination is to design an efficient model for diabetes diagnosis while using fewer features as resources. This method also incorporates the utilization of Isolation Forest as an outlier removal method. The study uses accuracy, precision, recall, F1 measure, training time, and standard deviation metrics to identify the classification performances. The proposed method acquired an accuracy of 79.077% for PIMA Indians Diabetes and 97.446% for the Diabetes Prediction dataset, outperforming many existing methods and demonstrating effectiveness in the diabetes domain.

Matched MeSH terms: Data Mining/methods
Fulltext Mining personal data using smartphones and wearable devices: a survey

Habib ur Rehman M, Liew CS, Wah TY, Shuja J, Daghighi B

Sensors (Basel), 2015 Feb 13;15(2):4430-69.
PMID: 25688592 DOI: 10.3390/s150204430

The staggering growth in smartphone and wearable device use has led to a massive scale generation of personal (user-specific) data. To explore, analyze, and extract useful information and knowledge from the deluge of personal data, one has to leverage these devices as the data-mining platforms in ubiquitous, pervasive, and big data environments. This study presents the personal ecosystem where all computational resources, communication facilities, storage and knowledge management systems are available in user proximity. An extensive review on recent literature has been conducted and a detailed taxonomy is presented. The performance evaluation metrics and their empirical evidences are sorted out in this paper. Finally, we have highlighted some future research directions and potentially emerging application areas for personal data mining using smartphones and wearable devices.

Matched MeSH terms: Data Mining
Comparative analysis of statistical tools for oil palm phytochemical research

Ishak NA, Tahir NI, Mohd Sa'id SN, Gopal K, Othman A, Ramli US

Heliyon, 2021 Feb;7(2):e06048.
PMID: 33553773 DOI: 10.1016/j.heliyon.2021.e06048

Recent advances in phytochemical analysis have allowed the accumulation of data for crop researchers due to its capacity to footprint and distinguish metabolites that are present within an organisms, tissues or cells. Apart from genotypic traits, slight changes either by biotic or abiotic stimuli will have significant impact on the metabolite abundances and will eventually be observed through physicochemical characteristics. Apposite data mining to interpret the mounds of phytochemical information from such a dynamic system is thus incumbent. In this investigation, several statistical software platforms ranging from exploratory and confirmatory technique of multivariate data analysis from four different statistical tools of COVAIN, SIMCA-P+, MetaboAnalyst and RIKEN Excel Macro were appraised using an oil palm phytochemical data set. As different software tool encompasses its own advantages and limitations, the insights gained from this assessment were documented to enlighten several aspects of functions and suitability for the adaptation of the tools into the oil palm phytochemistry pipeline. This comparative analysis will certainly provide scientists with salient notes on data assessment and data mining that will later allow the depiction of the overall oil palm status in-situ and ex-situ.

Matched MeSH terms: Data Mining
Fulltext Modified multi-class classification using association rule mining

Yuhanis Yusof, Mohammed Hayel Refai

Pertanika Journal of Science & Technology, 2013;21(1):205-216.
MyJurnal

As the amount of document increases, automation of classification that aids the analysis and management of documents receive focal attention. Classification, based on association rules that are generated from a collection of documents, is a recent data mining approach that integrates association rule mining and classification. The existing approaches produces either high accuracy with large number of rules or a small number of association rules that generate low accuracy. This work presents an association rule mining that employs a new item production algorithm that generates a small number of rules and produces an acceptable accuracy rate. The proposed method is evaluated on UCI datasets and measured based on prediction accuracy and the number of generated association rules. Comparison is later made against an existing classifier, Multi-class Classification based on Association Rule (MCAR). From the undertaken experiments, it is learned that the proposed method produces similar accuracy rate as MCAR but yet uses lesser number of rules.

Matched MeSH terms: Data Mining
Fulltext An Empirical Analysis of Rough Set Categorical Clustering Techniques

Uddin J, Ghazali R, Deris MM

PLoS One, 2017;12(1):e0164803.
PMID: 28068344 DOI: 10.1371/journal.pone.0164803

Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. For categorical data clustering the rough set based approaches such as Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA) has outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness (TR) and Min-Min Roughness(MMR). This paper presents the limitations and issues of MDA and MSA techniques on special type of data sets where both techniques fails to select or faces difficulty in selecting their best clustering attribute. Therefore, this analysis motivates the need to come up with better and more generalize rough set theory approach that can cope the issues with MDA and MSA. Hence, an alternative technique named Maximum Indiscernible Attribute (MIA) for clustering categorical data using rough set indiscernible relations is proposed. The novelty of the proposed approach is that, unlike other rough set theory techniques, it uses the domain knowledge of the data set. It is based on the concept of indiscernibility relation combined with a number of clusters. To show the significance of proposed approach, the effect of number of clusters on rough accuracy, purity and entropy are described in the form of propositions. Moreover, ten different data sets from previously utilized research cases and UCI repository are used for experiments. The results produced in tabular and graphical forms shows that the proposed MIA technique provides better performance in selecting the clustering attribute in terms of purity, entropy, iterations, time, accuracy and rough accuracy.

Matched MeSH terms: Data Mining
Fulltext Performance of SVM with multiple kernel learning for classification tasks of imbalanced datasets

Saeed, Sana, Ong, Hong Choon

Pertanika Journal of Science & Technology, 2019;27(1):527-545.
MyJurnal

Support vector machine (SVM) is one of the most popular algorithms in machine learning
and data mining. However, its reduced efficiency is usually observed for imbalanced
datasets. To improve the performance of SVM for binary imbalanced datasets, a new scheme
based on oversampling and the hybrid algorithm were introduced. Besides the use of a
single kernel function, SVM was applied with multiple kernel learning (MKL). A weighted
linear combination was defined based on the linear kernel function, radial basis function
(RBF kernel), and sigmoid kernel function for MKL. By generating the synthetic samples
in the minority class, searching the best choices of the SVM parameters and identifying
the weights of MKL by minimizing the objective function, the improved performance of
SVM was observed. To prove the strength of the proposed scheme, an experimental study,
including noisy borderline and real imbalanced datasets was conducted. SVM was applied
with linear kernel function, RBF kernel, sigmoid kernel function and MKL on all datasets.
The performance of SVM with all kernel functions was evaluated by using sensitivity,
G Mean, and F measure. A significantly improved performance of SVM with MKL was
observed by applying the proposed scheme.

Matched MeSH terms: Data Mining
Fulltext Statistical estimators as an alternative to standard deviation in weighted euclidean distance cluster analysis

Dalatu, Paul Inuwa, Habshah Midi

Pertanika Journal of Science & Technology, 2018;26(4):1823-1836.
MyJurnal

Clustering is basically one of the major sources of primary data mining tools. It makes
researchers understand the natural grouping of attributes in datasets. Clustering is an
unsupervised classification method with the major aim of partitioning, where objects in the
same cluster are similar, and objects which belong to different clusters vary significantly,
with respect to their attributes. However, the classical Standardized Euclidean distance,
which uses standard deviation to down weight maximum points of the ith features on the
distance clusters, has been criticized by many scholars that the method produces outliers,
lack robustness, and has 0% breakdown points. It also has low efficiency in normal
distribution. Therefore, to remedy the problem, we suggest two statistical estimators
which have 50% breakdown points namely the Sn and Qn estimators, with 58% and 82%
efficiency, respectively. The proposed methods evidently outperformed the existing methods
in down weighting the maximum points of the ith features in distance-based clustering
analysis.

Matched MeSH terms: Data Mining
Fulltext Fischer Linear Discrimination and Quadratic Discrimination Analysis-Based Data Mining Technique for Internet of Things Framework for Healthcare

Hasan MK, Ghazal TM, Alkhalifah A, Abu Bakar KA, Omidvar A, Nafi NS, et al.

Front Public Health, 2021;9:737149.
PMID: 34712639 DOI: 10.3389/fpubh.2021.737149

The internet of reality or augmented reality has been considered a breakthrough and an outstanding critical mutation with an emphasis on data mining leading to dismantling of some of its assumptions among several of its stakeholders. In this work, we study the pillars of these technologies connected to web usage as the Internet of things (IoT) system's healthcare infrastructure. We used several data mining techniques to evaluate the online advertisement data set, which can be categorized as high dimensional with 1,553 attributes, and the imbalanced data set, which automatically simulates an IoT discrimination problem. The proposed methodology applies Fischer linear discrimination analysis (FLDA) and quadratic discrimination analysis (QDA) within random projection (RP) filters to compare our runtime and accuracy with support vector machine (SVM), K-nearest neighbor (KNN), and Multilayer perceptron (MLP) in IoT-based systems. Finally, the impact on number of projections was practically experimented, and the sensitivity of both FLDA and QDA with regard to precision and runtime was found to be challenging. The modeling results show not only improved accuracy, but also runtime improvements. When compared with SVM, KNN, and MLP in QDA and FLDA, runtime shortens by 20 times in our chosen data set simulated for a healthcare framework. The RP filtering in the preprocessing stage of the attribute selection, fulfilling the model's runtime, is a standpoint in the IoT industry. Index Terms: Data Mining, Random Projection, Fischer Linear Discriminant Analysis, Online Advertisement Dataset, Quadratic Discriminant Analysis, Feature Selection, Internet of Things.

Matched MeSH terms: Data Mining
Feature selection and risk prediction for patients with coronary artery disease using data mining

Md Idris N, Chiam YK, Varathan KD, Wan Ahmad WA, Chee KH, Liew YM

Med Biol Eng Comput, 2020 Dec;58(12):3123-3140.
PMID: 33155096 DOI: 10.1007/s11517-020-02268-9

Coronary artery disease (CAD) is an important cause of mortality across the globe. Early risk prediction of CAD would be able to reduce the death rate by allowing early and targeted treatments. In healthcare, some studies applied data mining techniques and machine learning algorithms on the risk prediction of CAD using patient data collected by hospitals and medical centers. However, most of these studies used all the attributes in the datasets which might reduce the performance of prediction models due to data redundancy. The objective of this research is to identify significant features to build models for predicting the risk level of patients with CAD. In this research, significant features were selected using three methods (i.e., Chi-squared test, recursive feature elimination, and Embedded Decision Tree). Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique was implemented to address the imbalanced dataset issue. The prediction models were built based on the identified significant features and eight machine learning algorithms, utilizing Acute Coronary Syndrome (ACS) datasets provided by National Cardiovascular Disease Database (NCVD) Malaysia. The prediction models were evaluated and compared using six performance evaluation metrics, and the top-performing models have achieved AUC more than 90%. Graphical abstract.

Matched MeSH terms: Data Mining
Fulltext Review of context-based similarity measure for categorical data

Nurul Adzlyana, M.S., Rosma, M.D., Nurazzah, A.R.

Pertanika Journal of Science & Technology, 2017;25(2):619-630.
MyJurnal

Data mining processes such as clustering, classification, regression and outlier detection are developed based on similarity between two objects. Data mining processes of categorical data is found to be most challenging. Earlier similarity measures are context-free. In recent years, researchers have come up with context-sensitive similarity measure based on the relationships of objects. This paper provides an in-depth review of context-based similarity measures. Descriptions of algorithm for four context-based similarity measure, namely Association-based similarity measure, DILCA, CBDL and the hybrid context-based similarity measure, are described. Advantages and limitations of each context-based similarity measure are identified and explained. Context-based similarity measure is highly recommended for data-mining tasks for categorical data. The findings of this paper will help data miners in choosing appropriate similarity measures to achieve more accurate classification or clustering results.

Matched MeSH terms: Data Mining
Fulltext Text mining in mosquito-borne disease: A systematic review

Ong SQ, Pauzi MBM, Gan KH

Acta Trop, 2022 Jul;231:106447.
PMID: 35430265 DOI: 10.1016/j.actatropica.2022.106447

Mosquito-borne diseases are emerging and re-emerging across the globe, especially after the COVID19 pandemic. The recent advances in text mining in infectious diseases hold the potential of providing timely access to explicit and implicit associations among information in the text. In the past few years, the availability of online text data in the form of unstructured or semi-structured text with rich content of information from this domain enables many studies to provide solutions in this area, e.g., disease-related knowledge discovery, disease surveillance, early detection system, etc. However, a recent review of text mining in the domain of mosquito-borne disease was not available to the best of our knowledge. In this review, we survey the recent works in the text mining techniques used in combating mosquito-borne diseases. We highlight the corpus sources, technologies, applications, and the challenges faced by the studies, followed by the possible future directions that can be taken further in this domain. We present a bibliometric analysis of the 294 scientific articles that have been published in Scopus and PubMed in the domain of text mining in mosquito-borne diseases, from the year 2016 to 2021. The papers were further filtered and reviewed based on the techniques used to analyze the text related to mosquito-borne diseases. Based on the corpus of 158 selected articles, we found 27 of the articles were relevant and used text mining in mosquito-borne diseases. These articles covered the majority of Zika (38.70%), Dengue (32.26%), and Malaria (29.03%), with extremely low numbers or none of the other crucial mosquito-borne diseases like chikungunya, yellow fever, West Nile fever. Twitter was the dominant corpus resource to perform text mining in mosquito-borne diseases, followed by PubMed and LexisNexis databases. Sentiment analysis was the most popular technique of text mining to understand the discourse of the disease and followed by information extraction, which dependency relation and co-occurrence-based approach to extract relations and events. Surveillance was the main usage of most of the reviewed studies and followed by treatment, which focused on the drug-disease or symptom-disease association. The advance in text mining could improve the management of mosquito-borne diseases. However, the technique and application posed many limitations and challenges, including biases like user authentication and language, real-world implementation, etc. We discussed the future direction which can be useful to expand this area and domain. This review paper contributes mainly as a library for text mining in mosquito-borne diseases and could further explore the system for other neglected diseases.

Matched MeSH terms: Data Mining
Fulltext A hybrid interpretable deep structure based on adaptive neuro-fuzzy inference system, decision tree, and K-means for intrusion detection

Liu J, Yinchai W, Siong TC, Li X, Zhao L, Wei F

Sci Rep, 2022 Dec 01;12(1):20770.
PMID: 36456582 DOI: 10.1038/s41598-022-23765-x

For generating an interpretable deep architecture for identifying deep intrusion patterns, this study proposes an approach that combines ANFIS (Adaptive Network-based Fuzzy Inference System) and DT (Decision Tree) for interpreting the deep pattern of intrusion detection. Meanwhile, for improving the efficiency of training and predicting, Pearson Correlation analysis, standard deviation, and a new adaptive K-means are used to select attributes and make fuzzy interval decisions. The proposed algorithm was trained, validated, and tested on the NSL-KDD (National security lab-knowledge discovery and data mining) dataset. Using 22 attributes that highly related to the target, the performance of the proposed method achieves a 99.86% detection rate and 0.14% false alarm rate on the KDDTrain+ dataset, a 77.46% detection rate on the KDDTest+ dataset, which is better than many classifiers. Besides, the interpretable model can help us demonstrate the complex and overlapped pattern of intrusions and analyze the pattern of various intrusions.

Matched MeSH terms: Data Mining

Filters

Please provide feedback to Administrator ([email protected])

External Links