Bauxite mining in Kuantan district of Pahang has raised health concerns of communities residing near the mining areas. Bauxite mining and transportation activities have contributed a lot to the pollution of environment. There is a fear among the residents of the areas that whether the soil is free from naturally occurring radioactive substances or not. Therefore, the objective of this study was to detect the presence of natural radioactive elements in the soil of bauxite mining field at Bukit Goh, Kuantan.
Recent advances in phytochemical analysis have allowed the accumulation of data for crop researchers due to its capacity to footprint and distinguish metabolites that are present within an organisms, tissues or cells. Apart from genotypic traits, slight changes either by biotic or abiotic stimuli will have significant impact on the metabolite abundances and will eventually be observed through physicochemical characteristics. Apposite data mining to interpret the mounds of phytochemical information from such a dynamic system is thus incumbent. In this investigation, several statistical software platforms ranging from exploratory and confirmatory technique of multivariate data analysis from four different statistical tools of COVAIN, SIMCA-P+, MetaboAnalyst and RIKEN Excel Macro were appraised using an oil palm phytochemical data set. As different software tool encompasses its own advantages and limitations, the insights gained from this assessment were documented to enlighten several aspects of functions and suitability for the adaptation of the tools into the oil palm phytochemistry pipeline. This comparative analysis will certainly provide scientists with salient notes on data assessment and data mining that will later allow the depiction of the overall oil palm status in-situ and ex-situ.
As the amount of document increases, automation of classification that aids the analysis and management of documents receive focal attention. Classification, based on association rules that are generated from a collection of documents, is a recent data mining approach that integrates association rule mining and classification. The existing approaches produces either high accuracy with large number of rules or a small number of association rules that generate low accuracy. This work presents an association rule mining that employs a new item production algorithm that generates a small number of rules and produces an acceptable accuracy rate. The proposed method is evaluated on UCI datasets and measured based on prediction accuracy and the number of generated association rules. Comparison is later made against an existing classifier, Multi-class Classification based on Association Rule (MCAR). From the undertaken experiments, it is learned that the proposed method produces similar accuracy rate as MCAR but yet uses lesser number of rules.
Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Recently, many attentions have been put on categorical data clustering, where data objects are made up of non-numerical attributes. For categorical data clustering the rough set based approaches such as Maximum Dependency Attribute (MDA) and Maximum Significance Attribute (MSA) has outperformed their predecessor approaches like Bi-Clustering (BC), Total Roughness (TR) and Min-Min Roughness(MMR). This paper presents the limitations and issues of MDA and MSA techniques on special type of data sets where both techniques fails to select or faces difficulty in selecting their best clustering attribute. Therefore, this analysis motivates the need to come up with better and more generalize rough set theory approach that can cope the issues with MDA and MSA. Hence, an alternative technique named Maximum Indiscernible Attribute (MIA) for clustering categorical data using rough set indiscernible relations is proposed. The novelty of the proposed approach is that, unlike other rough set theory techniques, it uses the domain knowledge of the data set. It is based on the concept of indiscernibility relation combined with a number of clusters. To show the significance of proposed approach, the effect of number of clusters on rough accuracy, purity and entropy are described in the form of propositions. Moreover, ten different data sets from previously utilized research cases and UCI repository are used for experiments. The results produced in tabular and graphical forms shows that the proposed MIA technique provides better performance in selecting the clustering attribute in terms of purity, entropy, iterations, time, accuracy and rough accuracy.
A preliminary study on selected insect communities of Tasik Chini was conducted from 27th May to 2nd June 2004 along three trails namely trail to Sg. Gumum, trail to Kampung Melai and trail to old tin mining area. A total of eight Malaise traps were installed along the trail to Sg. Gumum while sweeping net and 10 yellow pan traps per trail were used to sample insects along the other two trails. A total of 502 insect individuals consisting of eight orders and 46 families were successfully collected. Of these, the hymenopterans (ants and wasps) had the most number (298 individuals and 11 families) while the Blattaria was the least number (six individuals and two families). Of the hymenopterans, the ichneumonids had the most individuals collected (52) followed by evaniid (50) and vespid wasps (41). For the Coleopterans, Cleridae were the most collected (26) during this short study followed by Anthribidae (13). There were 62 of individuals Odonata consisting of 9 species identified.
Support vector machine (SVM) is one of the most popular algorithms in machine learning
and data mining. However, its reduced efficiency is usually observed for imbalanced
datasets. To improve the performance of SVM for binary imbalanced datasets, a new scheme
based on oversampling and the hybrid algorithm were introduced. Besides the use of a
single kernel function, SVM was applied with multiple kernel learning (MKL). A weighted
linear combination was defined based on the linear kernel function, radial basis function
(RBF kernel), and sigmoid kernel function for MKL. By generating the synthetic samples
in the minority class, searching the best choices of the SVM parameters and identifying
the weights of MKL by minimizing the objective function, the improved performance of
SVM was observed. To prove the strength of the proposed scheme, an experimental study,
including noisy borderline and real imbalanced datasets was conducted. SVM was applied
with linear kernel function, RBF kernel, sigmoid kernel function and MKL on all datasets.
The performance of SVM with all kernel functions was evaluated by using sensitivity,
G Mean, and F measure. A significantly improved performance of SVM with MKL was
observed by applying the proposed scheme.
Clustering is basically one of the major sources of primary data mining tools. It makes
researchers understand the natural grouping of attributes in datasets. Clustering is an
unsupervised classification method with the major aim of partitioning, where objects in the
same cluster are similar, and objects which belong to different clusters vary significantly,
with respect to their attributes. However, the classical Standardized Euclidean distance,
which uses standard deviation to down weight maximum points of the ith features on the
distance clusters, has been criticized by many scholars that the method produces outliers,
lack robustness, and has 0% breakdown points. It also has low efficiency in normal
distribution. Therefore, to remedy the problem, we suggest two statistical estimators
which have 50% breakdown points namely the Sn and Qn estimators, with 58% and 82%
efficiency, respectively. The proposed methods evidently outperformed the existing methods
in down weighting the maximum points of the ith features in distance-based clustering
analysis.
In an era of electronics, recovering the precious metal such as gold from ever increasing piles of electronic-wastes and metal-ion infested soil has become one of the prime concerns for researchers worldwide. Biological mining is an attractive, economical and non-hazardous to recover gold from the low-grade auriferous ore containing waste or soil. This review represents the recent major biological gold retrieval methods used to bio-mine gold. The biomining methods discussed in this review include, bioleaching, bio-oxidation, bio-precipitation, bio-flotation, bio-flocculation, bio-sorption, bio-reduction, bio-electrometallurgical technologies and bioaccumulation. The mechanism of gold biorecovery by microbes is explained in detail to explore its intracellular mechanistic, which help it withstand high concentrations of gold without causing any fatal consequences. Major challenges and future opportunities associated with each method and how they will dictate the fate of gold bio-metallurgy from metal wastes or metal infested soil bioremediation in the coming future are also discussed. With the help of concurrent advancements in high-throughput technologies, the gold bio-exploratory methods will speed up our ways to ensure maximum gold retrieval out of such low-grade ores containing sources, while keeping the gold mining clean and more sustainable.
The internet of reality or augmented reality has been considered a breakthrough and an outstanding critical mutation with an emphasis on data mining leading to dismantling of some of its assumptions among several of its stakeholders. In this work, we study the pillars of these technologies connected to web usage as the Internet of things (IoT) system's healthcare infrastructure. We used several data mining techniques to evaluate the online advertisement data set, which can be categorized as high dimensional with 1,553 attributes, and the imbalanced data set, which automatically simulates an IoT discrimination problem. The proposed methodology applies Fischer linear discrimination analysis (FLDA) and quadratic discrimination analysis (QDA) within random projection (RP) filters to compare our runtime and accuracy with support vector machine (SVM), K-nearest neighbor (KNN), and Multilayer perceptron (MLP) in IoT-based systems. Finally, the impact on number of projections was practically experimented, and the sensitivity of both FLDA and QDA with regard to precision and runtime was found to be challenging. The modeling results show not only improved accuracy, but also runtime improvements. When compared with SVM, KNN, and MLP in QDA and FLDA, runtime shortens by 20 times in our chosen data set simulated for a healthcare framework. The RP filtering in the preprocessing stage of the attribute selection, fulfilling the model's runtime, is a standpoint in the IoT industry. Index Terms: Data Mining, Random Projection, Fischer Linear Discriminant Analysis, Online Advertisement Dataset, Quadratic Discriminant Analysis, Feature Selection, Internet of Things.
Coronary artery disease (CAD) is an important cause of mortality across the globe. Early risk prediction of CAD would be able to reduce the death rate by allowing early and targeted treatments. In healthcare, some studies applied data mining techniques and machine learning algorithms on the risk prediction of CAD using patient data collected by hospitals and medical centers. However, most of these studies used all the attributes in the datasets which might reduce the performance of prediction models due to data redundancy. The objective of this research is to identify significant features to build models for predicting the risk level of patients with CAD. In this research, significant features were selected using three methods (i.e., Chi-squared test, recursive feature elimination, and Embedded Decision Tree). Synthetic Minority Over-sampling Technique (SMOTE) oversampling technique was implemented to address the imbalanced dataset issue. The prediction models were built based on the identified significant features and eight machine learning algorithms, utilizing Acute Coronary Syndrome (ACS) datasets provided by National Cardiovascular Disease Database (NCVD) Malaysia. The prediction models were evaluated and compared using six performance evaluation metrics, and the top-performing models have achieved AUC more than 90%. Graphical abstract.
Thiocyanate (SCN-) is a contaminant requiring remediation in gold mine tailings and wastewaters globally. Seepage of SCN--contaminated waters into aquifers can occur from unlined or structurally compromised mine tailings storage facilities. A wide variety of microorganisms are known to be capable of biodegrading SCN-; however, little is known regarding the potential of native microbes for in situ SCN- biodegradation, a remediation option that is less costly than engineered approaches. Here we experimentally characterize the principal biogeochemical barrier to SCN- biodegradation for an autotrophic microbial consortium enriched from mine tailings, to arrive at an environmentally realistic assessment of in situ SCN- biodegradation potential. Upon amendment with phosphate, the consortium completely degraded up to ∼10 mM SCN- to ammonium and sulfate, with some evidence of nitrification of the ammonium to nitrate. Although similarly enriched in known SCN--degrading strains of thiobacilli, this consortium differed in its source (mine tailings) and metabolism (autotrophy) from those of previous studies. Our results provide a proof of concept that phosphate limitation may be the principal barrier to in situ SCN- biodegradation in mine tailing waters and also yield new insights into the microbial ecology of in situ SCN- bioremediation involving autotrophic sulfur-oxidizing bacteria.
Data mining processes such as clustering, classification, regression and outlier detection are developed based on similarity between two objects. Data mining processes of categorical data is found to be most challenging. Earlier similarity measures are context-free. In recent years, researchers have come up with context-sensitive similarity measure based on the relationships of objects. This paper provides an in-depth review of context-based similarity measures. Descriptions of algorithm for four context-based similarity measure, namely Association-based similarity measure, DILCA, CBDL and the hybrid context-based similarity measure, are described. Advantages and limitations of each context-based similarity measure are identified and explained. Context-based similarity measure is highly recommended for data-mining tasks for categorical data. The findings of this paper will help data miners in choosing appropriate similarity measures to achieve more accurate classification or clustering results.
Mosquito-borne diseases are emerging and re-emerging across the globe, especially after the COVID19 pandemic. The recent advances in text mining in infectious diseases hold the potential of providing timely access to explicit and implicit associations among information in the text. In the past few years, the availability of online text data in the form of unstructured or semi-structured text with rich content of information from this domain enables many studies to provide solutions in this area, e.g., disease-related knowledge discovery, disease surveillance, early detection system, etc. However, a recent review of text mining in the domain of mosquito-borne disease was not available to the best of our knowledge. In this review, we survey the recent works in the text mining techniques used in combating mosquito-borne diseases. We highlight the corpus sources, technologies, applications, and the challenges faced by the studies, followed by the possible future directions that can be taken further in this domain. We present a bibliometric analysis of the 294 scientific articles that have been published in Scopus and PubMed in the domain of text mining in mosquito-borne diseases, from the year 2016 to 2021. The papers were further filtered and reviewed based on the techniques used to analyze the text related to mosquito-borne diseases. Based on the corpus of 158 selected articles, we found 27 of the articles were relevant and used text mining in mosquito-borne diseases. These articles covered the majority of Zika (38.70%), Dengue (32.26%), and Malaria (29.03%), with extremely low numbers or none of the other crucial mosquito-borne diseases like chikungunya, yellow fever, West Nile fever. Twitter was the dominant corpus resource to perform text mining in mosquito-borne diseases, followed by PubMed and LexisNexis databases. Sentiment analysis was the most popular technique of text mining to understand the discourse of the disease and followed by information extraction, which dependency relation and co-occurrence-based approach to extract relations and events. Surveillance was the main usage of most of the reviewed studies and followed by treatment, which focused on the drug-disease or symptom-disease association. The advance in text mining could improve the management of mosquito-borne diseases. However, the technique and application posed many limitations and challenges, including biases like user authentication and language, real-world implementation, etc. We discussed the future direction which can be useful to expand this area and domain. This review paper contributes mainly as a library for text mining in mosquito-borne diseases and could further explore the system for other neglected diseases.
For generating an interpretable deep architecture for identifying deep intrusion patterns, this study proposes an approach that combines ANFIS (Adaptive Network-based Fuzzy Inference System) and DT (Decision Tree) for interpreting the deep pattern of intrusion detection. Meanwhile, for improving the efficiency of training and predicting, Pearson Correlation analysis, standard deviation, and a new adaptive K-means are used to select attributes and make fuzzy interval decisions. The proposed algorithm was trained, validated, and tested on the NSL-KDD (National security lab-knowledge discovery and data mining) dataset. Using 22 attributes that highly related to the target, the performance of the proposed method achieves a 99.86% detection rate and 0.14% false alarm rate on the KDDTrain+ dataset, a 77.46% detection rate on the KDDTest+ dataset, which is better than many classifiers. Besides, the interpretable model can help us demonstrate the complex and overlapped pattern of intrusions and analyze the pattern of various intrusions.
Land exploitation for mining sector may leave a series of environmental impacts on our ecosystem if not appropriately managed. Therefore, the present study attempts to evaluate the various environmental aspects due to abandoned metal mining including former iron ore, bauxite, and tin mining lands in view of its hydrogeochemical behavior. Mine-impacted waters and sediments were ascertained from former mining ponds, mine tailings, and impacted streams for interpretation of aqueous and sediment geochemistry, major and trace elements, hydrochemical facies, chemical weathering rate and CO2 consumption, and water quality classification. Results indicated that the environmental impact of the long-abandoned iron ore mine was still evident with some high concentration of metals and acidic pH. Higher concentrations of Fe and Mn in water were noticeable in some areas while other trace elements (Pb, Zn, As, Cd, Cr, and Cu) were found below the recommended guideline values. Sediment quality reflected the trend of water quality variables mainly associated with metal(loid) elements, resulting in potential ecological risk, classified as having low to moderate risk. There were variations in terms of hydrochemical facies of the waters suggesting the influence of minerals in water. The chemical weathering rate suggests that contribution of carbonate mineral weathering was more important (up to 60%) than silicate weathering. The resulting CO2 consumption by mineral weathering was estimated to be in the range of 1.7-9.8 × 107 mol/year (former bauxite and tin mining areas can act as temporary sinks for CO2). Water quality classifications according to several chemical indices (Kelly's ratio, sodium absorption ratio, soluble sodium percentage, residual sodium carbonate, magnesium absorption ratio, and permeability index) were also discussed in regards to mine water reuse for irrigation purpose. The findings suggest that a holistic approach that integrates all important hydrogeochemical aspects is essential for a thorough evaluation of the implication of medium- to long-term mining exploitation on its surrounding ecosystems. This would be beneficial in light of restoration potential of degraded mining land so as for future mitigation strategies in the mining sector.
Diagnosing diabetes early is critical as it helps patients live with the disease in a healthy way - through healthy eating, taking appropriate medical doses, and making patients more vigilant in their movements/activities to avoid wounds that are difficult to heal for diabetic patients. Data mining techniques are typically used to detect diabetes with high confidence to avoid misdiagnoses with other chronic diseases whose symptoms are similar to diabetes. Hidden Naïve Bayes is one of the algorithms for classification, which works under a data-mining model based on the assumption of conditional independence of the traditional Naïve Bayes. The results from this research study, which was conducted on the Pima Indian Diabetes (PID) dataset collection, show that the prediction accuracy of the HNB classifier achieved 82%. As a result, the discretization method increases the performance and accuracy of the HNB classifier.
Recent decades have witnessed a surge in research interest in bio-nanocomposite-based packaging materials, but still, a lack of systematic analysis exists in this domain. Bio-based packaging materials pose a sustainable alternative to petroleum-based packaging materials. The current work employs bibliometric analysis to deliver a comprehensive outline on the role of bio nanocomposites in packaging. India, Iran, and China were revealed to be the top three nations actively engaged in this domain in total publications. Islamic Azad University in Iran and Universiti Putra Malaysia in Malaysia are among the world's best institutions in active research and publications in this field. The extensive collaboration between nations and institutions highlights the significance of a holistic approach towards bio-nanocomposite. The National Natural Science Foundation of China is the leading funding body in this field of research. Among authors, Jong whan Rhim secured the topmost citations (2234) in this domain (13 publications). Among journals, Carbohydrate Polymers secured the maximum citation count (4629) from 36 articles; the initial one was published in 2011. Bio nanocomposite is the most frequently used keyword. Researchers and policymakers focussing on sustainable packaging solutions will gain crucial insights on the current research status on packaging solutions using bio-nanocomposites from the conclusions.
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
Matched MeSH terms: Data Mining/statistics & numerical data*
This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts.
Thermal structure and water quality in a large and shallow lake in Malaysia were studied between January 2012 and June 2013 in order to understand variations in relation to water level fluctuations and in-stream mining activities. Environmental variables, namely temperature, turbidity, dissolved oxygen, pH, electrical conductivity, chlorophyll-A and transparency, were measured using a multi-parameter probe and a Secchi disk. Measurements of environmental variables were performed at 0.1 m intervals from the surface to the bottom of the lake during the dry and wet seasons. High water level and strong solar radiation increased temperature stratification. River discharges during the wet season, and unsustainable sand mining activities led to an increased turbidity exceeding 100 NTU, and reduced transparency, which changed the temperature variation and subsequently altered the water quality pattern.