Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
Crude oil prices do play significant role in the global economy and are a key input into option pricing formulas, portfolio allocation, and risk measurement. In this paper, a hybrid model integrating wavelet and multiple linear regressions (MLR) is proposed for crude oil price forecasting. In this model, Mallat wavelet transform is first selected to decompose an original time series into several subseries with different scale. Then, the principal component analysis (PCA) is used in processing subseries data in MLR for crude oil price forecasting. The particle swarm optimization (PSO) is used to adopt the optimal parameters of the MLR model. To assess the effectiveness of this model, daily crude oil market, West Texas Intermediate (WTI), has been used as the case study. Time series prediction capability performance of the WMLR model is compared with the MLR, ARIMA, and GARCH models using various statistics measures. The experimental results show that the proposed model outperforms the individual models in forecasting of the crude oil prices series.
Matched MeSH terms: Principal Component Analysis/methods*
The location model proposed in the past is a predictive discriminant rule that can classify new observations into one
of two predefined groups based on mixtures of continuous and categorical variables. The ability of location model to
discriminate new observation correctly is highly dependent on the number of multinomial cells created by the number
of categorical variables. This study conducts a preliminary investigation to show the location model that uses maximum
likelihood estimation has high misclassification rate up to 45% on average in dealing with more than six categorical
variables for all 36 data tested. Such model indicated highly incorrect prediction as this model performed badly for
large categorical variables even with large sample size. To alleviate the high rate of misclassification, a new strategy
is embedded in the discriminant rule by introducing nonlinear principal component analysis (NPCA) into the classical
location model (cLM), mainly to handle the large number of categorical variables. This new strategy is investigated
on some simulation and real datasets through the estimation of misclassification rate using leave-one-out method. The
results from numerical investigations manifest the feasibility of the proposed model as the misclassification rate is
dramatically decreased compared to the cLM for all 18 different data settings. A practical application using real dataset
demonstrates a significant improvement and obtains comparable result among the best methods that are compared. The
overall findings reveal that the proposed model extended the applicability range of the location model as previously it
was limited to only six categorical variables to achieve acceptable performance. This study proved that the proposed
model with new discrimination procedure can be used as an alternative to the problems of mixed variables classification,
primarily when facing with large categorical variables.
The study to determine the concentrations of dissolved heavy metals in the Sungai Semenyih and to use the environmetric
methods to evaluate the influence of different pollution sources on heavy metals concentrations was carried out. Cluster
analysis (CA) classified 8 sampling stations into two clusters based on the similarity of sampling stations characteristics,
cluster 1 included stations 1, 2, 3 and 4 (low pollution area), whereas cluster 2 comprised of stations 5, 6, 7 and 8
(high pollution area). Principal component analysis (PCA) of the two datasets yield two factors for low pollution area
and three factors for the high pollution area at Eigenvalues >1, representing 92.544% and 100% of the total variance
in each heavy metals data sets and allowed to gather selected heavy metals based on the anthropogenic and lithologic
sources of contamination.
Statistical classification remains the most useful statistical tool for forensic chemists to assess the relationships between samples. Many clustering techniques such as principal component analysis and hierarchical cluster analysis have been employed to analyze chemical data for pattern recognition. Due to the feeble foundation of this statistics knowledge among novice drug chemists, a tetrahedron method was designed to simulate how advanced chemometrics operates. In this paper, the development of the graphical tetrahedron and computational matrices derived from the possible tetrahedrons are discussed. The tetrahedron method was applied to four selected parameters obtained from nine illicit heroin samples. Pattern analysis and mathematical computation of the differences in areas for assessing the dissimilarity between the nine tetrahedrons were found to be user-convenient and straightforward for novice cluster analysts.
The study investigates the latent pollution sources and most significant parameters that cause spatial variation and develops the best input for water quality modelling using principal component analysis (PCA) and artificial neural network (ANN). The dataset, 22 water quality parameters were obtained from Department of Environment Malaysia (DOE). The PCA generated six significant principal component scores (PCs) which explained 65.40 % of the total variance. Parameters for water quality variation are mainlyrelated to mineral components, anthropogenic activities, and natural processes. However, in ANN three input combination models (ANN A, B, and C) were developed to identify the best model that can predict water quality index (WQI) with very high precision. ANN A model appears to have the best prediction capacity with a coefficient of determination (R2) = 0.9999 and root mean square error (RMSE) = 0.0537. These results proved that the PCA and ANN methods can be applied as tools for decision-making and problem-solving for better managing of river quality.
Recent advances in imaging technologies, such as intra-oral surface scanning, have rapidly generated large datasets of high-resolution three-dimensional (3D) sample reconstructions. These datasets contain a wealth of phenotypic information that can provide an understanding of morphological variation and evolution. The geometric morphometric method (GMM) with landmarks and the development of sliding and surface semilandmark techniques has greatly enhanced the quantification of shape. This study aimed to determine whether there are significant differences in 3D palatal rugae shape between siblings. Digital casts representing 25 pairs of full siblings from each group, male-male (MM), female-female (FF), and female-male (FM), were digitized and transferred to a GM system. The palatal rugae were determined, quantified, and visualized using GMM computational tools with MorphoJ software (University of Manchester). Principal component analysis (PCA) and canonical variates analysis (CVA) were employed to analyze palatal rugae shape variability and distinguish between sibling groups based on shape. Additionally, regression analysis examined the potential impact of shape on palatal rugae. The study revealed that the palatal rugae shape covered the first nine of the PCA by 71.3%. In addition, the size of the palatal rugae has a negligible impact on its shape. Whilst palatal rugae are known for their individuality, it is noteworthy that three palatal rugae (right first, right second, and left third) can differentiate sibling groups, which may be attributed to genetics. Therefore, it is suggested that palatal rugae morphology can serve as forensic identification for siblings.
The motivation behind this research is to innovatively combine new methods like wavelet, principal component analysis (PCA), and artificial neural network (ANN) approaches to analyze trade in today's increasingly difficult and volatile financial futures markets. The main focus of this study is to facilitate forecasting by using an enhanced denoising process on market data, taken as a multivariate signal, in order to deduct the same noise from the open-high-low-close signal of a market. This research offers evidence on the predictive ability and the profitability of abnormal returns of a new hybrid forecasting model using Wavelet-PCA denoising and ANN (named WPCA-NN) on futures contracts of Hong Kong's Hang Seng futures, Japan's NIKKEI 225 futures, Singapore's MSCI futures, South Korea's KOSPI 200 futures, and Taiwan's TAIEX futures from 2005 to 2014. Using a host of technical analysis indicators consisting of RSI, MACD, MACD Signal, Stochastic Fast %K, Stochastic Slow %K, Stochastic %D, and Ultimate Oscillator, empirical results show that the annual mean returns of WPCA-NN are more than the threshold buy-and-hold for the validation, test, and evaluation periods; this is inconsistent with the traditional random walk hypothesis, which insists that mechanical rules cannot outperform the threshold buy-and-hold. The findings, however, are consistent with literature that advocates technical analysis.
Visible and near infrared spectroscopy is a non-destructive, green, and rapid technology that can be utilized to estimate the components of interest without conditioning it, as compared with classical analytical methods. The objective of this paper is to compare the performance of artificial neural network (ANN) (a nonlinear model) and principal component regression (PCR) (a linear model) based on visible and shortwave near infrared (VIS-SWNIR) (400-1000 nm) spectra in the non-destructive soluble solids content measurement of an apple. First, we used multiplicative scattering correction to pre-process the spectral data. Second, PCR was applied to estimate the optimal number of input variables. Third, the input variables with an optimal amount were used as the inputs of both multiple linear regression and ANN models. The initial weights and the number of hidden neurons were adjusted to optimize the performance of ANN. Findings suggest that the predictive performance of ANN with two hidden neurons outperforms that of PCR.
Combined Support Vector Machine (SVM) and Principal Component Analysis (PCA) was used to recognize the infant cries with asphyxia. SVM classifier based on features selected by the PCA was trained to differentiate between pathological and healthy cries. The PCA was applied to reduce dimensionality of the vectors that serve as inputs to the SVM. The performance of the SVM utilizing linear and RBF kernel was examined. Experimental results showed that SVM with RBF kernel yields good performance. The classification accuracy in classifying infant cry with asphyxia using the SVM-PCA is 95.86%.
Matched MeSH terms: Principal Component Analysis/methods*
An improved classification of Orthosiphon stamineus using a data fusion technique is presented. Five different commercial sources along with freshly prepared samples were discriminated using an electronic nose (e-nose) and an electronic tongue (e-tongue). Samples from the different commercial brands were evaluated by the e-tongue and then followed by the e-nose. Applying Principal Component Analysis (PCA) separately on the respective e-tongue and e-nose data, only five distinct groups were projected. However, by employing a low level data fusion technique, six distinct groupings were achieved. Hence, this technique can enhance the ability of PCA to analyze the complex samples of Orthosiphon stamineus. Linear Discriminant Analysis (LDA) was then used to further validate and classify the samples. It was found that the LDA performance was also improved when the responses from the e-nose and e-tongue were fused together.
Matched MeSH terms: Principal Component Analysis/methods
Color is one of the most prominent features of an image and used in many skin and face detection applications. Color space transformation is widely used by researchers to improve face and skin detection performance. Despite the substantial research efforts in this area, choosing a proper color space in terms of skin and face classification performance which can address issues like illumination variations, various camera characteristics and diversity in skin color tones has remained an open issue. This research proposes a new three-dimensional hybrid color space termed SKN by employing the Genetic Algorithm heuristic and Principal Component Analysis to find the optimal representation of human skin color in over seventeen existing color spaces. Genetic Algorithm heuristic is used to find the optimal color component combination setup in terms of skin detection accuracy while the Principal Component Analysis projects the optimal Genetic Algorithm solution to a less complex dimension. Pixel wise skin detection was used to evaluate the performance of the proposed color space. We have employed four classifiers including Random Forest, Naïve Bayes, Support Vector Machine and Multilayer Perceptron in order to generate the human skin color predictive model. The proposed color space was compared to some existing color spaces and shows superior results in terms of pixel-wise skin detection accuracy. Experimental results show that by using Random Forest classifier, the proposed SKN color space obtained an average F-score and True Positive Rate of 0.953 and False Positive Rate of 0.0482 which outperformed the existing color spaces in terms of pixel wise skin detection accuracy. The results also indicate that among the classifiers used in this study, Random Forest is the most suitable classifier for pixel wise skin detection applications.
Principal component analysis (PCA) is capable of handling large sets of data. However, lack of consistent method in data pre-treatment and its importance are the limitations in PCA applications. This study examined pre-treatments methods (log (x + 1) transformation, outlier removal, and granulometric and geochemical normalization) on dataset of Mengkabong Lagoon, Sabah, mangrove surface sediment at high and low tides. The study revealed that geochemical normalization using Al with outliers removal resulted in a better classification of the mangrove surface sediment than that outliers removal, granulometric normalization using clay and log (x + 1) transformation. PCA output using geochemical normalization with outliers removal demonstrated associations between environmental variables and tides of mangrove surface sediment, Mengkabong Lagoon, Sabah. The PCA outputs at high and low tides also provided to better interpret information about the sediment and its controlling factors in the intertidal zone. The study showed data pre-treatment method to be a useful procedure to standardize the datasets and reducing the influence of outliers.
Matched MeSH terms: Principal Component Analysis/methods*
This work proposes a functional data analysis approach for morphometrics in classifying three shrew species (S. murinus, C. monticola, and C. malayana) from Peninsular Malaysia. Functional data geometric morphometrics (FDGM) for 2D landmark data is introduced and its performance is compared with classical geometric morphometrics (GM). The FDGM approach converts 2D landmark data into continuous curves, which are then represented as linear combinations of basis functions. The landmark data was obtained from 89 crania of shrew specimens based on three craniodental views (dorsal, jaw, and lateral). Principal component analysis and linear discriminant analysis were applied to both GM and FDGM methods to classify the three shrew species. This study also compared four machine learning approaches (naïve Bayes, support vector machine, random forest, and generalised linear model) using predicted PC scores obtained from both methods (a combination of all three craniodental views and individual views). The analyses favoured FDGM and the dorsal view was the best view for distinguishing the three species.
In this dataset, we distinguish 15 accessions of Garcinia mangostana from Peninsular Malaysia using Fourier transform-infrared spectroscopy coupled with chemometric analysis. We found that the position and intensity of characteristic peaks at 3600-3100 cm(-) (1) in IR spectra allowed discrimination of G. mangostana from different locations. Further principal component analysis (PCA) of all the accessions suggests the two main clusters were formed: samples from Johor, Melaka, and Negeri Sembilan (South) were clustered together in one group while samples from Perak, Kedah, Penang, Selangor, Kelantan, and Terengganu (North and East Coast) were in another clustered group.
We evaluated the species richness and beta diversity of epiphyllous assemblages from three selected localities in Sabah, i.e. Mt. Silam in Sapagaya Forest Reserve, and Ulu Senagang and Mt. Alab in Crocker Range Park. A total of 98 species were found and a phytosociological survey was carried out based on the three study areas. A detailed statistical analysis including standard correlation and regression analyses, ordination of species and leaves using centered principal component analysis, and the SDR simplex method to evaluate the beta diversity, was conducted. Beta diversity is very high in the epiphyllous liverwort assemblages in Sabah, with species replacement as the major component of pattern formation and less pronounced richness difference. The community analysis of the epiphyllous communities in Sabah makes possible their detailed description and comparison with similar communities of other continents.
Banana peel flour (BPF) prepared from green or ripe Cavendish and Dream banana fruits were assessed for their total starch (TS), digestible starch (DS), resistant starch (RS), total dietary fibre (TDF), soluble dietary fibre (SDF) and insoluble dietary fibre (IDF). Principal component analysis (PCA) identified that only 1 component was responsible for 93.74% of the total variance in the starch and dietary fibre components that differentiated ripe and green banana flours. Cluster analysis (CA) applied to similar data obtained two statistically significant clusters (green and ripe bananas) to indicate difference in behaviours according to the stages of ripeness based on starch and dietary fibre components. We concluded that the starch and dietary fibre components could be used to discriminate between flours prepared from peels obtained from fruits of different ripeness. The results were also suggestive of the potential of green and ripe BPF as functional ingredients in food.
Parkinson's disease (PD) is a member of a larger group of neuromotor diseases marked by the progressive death of dopamineproducing cells in the brain. Providing computational tools for Parkinson disease using a set of data that contains medical information is very desirable for alleviating the symptoms that can help the amount of people who want to discover the risk of disease at an early stage. This paper proposes a new hybrid intelligent system for the prediction of PD progression using noise removal, clustering and prediction methods. Principal Component Analysis (PCA) and Expectation Maximization (EM) are respectively employed to address the multi-collinearity problems in the experimental datasets and clustering the data. We then apply Adaptive Neuro-Fuzzy Inference System (ANFIS) and Support Vector Regression (SVR) for prediction of PD progression. Experimental results on public Parkinson's datasets show that the proposed method remarkably improves the accuracy of prediction of PD progression. The hybrid intelligent system can assist medical practitioners in the healthcare practice for early detection of Parkinson disease.
Pyrolysis-gas chromatography-mass spectrometry (Py-GC-MS) has been recognised as an effective technique to analyse car paint. This study was conducted to assess the combination of Py-GC-MS and chemometric techniques to classify car paint primer, the inner layer of car paint system. Fifty car paint primer samples from various manufacturers were analysed using Py-GC-MS, and data set of identified pyrolysis products was subjected to principal component analysis (PCA) and discriminant analysis (DA). The PCA rendered 16 principal components with 86.33% of the total variance. The DA was useful to classify the car paint primer samples according to their types (1k and 2k primer) with 100% correct classification in the test set for all three modes (standard, stepwise forward and stepwise backward). Three compounds, indolizine, 1,3-benzenedicarbonitrile and p-terphenyl, were the most significant compounds in discriminating the car paint primer samples.
The issue of classifying objects into groups when measured variables in an experiment are mixed has attracted the attention of statisticians. The Smoothed Location Model (SLM) appears to be a popular classification method to handle data containing both continuous and binary variables simultaneously. However, SLM is infeasible for a large number of binary variables due to the occurrence of numerous empty cells. Therefore, this study aims to construct new SLMs by integrating SLM with two variable extraction techniques, Principal Component Analysis (PCA) and two types of Multiple Correspondence Analysis (MCA) in order to reduce the large number of mixed variables, primarily the binary ones. The performance of the newly constructed models, namely the SLM+PCA+Indicator MCA and SLM+PCA+Burt MCA are examined based on misclassification rate. Results from simulation studies for a sample size of n=60 show that the SLM+PCA+Indicator MCA model provides perfect classification when the sizes of binary variables (b) are 5 and 10. For b=20, the SLM+PCA+Indicator MCA model produces misclassification rates of 0.3833, 0.6667 and 0.3221 for n=60, n=120 and n=180, respectively. Meanwhile, the SLM+PCA+Burt MCA model provides a perfect classification when the sizes of the binary variables are 5, 10, 15 and 20 and yields a small misclassification rate as 0.0167 when b=25. Investigations into real dataset demonstrate that both of the newly constructed models yield low misclassification rates with 0.3066 and 0.2336 respectively, in which the SLM+PCA+Burt MCA model performed the best among all the classification methods compared. The findings reveal that the two new models of SLM integrated with two variable extraction techniques can be good alternative methods for classification purposes in handling mixed variable problems, mainly when dealing with large binary variables.