The Internet of Things (IoT) is leading the physical and digital world of technology to converge. Real-time and massive scale connections produce a large amount of versatile data, where Big Data comes into the picture. Big Data refers to large, diverse sets of information with dimensions that go beyond the capabilities of widely used database management systems, or standard data processing software tools to manage within a given limit. Almost every big dataset is dirty and may contain missing data, mistyping, inaccuracies, and many more issues that impact Big Data analytics performances. One of the biggest challenges in Big Data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics results and unpredictable conclusions. We experimented with different missing value imputation techniques and compared machine learning (ML) model performances with different imputation methods. We propose a hybrid model for missing value imputation combining ML and sample-based statistical techniques. Furthermore, we continued with the best missing value inputted dataset, chosen based on ML model performance for feature engineering and hyperparameter tuning. We used k-means clustering and principal component analysis. Accuracy, the evaluated outcome, improved dramatically and proved that the XGBoost model gives very high accuracy at around 0.125 root mean squared logarithmic error (RMSLE). To overcome overfitting, we used K-fold cross-validation.
Here, we report the draft genome sequence and assembly of the Penicillium sp. strain E22, which was isolated from Antarctic soil of Deception Island, South Shetland Islands close to the Antarctic Peninsula. The genome was sequenced using a 2 # 250 bp paired-end method by Illumina MiSeq 6000. The genome assembly was performed using softwares implemented in the Kbase web service. The phylogenetic tree of strain E22 comparing its internal transcribed spacer (ITS) region with the other Penicillium showed high genetic similarity to Penicillium griseofulvum MN545450 and Penicillium camemberti MT530220. Draf genome of Penicillium sp. strain E22 comprises 33,653 coding sequences, with a high G + C content of 48.32% and a total size of 37,484,944 bp. This draft genome assembly version has been deposited at GenBank under accession JASJUN000000000.