Data 2 Pattern
88 subscribers
3 photos
4 files
18 links
Data science isn't about the quantity of data but rather the quality. โ€” Joo Ann Lee
Download Telegram
๐Ÿ” Understanding the Impact of Feature Selection vs. Feature Extraction in Dimensionality Reduction for Big Data ๐Ÿ“Š



In the era of big data, working with high-dimensional datasets presents major challenges in processing, visualization, and model performance. A recent study titled "Comparison of Feature Selection and Feature Extraction Role in Dimensionality Reduction of Big Data" (Journal of Techniques, 2023) offers a comprehensive evaluation of Feature Selection (FS) and Feature Extraction (FE) using the ANSUR II dataset โ€” a U.S. Army anthropometric dataset with 109 features and 6068 observations.



๐Ÿ“Œ Study Goals

To compare FS and FE techniques in terms of:

โžก๏ธ Dimensionality reduction

โžก๏ธ Predictive performance

โžก๏ธ Information retention

โš™๏ธ Techniques Explored

๐Ÿงน Feature Selection:

๐Ÿ”ธ Highly Correlated Filter โ€“ removes features with correlation > 0.88

๐Ÿ”ธ Recursive Feature Elimination (RFE) โ€“ eliminates the least important features iteratively

๐Ÿ”„ Feature Extraction:

๐Ÿ”น Principal Component Analysis (PCA) โ€“ transforms original features into orthogonal components

๐Ÿงช Methodology

๐Ÿงผ Data preprocessing using Missing Value Ratio

๐Ÿง  Classification using ML models:

โœ… K-Nearest Neighbors (KNN)

โœ… Decision Tree

โœ… Support Vector Machine (SVM)

โœ… Neural Network

โœ… Random Forest

๐Ÿ” Post-reduction classification using the same models

๐Ÿ“ˆ Key Results

๐Ÿ† KNN consistently performed best, maintaining 83% accuracy pre- and post-reduction

๐Ÿง  RFE showed the highest accuracy among reduction techniques with 66% post-reduction accuracy

๐Ÿงฉ PCA effectively reduced features but slightly decreased accuracy and interpretability

๐Ÿ’ก Takeaways

โœ… Use Feature Selection when interpretability and maintaining original structure are important

โœ… Use Feature Extraction for noisy or highly redundant datasets

๐ŸŽฏ The choice depends on your data and modeling objectives

๐Ÿ“– Read the full paper here: DOI: 10.51173/jt.v5i1.1027



This is an excellent reference for anyone navigating the complexities of dimensionality reduction in ML pipelines. Whether you're optimizing models or just curious about FS vs. FE, this study is gold! ๐Ÿง โœจ

#MachineLearning #DataScience #FeatureEngineering #DimensionalityReduction #BigData #AI #KNN #PCA #RFE #MLResearch #DataAnalytics