Volume no :
6 |
Issue no :
1
Article Type :
Scholarly Article
Author :
A.Manoj Prabaharan
Published Date :
May, 2025
Publisher :
Journal of Science Technology and Research (JSTAR)
Page No: 1 - 12
Abstract : With the explosive growth of unstructured textual data in various domains such as social media, scientific literature, and enterprise records, scalable and accurate topic modeling has become crucial for extracting meaningful insights. Traditional topic modeling techniques like Latent Dirichlet Allocation (LDA) often struggle to handle the volume, velocity, and variety of big data while maintaining interpretability and computational efficiency. This paper proposes a novel approach that leverages transformer-based language models to enhance text mining and enable scalable topic modeling suitable for big data environments. Transformers, characterized by their self-attention mechanisms, have revolutionized natural language processing by capturing long-range dependencies and contextual semantics more effectively than previous methods. However, their direct application to large-scale topic modeling faces challenges due to computational demands and the complexity of adapting generative topic models to transformer architectures. To address these limitations, our approach integrates transformer embeddings with scalable clustering and dimensionality reduction techniques, creating a hybrid pipeline optimized for both performance and scalability. First, raw textual data undergoes preprocessing including tokenization, stop-word removal, and normalization. Subsequently, transformer models such as BERT or its variants generate high-dimensional contextualized embeddings representing semantic features of the documents. These embeddings encapsulate rich linguistic information, enabling the discovery of nuanced and coherent topics that traditional bag-of-words models may miss. To manage scalability, we employ approximate nearest neighbor search and hierarchical clustering algorithms, which reduce the computational burden when grouping documents into topic clusters. Dimensionality reduction techniques such as UMAP or PCA further streamline the embedding space, facilitating efficient topic extraction and visualization. This combination ensures the framework can process millions of documents without compromising topic quality or requiring extensive computational resources. We evaluate the proposed method on multiple large-scale benchmark datasets, including social media posts, scientific articles, and customer feedback collections. The results demonstrate significant improvements in topic coherence, relevance, and interpretability compared to baseline models including LDA and non-transformer embedding approaches. Moreover, our model exhibits robust scalability, with linear or near-linear time complexity growth relative to dataset size, making it well-suited for real-time and streaming data scenarios. In addition, the transformer-enhanced pipeline offers flexibility for downstream applications such as sentiment analysis, trend detection, and knowledge discovery, by providing semantically rich topic representations that facilitate deeper understanding and contextualization of big data content. The modular nature of the system allows easy integration with existing big data platforms and supports incremental updates to topic models as new data arrives. This work highlights the potential of combining state-of-the-art transformer architectures with scalable clustering methods to advance the field of topic modeling in big data contexts. Future research directions include optimizing transformer architectures for even greater efficiency, exploring multilingual and cross-domain topic models, and incorporating user feedback mechanisms to refine topic quality interactively. In summary, our transformer-enhanced text mining framework presents a scalable, accurate, and flexible solution for extracting meaningful topics from massive unstructured text corpora, empowering organizations to unlock actionable insights from big data environments.
Keyword: Transformer models, Text mining, Topic modeling, Big data, Scalability, BERT embeddings, Clustering, Dimensionality reduction, UMAP, Latent Dirichlet Allocation, Natural language processing, Semantic analysis, Large-scale text analysis, Document clustering, Computational efficiency
Reference:

1. Sidharth, S. (2017). Cybersecurity Approaches for IoT Devices in Smart City Infrastructures.
2. Sidharth, S. (2016). The Role of Artificial Intelligence in Enhancing Automated Threat Hunting 1Mr. Sidharth Sharma.
3. Srinivasan, R. (2025). Friction Stir Additive Manufacturing of AA7075/Al2O3 and Al/MgB2 Composites for Improved Wear and Radiation Resistance in Aerospace Applications. J. Environ. Nanotechnol, 14(1), 295-305.
4. Vijayalakshmi, K., Amuthakkannan, R., Ramachandran, K., & Rajkavin, S. A. (2024). Federated Learning-Based Futuristic Fault Diagnosis and Standardization in Rotating Machinery. SSRG International Journal of Electronics and Communication Engineering, 11(9), 223-236.
5. Sidharth, S. (2019). Enhancing Security of Cloud-Native Microservices with Service Mesh Technologies.
6. Sidharth, S. (2022). Zero Trust Architecture: A Key Component of Modern Cybersecurity Frameworks.
7. Sakthibalan, P., Saravanan, M., Ansal, V., Rajakannu, A., Vijayalakshmi, K., & Vani, K. D. (2023). A Federated Learning Approach for ResourceConstrained IoT Security Monitoring. In Handbook on Federated Learning (pp. 131-154). CRC Press.
8. Amuthakkannan, R., & Al Yaqoubi, M. H. A. (2023). Development of IoT based water pollution identification to avoid destruction of aquatic life and to improve the quality of water. International journal of engineering trends and technology, 71(10), 355-370.
9. Sidharth, S. (2016). Establishing Ethical and Accountability Frameworks for Responsible AI Systems.
10. Sidharth, S. (2015). AI-Driven Detection and Mitigation of Misinformation Spread in Generated Content.
11. Amuthakkannan, R., Muthuraj, M., Ademi, E., Rajesh, V., & Ahammad, S. H. (2023). Analysis of fatigue strength on friction stir lap weld AA2198/Ti6Al4V joints. Materials Today: Proceedings.
12. Prova, Nuzhat Noor Islam. “Healthcare Fraud Detection Using Machine Learning.” 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI). IEEE, 2024.
13. Prova, N. N. I. (2024, August). Garbage Intelligence: Utilizing Vision Transformer for Smart Waste Sorting. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI) (pp. 1213-1219). IEEE.
14. Sidharth, S. (2023). AI-Driven Anomaly Detection for Advanced Threat Detection.
15. Sidharth, S. (2023). Homomorphic Encryption: Enabling Secure Cloud Data Processing.
16. Prova, N. N. I. (2024, August). Advanced Machine Learning Techniques for Predictive Analysis of Health Insurance. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI) (pp. 1166-1170). IEEE.
17. Prova, N. N. I. (2024, October). Improved Solar Panel Efficiency through Dust Detection Using the InceptionV3 Transfer Learning Model. In 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) (pp. 260-268). IEEE.
18. Devi, K., & Indoria, D. (2021). Digital Payment Service In India: A Review On Unified Payment Interface. Int. J. of Aquatic Science, 12(3), 1960-1966.
19. Devi, K., & Indoria, D. (2023). The Critical Analysis on The Impact of Artificial Intelligence on Strategic Financial Management Using Regression Analysis. Res Militaris, 13(2), 7093-7102.
20. Prova, N. N. I. (2024, October). Enhancing Fish Disease Classification in Bangladeshi Aquaculture through Transfer Learning, and LIME Interpretability Techniques. In 2024 4th International Conference on Sustainable Expert Systems (ICSES) (pp. 1157-1163). IEEE.
21. Devi, K., & Indoria, D. (2021). Role of Micro Enterprises in the Socio-Economic Development of Women–A Case Study of Koraput District, Odisha. Design Engineering, 1135-1151.
22. Indoria, D. (2021). AN APPLICATION OF FOREIGN DIRECT INVESTMENT. BIMS International Research Journal of Management and Commerce, 6(1), 01-04.
23. Sidharth, S. (2024). Strengthening Cloud Security with AI-Based Intrusion Detection Systems.
24. Sidharth, S. (2022). Enhancing Generative AI Models for Secure and Private Data Synthesis.
25. Kumar, G. H., Raja, D. K., Suresh, S., Kottamala, R., & Harsith, M. (2024, August). Vision-Guided Pick and Place Systems Using Raspberry Pi and YOLO. In 2024 2nd International Conference on Networking, Embedded and Wireless Systems (ICNEWS) (pp. 1-7). IEEE.
26. Kumar, G. H., Raja, D. K., Varun, H. D., & Nandikol, S. (2024, November). Optimizing Spatial Efficiency Through Velocity-Responsive Controller in Vehicle Platooning. In 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS) (pp. 1-5). IEEE.
27. Sidharth, S. (2021). Multi-Cloud Environments: Reducing Security Risks in Distributed Architectures.
28. Sidharth, S. (2020). The Rising Threat of Deepfakes: Security and Privacy Implications.
29. Kalimuthu, S., Perumal, T., Yaakob, R., Marlisah, E., & Babangida, L. (2021, March). Human Activity Recognition based on smart home environment and their applications, challenges. In 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 815-819). IEEE.
30. Vidhyasagar, B. S., Lakshmanan, A. S., Abishek, M. K., & Kalimuthu, S. (2023, October). Video captioning based on sign language using yolov8 model. In IFIP International Internet of Things Conference (pp. 306-315). Cham: Springer Nature Switzerland.
31. Thamma, S. R. T. S. R. (2024). Optimization of Generative AI Costs in Multi-Agent and Multi-Cloud Systems.
32. Thamma, S. R. T. S. R. (2024). Revolutionizing Healthcare: Spatial Computing Meets Generative AI.
33. Vidhyasagar, B. S., Arvindhan, M., Arulprakash, A., Kannan, B. B., & Kalimuthu, S. (2023, November). The crucial function that clouds access security brokers play in ensuring the safety of cloud computing. In 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI) (pp. 98-102). IEEE.
34. Vidhyasagar, B. S., Harshagnan, K., Diviya, M., & Kalimuthu, S. (2023, October). Prediction of Tomato Leaf Disease Plying Transfer Learning Models. In IFIP International Internet of Things Conference (pp. 293-305). Cham: Springer Nature Switzerland.
35. Thamma, S. R. (2024). Cardiovascular image analysis: AI can analyze heart images to assess cardiovascular health and identify potential risks.
36. Thamma, S. R. T. S. R. (2024). Generative AI in Graph-Based Spatial Computing: Techniques and Use Cases.
37. Kalimuthu, S., Perumal, T., Yaakob, R., Marlisah, E., & Raghavan, S. (2024, March). Multiple human activity recognition using iot sensors and machine learning in device-free environment: Feature extraction, classification, and challenges: A comprehensive review. In AIP Conference Proceedings (Vol. 2816, No. 1). AIP Publishing.
38. Bs, V., Madamanchi, S. C., & Kalimuthu, S. (2024, February). Early Detection of Down Syndrome Through Ultrasound Imaging Using Deep Learning Strategies—A Review. In 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE) (pp. 1-6). IEEE.
39. Turlapati, V. R., Vichitra, P., Raval, N., Khaja Mohinuddeen, J., & Mishra, B. R. (2024). Ethical Implications of Artificial Intelligence in Business Decision-making: A Framework for Responsible AI Adoption. Journal of Informatics Education and Research, 4(1).
40. Raju, P., Arun, R., Turlapati, V. R., Veeran, L., & Rajesh, S. (2024). Next-Generation Management on Exploring AI-Driven Decision Support in Business. In Optimizing Intelligent Systems for Cross-Industry Application (pp. 61-78). IGI Global.
41. Kalimuthu, S., Perumal, T., Marlisah, E., Yaakob, R., BS, V., & Ismail, N. H. (2024). HUMAN ACTIVITY RECOGNITION BASED ON DEVICE-FREE WI-FI SENSING: A COMPREHENSIVE REVIEW. Malaysian Journal of Computer Science, 37(3), 252-269.
42. Seshanna, M., Kumar, H., Seshanna, S., & Alur, N. (2021). THE INFLUENCE OF FINANCIAL LITERACY ON COLLECTIBLES AS AN ALTERNATIVE INVESTMENT AVENUE: EFFECTS OF FINANCIAL SKILL, FINANCIAL BEHAVIOUR AND PERCEIVED KNOWLEDGE ON INVESTORS’FINANCIAL WELLBEING. Turkish Online Journal of Qualitative Inquiry, 12(4).
43. Rao, P. S. (2008). International Business Environment. HIMALAYA PUBLISHING HOUSE 2nd Rev. ed..
44. Deshmukh, M., Ghadle, K., & Jadhav, O. (2020). An innovative approach for ranking hexagonal fuzzy numbers to solve linear programming problems. International Journal on Emerging Technologies, 11(2), 385-388.
45. Patil, R. D., & Jadhav, O. S. (2016). Some contribution of statistical techniques in big data: a review. International Journal on Recent and Innovation Trends in Computing and Communication, 4(4), 293-303.
46. Sreekanthaswamy, N., Anitha, S., Singh, A., Jayadeva, S. M., Gupta, S., Manjunath, T. C., & Selvakumar, P. (2025). Digital Tools and Methods. Enhancing School Counseling With Technology and Case Studies, 25.
47. Sreekanthaswamy, N., & Hubballi, R. B. (2024). Innovative Approaches To Fmcg Customer Journey Mapping: The Role Of Block Chain And Artificial Intelligence In Analyzing Consumer Behavior And Decision-Making. Library of Progress-Library Science, Information Technology & Computer, 44(3).
48. Kalluri, V. S. Impact of AI-Driven CRM on Customer Relationship Management and Business Growth in the Manufacturing Sector. International Journal of Innovative Science and Research Technology (IJISRT).
49. Kalluri, V. S. Optimizing Supply Chain Management in Boiler Manufacturing through AI-enhanced CRM and ERP Integration. International Journal of Innovative Science and Research Technology (IJISRT).
50. Nair, S. S., Lakshmikanthan, G., Kendyala, S. H., & Dhaduvai, V. S. (2024, October). Safeguarding Tomorrow-Fortifying Child Safety in Digital Landscape. In 2024 International Conference on Computing, Sciences and Communications (ICCSC) (pp. 1-6). IEEE.
51. Lakshmikanthan, G., Nair, S. S., Sarathy, J. P., Singh, S., Santiago, S., & Jegajothi, B. (2024, December). Mitigating IoT Botnet Attacks: Machine Learning Techniques for Securing Connected Devices. In 2024 International Conference on Emerging Research in Computational Science (ICERCS) (pp. 1-6). IEEE.
52. Kalluri, S. V. S., & Narra, S. (2024). Predictive Analytics in ADAS Development: Leveraging CRM Data for Customer-Centric Innovations in Car Manufacturing. vol, 9, 6.
53. Kalluri, V. S., Malineni, S. C., Seenivasan, M., Sakkarai, J., Kumar, D., & Ananthan, B. (2025). Enhancing manufacturing efficiency: leveraging CRM data with Lean-based DL approach for early failure detection. Bulletin of Electrical Engineering and Informatics, 14(3), 2319-2329.
54. Nair, S. S. (2023). Digital Warfare: Cybersecurity Implications of the Russia-Ukraine Conflict. International Journal of Emerging Trends in Computer Science and Information Technology, 4(4), 31-40.
55. Chu, T. S., Nair, S. S., & Lakshmikanthan, G. (2022). Network Intrusion Detection Using Advanced AI Models A Comparative Study of Machine Learning and Deep Learning Approaches. International Journal of Communication Networks and Information Security (IJCNIS), 14(2), 359-365.
56. Jeyaprabha, B., & Sundar, C. (2021). The mediating effect of e-satisfaction on e-service quality and e-loyalty link in securities brokerage industry. Revista Geintec-gestao Inovacao E Tecnologias, 11(2), 931-940.
57. Jeyaprabha, B., & Sunder, C. What Influences Online Stock Traders’ Online Loyalty Intention? The Moderating Role of Website Familiarity. Journal of Tianjin University Science and Technology.
58. Lakshmikanthan, G., & Nair, S. S. (2024). Protecting Self-Driving Vehicles from attack threats. International Journal of Emerging Research in Engineering and Technology, 5(1), 16-20.
59. Sivakumar, K., Manoj Kumar, S., Saravanan, G., & Mahendran, G. (2025). Mechanical, wear, fatigue, water absorption and flammability of silane-treated Indian squid chitin powder-dispersed pineapple fiber-polyester composite. Polymer Bulletin, 82(5), 1663-1683.
60. Mahendran, G., Kumar, S. M., Uvaraja, V. C., & Anand, H. (2025). Effect of wheat husk biogenic ceramic Si3N4 addition on mechanical, wear and flammability behaviour of castor sheath fibre-reinforced epoxy composite. Journal of the Australian Ceramic Society, 1-10.
61. Jeyaprabha, B., Catherine, S., & Vijayakumar, M. (2024). Unveiling the Economic Tapestry: Statistical Insights Into India’s Thriving Travel and Tourism Sector. In Managing Tourism and Hospitality Sectors for Sustainable Global Transformation (pp. 249-259). IGI Global.
62. JEYAPRABHA, B., & SUNDAR, C. (2022). The Psychological Dimensions Of Stock Trader Satisfaction With The E-Broking Service Provider. Journal of Positive School Psychology, 3787-3795.
63. Mahendran, G., Mageswari, M., Kakaravada, I., & Rao, P. K. V. (2024). Characterization of polyester composite developed using silane-treated rubber seed cellulose toughened acrylonitrile butadiene styrene honey comb core and sunn hemp fiber. Polymer Bulletin, 81(17), 15955-15973.
64. Mahendran, G., Gift, M. M., Kakaravada, I., & Raja, V. L. (2024). Load bearing investigations on lightweight rubber seed husk cellulose–ABS 3D-printed core and sunn hemp fiber-polyester composite skin building material. Macromolecular Research, 32(10), 947-958.
65. Kumar, T. V. (2019). Cloud-Based Core Banking Systems Using Microservices Architecture.
66. Kumar, T. V. (2019). BLOCKCHAIN-INTEGRATED PAYMENT GATEWAYS FOR SECURE DIGITAL BANKING.
67. Mohanavel, V., Diwakar, G., Govindasamy, M., Singh, V., Theophilus Rajakumar, I. P., Soudagar, M. E. M., … & Alharbi, S. A. (2024). Fabrication of ramie/hemp fibers-reinforced hybrid polymer composite—A comprehensive study on biological and structural application. AIP Advances, 14(8).
68. Nadaf, A. B., Sharma, S., & Trivedi, K. K. (2024). CONTEMPORARY SOCIAL MEDIA AND IOT BASED PANDEMIC CONTROL: A ANALYTICAL APPROACH. Weser Books, 73.
69. Trivedi, K. K. (2022). A Framework of Legal Education towards Litigation-Free India. Issue 3 Indian JL & Legal Rsch., 4, 1.
70. Kumar, T. V. (2015). CLOUD-NATIVE MODEL DEPLOYMENT FOR FINANCIAL APPLICATIONS.
71. Kumar, T. V. (2018). REAL-TIME COMPLIANCE MONITORING IN BANKING OPERATIONS USING AI.
72. Trivedi, K. K. (2022). HISTORICAL AND CONCEPTUAL DEVELOPMENT OF PARLIAMENTARY PRIVILEGES IN INDIA.
73. Himanshu Gupta, H. G., & Trivedi, K. K. (2017). International water clashes and India (a study of Indian river-water treaties with Bangladesh and Pakistan).
74. Trivedi, K. K. (2017). Cultural Influences On The Effectiveness Of Women Protection Legislation.
75. Kumar, T. V. (2020). Generative AI Applications in Customizing User Experiences in Banking Apps.
76. Kumar, T. V. (2020). FEDERATED LEARNING TECHNIQUES FOR SECURE AI MODEL TRAINING IN FINTECH.
77. Hussain, M. I., Shamim, M., Ravi Sankar, A. V., Kumar, M., Samanta, K., & Sakhare, D. T. (2022). The effect of the Artificial Intelligence on learning quality & practices in higher education. Journal of Positive School Psychology, 1002-1009.
78. Prasad, V., Dangi, A. K., Tripathi, R., & Kumar, N. (2023). Educational Perspective of Intellectual Property Rights. Russian Law Journal, 11(2S), 257-268.
79. Kumar, T. V. (2022). AI-Powered Fraud Detection in Real-Time Financial Transactions.
80. Kumar, T. V. (2021). NATURAL LANGUAGE UNDERSTANDING MODELS FOR PERSONALIZED FINANCIAL SERVICES.
81. Khachariya, H. D., Naveen, S., Al-Nussairi, A. K. J., Abood, B. S. Z., Alanssari, A. I., & Shaker, Z. Y. (2024, November). Deep Learning for Workforce Planning and Analytics. In 2024 Second International Conference Computational and Characterization Techniques in Engineering & Sciences (IC3TES) (pp. 1-5). IEEE.

Transformer-Enhanced Text Mining

Transformer-Enhanced Text Mining A primary obstacle in utilizing transformers directly for topic modeling is their computational
intensity and the difficulty in interpreting the resulting embeddings in terms of explicit topics.
Additionally, big data environments introduce additional constraints such as the need for real-
time processing, distributed computing, and incremental learning capabilities. To overcome
these hurdles, there is a growing interest in hybrid approaches that combine transformer-based
embeddings with scalable clustering and dimensionality reduction techniques, thereby
harnessing the semantic power of transformers while maintaining the efficiency required for big
data applications.

This paper proposes a novel transformer-enhanced text mining framework designed specifically
for scalable topic modeling in big data environments. Our approach involves generating
contextual embeddings from large corpora using transformer models, followed by scalable
clustering methods to group semantically similar documents into coherent topics. Dimensionality
reduction techniques further optimize the embedding space, enabling efficient computation and
visualization. This integrated pipeline not only improves topic quality but also addresses the
scalability challenges posed by massive datasets Transformer-Enhanced Text Mining.
We systematically evaluate the performance of our framework on diverse large-scale datasets
drawn from social media, scientific publications, and customer feedback, benchmarking against
classical topic modeling methods and non-transformer embedding baselines. The results
illustrate that our method significantly enhances topic coherence, interpretability, and relevance
while maintaining computational efficiency. Furthermore, the approach is adaptable to real-time
and streaming data, supporting incremental updates that are crucial for dynamic big data
ecosystems.

This paper presents a novel framework that leverages transformer-based language models to
enhance text mining and enable scalable topic modeling in big data environments. By integrating
powerful contextual embeddings generated from transformer architectures such as BERT with
scalable clustering and dimensionality reduction techniques, the proposed system addresses the
limitations of traditional topic modeling methods in handling massive and complex textual
datasets.

Transformer-Enhanced Text Mining

Download

Indexed in