Volume no :
6 |
Issue no :
1
Article Type :
Scholarly Article
Author :
Arul Selvan M
Published Date :
May, 2025
Publisher :
Journal of Science Technology and Research (JSTAR)
Page No: 1 - 12
Abstract : This paper presents a novel framework that leverages pretrained language models to enhance semantic-aware text extraction and neural topic discovery from large and unstructured text corpora. Traditional topic modeling techniques, such as Latent Dirichlet Allocation (LDA), often rely on bag-of-words representations, which ignore the semantic relationships between words and suffer from limited contextual understanding. To address these limitations, our approach integrates contextual embeddings derived from state-of-the-art pretrained language models, such as BERT and its variants, enabling a richer and more nuanced representation of text semantics. Our framework begins by extracting semantically meaningful text segments using attention-based mechanisms inherent to pretrained models, allowing us to capture relevant context beyond simple keyword frequency. This step significantly improves the quality of text extraction by focusing on conceptually important phrases and sentences rather than isolated terms. Subsequently, we utilize neural topic discovery techniques that operate directly on the dense embeddings produced by the language models, bypassing the traditional reliance on sparse count-based features. We introduce a novel neural architecture that combines contrastive learning with clustering objectives to discover coherent and interpretable topics from these semantic embeddings. This architecture dynamically adapts to the semantic structures within the data, improving both the granularity and coherence of discovered topics. Experimental results on multiple benchmark datasets demonstrate that our method outperforms conventional topic models in terms of topic coherence and relevance, as measured by standard evaluation metrics such as topic coherence scores (e.g., NPMI, UMass) and human qualitative assessments. Additionally, our approach exhibits robustness across various domains and scales efficiently to large datasets, making it applicable for real-world scenarios such as document summarization, content recommendation, and exploratory data analysis. We also provide comprehensive ablation studies illustrating the contributions of semantic-aware extraction and neural topic modeling components to overall system performance. In conclusion, our work highlights the potential of pretrained language models not only for improving semantic representation but also as foundational tools for downstream tasks involving text extraction and topic discovery. By bridging the gap between deep contextual understanding and neural topic modeling, this framework offers a powerful paradigm for interpreting and organizing complex textual information.
Keyword: pretrained language models, semantic-aware text extraction, neural topic discovery, BERT, contextual embeddings, topic modeling, contrastive learning, clustering, topic coherence, natural language processing, document summarization, text representation, unsupervised learning, deep learning, semantic understanding
Reference:

1. Sidharth, S. (2017). Cybersecurity Approaches for IoT Devices in Smart City Infrastructures.
2. Sidharth, S. (2016). The Role of Artificial Intelligence in Enhancing Automated Threat Hunting 1Mr. Sidharth Sharma.
3. Srinivasan, R. (2025). Friction Stir Additive Manufacturing of AA7075/Al2O3 and Al/MgB2 Composites for Improved Wear and Radiation Resistance in Aerospace Applications. J. Environ. Nanotechnol, 14(1), 295-305.
4. Vijayalakshmi, K., Amuthakkannan, R., Ramachandran, K., & Rajkavin, S. A. (2024). Federated Learning-Based Futuristic Fault Diagnosis and Standardization in Rotating Machinery. SSRG International Journal of Electronics and Communication Engineering, 11(9), 223-236.
5. Sidharth, S. (2019). Enhancing Security of Cloud-Native Microservices with Service Mesh Technologies.
6. Sidharth, S. (2022). Zero Trust Architecture: A Key Component of Modern Cybersecurity Frameworks.
7. Sakthibalan, P., Saravanan, M., Ansal, V., Rajakannu, A., Vijayalakshmi, K., & Vani, K. D. (2023). A Federated Learning Approach for ResourceConstrained IoT Security Monitoring. In Handbook on Federated Learning (pp. 131-154). CRC Press.
8. Amuthakkannan, R., & Al Yaqoubi, M. H. A. (2023). Development of IoT based water pollution identification to avoid destruction of aquatic life and to improve the quality of water. International journal of engineering trends and technology, 71(10), 355-370.
9. Sidharth, S. (2016). Establishing Ethical and Accountability Frameworks for Responsible AI Systems.
10. Sidharth, S. (2015). AI-Driven Detection and Mitigation of Misinformation Spread in Generated Content.
11. Amuthakkannan, R., Muthuraj, M., Ademi, E., Rajesh, V., & Ahammad, S. H. (2023). Analysis of fatigue strength on friction stir lap weld AA2198/Ti6Al4V joints. Materials Today: Proceedings.
12. Prova, Nuzhat Noor Islam. “Healthcare Fraud Detection Using Machine Learning.” 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI). IEEE, 2024.
13. Prova, N. N. I. (2024, August). Garbage Intelligence: Utilizing Vision Transformer for Smart Waste Sorting. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI) (pp. 1213-1219). IEEE.
14. Sidharth, S. (2023). AI-Driven Anomaly Detection for Advanced Threat Detection.
15. Sidharth, S. (2023). Homomorphic Encryption: Enabling Secure Cloud Data Processing.
16. Prova, N. N. I. (2024, August). Advanced Machine Learning Techniques for Predictive Analysis of Health Insurance. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI) (pp. 1166-1170). IEEE.
17. Prova, N. N. I. (2024, October). Improved Solar Panel Efficiency through Dust Detection Using the InceptionV3 Transfer Learning Model. In 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) (pp. 260-268). IEEE.
18. Devi, K., & Indoria, D. (2021). Digital Payment Service In India: A Review On Unified Payment Interface. Int. J. of Aquatic Science, 12(3), 1960-1966.
19. Devi, K., & Indoria, D. (2023). The Critical Analysis on The Impact of Artificial Intelligence on Strategic Financial Management Using Regression Analysis. Res Militaris, 13(2), 7093-7102.
20. Prova, N. N. I. (2024, October). Enhancing Fish Disease Classification in Bangladeshi Aquaculture through Transfer Learning, and LIME Interpretability Techniques. In 2024 4th International Conference on Sustainable Expert Systems (ICSES) (pp. 1157-1163). IEEE.
21. Devi, K., & Indoria, D. (2021). Role of Micro Enterprises in the Socio-Economic Development of Women–A Case Study of Koraput District, Odisha. Design Engineering, 1135-1151.
22. Indoria, D. (2021). AN APPLICATION OF FOREIGN DIRECT INVESTMENT. BIMS International Research Journal of Management and Commerce, 6(1), 01-04.
23. Sidharth, S. (2024). Strengthening Cloud Security with AI-Based Intrusion Detection Systems.
24. Sidharth, S. (2022). Enhancing Generative AI Models for Secure and Private Data Synthesis.
25. Kumar, G. H., Raja, D. K., Suresh, S., Kottamala, R., & Harsith, M. (2024, August). Vision-Guided Pick and Place Systems Using Raspberry Pi and YOLO. In 2024 2nd International Conference on Networking, Embedded and Wireless Systems (ICNEWS) (pp. 1-7). IEEE.
26. Kumar, G. H., Raja, D. K., Varun, H. D., & Nandikol, S. (2024, November). Optimizing Spatial Efficiency Through Velocity-Responsive Controller in Vehicle Platooning. In 2024 8th International Conference on Computational System and Information Technology for Sustainable Solutions (CSITSS) (pp. 1-5). IEEE.
27. Sidharth, S. (2021). Multi-Cloud Environments: Reducing Security Risks in Distributed Architectures.
28. Sidharth, S. (2020). The Rising Threat of Deepfakes: Security and Privacy Implications.
29. Kalimuthu, S., Perumal, T., Yaakob, R., Marlisah, E., & Babangida, L. (2021, March). Human Activity Recognition based on smart home environment and their applications, challenges. In 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 815-819). IEEE.
30. Vidhyasagar, B. S., Lakshmanan, A. S., Abishek, M. K., & Kalimuthu, S. (2023, October). Video captioning based on sign language using yolov8 model. In IFIP International Internet of Things Conference (pp. 306-315). Cham: Springer Nature Switzerland.
31. Thamma, S. R. T. S. R. (2024). Optimization of Generative AI Costs in Multi-Agent and Multi-Cloud Systems.
32. Thamma, S. R. T. S. R. (2024). Revolutionizing Healthcare: Spatial Computing Meets Generative AI.
33. Vidhyasagar, B. S., Arvindhan, M., Arulprakash, A., Kannan, B. B., & Kalimuthu, S. (2023, November). The crucial function that clouds access security brokers play in ensuring the safety of cloud computing. In 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI) (pp. 98-102). IEEE.
34. Vidhyasagar, B. S., Harshagnan, K., Diviya, M., & Kalimuthu, S. (2023, October). Prediction of Tomato Leaf Disease Plying Transfer Learning Models. In IFIP International Internet of Things Conference (pp. 293-305). Cham: Springer Nature Switzerland.
35. Thamma, S. R. (2024). Cardiovascular image analysis: AI can analyze heart images to assess cardiovascular health and identify potential risks.
36. Thamma, S. R. T. S. R. (2024). Generative AI in Graph-Based Spatial Computing: Techniques and Use Cases.
37. Kalimuthu, S., Perumal, T., Yaakob, R., Marlisah, E., & Raghavan, S. (2024, March). Multiple human activity recognition using iot sensors and machine learning in device-free environment: Feature extraction, classification, and challenges: A comprehensive review. In AIP Conference Proceedings (Vol. 2816, No. 1). AIP Publishing.
38. Bs, V., Madamanchi, S. C., & Kalimuthu, S. (2024, February). Early Detection of Down Syndrome Through Ultrasound Imaging Using Deep Learning Strategies—A Review. In 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE) (pp. 1-6). IEEE.
39. Turlapati, V. R., Vichitra, P., Raval, N., Khaja Mohinuddeen, J., & Mishra, B. R. (2024). Ethical Implications of Artificial Intelligence in Business Decision-making: A Framework for Responsible AI Adoption. Journal of Informatics Education and Research, 4(1).
40. Raju, P., Arun, R., Turlapati, V. R., Veeran, L., & Rajesh, S. (2024). Next-Generation Management on Exploring AI-Driven Decision Support in Business. In Optimizing Intelligent Systems for Cross-Industry Application (pp. 61-78). IGI Global.
41. Kalimuthu, S., Perumal, T., Marlisah, E., Yaakob, R., BS, V., & Ismail, N. H. (2024). HUMAN ACTIVITY RECOGNITION BASED ON DEVICE-FREE WI-FI SENSING: A COMPREHENSIVE REVIEW. Malaysian Journal of Computer Science, 37(3), 252-269.
42. Seshanna, M., Kumar, H., Seshanna, S., & Alur, N. (2021). THE INFLUENCE OF FINANCIAL LITERACY ON COLLECTIBLES AS AN ALTERNATIVE INVESTMENT AVENUE: EFFECTS OF FINANCIAL SKILL, FINANCIAL BEHAVIOUR AND PERCEIVED KNOWLEDGE ON INVESTORS’FINANCIAL WELLBEING. Turkish Online Journal of Qualitative Inquiry, 12(4).
43. Rao, P. S. (2008). International Business Environment. HIMALAYA PUBLISHING HOUSE 2nd Rev. ed..
44. Deshmukh, M., Ghadle, K., & Jadhav, O. (2020). An innovative approach for ranking hexagonal fuzzy numbers to solve linear programming problems. International Journal on Emerging Technologies, 11(2), 385-388.
45. Patil, R. D., & Jadhav, O. S. (2016). Some contribution of statistical techniques in big data: a review. International Journal on Recent and Innovation Trends in Computing and Communication, 4(4), 293-303.
46. Sreekanthaswamy, N., Anitha, S., Singh, A., Jayadeva, S. M., Gupta, S., Manjunath, T. C., & Selvakumar, P. (2025). Digital Tools and Methods. Enhancing School Counseling With Technology and Case Studies, 25.
47. Sreekanthaswamy, N., & Hubballi, R. B. (2024). Innovative Approaches To Fmcg Customer Journey Mapping: The Role Of Block Chain And Artificial Intelligence In Analyzing Consumer Behavior And Decision-Making. Library of Progress-Library Science, Information Technology & Computer, 44(3).
48. Kalluri, V. S. Impact of AI-Driven CRM on Customer Relationship Management and Business Growth in the Manufacturing Sector. International Journal of Innovative Science and Research Technology (IJISRT).
49. Kalluri, V. S. Optimizing Supply Chain Management in Boiler Manufacturing through AI-enhanced CRM and ERP Integration. International Journal of Innovative Science and Research Technology (IJISRT).
50. Nair, S. S., Lakshmikanthan, G., Kendyala, S. H., & Dhaduvai, V. S. (2024, October). Safeguarding Tomorrow-Fortifying Child Safety in Digital Landscape. In 2024 International Conference on Computing, Sciences and Communications (ICCSC) (pp. 1-6). IEEE.
51. Lakshmikanthan, G., Nair, S. S., Sarathy, J. P., Singh, S., Santiago, S., & Jegajothi, B. (2024, December). Mitigating IoT Botnet Attacks: Machine Learning Techniques for Securing Connected Devices. In 2024 International Conference on Emerging Research in Computational Science (ICERCS) (pp. 1-6). IEEE.
52. Kalluri, S. V. S., & Narra, S. (2024). Predictive Analytics in ADAS Development: Leveraging CRM Data for Customer-Centric Innovations in Car Manufacturing. vol, 9, 6.
53. Kalluri, V. S., Malineni, S. C., Seenivasan, M., Sakkarai, J., Kumar, D., & Ananthan, B. (2025). Enhancing manufacturing efficiency: leveraging CRM data with Lean-based DL approach for early failure detection. Bulletin of Electrical Engineering and Informatics, 14(3), 2319-2329.
54. Nair, S. S. (2023). Digital Warfare: Cybersecurity Implications of the Russia-Ukraine Conflict. International Journal of Emerging Trends in Computer Science and Information Technology, 4(4), 31-40.
55. Chu, T. S., Nair, S. S., & Lakshmikanthan, G. (2022). Network Intrusion Detection Using Advanced AI Models A Comparative Study of Machine Learning and Deep Learning Approaches. International Journal of Communication Networks and Information Security (IJCNIS), 14(2), 359-365.
56. Jeyaprabha, B., & Sundar, C. (2021). The mediating effect of e-satisfaction on e-service quality and e-loyalty link in securities brokerage industry. Revista Geintec-gestao Inovacao E Tecnologias, 11(2), 931-940.
57. Jeyaprabha, B., & Sunder, C. What Influences Online Stock Traders’ Online Loyalty Intention? The Moderating Role of Website Familiarity. Journal of Tianjin University Science and Technology.
58. Lakshmikanthan, G., & Nair, S. S. (2024). Protecting Self-Driving Vehicles from attack threats. International Journal of Emerging Research in Engineering and Technology, 5(1), 16-20.
59. Sivakumar, K., Manoj Kumar, S., Saravanan, G., & Mahendran, G. (2025). Mechanical, wear, fatigue, water absorption and flammability of silane-treated Indian squid chitin powder-dispersed pineapple fiber-polyester composite. Polymer Bulletin, 82(5), 1663-1683.
60. Mahendran, G., Kumar, S. M., Uvaraja, V. C., & Anand, H. (2025). Effect of wheat husk biogenic ceramic Si3N4 addition on mechanical, wear and flammability behaviour of castor sheath fibre-reinforced epoxy composite. Journal of the Australian Ceramic Society, 1-10.
61. Jeyaprabha, B., Catherine, S., & Vijayakumar, M. (2024). Unveiling the Economic Tapestry: Statistical Insights Into India’s Thriving Travel and Tourism Sector. In Managing Tourism and Hospitality Sectors for Sustainable Global Transformation (pp. 249-259). IGI Global.
62. JEYAPRABHA, B., & SUNDAR, C. (2022). The Psychological Dimensions Of Stock Trader Satisfaction With The E-Broking Service Provider. Journal of Positive School Psychology, 3787-3795.
63. Mahendran, G., Mageswari, M., Kakaravada, I., & Rao, P. K. V. (2024). Characterization of polyester composite developed using silane-treated rubber seed cellulose toughened acrylonitrile butadiene styrene honey comb core and sunn hemp fiber. Polymer Bulletin, 81(17), 15955-15973.
64. Mahendran, G., Gift, M. M., Kakaravada, I., & Raja, V. L. (2024). Load bearing investigations on lightweight rubber seed husk cellulose–ABS 3D-printed core and sunn hemp fiber-polyester composite skin building material. Macromolecular Research, 32(10), 947-958.
65. Kumar, T. V. (2019). Cloud-Based Core Banking Systems Using Microservices Architecture.
66. Kumar, T. V. (2019). BLOCKCHAIN-INTEGRATED PAYMENT GATEWAYS FOR SECURE DIGITAL BANKING.
67. Mohanavel, V., Diwakar, G., Govindasamy, M., Singh, V., Theophilus Rajakumar, I. P., Soudagar, M. E. M., … & Alharbi, S. A. (2024). Fabrication of ramie/hemp fibers-reinforced hybrid polymer composite—A comprehensive study on biological and structural application. AIP Advances, 14(8).
68. Nadaf, A. B., Sharma, S., & Trivedi, K. K. (2024). CONTEMPORARY SOCIAL MEDIA AND IOT BASED PANDEMIC CONTROL: A ANALYTICAL APPROACH. Weser Books, 73.
69. Trivedi, K. K. (2022). A Framework of Legal Education towards Litigation-Free India. Issue 3 Indian JL & Legal Rsch., 4, 1.
70. Kumar, T. V. (2015). CLOUD-NATIVE MODEL DEPLOYMENT FOR FINANCIAL APPLICATIONS.
71. Kumar, T. V. (2018). REAL-TIME COMPLIANCE MONITORING IN BANKING OPERATIONS USING AI.
72. Trivedi, K. K. (2022). HISTORICAL AND CONCEPTUAL DEVELOPMENT OF PARLIAMENTARY PRIVILEGES IN INDIA.
73. Himanshu Gupta, H. G., & Trivedi, K. K. (2017). International water clashes and India (a study of Indian river-water treaties with Bangladesh and Pakistan).
74. Trivedi, K. K. (2017). Cultural Influences On The Effectiveness Of Women Protection Legislation.
75. Kumar, T. V. (2020). Generative AI Applications in Customizing User Experiences in Banking Apps.
76. Kumar, T. V. (2020). FEDERATED LEARNING TECHNIQUES FOR SECURE AI MODEL TRAINING IN FINTECH.
77. Hussain, M. I., Shamim, M., Ravi Sankar, A. V., Kumar, M., Samanta, K., & Sakhare, D. T. (2022). The effect of the Artificial Intelligence on learning quality & practices in higher education. Journal of Positive School Psychology, 1002-1009.
78. Prasad, V., Dangi, A. K., Tripathi, R., & Kumar, N. (2023). Educational Perspective of Intellectual Property Rights. Russian Law Journal, 11(2S), 257-268.
79. Kumar, T. V. (2022). AI-Powered Fraud Detection in Real-Time Financial Transactions.
80. Kumar, T. V. (2021). NATURAL LANGUAGE UNDERSTANDING MODELS FOR PERSONALIZED FINANCIAL SERVICES.
81. Khachariya, H. D., Naveen, S., Al-Nussairi, A. K. J., Abood, B. S. Z., Alanssari, A. I., & Shaker, Z. Y. (2024, November). Deep Learning for Workforce Planning and Analytics. In 2024 Second International Conference Computational and Characterization Techniques in Engineering & Sciences (IC3TES) (pp. 1-5). IEEE.

Semantic-Aware Text Extraction

Semantic-Aware Text Extraction with the exponential growth of digital text data generated daily—from social media posts and
news articles to academic papers and corporate documents—there is an increasing need for
automated methods that can effectively extract meaningful information and uncover latent
topics within large unstructured text corpora. Understanding the thematic structure of
documents not only aids in organizing and summarizing content but also supports numerous
downstream applications such as information retrieval, recommendation systems, and
knowledge discovery. Traditional topic modeling approaches, such as Latent Dirichlet Allocation
(LDA) and Non-negative Matrix Factorization (NMF), have played a pivotal role in this domain by
identifying groups of co-occurring words that represent underlying topics. However, these
models often rely on bag-of-words representations that disregard the semantic context of terms,
resulting in limited interpretability and performance, especially on complex or nuanced datasets.
In recent years, the advent of pretrained language models (PLMs), such as BERT (Bidirectional
Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), and
their derivatives, has revolutionized natural language processing (NLP). These models are trained
on massive corpora using self-supervised objectives, enabling them to capture deep contextual
relationships between words in a sentence or document. Unlike traditional methods that treat
words as discrete tokens, PLMs generate dense, context-aware embeddings that encapsulate
semantic meaning and syntactic nuances. Leveraging these embeddings for downstream tasks
has led to remarkable improvements in performance across a wide spectrum of NLP challenges,
including sentiment analysis, question answering, and text classification.
Despite the success of PLMs, their potential for enhancing topic discovery remains relatively
underexplored. Most classical topic modeling techniques cannot directly incorporate denseWith the exponential growth of digital text data generated daily—from social media posts and
news articles to academic papers and corporate documents—there is an increasing need for
automated methods that can effectively extract meaningful information and uncover latent
topics within large unstructured text corpora. Understanding the thematic structure of
documents not only aids in organizing and summarizing content but also supports numerous
downstream applications such as information retrieval, recommendation systems, and
knowledge discovery. Traditional topic modeling approaches, such as Latent Dirichlet Allocation
(LDA) and Non-negative Matrix Factorization (NMF), have played a pivotal role in this domain by
identifying groups of co-occurring words that represent underlying topics. However, these
models often rely on bag-of-words representations that disregard the semantic context of terms,
resulting in limited interpretability and performance, especially on complex or nuanced datasets.
In recent years, the advent of pretrained language models (PLMs), such as BERT (Bidirectional
Encoder Representations from Transformers), GPT (Generative Pretrained Transformer), and
their derivatives, has revolutionized natural language processing (NLP). These models are trained
on massive corpora using self-supervised objectives, enabling them to capture deep contextual
relationships between words in a sentence or document. Unlike traditional methods that treat
words as discrete tokens, PLMs generate dense, context-aware embeddings that encapsulate
semantic meaning and syntactic nuances. Leveraging these embeddings for downstream tasks
has led to remarkable improvements in performance across a wide spectrum of NLP challenges,
including sentiment analysis, question answering, and text classification.
Despite the success of PLMs, their potential for enhancing topic discovery remains relatively
underexplored. Most classical topic modeling techniques cannot directly incorporate dense

Semantic-Aware Text Extraction

Download

Indexed in