The dataset created focuses on the Pakistan Military by collecting five types of entities from Wikipedia: weapons, ranks, dates, operations, and locations. An open-source NER annotator was utilized for annotation, ensuring accurate labeling of data. Post-annotation, the data underwent cleaning and balancing processes. The final dataset comprises 660 neutral and 660 anti-military sentiment samples, totaling 1320 samples. This balanced dataset serves as a valuable resource for sentiment analysis, providing insights into public sentiment regarding military-related topics.


This Named Entities dataset is implemented by employing the widely used Large Language Model (LLM), BERT, on the CORD-19 biomedical literature corpus. By fine-tuning the pre-trained BERT on the CORD-NER dataset, the model gains the ability to comprehend the context and semantics of biomedical named entities. The refined model is then utilized on the CORD-19 to extract more contextually relevant and updated named entities. However, fine-tuning large datasets with LLMs poses a challenge. To counter this, two distinct sampling methodologies are utilized.