Multilingual datasets for Main content extraction from web pages

Citation Author(s):: Geunseong Jung (Hanyang University)
Submitted by:: Geunseong Jung
Last updated:: Thu, 03/21/2024 - 07:27
DOI:: 10.21227/rj0q-t583
Research Article Link:: Extracting the Main Content of Web Pages Using the First Impression Area
Links:: Figshare mirror

190 views

Categories:

Other

Keywords:

web data

ACCESS DATASET CITE

Abstract

This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.

This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).

Releated Resources:

- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-frame…
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2

Instructions:

Please read this instructions here:

https://github.com/dreamwayjgs/main-content-extraction-assessment-framework#demo-datasets

Dataset Files

dataset.zip (Size: 5.21 GB)

Datasets

Standard Dataset

Multilingual datasets for Main content extraction from web pages

Abstract

Instructions:

Dataset Files

QUESTIONS?

More like this Dataset

List of Indexed Journal: Web of Science, Scopus, and DOAJ

Dataset for classification of handwritten and printed text in a Doctor's prescription

Stock Market Tweets Data

Hotel Reviews from around the world with Sentiment Values and Review Ratings in different Categories for Natural Language Processing

SU-AIS BB-MAS (Syracuse University and Assured Information Security - Behavioral Biometrics Multi-device and multi-Activity data from Same users) Dataset

A Dataset on Online Learning-based Web Behavior from Different Countries Before and After COVID-19