Multilingual datasets for Main content extraction from web pages

Citation Author(s):
Geunseong
Jung
Hanyang University
Submitted by:
Geunseong Jung
Last updated:
Thu, 03/21/2024 - 03:27
DOI:
10.21227/rj0q-t583
Research Article Link:
Links:
License:
171 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.

This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).

Releated Resources:

- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2

Instructions: