Datasets
Standard Dataset
Multilingual datasets for Main content extraction from web pages
- Citation Author(s):
- Submitted by:
- Geunseong Jung
- Last updated:
- Thu, 03/21/2024 - 03:27
- DOI:
- 10.21227/rj0q-t583
- Research Article Link:
- Links:
- License:
171 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.
This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).
Releated Resources:
- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2
Instructions:
Please read this instructions here:
https://github.com/dreamwayjgs/main-content-extraction-assessment-framew...