Skip to main content

Datasets

Standard Dataset

Multilingual datasets for Main content extraction from web pages

Citation Author(s):
Geunseong Jung (Hanyang University)
Submitted by:
Geunseong Jung
Last updated:
DOI:
10.21227/rj0q-t583
Research Article Link:
Links:
190 views
Categories:
Keywords:
No Ratings Yet

Abstract

This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.

This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).

Releated Resources:

- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-frame…
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2

Instructions:

Please read this instructions here:

https://github.com/dreamwayjgs/main-content-extraction-assessment-framework#demo-datasets