Multilingual datasets for Main content extraction from web pages
This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.
This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).
- Main Content Extraction Framework: https://github.com/dreamwayjgs/main-content-extraction-assessment-framework
- GCE Algorithm (on the above framework): https://gitlab.com/dreamwayjgs/main-content-extractor-v2
Please read this instructions here: