web data

Multilingual datasets for Main content extraction from web pages

This dataset is for researching main content extraction from web pages as a archived mongoDB file and postgresql dump file.

This dataset has crawled MHTML files of web pages from nine languages (Korean, Japanese, Indonesian, French, Russian, Saudi Arabian (Arabic), and Chinese).

Releated Resources:

Categories:: Other

189 Views

free dataset from news/message boards/blogs about CoronaVirus (4 month of data - 5.2M posts)

Free dataset from news/message boards/blogs about CoronaVirus (4 month of data - 5.2M posts). The time frame of the data is Dec/2019 - March/2020. The posts are in English mentioning at least one of the following: "Covid" OR CoronaVirus OR "Corona Virus".

Categories:: COVID-19

3865 Views