A Multimodal Dataset for Bank Failure Prediction in Vietnam

0
0 ratings - Please login to submit your rating.

Abstract 

This dataset integrates textual, financial, and macroeconomic indicators to support research on bank failure prediction and financial distress forecasting in Vietnam. It includes financial news from the BKAI News Corpus Dataset (2009–2023) and financial crisis data from "A Dataset for the Vietnamese Banking System (2002–2021)" (Tu Le et al., 2022), covering crisis-related events such as restructuring, special control, mergers, and acquisitions.

The dataset was systematically processed to align textual and financial data by time and entity (banks). Bank mentions in news articles were identified using common spelling variations and abbreviations. Macroeconomic and banking sector indicators were collected from the World Bank and the General Statistics Office of Vietnam. A categorical variable indicates whether a bank was classified as weak based on official State Bank of Vietnam reports.

This dataset is suitable for time-series analysis and machine learning applications, enabling researchers to explore relationships between financial narratives, macroeconomic factors, and banking crises. It facilitates the development of predictive models, including deep learning architectures, for early warning systems in financial risk management.

Proper attribution to BKAI, Tu Le et al. (2022), the World Bank, and the General Statistics Office of Vietnam is required when using this dataset.

Keywords: Bank failure prediction, multimodal dataset, financial news analysis, CAMELS rating, time-series analysis, machine learning, deep learning, Vietnam banking system.

Instructions: 
Overview

This dataset integrates financial news from 13 Vietnamese banks (2009–2023) with key financial indicators, supporting research on bank failure prediction and financial distress forecasting. Preprocessed and standardized for consistency, it facilitates machine learning and statistical analysis. By combining textual and quantitative data, it enables a multimodal approach to early banking crisis detection in emerging markets.

Structure

The main dataset (processed_data.csv) contains the following columns:

  • id: Unique identifier for each news article.

  • link: URL of the news article.

  • publish: Date and time when the news article was published.

  • text: Content of the news article.

  • year: The year in which the article was published.

  • First_Bank: The primary bank mentioned in the article (or the most important bank if multiple banks are referenced).

  • signal: Binary label indicating financial distress (1 = distressed, 0 = not distressed).

CAMELS Indicators
  • ETA (Equity-to-Total Assets Ratio): Measures a bank’s financial stability by comparing equity to total assets.

  • LLP (Loan Loss Provision): The amount set aside by banks to cover potential loan defaults.

  • ROA (Return on Assets): A profitability ratio indicating how efficiently a bank uses its assets to generate profit.

  • NIM (Net Interest Margin): Measures the difference between interest income and interest expenses as a percentage of total assets.

  • CIR (Cost-to-Income Ratio): A measure of a bank’s efficiency, calculated as operating costs divided by income.

  • LTD (Liquid Assets Over Total Deposits): The proportion of a bank’s liquid assets relative to its total deposits, indicating liquidity strength. 

Macroeconomic Indicators
  • NPL (Non-Performing Loan Ratio): The overall percentage of non-performing loans in the Vietnam banking industry.

  • Fsector (Personal Remittances Received): Total personal remittances received, measured in US dollars.

  • CPI (Consumer Price Index Change %): Year-over-year percentage change in the Consumer Price Index, reflecting inflation trends.

  • Gold (Gold Price Change %): Year-over-year percentage change in gold prices, indicating market fluctuations.

  • US dollar (Exchange Rate Change %): Year-over-year percentage change in the exchange rate of the Vietnamese đồng (VND) against the US dollar.

  • Inflat (Inflation Rate Change %): Year-over-year percentage change in inflation, showing overall price level movements.

Categorical Indicators
  • Weak: A binary indicator showing whether the bank was classified as weak (1 = weak, 0 = not weak).

  • Not Weak: A binary indicator showing whether the bank was not classified as weak (1 = not weak, 0 = weak).

Attribution

If you use this dataset, please acknowledge the original sources: