Large vision-language models (LVLMs) suffer from hallucination, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in multi-modal contexts, which can be mainly attributed its training data. The vision instruction dataset primarily focuses on global description that are highly relevant to the image, with few samples containing image details.