Scene understanding is essential for a wide range of robotic tasks, such as grasping. Simplifying the scene into predefined forms makes the robot perform the robotic task more properly, especially in an unknown environment. This paper proposes a combination of simulation-based and realworld datasets for domain adaptation purposes and grasping in practical settings. In order to compensate for the weakness of depth images in previous studies reported in the literature for clearly representing boundaries, the RGB image has also been fed as input in RGB and RGB-D input modalities.