Datasets
Standard Dataset
Palmer Penguins 100k
- Citation Author(s):
- Submitted by:
- Ifeanyi Idiaye
- Last updated:
- Wed, 11/13/2024 - 13:01
- DOI:
- 10.21227/q92n-mr26
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
To provide machine learning and data science experts with a more robust dataset for model training, the well-known Palmer Penguins dataset has been expanded from its original 344 rows to 100,000 rows. This substantial increase was achieved using an adversarial random forest technique, effectively generating additional synthetic data while maintaining key patterns and features. The method achieved an impressive accuracy of 88%, ensuring the expanded dataset remains realistic and suitable for classification tasks. Now, users can explore more complex modeling opportunities, develop nuanced classification models, and conduct broader experiments with penguin data than was possible with the limited original dataset. This scaled-up dataset opens new possibilities for data scientists, enabling enhanced model performance testing, more detailed training procedures, and diverse feature exploration. By expanding this beloved dataset, the aim is to foster innovation and facilitate deeper insights within the machine learning community.
To load the scaled Palmer Penguins dataset as a CSV file in both R and Python, follow these steps:
-
Locate the CSV File: Make sure the CSV file of the scaled dataset is saved on your computer. Note its file path, as it will be needed to load the data into R and Python.
-
Load the Dataset in R:
- Use R’s
read.csv()
function to load the dataset by specifying the file path. This function reads the data and stores it as a data frame, a common structure for data manipulation in R. - To confirm the data has loaded correctly, you can use the
head()
function, which displays the first few rows, allowing you to inspect the dataset's columns and content.
- Use R’s
-
Load the Dataset in Python:
- In Python, the popular
pandas
library provides aread_csv()
function to load the CSV file. Like in R, you specify the file path, andpandas
imports the data as a DataFrame, which is ideal for analysis in Python. - Preview the data by using the
.head()
method on the DataFrame. This will display the first few rows, helping you verify that the dataset loaded as expected.
- In Python, the popular
These steps will ensure the scaled Palmer Penguins dataset is ready for further exploration and model training in both R and Python.
Comments