What questions do you still have about the model and the associated data?
- Licensing & usage rights: What license covers each dataset used? Are there any restriction from the license that possibly affect the model training as well as the implementation in projects?
- Consent & privacy issues: For the dataset collected, how were they collected and where did they come from? Did they receive consent from the data owner, or author who published the content?
- Bias & training process: How do researchers try to decrease the biases during training? What specific methods are being used?
Are there elements you would propose including in the biography?
- datasheet summary for each dataset: for better clarification for selection biases and ethical constraints.
- detailed dataset information: the access to the dataset used.
- attributions & contributions: to acknowledge the help for those who provide the database, help train the model, do the data cleaning, etc.
How does understanding the provenance of the model and its data inform your creative process?
- Provide a perspective to better search into the certain model I wish to use and prevent using the models that potentially have the risk of privacy issues, exploit of labor, or violation of consents.
- I can turn these downside into a project that showcase the bias and and problems involved in training these models to help people better understand the rick behind these technology and call for attention and improvement in the relative field.
- Carefully design the database and the collection process if I somehow find it necessary to train one model in the future on my own