The Data Audit
The purpose of the data audit is to gain an understanding of the datasets your are using, and make sure that your data is reusable, and that your work is reproducible. This is also a mechanism for understanding the origins of the data you’re working with, to help you uncover bias. Specifically, your data audit should address the following (adapted from the Mozilla Science Guide).:
- What
- What is the title of the dataset? Provide as much context as possible, including a proper citation for the dataset where possible (you can get more information about citing datasets from the Tufts Library Research Guide)
- Are the any relevant research publications related to the dataset? This can include datasets that are used to create your dataset, or briefs written in support of the dataset.
- Who
- Who is responsible for the data? This can include PIs, research groups, or institutions that contributed to the data collection. This might also include specific contact information for a person who can answer questions about the data.
- Who else can use this data? If possible, identify the specific license assigned to the dataset.
- Where
- Where was the data collected? This can include multiple geographic locations.
- Where does the data live now?
- When
- When was the data collected? When writing about time, be sure to use ISO format (YYYY-MM-DD hh:mm:ss) and be as specific as possible.
- What timespan does the data cover?
- How
- How was the data collected? What were the steps and instruments used to collect the data?
- How was the data processed? What were the steps taking to clean the data, how were null values handled, what sort of pre-processing has been done?
- What else? Who else?
- What or who might be missing in the data?
- What do you still not know about the process of creating the dataset after answering the questions above?
- Do you have any other questions or concerns about the data?
The data audit can be submitted as a bulleted list for, but keep in mind that eventually it will be transformed into a narrative form and included as a section of your final paper called “Data Overview.”