Long (2009) conceptualizes data verification as having four dimensions.
1. Values review: The process of assessing "whether the values held by each variable are appropriate and whether all values that should be represented are found in your data" (p. 211).
2. Substantive review: The process of checking values in terms of their substantive meaning.
3. Missing data review: Procedures used to ensure that missing data and skip patterns are correctly coded, etc.
4. Internal consistency review: The process of "checking that responses are consistent across variables...[in order to] uncover subtle problems with the data" (p. 211).
describe (ds when abbreviated) is a helpful command that lets users quickly view characteristics of a given dataset, including variable types (numeric, string, etc.), variable formats, and "whether variable labels, value labels, and characteristics...[are] attached" (p. 448). summarize is also a useful command for viewing key characteristics of the data, especially when used with the detail option.
For example, one can retrieve a list of numeric variables in a dataset by using the has option followed by the variable type.
The has option can also be used to view variables with value labels attached.
ds, detail allows Stata users to view variable storage types, formats, and labels for all of the variables in the dataset.
codebook, compact gives users a quick and easy overview of any variables in the dataset and their corresponding summary statistics and labels. Users can detect problems with their dataset by checking the variables' ranges, for example.
Long (2009) recommends using inspect, which produces a small histogram. This command is useful because it prints the histogram to the user's log file.
When verifying and cleaning data, tab1 offers an advantage over tab. Specifically, tab1 allows users to specify a variable list.
dotplot and stem lend themselves well to checking variables with many values. This graph was created using the command dotplot mpg.
Here is an example using the stem command, which prints to the user's log file. This graph allows researchers to easily observe spikes in the data, which may warrant further investigation.