Long (2009) conceptualizes data verification as having four dimensions.


1. Values review: The process of assessing "whether the values held by each variable are appropriate and whether all values that should be represented are found in your data" (p. 211).

2. Substantive review: The process of checking values in terms of their substantive meaning.

3. Missing data review: Procedures used to ensure that missing data and skip patterns are correctly coded, etc.

4. Internal consistency review: The process of "checking that responses are consistent across variables...[in order to] uncover subtle problems with the data" (p. 211).






describe (ds when abbreviated) is a helpful command that lets users quickly view characteristics of a given dataset, including variable types (numeric, string, etc.), variable formats, and "whether variable labels, value labels, and characteristics...[are] attached" (p. 448). summarize is also a useful command for viewing key characteristics of the data, especially when used with the detail option.


For example, one can retrieve a list of numeric variables in a dataset by using the has option followed by the variable type.




The has option can also be used to view variables with value labels attached.




ds, detail allows Stata users to view variable storage types, formats, and labels for all of the variables in the dataset.













codebook, compact gives users a quick and easy overview of any variables in the dataset and their corresponding summary statistics and labels. Users can detect problems with their dataset by checking the variables' ranges, for example.













Long (2009) recommends using inspect, which produces a small histogram. This command is useful because it prints the histogram to the user's log file.












When verifying and cleaning data, tab1 offers an advantage over tab. Specifically, tab1 allows users to specify a variable list.
































dotplot and stem lend themselves well to checking variables with many values. This graph was created using the command dotplot mpg.














Here is an example using the stem command, which prints to the user's log file. This graph allows researchers to easily observe spikes in the data, which may warrant further investigation.

















All of the above commands are useful for substantively reviewing the imported data. During this phase of data verification, researchers should think carefully about the relationships between the variables and whether or not these relationships are accurately represented in the data. scatter provides yet another visual means of verifying and familiarizing oneself with recently imported data. The following scatter plot was created using the command scatter mpg price.














Formatting options such as jitter() and msymbol(circle_hollow) allow the use to more easily see discrete cases within the dataset. This scatter plot was created using scatter mpg price, msymbol(circle_hollow) jitter(8)

















Stata recognizes the period symbol . as system missing. It is important to note that Stata reads system missing data as positive infinity. Stata also features 26 extended missing values (.a through .z). These values are useful for distinguishing between different types of missing data (i.e., refused, not applicable, etc.).


In order to retrieve a count of missing values for a given variable, using the tab command in combination with the missing option.












When recoding variables with conditional statements, remember that Stata reads missing values as positive infinity. By consistently using <. as the upper boundary in a conditional statement, users can avoid accidentally recoding missing values.


Another method for reviewing missing values involves creating indicator variables. Here is example code for this missing data review strategy. 
















Throughout this process, it can be helpful to add notes to your variables using the notes varname: command.





Long (2009) suggests thinking about "logical links among variables" and using assert to test these links. In this way, researchers can detect internal inconsistencies in their data.


For example, one would expect that no cars in this database have a price of zero dollars. Furthermore, the maximum value of price is slightly over 15,000. Therefore, there should be no observations that equal 50,000 or more for this variable. When an assertion holds, Stata does not provide output. When an assertion is proven false, Stata will provide information regarding the contradictions.







Long (2009) recommends using the commands tab, compare, and inspect, among others, to investigate internal inconsistencies. In some cases, researchers may generate new variables to assist in this process.





Before saving the data, Long (2009) points out that researchers may want to do one or more of the following (p. 261):


1. Drop variables and/or observations (i.e., drop varname or keep varlist)

2. Create new variables

3. Rearrange variables within the dataset

4. Add internal documentation that indicates when and how the dataset was created

5. Compress the data to minimize the file size (i.e., compress)

6. Run diagnostics to look for data problems (see below)

7. Add a data signature to guard against inadvertent changes to the data (i.e., datasignature set)


Users can save their data using the save dataname, replace command.


Furthermore, a few simple codes can help users diagnose problems. codebook, problems detects "variables that have no variation," "variables with nonexisting value labels," and "incompletely labeled variables" (p. 265).








isid varlist checks to make sure that IDs are unique, although users may sometimes expect some IDs to be non-unique (e.g., in the case of panel data). Stata will only produce output when duplicate IDs are detected. duplicates is a simlar but more general command that detects duplicate observations in the dataset.


Finally, a data signature can help researchers when they're collaborating on a project and need to ensure that they're working with the most recent version of a dataset.