Getting Started

An overview of the usage of ReadStatTables.jl is provided below. For instructions on installation, see Installation.

Reading a Data File

Suppose we have a Stata .dta file located at data/sample.dta. To read this file into Julia:

julia> using ReadStatTables
julia> tb = readstat("data/sample.dta")5×7 ReadStatTable: Row │ mychar mynum mydate dtime mylabl ⋯ │ String3 Float64 Date? DateTime? Labeled{Int8} Label ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male ⋯ 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female ⋯ 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male ⋯ 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female ⋯ 5 │ e 1000.3 missing missing Male ⋯ 2 columns omitted

Here is how we read the above result:[1]

  • Variable names from the data file are displayed in the first row.
  • Element type of each variable is displayed below the corresponding variable name.
  • The values of each variable are displayed column-wise starting from the third row.

Some additional details to be noted:

  • If a variable contains any missing value, there is a question mark ? in the displayed element type.
  • By default, all missing values are treated as missing, a special value in Julia.
  • The date and time values have been translated into Date and DateTime respectively.[2]
  • Labels instead of the numeric values are displayed for variables with value labels.
  • Labeled{Int8} is an abbreviation for LabeledValue{Int8}.

Accessing Individual Objects

A vector of all variable names can be obtained as follows:

julia> columnnames(tb)7-element Vector{Symbol}:
 :mychar
 :mynum
 :mydate
 :dtime
 :mylabl
 :myord
 :mytime

To retrieve the array containing data for a specific variable:

julia> tb.mylabl5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}:
 1 => Male
 2 => Female
 1 => Male
 2 => Female
 1 => Male
Note

The returned array is exactly the same array holding the data for the table. Therefore, modifying elements in the returned array will also change the data in the table. To avoid such changes, please copy the array first.

Metadata for the data file can be accessed from tb using methods that are compatible with DataAPI.jl.

julia> metadata(tb)ReadStatMeta:
  row count           => 5
  var count           => 7
  modified time       => 2021-04-22T21:36:00
  file format version => 118
  file label          => A test file
  file extension      => .dta
julia> colmetadata(tb)ColMetaIterator{ReadStatColMeta} with 7 entries: :mychar => ReadStatColMeta(character, %-1s) :mynum => ReadStatColMeta(numeric, %16.2f) :mydate => ReadStatColMeta(date, %td) :dtime => ReadStatColMeta(datetime, %tc) :mylabl => ReadStatColMeta(labeled, %16.0f) :myord => ReadStatColMeta(ordinal, %16.0f) :mytime => ReadStatColMeta(time, %tcHH:MM:SS)
julia> colmetadata(tb, :myord)ReadStatColMeta: label => ordinal format => %16.0f type => READSTAT_TYPE_INT8 value label => myord storage width => 1 display width => 16 measure => READSTAT_MEASURE_UNKNOWN alignment => READSTAT_ALIGNMENT_RIGHT

Type Conversions

The interface provided by ReadStatTables.jl allows basic tasks. In case more complicated operations are needed, it is easy to convert the objects into other types.

Converting ReadStatTable

The table returned by readstat is a ReadStatTable. Converting a ReadStatTable to another table type is easy, thanks to the widely supported Tables.jl interface.

For example, to convert a ReadStatTable to a DataFrame from DataFrames.jl:

julia> using DataFrames
julia> df = DataFrame(tb)5×7 DataFrame Row │ mychar mynum mydate dtime mylabl myord ⋯ │ String3 Float64 Date? DateTime? LabeledV… LabeledV… ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male low ⋯ 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female medium 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male high 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female low 5 │ e 1000.3 missing missing Male missing ⋯ 1 column omitted

Metadata contained in a ReadStatTable are preserved in the converted DataFrame when working with DataFrames.jl version v1.4.0 or above, which supports the same DataAPI.jl interface for metadata:

julia> metadata(df)Dict{String, Any} with 13 entries:
  "file_ext"             => ".dta"
  "modified_time"        => DateTime("2021-04-22T21:36:00")
  "file_format_version"  => 118
  "file_format_is_64bit" => true
  "table_name"           => ""
  "notes"                => String[]
  "file_encoding"        => ""
  "file_label"           => "A test file"
  "var_count"            => 7
  "row_count"            => 5
  "creation_time"        => DateTime("2021-04-22T21:36:00")
  "endianness"           => READSTAT_ENDIAN_LITTLE
  "compression"          => READSTAT_COMPRESS_NONE
julia> colmetadata(df, :myord)Dict{String, Any} with 8 entries: "label" => "ordinal" "format" => "%16.0f" "display_width" => 16 "measure" => READSTAT_MEASURE_UNKNOWN "alignment" => READSTAT_ALIGNMENT_RIGHT "type" => READSTAT_TYPE_INT8 "storage_width" => 0x0000000000000001 "vallabel" => :myord

Converting LabeledArray

Variables with value labels are stored in LabeledArrays. To convert a LabeledArray to another array type, we may either obtain an array of LabeledValues or collect the values and labels separately. The data values can be directly retrieved by calling refarray:

julia> refarray(tb.mylabl)5-element Vector{Union{Missing, Int8}}:
 1
 2
 1
 2
 1
Note

The array returned by refarray is exactly the same array underlying the LabeledArray. Therefore, modifying the elements of the array will also mutate the values in the associated LabeledArray.

If only the value labels are needed, we can obtain an iterator of the value labels via valuelabels. For example, to convert a LabeledArray to a CategoricalArray from CategoricalArrays.jl:

julia> using CategoricalArrays
julia> CategoricalArray(valuelabels(tb.mylabl))5-element CategoricalArray{String,1,UInt32}: "Male" "Female" "Male" "Female" "Male"

It is also possible to only convert the type of the underlying data values:

julia> convertvalue(Int32, tb.mylabl)5-element LabeledVector{Int32, Vector{Int32}, Union{Char, Int32}}:
 1 => Male
 2 => Female
 1 => Male
 2 => Female
 1 => Male
ReadStatTables.convertvalueFunction
convertvalue(T, x::LabeledArray)

Convert the type of data values contained in x to T. This method is equivalent to convert(AbstractArray{LabeledValue{T, K}, N}}, x).

source

Writing a Data File

To write a table to a supported data file format:

julia> # Create a data frame for illustration
       df = DataFrame(readstat("data/alltypes.dta")); emptycolmetadata!(df)3×9 DataFrame
 Row │ vbyte      vint       vlong      vfloat     vdouble    vstr     vstrL   ⋯
     │ LabeledV…  LabeledV…  LabeledV…  LabeledV…  LabeledV…  String3  String  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ A          A          A          A          A          ab       This is ⋯
   2 │ missing    missing    missing    missing    missing
   3 │ missing    missing    missing    missing    missing
                                                               3 columns omitted
julia> out = writestat("data/write_alltypes.dta", df)3×9 ReadStatTable: Row │ vbyte vint vlong vfloat ⋯ │ Labeled{Int8?} Labeled{Int16?} Labeled{Int32?} Labeled{Float32?} La ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A A A A ⋯ 2 │ missing missing missing missing ⋯ 3 │ missing missing missing missing ⋯ 5 columns omitted

The returned table out contains the actual data (including metadata) that are exposed to the writer.

Value labels attached to a LabeledArray are always preserved in the output file. If the input table contains any column of type CategoricalArray or PooledArray, value labels are created and written automatically by default:

julia> using PooledArrays
julia> df[!,:vbyte] = CategoricalArray(valuelabels(df.vbyte))3-element CategoricalArray{String,1,UInt32}: "A" "missing" "missing"
julia> df[!,:vint] = PooledArray(valuelabels(df.vint))3-element PooledVector{String, UInt32, Vector{UInt32}}: "A" "missing" "missing"
julia> out = writestat("data/write_alltypes.dta", df)3×9 ReadStatTable: Row │ vbyte vint vlong vfloat ⋯ │ Labeled{Int32?} Labeled{Int32?} Labeled{Int32?} Labeled{Float32?} L ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A A A A ⋯ 2 │ missing missing missing missing ⋯ 3 │ missing missing missing missing ⋯ 5 columns omitted

Notice that in the returned table, the columns vbyte and vint are LabeledArrays:

julia> out.vbyte3-element LabeledVector{Union{Missing, Int32}, SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}, Union{Char, Int32}}:
 1 => A
 2 => missing
 2 => missing
julia> out.vint3-element LabeledVector{Union{Missing, Int32}, SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}, Union{Char, Int32}}: 1 => A 2 => missing 2 => missing

It is possible to specify the format of certain variables. This can be important, for example, for variables representing date/time. In the above example, a default format has been selected:

julia> out.vdate3-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Day}}(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1))), ::SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}) with eltype Union{Missing, Date}:
 1960-01-02
 missing
 missing
julia> colmetadata(out, :vdate, "format")"%td"

To specify a different format, a convenient approach is to specify the varformat keyword argument:

julia> out2 = writestat("data/write_alltypes.dta", df, varformat=Dict(:vdate=>"%tm"));
julia> out2.vdate3-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Month}}(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1))), ::SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}) with eltype Union{Missing, Date}: 1960-01-01 missing missing
julia> colmetadata(out2, :vdate, "format")"%tm"

Notice that since the format "%tm" is for months, the day within a month has been ignored and becomes 1 for the printed value of Date in the above example.

Info

When a variable has a Stata format "%tw", "%tm", "%tq" or "%th", a displayed Date is always the first day within the corresponding Stata period. In particular, Stata counts week numbers starting from the first day of each year and hence a displayed Date may not correspond to the first day of a calendar week.

Warning

The write support is experimental and requires further testing. Caution should be taken when writing the data files.

More Options

The behavior of readstat can be adjusted by passing keyword arguments:

ReadStatTables.readstatFunction
readstat(filepath; kwargs...)

Return a ReadStatTable that collects data (including metadata) from a supported data file located at filepath.

Supported File Formats

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and por

Keywords

  • ext = lowercase(splitext(filepath)[2]): extension of data file for choosing the parser.
  • usecols::Union{ColumnSelector, Nothing} = nothing: only collect data from the specified columns (variables); collect all columns if usecols=nothing.
  • row_limit::Union{Integer, Nothing} = nothing: restrict the total number of rows to be read; read all rows if row_limit=nothing.
  • row_offset::Integer = 0: skip the specified number of rows.
  • ntasks::Union{Integer, Nothing} = nothing: number of tasks spawned to read data file in concurrent chunks with multiple threads; with ntasks being nothing or smaller than 1, select a default value based on the size of data file and the number of threads available (Threads.nthreads()); not applicable to .xpt and .por files where row count is unknown from metadata.
  • apply_value_labels::Bool = true: apply value labels to the associated columns.
  • inlinestring_width::Integer = ext ∈ (".sav", ".por") ? 0 : 32: use a fixed-width string type that can be stored inline for any string variable with width below inlinestring_width and pool_width; a non-positive value avoids using any inline string type; not recommended for SPSS files.
  • pool_width::Integer = 64: only attempt to use PooledArray for string variables with width of at least 64.
  • pool_thres::Integer = 500: do not use PooledArray for string variables if the number of unique values exceeds pool_thres; a non-positive value avoids using PooledArray.
  • file_encoding::Union{String, Nothing} = nothing: manually specify the file character encoding; need to be an iconv-compatible name.
  • handler_encoding::Union{String, Nothing} = nothing: manually specify the handler character encoding; default to UTF-8.
source

The accepted types of values for selecting certain variables (data columns) are shown below:

File-level metadata can be obtained without reading the entire data file:

ReadStatTables.readstatmetaFunction
readstatmeta(filepath; kwargs...)

Return a ReadStatMeta that collects file-level metadata without reading the full data from a supported data file located at filepath. See also readstatallmeta.

Supported File Formats

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and por

Keywords

  • ext = lowercase(splitext(filepath)[2]): extension of data file for choosing the parser.
  • file_encoding::Union{String, Nothing} = nothing: manually specify the file character encoding; need to be an iconv-compatible name.
  • handler_encoding::Union{String, Nothing} = nothing: manually specify the handler character encoding; default to UTF-8.
source

To additionally collect variable-level metadata and all value labels:

ReadStatTables.readstatallmetaFunction
readstatallmeta(filepath; kwargs...)

Return all metadata including value labels without reading the full data from a supported data file located at filepath. The four returned objects are for file-level metadata, variable names, variable-level metadata and value labels respectively. See also readstatmeta.

Supported File Formats

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and por

Keywords

  • ext = lowercase(splitext(filepath)[2]): extension of data file for choosing the parser.
  • usecols::Union{ColumnSelector, Nothing} = nothing: only collect variable-level metadata from the specified columns (variables); collect all columns if usecols=nothing.
  • file_encoding::Union{String, Nothing} = nothing: manually specify the file character encoding; need to be an iconv-compatible name.
  • handler_encoding::Union{String, Nothing} = nothing: manually specify the handler character encoding; default to UTF-8.
source

For writing tables to data files, one may gain more control by first converting a table to a ReadStatTable:

ReadStatTables.writestatFunction
writestat(filepath, table; ext = lowercase(splitext(filepath)[2]), kwargs...)

Write a Tables.jl-compatible table to filepath as a data file supported by ReadStat. File format is determined based on the extension contained in filepath and may be overriden by the ext keyword.

Any user-provided table is converted to a ReadStatTable first before being handled by a ReadStat writer. Therefore, to gain fine-grained control over the content to be written, especially for metadata, one may directly work with a ReadStatTable (possibly converted from another table type such as DataFrame from DataFrames.jl) before passing it to writestat. Alternatively, one may pass any keyword argument accepted by a constructor of ReadStatTable to writestat. The actual ReadStatTable handled by the writer is returned after the writer finishes.

Supported File Formats

  • Stata: .dta
  • SAS: .sas7bdat and .xpt (Note: SAS may not recognize the produced .sas7bdat files due to a known limitation with ReadStat.)
  • SPSS: .sav and por

Conversion

For data values, Julia objects are converted to the closest ReadStat type for either numerical values or strings. However, depending on the file format of the output file, a data column may be written in a different type when the closest ReadStat type is not supported.

For metadata, if the user-provided table is not a ReadStatTable, an attempt will be made to collect table-level or column-level metadata with a key that matches a metadata field in ReadStatMeta or ReadStatColMeta via the metadata and colmetadata interface defined by DataAPI.jl. If the table is a ReadStatTable, then the associated metadata will be written as long as their values are compatible with the format of the output file. Value labels associated with a LabeledArray are always preserved even when the name of the dictionary of value labels is not specified in metadata (column name will be used by default). If a column is of an array type that makes use of DataAPI.refpool (e.g., CategoricalArray and PooledArray), value labels will be generated automatically by default (with keyword refpoolaslabel set to be true) and the underlying numerical reference values instead of the values returned by getindex are written to files (with value labels attached).

source
ReadStatTables.ReadStatTableMethod
ReadStatTable(table, ext::AbstractString; kwargs...)

Construct a ReadStatTable by wrapping a Tables.jl-compatible column table for a supported file format with extension ext. An attempt is made to collect table-level or column-level metadata with a key that matches a metadata field in ReadStatMeta or ReadStatColMeta via the metadata and colmetadata interface defined by DataAPI.jl.

This method is used by writestat when the provided table is not already a ReadStatTable. Hence, it is useful for gaining fine-grained control over the content to be written. Metadata may be manually specified with keyword arguments.

Keywords

  • copycols::Bool = true: copy data columns to ReadStatColumns; this is required for writing columns of date/time values (that are not already represented by numeric values).
  • refpoolaslabel::Bool = true: generate value labels for columns of an array type that makes use of DataAPI.refpool (e.g., CategoricalArray and PooledArray).
  • vallabels::Dict{Symbol, Dict} = Dict{Symbol, Dict}(): a dictionary of all value label dictionaries indexed by their names.
  • hasmissing::Vector{Bool} = Vector{Bool}(): a vector of indicators for whether any missing value present in the corresponding column; irrelavent for writing tables.
  • meta::ReadStatMeta = ReadStatMeta(): file-level metadata.
  • colmeta::ColMetaVec = ReadStatColMetaVec(): variable-level metadata stored in a StructArray of ReadStatColMetas; values are always overwritten.
  • varformat::Union{Dict{Symbol,String}, Nothing} = nothing: specify variable-level format for certain variables with the key being the variable name (as Symbol) and value being the format string.
  • styles::Dict{Symbol, Symbol} = _default_metastyles(): metadata styles.
  • maxdispwidth::Integer = 60: maximum display_width set for any variable.
source
ReadStatTables.ReadStatTableMethod
ReadStatTable(table::ReadStatTable, ext::AbstractString; kwargs...)

Construct a ReadStatTable from an existing ReadStatTable for a supported file format with extension ext.

Keywords

  • update_width::Bool = true: determine the storage width for each string variable by checking the actual data columns instead of any existing metadata value.
source
  • 1The printed output is generated with PrettyTables.jl.
  • 2The time types Date and DateTime are from the Dates module of Julia.