Getting Started
An overview of the usage of ReadStatTables.jl is provided below. For instructions on installation, see Installation.
Reading a Data File
Suppose we have a Stata .dta
file located at data/sample.dta
. To read this file into Julia:
julia> using ReadStatTables
julia> tb = readstat("data/sample.dta")
5×7 ReadStatTable: Row │ mychar mynum mydate dtime mylabl ⋯ │ String3 Float64 Date? DateTime? Labeled{Int8} Label ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male ⋯ 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female ⋯ 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male ⋯ 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female ⋯ 5 │ e 1000.3 missing missing Male ⋯ 2 columns omitted
Here is how we read the above result:[1]
- Variable names from the data file are displayed in the first row.
- Element type of each variable is displayed below the corresponding variable name.
- The values of each variable are displayed column-wise starting from the third row.
Some additional details to be noted:
- If a variable contains any missing value, there is a question mark
?
in the displayed element type. - By default, all missing values are treated as
missing
, a special value in Julia. - The date and time values have been translated into
Date
andDateTime
respectively.[2] - Labels instead of the numeric values are displayed for variables with value labels.
Labeled{Int8}
is an abbreviation forLabeledValue{Int8}
.
Accessing Individual Objects
A vector of all variable names can be obtained as follows:
julia> columnnames(tb)
7-element Vector{Symbol}: :mychar :mynum :mydate :dtime :mylabl :myord :mytime
To retrieve the array containing data for a specific variable:
julia> tb.mylabl
5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}: 1 => Male 2 => Female 1 => Male 2 => Female 1 => Male
The returned array is exactly the same array holding the data for the table. Therefore, modifying elements in the returned array will also change the data in the table. To avoid such changes, please copy
the array first.
Metadata for the data file can be accessed from tb
using methods that are compatible with DataAPI.jl.
julia> metadata(tb)
ReadStatMeta: row count => 5 var count => 7 modified time => 2021-04-22T21:36:00 file format version => 118 file label => A test file file extension => .dta
julia> colmetadata(tb)
ColMetaIterator{ReadStatColMeta} with 7 entries: :mychar => ReadStatColMeta(character, %-1s) :mynum => ReadStatColMeta(numeric, %16.2f) :mydate => ReadStatColMeta(date, %td) :dtime => ReadStatColMeta(datetime, %tc) :mylabl => ReadStatColMeta(labeled, %16.0f) :myord => ReadStatColMeta(ordinal, %16.0f) :mytime => ReadStatColMeta(time, %tcHH:MM:SS)
julia> colmetadata(tb, :myord)
ReadStatColMeta: label => ordinal format => %16.0f type => READSTAT_TYPE_INT8 value label => myord storage width => 1 display width => 16 measure => READSTAT_MEASURE_UNKNOWN alignment => READSTAT_ALIGNMENT_RIGHT
Type Conversions
The interface provided by ReadStatTables.jl allows basic tasks. In case more complicated operations are needed, it is easy to convert the objects into other types.
Converting ReadStatTable
The table returned by readstat
is a ReadStatTable
. Converting a ReadStatTable
to another table type is easy, thanks to the widely supported Tables.jl interface.
For example, to convert a ReadStatTable
to a DataFrame
from DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(tb)
5×7 DataFrame Row │ mychar mynum mydate dtime mylabl myord ⋯ │ String3 Float64 Date? DateTime? LabeledV… LabeledV… ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male low ⋯ 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female medium 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male high 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female low 5 │ e 1000.3 missing missing Male missing ⋯ 1 column omitted
Metadata contained in a ReadStatTable
are preserved in the converted DataFrame
when working with DataFrames.jl version v1.4.0
or above, which supports the same DataAPI.jl interface for metadata:
julia> metadata(df)
Dict{String, Any} with 13 entries: "file_ext" => ".dta" "modified_time" => DateTime("2021-04-22T21:36:00") "file_format_version" => 118 "file_format_is_64bit" => true "table_name" => "" "notes" => String[] "file_encoding" => "" "file_label" => "A test file" "var_count" => 7 "row_count" => 5 "creation_time" => DateTime("2021-04-22T21:36:00") "endianness" => READSTAT_ENDIAN_LITTLE "compression" => READSTAT_COMPRESS_NONE
julia> colmetadata(df, :myord)
Dict{String, Any} with 8 entries: "label" => "ordinal" "format" => "%16.0f" "display_width" => 16 "measure" => READSTAT_MEASURE_UNKNOWN "alignment" => READSTAT_ALIGNMENT_RIGHT "type" => READSTAT_TYPE_INT8 "storage_width" => 0x0000000000000001 "vallabel" => :myord
Converting LabeledArray
Variables with value labels are stored in LabeledArray
s. To convert a LabeledArray
to another array type, we may either obtain an array of LabeledValue
s or collect the values and labels separately. The data values can be directly retrieved by calling refarray
:
julia> refarray(tb.mylabl)
5-element Vector{Union{Missing, Int8}}: 1 2 1 2 1
The array returned by refarray
is exactly the same array underlying the LabeledArray
. Therefore, modifying the elements of the array will also mutate the values in the associated LabeledArray
.
If only the value labels are needed, we can obtain an iterator of the value labels via valuelabels
. For example, to convert a LabeledArray
to a CategoricalArray
from CategoricalArrays.jl:
julia> using CategoricalArrays
julia> CategoricalArray(valuelabels(tb.mylabl))
5-element CategoricalArray{String,1,UInt32}: "Male" "Female" "Male" "Female" "Male"
It is also possible to only convert the type of the underlying data values:
julia> convertvalue(Int32, tb.mylabl)
5-element LabeledVector{Int32, Vector{Int32}, Union{Char, Int32}}: 1 => Male 2 => Female 1 => Male 2 => Female 1 => Male
ReadStatTables.convertvalue
— Functionconvertvalue(T, x::LabeledArray)
Convert the type of data values contained in x
to T
. This method is equivalent to convert(AbstractArray{LabeledValue{T, K}, N}}, x)
.
Writing a Data File
To write a table to a supported data file format:
julia> # Create a data frame for illustration df = DataFrame(readstat("data/alltypes.dta")); emptycolmetadata!(df)
3×9 DataFrame Row │ vbyte vint vlong vfloat vdouble vstr vstrL ⋯ │ LabeledV… LabeledV… LabeledV… LabeledV… LabeledV… String3 String ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A A A A A ab This is ⋯ 2 │ missing missing missing missing missing 3 │ missing missing missing missing missing 3 columns omitted
julia> out = writestat("data/write_alltypes.dta", df)
3×9 ReadStatTable: Row │ vbyte vint vlong vfloat ⋯ │ Labeled{Int8?} Labeled{Int16?} Labeled{Int32?} Labeled{Float32?} La ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A A A A ⋯ 2 │ missing missing missing missing ⋯ 3 │ missing missing missing missing ⋯ 5 columns omitted
The returned table out
contains the actual data (including metadata) that are exposed to the writer.
Value labels attached to a LabeledArray
are always preserved in the output file. If the input table contains any column of type CategoricalArray
or PooledArray
, value labels are created and written automatically by default:
julia> using PooledArrays
julia> df[!,:vbyte] = CategoricalArray(valuelabels(df.vbyte))
3-element CategoricalArray{String,1,UInt32}: "A" "missing" "missing"
julia> df[!,:vint] = PooledArray(valuelabels(df.vint))
3-element PooledVector{String, UInt32, Vector{UInt32}}: "A" "missing" "missing"
julia> out = writestat("data/write_alltypes.dta", df)
3×9 ReadStatTable: Row │ vbyte vint vlong vfloat ⋯ │ Labeled{Int32?} Labeled{Int32?} Labeled{Int32?} Labeled{Float32?} L ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ A A A A ⋯ 2 │ missing missing missing missing ⋯ 3 │ missing missing missing missing ⋯ 5 columns omitted
Notice that in the returned table, the columns vbyte
and vint
are LabeledArray
s:
julia> out.vbyte
3-element LabeledVector{Union{Missing, Int32}, SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}, Union{Char, Int32}}: 1 => A 2 => missing 2 => missing
julia> out.vint
3-element LabeledVector{Union{Missing, Int32}, SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}, Union{Char, Int32}}: 1 => A 2 => missing 2 => missing
It is possible to specify the format of certain variables. This can be important, for example, for variables representing date/time. In the above example, a default format has been selected:
julia> out.vdate
3-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Day}}(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1))), ::SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}) with eltype Union{Missing, Date}: 1960-01-02 missing missing
julia> colmetadata(out, :vdate, "format")
"%td"
To specify a different format, a convenient approach is to specify the varformat
keyword argument:
julia> out2 = writestat("data/write_alltypes.dta", df, varformat=Dict(:vdate=>"%tm"));
julia> out2.vdate
3-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Month}}(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1))), ::SentinelArrays.SentinelVector{Int32, Int32, Missing, Vector{Int32}}) with eltype Union{Missing, Date}: 1960-01-01 missing missing
julia> colmetadata(out2, :vdate, "format")
"%tm"
Notice that since the format "%tm"
is for months, the day within a month has been ignored and becomes 1 for the printed value of Date
in the above example.
When a variable has a Stata format "%tw"
, "%tm"
, "%tq"
or "%th"
, a displayed Date
is always the first day within the corresponding Stata period. In particular, Stata counts week numbers starting from the first day of each year and hence a displayed Date
may not correspond to the first day of a calendar week.
The write support is experimental and requires further testing. Caution should be taken when writing the data files.
More Options
The behavior of readstat
can be adjusted by passing keyword arguments:
ReadStatTables.readstat
— Functionreadstat(filepath; kwargs...)
Return a ReadStatTable
that collects data (including metadata) from a supported data file located at filepath
.
Supported File Formats
- Stata:
.dta
- SAS:
.sas7bdat
and.xpt
- SPSS:
.sav
andpor
Keywords
ext = lowercase(splitext(filepath)[2])
: extension of data file for choosing the parser.usecols::Union{ColumnSelector, Nothing} = nothing
: only collect data from the specified columns (variables); collect all columns ifusecols=nothing
.row_limit::Union{Integer, Nothing} = nothing
: restrict the total number of rows to be read; read all rows ifrow_limit=nothing
.row_offset::Integer = 0
: skip the specified number of rows.ntasks::Union{Integer, Nothing} = nothing
: number of tasks spawned to read data file in concurrent chunks with multiple threads; withntasks
beingnothing
or smaller than 1, select a default value based on the size of data file and the number of threads available (Threads.nthreads()
); not applicable to.xpt
and.por
files where row count is unknown from metadata.apply_value_labels::Bool = true
: apply value labels to the associated columns.inlinestring_width::Integer = ext ∈ (".sav", ".por") ? 0 : 32
: use a fixed-width string type that can be stored inline for any string variable with width belowinlinestring_width
andpool_width
; a non-positive value avoids using any inline string type; not recommended for SPSS files.pool_width::Integer = 64
: only attempt to usePooledArray
for string variables with width of at least 64.pool_thres::Integer = 500
: do not usePooledArray
for string variables if the number of unique values exceedspool_thres
; a non-positive value avoids usingPooledArray
.file_encoding::Union{String, Nothing} = nothing
: manually specify the file character encoding; need to be aniconv
-compatible name.handler_encoding::Union{String, Nothing} = nothing
: manually specify the handler character encoding; default to UTF-8.
The accepted types of values for selecting certain variables (data columns) are shown below:
ReadStatTables.ColumnIndex
— TypeColumnIndex
A type union for values accepted by readstat
and ReadStatTable
for selecting a column. A column can be selected either with the column name as Symbol
or String
; or with an integer (Int
) index based on the position in a table. See also ColumnSelector
.
ReadStatTables.ColumnSelector
— TypeColumnSelector
A type union for values accepted by readstat
for selecting a single column or multiple columns. The accepted values can be of type ColumnIndex
, a UnitRange
of integers, an array or a set of ColumnIndex
.
File-level metadata can be obtained without reading the entire data file:
ReadStatTables.readstatmeta
— Functionreadstatmeta(filepath; kwargs...)
Return a ReadStatMeta
that collects file-level metadata without reading the full data from a supported data file located at filepath
. See also readstatallmeta
.
Supported File Formats
- Stata:
.dta
- SAS:
.sas7bdat
and.xpt
- SPSS:
.sav
andpor
Keywords
ext = lowercase(splitext(filepath)[2])
: extension of data file for choosing the parser.file_encoding::Union{String, Nothing} = nothing
: manually specify the file character encoding; need to be aniconv
-compatible name.handler_encoding::Union{String, Nothing} = nothing
: manually specify the handler character encoding; default to UTF-8.
To additionally collect variable-level metadata and all value labels:
ReadStatTables.readstatallmeta
— Functionreadstatallmeta(filepath; kwargs...)
Return all metadata including value labels without reading the full data from a supported data file located at filepath
. The four returned objects are for file-level metadata, variable names, variable-level metadata and value labels respectively. See also readstatmeta
.
Supported File Formats
- Stata:
.dta
- SAS:
.sas7bdat
and.xpt
- SPSS:
.sav
andpor
Keywords
ext = lowercase(splitext(filepath)[2])
: extension of data file for choosing the parser.usecols::Union{ColumnSelector, Nothing} = nothing
: only collect variable-level metadata from the specified columns (variables); collect all columns ifusecols=nothing
.file_encoding::Union{String, Nothing} = nothing
: manually specify the file character encoding; need to be aniconv
-compatible name.handler_encoding::Union{String, Nothing} = nothing
: manually specify the handler character encoding; default to UTF-8.
For writing tables to data files, one may gain more control by first converting a table to a ReadStatTable
:
ReadStatTables.writestat
— Functionwritestat(filepath, table; ext = lowercase(splitext(filepath)[2]), kwargs...)
Write a Tables.jl
-compatible table
to filepath
as a data file supported by ReadStat
. File format is determined based on the extension contained in filepath
and may be overriden by the ext
keyword.
Any user-provided table
is converted to a ReadStatTable
first before being handled by a ReadStat
writer. Therefore, to gain fine-grained control over the content to be written, especially for metadata, one may directly work with a ReadStatTable
(possibly converted from another table type such as DataFrame
from DataFrames.jl
) before passing it to writestat
. Alternatively, one may pass any keyword argument accepted by a constructor of ReadStatTable
to writestat
. The actual ReadStatTable
handled by the writer is returned after the writer finishes.
Supported File Formats
- Stata:
.dta
- SAS:
.sas7bdat
and.xpt
(Note: SAS may not recognize the produced.sas7bdat
files due to a known limitation with ReadStat.) - SPSS:
.sav
andpor
Conversion
For data values, Julia objects are converted to the closest ReadStat
type for either numerical values or strings. However, depending on the file format of the output file, a data column may be written in a different type when the closest ReadStat
type is not supported.
For metadata, if the user-provided table
is not a ReadStatTable
, an attempt will be made to collect table-level or column-level metadata with a key that matches a metadata field in ReadStatMeta
or ReadStatColMeta
via the metadata
and colmetadata
interface defined by DataAPI.jl
. If the table
is a ReadStatTable
, then the associated metadata will be written as long as their values are compatible with the format of the output file. Value labels associated with a LabeledArray
are always preserved even when the name of the dictionary of value labels is not specified in metadata (column name will be used by default). If a column is of an array type that makes use of DataAPI.refpool
(e.g., CategoricalArray
and PooledArray
), value labels will be generated automatically by default (with keyword refpoolaslabel
set to be true
) and the underlying numerical reference values instead of the values returned by getindex
are written to files (with value labels attached).
ReadStatTables.ReadStatTable
— MethodReadStatTable(table, ext::AbstractString; kwargs...)
Construct a ReadStatTable
by wrapping a Tables.jl
-compatible column table for a supported file format with extension ext
. An attempt is made to collect table-level or column-level metadata with a key that matches a metadata field in ReadStatMeta
or ReadStatColMeta
via the metadata
and colmetadata
interface defined by DataAPI.jl
.
This method is used by writestat
when the provided table
is not already a ReadStatTable
. Hence, it is useful for gaining fine-grained control over the content to be written. Metadata may be manually specified with keyword arguments.
Keywords
copycols::Bool = true
: copy data columns toReadStatColumns
; this is required for writing columns of date/time values (that are not already represented by numeric values).refpoolaslabel::Bool = true
: generate value labels for columns of an array type that makes use ofDataAPI.refpool
(e.g.,CategoricalArray
andPooledArray
).vallabels::Dict{Symbol, Dict} = Dict{Symbol, Dict}()
: a dictionary of all value label dictionaries indexed by their names.hasmissing::Vector{Bool} = Vector{Bool}()
: a vector of indicators for whether any missing value present in the corresponding column; irrelavent for writing tables.meta::ReadStatMeta = ReadStatMeta()
: file-level metadata.colmeta::ColMetaVec = ReadStatColMetaVec()
: variable-level metadata stored in aStructArray
ofReadStatColMeta
s; values are always overwritten.varformat::Union{Dict{Symbol,String}, Nothing} = nothing
: specify variable-level format for certain variables with the key being the variable name (asSymbol
) and value being the format string.styles::Dict{Symbol, Symbol} = _default_metastyles()
: metadata styles.maxdispwidth::Integer = 60
: maximumdisplay_width
set for any variable.
ReadStatTables.ReadStatTable
— MethodReadStatTable(table::ReadStatTable, ext::AbstractString; kwargs...)
Construct a ReadStatTable
from an existing ReadStatTable
for a supported file format with extension ext
.
Keywords
update_width::Bool = true
: determine the storage width for each string variable by checking the actual data columns instead of any existing metadata value.
- 1The printed output is generated with PrettyTables.jl.
- 2The time types
Date
andDateTime
are from the Dates module of Julia.