Date and Time Values
Date and time values in the data files are recognized based on the format of each variable. Many data/time formats can be recognized without user intervention.[1] In case certain date/time formats are not recognized, they can be added easily.
Translating Date and Time Values
For all date/time formats from Stata, SAS and SPSS, the date and time values are stored as the numbers of periods elapsed since a reference date or time point (epoch) chosen by the software. Therefore, knowing the reference data/time and the length of a single period is sufficient for uncovering the represented date/time values for a given format.
Two exceptions are Stata format "%tw"
for weeks and "%ty"
for years. Stata always counts the week numbers starting from the first day of a year. Each year always consists of 52 weeks. Any remaining day at the end of a year is counted as the 52th week within that year. Conversion for a variable with format "%tw"
is therefore handled differently. For "%ty"
, the recorded numerical values are simply the calendar years without any transformation. A variable with format "%ty"
is not converted to Julia Date
or DateTime
.
If a variable is in a date/time format that can be recognized, the values will be displayed as Julia Date
or DateTime
when printing a ReadStatTable
. Notice that the underlying numerical values are preserved and the conversion to the Julia Date
or DateTime
happens only lazily via a MappedArray
when working with a ReadStatTable
.
julia> using ReadStatTables, DataFrames
julia> tb = readstat("data/sample.dta")
5×7 ReadStatTable: Row │ mychar mynum mydate dtime mylabl ⋯ │ String3 Float64 Date? DateTime? Labeled{Int8} Label ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male ⋯ 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female ⋯ 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male ⋯ 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female ⋯ 5 │ e 1000.3 missing missing Male ⋯ 2 columns omitted
julia> tb.mydate
5-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Day}}(ReadStatTables.Num2DateTime{Date, Dates.Day}(Date("1960-01-01"), Dates.Day(1))), ::SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}) with eltype Union{Missing, Date}: 2018-05-06 1880-05-06 1960-01-01 1583-01-01 missing
julia> tb.mydate.data
5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}: 21310.0 -29093.0 0.0 -137696.0 missing
julia> colmetadata(tb, :mydate, "format")
"%td"
The variable-level metadata key named format
informs ReadStatTable
whether the variable represents date/time and how the numerical values should be interpreted. Changing the format
directly affects how the values are displayed, although the numerical values remain unchanged.
julia> colmetadata!(tb, :mydate, "format", "%tm")
ColMetaIterator{ReadStatColMeta} with 7 entries: :mychar => ReadStatColMeta(character, %-1s) :mynum => ReadStatColMeta(numeric, %16.2f) :mydate => ReadStatColMeta(date, %tm) :dtime => ReadStatColMeta(datetime, %tc) :mylabl => ReadStatColMeta(labeled, %16.0f) :myord => ReadStatColMeta(ordinal, %16.0f) :mytime => ReadStatColMeta(time, %tcHH:MM:SS)
julia> tb.mydate
5-element mappedarray(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1)), ReadStatTables.DateTime2Num{ReadStatTables.Num2DateTime{Date, Dates.Month}}(ReadStatTables.Num2DateTime{Date, Dates.Month}(Date("1960-01-01"), Dates.Month(1))), ::SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}) with eltype Union{Missing, Date}: 3735-11-01 -0465-08-01 1960-01-01 -9515-05-01 missing
julia> colmetadata!(tb, :mydate, "format", "%8.0f")
ColMetaIterator{ReadStatColMeta} with 7 entries: :mychar => ReadStatColMeta(character, %-1s) :mynum => ReadStatColMeta(numeric, %16.2f) :mydate => ReadStatColMeta(date, %8.0f) :dtime => ReadStatColMeta(datetime, %tc) :mylabl => ReadStatColMeta(labeled, %16.0f) :myord => ReadStatColMeta(ordinal, %16.0f) :mytime => ReadStatColMeta(time, %tcHH:MM:SS)
julia> tb.mydate
5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}: 21310.0 -29093.0 0.0 -137696.0 missing
Copying a ReadStatTable
(e.g., converting to a DataFrame
) may drop the underlying numerical values. Hence, users who wish to directly work with the underlying numerical values may want to preserve the ReadStatTable
generated from the data file.
julia> df = DataFrame(tb)
5×7 DataFrame Row │ mychar mynum mydate dtime mylabl myord ⋯ │ String3 Float64 Float64? DateTime? LabeledV… LabeledV… ⋯ ─────┼────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 21310.0 2018-05-06T10:10:10 Male low ⋯ 2 │ b 1.2 -29093.0 1880-05-06T10:10:10 Female medium 3 │ c -1000.3 0.0 1960-01-01T00:00:00 Male high 4 │ d -1.4 -137696.0 1583-01-01T00:00:00 Female low 5 │ e 1000.3 missing missing Male missing ⋯ 1 column omitted
julia> df.mydate
5-element Vector{Union{Missing, Float64}}: 21310.0 -29093.0 0.0 -137696.0 missing
In the above example, df.mydate
only contains the Date
values and the underlying numerical values are lost when constructing the DataFrame
.
The full lists of recognized date/time formats for the statistical software are stored as dictionary keys; while the associated values are tuples of reference date/time and period length.[2] If a date/time format is not found in the dictionary, no type conversion will be attempted. Additional formats may be added by inserting key-value pairs to the relevant dictionaries.
julia> ReadStatTables.stata_dt_formats
Dict{String, Tuple{Union{Date, DateTime}, Dates.Period}} with 6 entries: "%tc" => (DateTime("1960-01-01T00:00:00"), Millisecond(1)) "%tw" => (Date("1960-01-01"), Week(1)) "%td" => (Date("1960-01-01"), Day(1)) "%tq" => (Date("1960-01-01"), Month(3)) "%tm" => (Date("1960-01-01"), Month(1)) "%th" => (Date("1960-01-01"), Month(6))
julia> ReadStatTables.sas_dt_formats["MMDDYY"]
(Date("1960-01-01"), Dates.Day(1))
julia> ReadStatTables.spss_dt_formats["TIME"]
(DateTime("1582-10-14T00:00:00"), Dates.Second(1))
- 1For Stata, all date/time formats except
"%tC"
and"%d"
are supported. The only difference between the"%tC"
format and the"%tc"
format is that"%tC"
takes into account leap seconds while"%tc"
does not. Since theDateTime
type in the Dates module does not allow leap seconds, only the"%tc"
format is supported. The"%d"
format that appears in earlier versions of Stata is no longer documented in recent versions. For SAS and SPSS, the coverage of date/time formats might be less comprehensive. - 2For Stata, the reference for date/time value translation is the official Stata documentation. Only the first three characters in the format strings affect the coding. For SAS and SPSS, the reference is
pyreadstat/_readstat_parser.pyx
.