ReadStatTables.jl

Welcome to the documentation site for ReadStatTables.jl!

ReadStatTables.jl is a Julia package for reading and writing Stata, SAS and SPSS data files with Tables.jl-compatible tables.[1] It utilizes the ReadStat C library developed by Evan Miller for parsing and writing the data files. The same C library is also the backend of popular packages in other languages such as pyreadstat for Python and haven for R. As the Julia counterpart for similar purposes, ReadStatTables.jl leverages the state-of-the-art Julia ecosystem for usability and performance. Its read performance, especially when taking advantage of multiple threads, surpasses all related packages by a sizable margin based on the benchmark results here:


Features

ReadStatTables.jl provides the following features in addition to wrapping the C interface of ReadStat:

  • Fast multi-threaded data collection from ReadStat parsers to a Tables.jl-compatible ReadStatTable
  • Interface of file-level and variable-level metadata compatible with DataAPI.jl
  • Integration of value labels into data columns via a custom array type LabeledArray
  • Translation of date and time values into Julia time types Date and DateTime
  • Write support for Tables.jl-compatible tables (experimental)

Supported File Formats

ReadStatTables.jl recognizes data files with the following file extensions at this moment:

  • Stata: .dta
  • SAS: .sas7bdat and .xpt
  • SPSS: .sav and .por

Installation

ReadStatTables.jl can be installed with the Julia package manager Pkg. From the Julia REPL, type ] to enter the Pkg REPL and run:

pkg> add ReadStatTables

Known Limitations

The development of ReadStatTables.jl is not fully complete. The main limitations to be addressed are the following:

  • Read support for SAS value labels is temporarily absent.
  • All missing values are represented by a single value missing.[2]
  • Write support for the file formats is experimental and not fully developed.
  • 1Development for the reading capability is temporarily prioritized over that for the writing capability. Implementation for the write support only started recently and should be considered as experimental.
  • 2The statistical software may accept multiple values for representing missing values (e.g., .a, .b,..., .z in Stata). These original values can be recognized by the parser but are not integrated into the output at this moment.