Convert a document (.docx, .pdf, .odt, .rtf, or .html) to a plain text file

This used to be a thin wrapper around textreadr::read_document() that also writes the result to output, doing its best to correctly write UTF-8 (based on the approach recommended in this blog post). However, textreadr was archived from CRAN. It now directly wraps the functions that textreadr wraps: pdftools::pdf_text(), striprtf::read_rtf, and it uses xml2 to import .docx and .odt files, and rvest to import .html files, using the code from the textreadr package.

doc_to_txt(
  input,
  output = NULL,
  encoding = rock::opts$get("encoding"),
  newExt = NULL,
  preventOverwriting = rock::opts$get("preventOverwriting"),
  silent = rock::opts$get("silent")
)

Arguments

input: The path to the input file.
output: The path and filename to write to. If this is a path to an existing directory (without a filename specified), the input filename will be used, and the extension will be replaced with extension.
encoding: The encoding to use when writing the text file.
newExt: The extension to append: only used if output = NULL and newExt is not NULL, in which case the output will be written to a file with the same name as input but with newExt as extension.
preventOverwriting: Whether to prevent overwriting existing files.
silent: Whether to the silent or chatty.

Value

The converted source, as a character vector.

Examples

### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
  print(
    rock::doc_to_txt(
      input = system.file(
        "extdata/doc-to-test.docx", package="rock"
      )
    )
  );
}
#> [1] "This is a word document."                                                          
#> [2] "It doesn’t have much fancy content, but it’s 12kb large nonetheless."              
#> [3] "Because some people use Word to transcribe, it can be useful to import Word files."
#> [4] "Note that this does mean you’ll lose markup."