This used to be a thin wrapper around textreadr::read_document() that also writes the result to output, doing its best to correctly write UTF-8 (based on the approach recommended in this blog post). However, textreadr was archived from CRAN. It now directly wraps the functions that textreadr wraps: pdftools::pdf_text(), striprtf::read_rtf, and it uses xml2 to import .docx and .odt files, and rvest to import .html files, using the code from the textreadr package.

doc_to_txt(
  input,
  output = NULL,
  encoding = rock::opts$get("encoding"),
  newExt = NULL,
  preventOverwriting = rock::opts$get("preventOverwriting"),
  silent = rock::opts$get("silent")
)

Arguments

input

The path to the input file.

output

The path and filename to write to. If this is a path to an existing directory (without a filename specified), the input filename will be used, and the extension will be replaced with extension.

encoding

The encoding to use when writing the text file.

newExt

The extension to append: only used if output = NULL and newExt is not NULL, in which case the output will be written to a file with the same name as input but with newExt as extension.

preventOverwriting

Whether to prevent overwriting existing files.

silent

Whether to the silent or chatty.

Value

The converted source, as a character vector.

Examples

### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
  print(
    rock::doc_to_txt(
      input = system.file(
        "extdata/doc-to-test.docx", package="rock"
      )
    )
  );
}
#> [1] "This is a word document."                                                          
#> [2] "It doesn’t have much fancy content, but it’s 12kb large nonetheless."              
#> [3] "Because some people use Word to transcribe, it can be useful to import Word files."
#> [4] "Note that this does mean you’ll lose markup."