Tutorial: Text Files

Text Files

This assumes you have read the article on Character Encoding.

What are Text Files?

By text file, I mean any file for which the bits are intended to be decoded using a standard text format, such as ASCII or Unicode (e.g., 1100 1000 will be “A” and so on). Also called “plain text” or “human readable”, these files are easy to undertstand with any software that supports text or other documents. Any other file is called a binary file, as the [binary] bits can be interpreted in any way the creator desires. A binary file can be opened as if it were text, but the result is gibberish.

Most text files are actually meant to be further processed or interpreted by a software to produce a specific type of content. The files use special characters or sequences of characters to provide instructions to the interpreters. Some people use plain text to refer only to those files that do not get further interpreted, or only those in ASCII.

In general, there are three main types of text files based on how the content is meant to be used. There is no standard breakdown, but the following is a useful way of classifying files that researchers encounter. Also shown are examples of file extensions of that type. Thus, each file type listed here is a text file.

  • Scripts – For computer instructions or code.
    • .R, .py, .bat, .sps, .do, .sas, .sql, .ps, .js,.vbs, .scpt, .php, .asp
  • Documents – For the display of textual documents for reading.
    • .rtf, .md, .markdown, .html, .xml, .log, .tex, .txt, .text
  • Data – For structured text, numeric data, or graphical display.
    • .csv, .tab, .tsv, .por, .json, .xml, .ris, .bib, .eml, .ics, .css, .svg, .ps

An empirical researcher can work almost entirely with just text files: keep data, store rules for processing and analyzing the data, and save reports. Most common image files and other media are binary, but a few graphical formats (ex. .svg) are text, too.

A text editor is a general software that inteprets bits as text but does not further proecess or interpret the contents. Such programs often open quicker than document-editing programs like Microsoft Word, and automatically save documents in the original text format. It is good to have one for the quick examination of files, especially data files that may not be opening properly. There are many free text editors. Windows users are advised to try Notepad++, though the built-in Notepad will also work. Mac users should try Atom or the built-in TextEdit application (though it does not use plain text by default). Try opening various files in your text editor and see what you learn about them.

For general document writing or scripting, you will want to find a task- and language- specific editor with additional tools, like previewing the document or running the script. Some editors work well with multiple languages. See the articles on each file type for recommendations.

Text File Types

Scripts

Every programming langauges saves scripts in text files, including all the statsitical softwares. Below is a section from a script in the statistical programming language R, as seen in a text editor (left image) and in RStudio (right image), a popular R editor. The designated software will often add coloring and other formatting for keywords to help with writing the language, but only the text itself is saved. Of course, scripts will also perform tasks when run using the designated program. This example was to process demographic variables by collapsing and creating categories. Learn more about scripts and scripting for researchers working with data in the separate article.

image-20200911191950909image-20200911191957476

Documents

Document files contain markup–characters or sequences of characters that provide information above and beyond the document text. Software can recognize and interpret standardized annotations to control the display or processing of a text file. Below is a section of a document (this one!) in the markup language Markdown as viewed in a text editor (left image) and how it looked in the interpreter Typora (right image). Learn more about Text Documents in the separate article.

image-20200910215233978image-20200911193034841

Data

Data files typically contain tags, keys, or delimiters. Below is a portion of data stored in comma delimited format (left image) as viewed in a text editor, and the same file opened in Excel (right image). Learn more about Text Data in the separate article.

image-20200910215332466image-20200911192735438

Combination Types

XML

eXtensiible Markup Language (XML) grew out of similar languages (HTML and SGML) as a general type of markup for any data content or document structure. It is very flexible and equally useful for files that are data-centric or document-centric. XML and related schemes are easy to identify from the tags (keywords) surrounded by <>’s, for example:

<catalog>
   <book id="101">
      <title>Text Files and You: A Journey</title>
      <authors>
      	<author>Mason, George</author>
      	<author>Washington, Gregory</author>
      </authors>
   </book>
</catalog>

XML allows you to define a specific set of tag words for particular purposes. This extensibility has been used in many different fields. XML document formats can be specified with a Document Type Definition (DTD), written in a notation called EBNF, or an XML Schema Definition (XSD), which is written in XML itself. Learn more about XML in both the Data and Document articles.

Zipped XML

Several document file formats are actually compressed (zip) files containing a set of XML files plus attachments like images. To see the the contents, you can change the extention to .zip and use your normal unzipping process. Or, use a standalone unzipper software (like 7-zip for Windows). Remember, changing the extension does not change the encoding, but such files are already in zipped format. Here are some zipped XML file types:

Older Microsoft office document types (ex. .doc, .xls) are binary. A zip file itself is also binary, though the standard is well supported and appropriate for long term preservation (see below). Depending on the context, .docx and the others may be considered both a binary file and a text file.

Research Notebooks

More and more researchers are combining documents and scripts into what is called a research notebook. The document has embedded code along with the resultsand are all text fiels. Jupyter Notebooks’ .ipynb files are in JSON format, a type of text data file, with sections of document text and sections of script. RStudio’s RMarkdown files (.rmd) is in a document markup called Markdown, with special notations to indicate sections of R script.

Why are Text Files important?

You may be wondering why using text files matters. Scripts are always saved in text format, but this is in part because of the tremendous advantages, enabling important capabilities such as version control.

Universally Accessibile

A text file is decoded using a long-standing universal scheme into human-readable characters. This is simple and straight-forward for almost any software or programming language, on any operating system, in any country, and will continue to be so for the forseeable future.

But programs like Stata can change their file format. Both R and Python rely on a the same program to open files from statistical software.

Long-term Preservation

Text files are typically preferred for long term storage and preservation of data and documentation because the content can always be read in a text editor even if the original interpretive software is unavailable. But, there are some binary standards that are open and supported (and used) by a variety of software and are acceptable. For instance, even though pdf files are binary, the PDF/A (ISO 190005-1 compliant) standard is a good preservation format because is open and well-supported.

Data files produced by statistical software are typically binary (ex. .sav, .dta, .sas7bdat). Such files can more easily store metadata, like labels and data types, along with the data, and may even produce smaller file sizes than the text alternatives. Although it is good to preserve the original file because of that metadata, it is best to also save the data in a text format. Because text formats do not store labels and other data information, be sure to make a codebook and store it with the file as well (also in text format).

These are data files in text format:

  • Delimited: All software can save in this format (see more in the article on data files), in which the columns are separated by a symbol. For example, the file extension .csv refers to comma-separated values.
  • SPSS: The text .por format (for “portable”) was meant for data storage and transfer.
  • R: Writes binary files by default, but all functions include an argument that will create an ASCII file.
  • Python: The pickle module will default to saving data as a text file.

Collaboration

Even if they do not have the interpretation software, they can open the file in a text editor and still get the information. You can even copy text into an email without loosing anything.

Machine Readable

Searching & Processing

It is easy for computers to search through text files, to determine how they differ or to open and process text files with any programming langugage you want. If a file was binary, the people who created the file would have to build a way to makt this happen, hwi

Version control systems like Git, Mercurial, and Subversion are much more useful for Text Files than Binary files. Although most will store both types files, one of the most useful features of such systems is the ability to show line-by-line changes between one version and the next. Document text files with markup allow for both rich or structured text and the ability to easily see modifications. Further, many version control systems are able to just save the changes to text files, reducing the amount of storage necessary since the entire file is not backed up every time. To show differences in binary files like Microsoft Word requires converting it to a text format first, or using Word’s own clunky tools.

Simple

Text files can help avoid issues with file and data transfer. Because every markup and other character can be seen and modified, it is easier to diagnose issues with formatting. For data files it is not uncommon to have difficulty opening Excel files in statistical programs. The solution is to save the file as csv, a text format, so that it just holds the data. The only difference (assuming only values are there) is that each Excel sheet must be saved in it’s own csv file.

Text document files open faster than binary files in part because they are smaller than their binary equivalent, sometimes by a lot. For data, binary files can be smaller though they often contain additional information and may not be.

Text document files encourage focusing on the structure as opposed to the look because factors such as font are not specified directly. It offers the same advantages of using Styles in Microsoft Word, including formatting consistency and easier outlining.

Special Characters and Escaping

Because most text files are further interpreted, some standard characters actually have special meaning. In Markdown for example (which is being used to write this), using asterisks tells the interpreter to put a word in italics.

Displaying Special Characters

So, how would one display an asterisk itself?

  • ​ Use the escape character: * \*
  • ​ Use an escape sequence: * &#42;
  • ​ Specify it is a raw string: * `*`

Escape Character

The escape character tells the software interpreter to change the default meaning of the next character. Thus, a special character will be treated as a standard character. It is most commonly the backslash \, so if there is an issue with a special character, try adding a backslash before it (I used a lot of seemingly-invisible backslashes for the table in the next section). The backslash looks like it is leaning back toward the beginning of the sentence. Having the backslash as the escape character is useful in some circumstances, but leads to issues in many contexts (see below). Here are some more examples:

**escape** = escape                          \**escape\** = *escape*

\*\*escape\*\* = **escape**          \\escape\\ = \escape\

Escape Sequence

An escape sequence is a combination of characters that can be used to represent single character. They usually start with one of of a few escape characters, such as , &, or %. It is common to utilize the character’s ASCII or Unicode value. For example, in HTML and XML these sequences start with an & and end with ;. Between, you can use the character name, the decimal code, or the equivalent hex code. See the table below for details.

Percent-encoding is used for special characters in URLs or other web applications, and is the percent followed by the ASCII value in hex. This is most often seen for the forward slash (%2F), the space (%20), or the percent sign itself (%25).

Character Name Decimal Hex URL
< &lt; &#60; &#x3C; %3C
& &amp; &#38; &#x26; %26
/ &sol; &#47; &#x2F; %2F
space multiple &#32; &#x20; %20

Raw String

A raw string is a feature of many programming languages to prevent any internal interpretation. In Markdown, formatting text as code will do the same, which can be done by surrounding it in backticks: `raw`. In Python and R (v4.0+), you put an “r” in front of a specially quoted string. Python will accept a single quote instead of triple-quotes if you do not need to use quotes inside the string (ex. r"Hi"), but R will not.

r"""This is a "raw" string that will print \exactly\\."""   # Python
r"(This is a "raw" string that will print \exactly\\.)"     # R (4.0+)
This is a “raw” string that will print \exactly\\.

Issues with Specific Characters

There are several common issues related to backslashes and escaping in scripts.

Control Codes

The backslash actually works more like a switch. As above, if you put it before a character that is special, that character will be treated normally. But, the backslash is also used as part of escape sequences to give regular letters special meaning. It indicates markup in some document files, and is used in most programming languages (as far back as C) or searching to represent otherwise-unseen control characters: a tab is \t and a line break is \n. Here is an example in Python showing the backslash used in both ways. In R, use the function cat() with the same string.

print "Hello!\nThis is an \"Expert\" Question:\tWhat does \\n mean?"

Once interpreted, this would print:

Hello!
This is an “Expert” Question: What does \n mean?

Windows File Paths

File Paths in Windows (the location of the file on the computer) are often a source of problems because Windows computers traditionally separate folder names with a backslash, the traditional escape character. Mac and Linux computers use the forward slash for file paths, which are not a problem. The interpreter sees \ as the start of an escape sequence and tries to interpret the next character accordingly, leading to errors. To specify file paths in programs like R or Python, you can:

  • Use Mac/Linux Style: “C:**/Users/Name/Documents/**Project”
  • Escape the Separator : “C:**\\Users\\Name\\Documents\\**Project”

Quotes

Quotes are often special characters in statistical programming because labels, messages, and string-type values must be surrounded by quotes. But, sometimes the content needs to include quotes as well. Most languages will accept either single or double-quoted strings (when matched). Thus, if you need to include a string with double-quotes, use outer single quotes and vice versa. Here are additional options for relevant software. Learn more about Escape Sequences from a data perspective.

Software What to do Example
Most software Use the opposite quote type ‘Is it “good”?’
R or Python Escape the quotes
Use a raw string
“Is it \“good\”?“
r”(Is it “good”?)”
SPSS, SAS, or Excel Double the quotes “Is it ““good””?”
Stata Use compound quote
(backtick ` + single quote ‘ )
`Is it “good”?’

Working with Text Files

The first step is to make sure that you will not lose information in the transition. Check out the article on Data files for more specifics on that format.

Make sure you know how to open files in a text editor and/or change the default program to open the file.

Highlighting

It is common for people to use formatting like highlighting or other coloring to mark values or sections that need follow up. But, formatting like that should not be used with text files as it will not be retained. Instead, you can use characters or numbers that are distinctive and easy for you or the computer to find. A rule of thumb from the Carpentries is that if you can’t find it by Ctrl+F/Command+F, it will not work. See the articles on Data and Document file types for specific recommendations.

Save As

In order to use new document formats, it is important to know how to adjust the file types that programs are showing you.

More information on Long-Term Preservation and Retention

Open Files with the Non-Default Software

Operating systems are built to automatically open each file with the intended software, and do not specifically indicate whether a file is Text or Binary. But, it is a useful characteristic to know. You can search online for the file extension. Or, just try opening the file in a text or document editor. When doing so on a Windows computer, you may need to select “All Files” in the drop-down box on the lower right, as seen below.

Open File Dialog Box