123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112 |
- A Fast Method of Identifying Plain Text Files
- =============================================
- Introduction
- ------------
- Given a file coming from an unknown source, it is generally impossible
- to conclude automatically, and with 100% accuracy, whether that file is
- a plain text file, without performing a heavy-duty semantic analysis on
- the file contents. It is, however, possible to obtain a fairly high
- degree of accuracy, by employing various simple heuristics.
- Previous versions of the zip tools were using a crude detection scheme,
- originally used by PKWare in its PKZip programs: if more than 80% (4/5)
- of the bytes are within the range [7..127], the file is labeled as plain
- text, otherwise it is labeled as binary. A prominent limitation of this
- scheme is the restriction to Latin-based alphabets. Other alphabets,
- like Greek, Cyrillic or Asian, make extensive use of the bytes within
- the range [128..255], and texts using these alphabets are most often
- mis-identified by this scheme; in other words, the rate of false
- negatives is sometimes too high, which means that the recall is low.
- Another weakness of this scheme is a reduced precision, due to the false
- positives that may occur when binary files containing a large amount of
- textual characters are mis-identified as plain text.
- In this article we propose a new detection scheme, with a much increased
- accuracy and precision, and a near-100% recall. This scheme is designed
- to work on ASCII and ASCII-derived alphabets, and it handles single-byte
- alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets
- (DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte
- alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme
- can easily be adapted to non-ASCII alphabets like EBCDIC.
- The Algorithm
- -------------
- The algorithm works by dividing the set of bytes [0..255] into three
- categories:
- - The white list of textual bytecodes:
- 9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255
- - The gray list of tolerated bytecodes:
- 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC)
- - The black list of undesired, non-textual bytecodes:
- 0 (NUL) to 6, 14 to 31.
- If a file contains at least one byte that belongs to the white list, and
- no byte that belongs to the black list, then the file is categorized as
- plain text. Otherwise, it is categorized as binary.
- Rationale
- ---------
- The idea behind this algorithm relies on two observations.
- The first observation is that, although the full range of 7-bit codes
- (0..127) is properly specified by the ASCII standard, most control
- characters in the range 0..31 are not used in practice. The only
- widely-used, almost universally-portable control codes are 9 (TAB),
- 10 (LF), and 13 (CR). There are a few more control codes that are
- recognized on a reduced range of platforms and text viewers/editors:
- 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these
- codes are rarely (if ever) used alone, without being accompanied by
- some printable text. Even the newer, portable text formats, such as
- XML, avoid using control characters outside the list mentioned here.
- The second observation is that most of the binary files tend to contain
- control characters, especially 0 (NUL); even though the older text
- detection schemes observe the presence of non-ASCII codes from the range
- [128..255], the precision rarely has to suffer if this upper range is
- labeled as textual, because the files that are genuinely binary tend to
- contain both control characters, and codes from the upper range. On the
- other hand, the upper range needs to be labeled as textual, because it
- is being used by virtually all ASCII extensions. In particular, this
- range is being heavily used to encode non-Latin scripts.
- Given the two observations, the plain text detection algorithm becomes
- straightforward. There must be at least some printable material, or
- some portable whitespace such as TAB, CR or LF, otherwise the file is
- not labeled as plain text. (The boundary case, when the file is empty,
- automatically falls into this category.) However, there must be no
- non-portable control characters, otherwise it's very likely that the
- intended reader of that file is a machine, rather than a human.
- Since there is no counting involved, other than simply observing the
- presence or the absence of some byte values, the algorithm produces
- uniform results on any particular text file, no matter what alphabet
- encoding is being used for that text. (In contrast, if counting were
- involved, it could be possible to obtain different results on a text
- encoded, say, using ISO-8859-2 versus UTF-8.) There is the category
- of plain text files that are "polluted" with one or a few black-listed
- codes, either by mistake, or by peculiar design considerations. In such
- cases, a scheme that tolerates a small percentage of black-listed codes
- would provide an increased recall (i.e. more true positives). This,
- however, incurs a reduced precision, since false positives are also more
- likely to appear in binary files that contain large chunks of textual
- data. "Polluted" plain text may, in fact, be regarded as binary, on
- which text conversions should not be performed. Under this premise, it
- is safe to say that the detection method provides a near-100% recall.
- Experiments have been run on a large set of files of various categories,
- including plain old texts, system logs, source code, formatted office
- documents, compiled object code, etcetera. The results confirm the
- optimistic assumptions about the high accuracy, precision and recall
- offered by this algorithm.
- --
- Cosmin Truta
- Last updated: 2005-Feb-27
|