txtvsbin.txt 5.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
  1. A Fast Method of Identifying Plain Text Files
  2. =============================================
  3. Introduction
  4. ------------
  5. Given a file coming from an unknown source, it is generally impossible
  6. to conclude automatically, and with 100% accuracy, whether that file is
  7. a plain text file, without performing a heavy-duty semantic analysis on
  8. the file contents. It is, however, possible to obtain a fairly high
  9. degree of accuracy, by employing various simple heuristics.
  10. Previous versions of the zip tools were using a crude detection scheme,
  11. originally used by PKWare in its PKZip programs: if more than 80% (4/5)
  12. of the bytes are within the range [7..127], the file is labeled as plain
  13. text, otherwise it is labeled as binary. A prominent limitation of this
  14. scheme is the restriction to Latin-based alphabets. Other alphabets,
  15. like Greek, Cyrillic or Asian, make extensive use of the bytes within
  16. the range [128..255], and texts using these alphabets are most often
  17. mis-identified by this scheme; in other words, the rate of false
  18. negatives is sometimes too high, which means that the recall is low.
  19. Another weakness of this scheme is a reduced precision, due to the false
  20. positives that may occur when binary files containing a large amount of
  21. textual characters are mis-identified as plain text.
  22. In this article we propose a new detection scheme, with a much increased
  23. accuracy and precision, and a near-100% recall. This scheme is designed
  24. to work on ASCII and ASCII-derived alphabets, and it handles single-byte
  25. alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets
  26. (DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte
  27. alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme
  28. can easily be adapted to non-ASCII alphabets like EBCDIC.
  29. The Algorithm
  30. -------------
  31. The algorithm works by dividing the set of bytes [0..255] into three
  32. categories:
  33. - The white list of textual bytecodes:
  34. 9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255
  35. - The gray list of tolerated bytecodes:
  36. 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC)
  37. - The black list of undesired, non-textual bytecodes:
  38. 0 (NUL) to 6, 14 to 31.
  39. If a file contains at least one byte that belongs to the white list, and
  40. no byte that belongs to the black list, then the file is categorized as
  41. plain text. Otherwise, it is categorized as binary.
  42. Rationale
  43. ---------
  44. The idea behind this algorithm relies on two observations.
  45. The first observation is that, although the full range of 7-bit codes
  46. (0..127) is properly specified by the ASCII standard, most control
  47. characters in the range 0..31 are not used in practice. The only
  48. widely-used, almost universally-portable control codes are 9 (TAB),
  49. 10 (LF), and 13 (CR). There are a few more control codes that are
  50. recognized on a reduced range of platforms and text viewers/editors:
  51. 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these
  52. codes are rarely (if ever) used alone, without being accompanied by
  53. some printable text. Even the newer, portable text formats, such as
  54. XML, avoid using control characters outside the list mentioned here.
  55. The second observation is that most of the binary files tend to contain
  56. control characters, especially 0 (NUL); even though the older text
  57. detection schemes observe the presence of non-ASCII codes from the range
  58. [128..255], the precision rarely has to suffer if this upper range is
  59. labeled as textual, because the files that are genuinely binary tend to
  60. contain both control characters, and codes from the upper range. On the
  61. other hand, the upper range needs to be labeled as textual, because it
  62. is being used by virtually all ASCII extensions. In particular, this
  63. range is being heavily used to encode non-Latin scripts.
  64. Given the two observations, the plain text detection algorithm becomes
  65. straightforward. There must be at least some printable material, or
  66. some portable whitespace such as TAB, CR or LF, otherwise the file is
  67. not labeled as plain text. (The boundary case, when the file is empty,
  68. automatically falls into this category.) However, there must be no
  69. non-portable control characters, otherwise it's very likely that the
  70. intended reader of that file is a machine, rather than a human.
  71. Since there is no counting involved, other than simply observing the
  72. presence or the absence of some byte values, the algorithm produces
  73. uniform results on any particular text file, no matter what alphabet
  74. encoding is being used for that text. (In contrast, if counting were
  75. involved, it could be possible to obtain different results on a text
  76. encoded, say, using ISO-8859-2 versus UTF-8.) There is the category
  77. of plain text files that are "polluted" with one or a few black-listed
  78. codes, either by mistake, or by peculiar design considerations. In such
  79. cases, a scheme that tolerates a small percentage of black-listed codes
  80. would provide an increased recall (i.e. more true positives). This,
  81. however, incurs a reduced precision, since false positives are also more
  82. likely to appear in binary files that contain large chunks of textual
  83. data. "Polluted" plain text may, in fact, be regarded as binary, on
  84. which text conversions should not be performed. Under this premise, it
  85. is safe to say that the detection method provides a near-100% recall.
  86. Experiments have been run on a large set of files of various categories,
  87. including plain old texts, system logs, source code, formatted office
  88. documents, compiled object code, etcetera. The results confirm the
  89. optimistic assumptions about the high accuracy, precision and recall
  90. offered by this algorithm.
  91. --
  92. Cosmin Truta
  93. Last updated: 2005-Feb-27