File Encoding — A Short Primer

This page explains one idea you meet in Lab 2: a text file can be perfectly good data and still refuse to open correctly, because of its encoding. It is a five-minute read.

What an encoding is

A text file is not really stored as letters. It is stored as bytes — numbers. An encoding is the rulebook that maps those bytes back to characters you can read. Open a file with the wrong rulebook and the text comes out wrong: garbled, spaced out with gaps, or it fails to open at all. The data was fine; the rulebook was wrong.

The encodings you will meet

They are not interchangeable. A file written as UTF-16 has to be read as UTF-16.

The BOM — why the first line can look strange

Some files begin with a few invisible bytes called a byte-order mark (BOM) that announce the encoding. If your editor reads the file with the wrong rulebook, that BOM is what shows up as odd symbols at the very start of the first line. It is not corruption — it is the file telling you what it is.

Why it breaks an import

In Lab 2, JEA Detail.txt is UTF-16. A tool that assumes UTF-8 hits the very first bytes, cannot make sense of them, and stops — before it has read a single row of data. Read the same file as a wrong single-byte encoding instead and every character comes back with a gap between it (C o m p a n y). The file is fine. The encoding was guessed wrong.

How to handle it

  1. Look first. Open the file in a text editor — it tells you the encoding. VS Code shows it in the blue bar at the bottom-right; Notepad shows it in the status bar at the bottom.
  2. In a Python script, state the encoding instead of letting the tool guess:
    import pandas as pd
    df = pd.read_csv("JEA Detail.txt", sep="\t", encoding="utf-16")
    If utf-16 does not work, try utf-16-le, utf-8, or latin-1 until the data reads cleanly.
  3. In Excel Power Query, the import preview has a File Origin box — that is the encoding. Power Query usually detects it for you automatically.
The Lab 2 point. Power Query detects the encoding for you. A Python script does not — it does only what you tell it, and it will not guess. So when you direct an AI to write a script, the encoding is a requirement your specification has to name out loud. That is the whole lesson in one detail.

Related: pandas primer · Lab 2 Output Validator · back to the Lab 2 page