pandas — A Short Primer

In Lab 2 you direct an AI to build a Python script. That script is written with pandas — the standard Python library for working with tables of data. You are not asked to write pandas code, and you are not graded on it. But you should be able to recognize what the script is doing, because Part 3 asks you to explain it. This page is the reference for that.

1. What pandas is

pandas does for a Python script what Excel does for you on screen: it loads a table, cleans it, reshapes it, and summarizes it. The difference is that pandas does it through written instructions rather than clicks — which is exactly why a specification matters. Excel's Power Query quietly made good guesses for you in Part 1. pandas does only what the script says.

2. The DataFrame — a table in memory

The one object to know is the DataFrame. It is a table: rows and named columns, held in memory while the script runs. By convention it is called df.

import pandas as pd

df = pd.read_csv("JEA Detail.txt", sep="\t", encoding="utf-16")

After that line, df holds the file as a table. Every later step is something done to df.

3. Reading the file — the arguments that matter

pd.read_csv reads a delimited text file. Most of the work in Lab 2 is telling it how to read this file correctly:

ArgumentWhat it does in this lab
sep="\t"The file is tab-separated, not comma-separated. Without this, every row loads as one long column.
encoding="utf-16"The file is UTF-16, not the usual UTF-8. Get this wrong and the read fails before a single row loads. This is the detail your specification has to name.
skiprows= / header=Tells pandas how many lines at the top are report header, and which line holds the real column names.

A script can read the file from your computer or straight from a web address — pandas treats both the same way:

url = "https://raw.githubusercontent.com/sean-mccaman/acctg5150-090/main/week-02/JEA%20Detail.txt"
df = pd.read_csv(url, sep="\t", encoding="utf-16")

4. The cleaning steps

Each cleaning step you did by hand in Part 1 has a pandas equivalent. You do not need the exact syntax — you need to recognize the move.

Drop the junk rows

The report header at the top and the page-break rows in the middle are not data. The script removes them — by skipping the top rows on read, and by filtering out rows where the data columns are empty.

Fix the Amount column

The amounts arrive as text like " 50,000.00 " — quotes, spaces, and a thousands comma. The script strips those characters and converts the column to a real number, so it can be summed and compared.

df["Amount"] = df["Amount"].str.replace(",", "").str.strip().astype(float)

That one line chains four small operations. Reading it left to right:

PieceWhat it does
.str The string accessor. It tells pandas "the next operation is a text operation — apply it to every value in this column at once." The methods after it only work because .str put them in text mode.
.replace(",", "") Find-and-replace on the text. It replaces every comma with "" — an empty string, meaning nothing. So 50,000.00 becomes 50000.00: the thousands comma is removed.
.strip() Removes whitespace — spaces, tabs, line breaks — from the start and end of each value, never the middle. So " 50000.00 " becomes "50000.00": the padding spaces are gone.
.astype(float) Converts the cleaned text into an actual number (a float — a number that can have a decimal point). Only now can the column be summed or averaged. The text "50000.00" cannot be added up; the number 50000.0 can.

Order matters here: the commas and spaces must come out before .astype(float) runs, because converting to a number fails on text that still contains a comma or a stray space.

Clean the user column

Some values in the user column begin with an invisible non-printing character — a stray line break. The fix is the same .strip() you just met:

df["ID"] = df["ID"].str.strip()

A line break counts as whitespace, so .strip() removes it from the front of each value — turning a broken-looking "\nBeanCounter25" back into a plain "BeanCounter25" that groups correctly when the script counts entries per user.

5. Looking at the data at scale

Once the table is clean, a few short commands describe the whole dataset at once. This is the part to notice: pandas turns 23,000-plus rows into a handful of numbers instantly.

CommandAnswers the question
df.shapeHow many rows and columns are there?
df.head()What do the first few rows look like?
df["ID"].nunique()How many distinct users are there?
df["ID"].value_counts()How many journal entries does each user have?
df["Amount"].sum()What do all the amounts add up to?
df.describe()Count, mean, min, max for every numeric column at once.

value_counts() is the one that produces your Lab 2 result: the count of journal entries per user — the same table you built by hand with a PivotTable in Part 1.

6. Writing the result out

A printed result disappears when the script ends. To keep it — and to have a file you can open, check, and hand to the validator — the script writes a CSV:

df.to_csv("lab_02_cleaned.csv", index=False)
counts.to_csv("lab_02_user_counts.csv")
Why a CSV. A CSV is a plain, inspectable file. You can open it, compare it to your by-hand result, and run it through the Output Validator. An analysis you cannot open is an analysis you cannot check.
The point of this page. You are not expected to write any of this. You are expected to recognize it — to read your script and see "this line loads the file, this line fixes the amounts, this line counts per user." When you can do that, Part 3 is straightforward.

Related: encoding primer · Lab 2 Output Validator · back to the Lab 2 page