In Lab 2 you direct an AI to build a Python script. That script is written with pandas — the standard Python library for working with tables of data. You are not asked to write pandas code, and you are not graded on it. But you should be able to recognize what the script is doing, because Part 3 asks you to explain it. This page is the reference for that.
pandas does for a Python script what Excel does for you on screen: it loads a table, cleans it, reshapes it, and summarizes it. The difference is that pandas does it through written instructions rather than clicks — which is exactly why a specification matters. Excel's Power Query quietly made good guesses for you in Part 1. pandas does only what the script says.
The one object to know is the DataFrame. It is a table:
rows and named columns, held in memory while the script runs. By convention it
is called df.
import pandas as pd
df = pd.read_csv("JEA Detail.txt", sep="\t", encoding="utf-16")
After that line, df holds the file as a table. Every later step
is something done to df.
pd.read_csv reads a delimited text file. Most of the work in
Lab 2 is telling it how to read this file correctly:
| Argument | What it does in this lab |
|---|---|
sep="\t" | The file is tab-separated, not comma-separated. Without this, every row loads as one long column. |
encoding="utf-16" | The file is UTF-16, not the usual UTF-8. Get this wrong and the read fails before a single row loads. This is the detail your specification has to name. |
skiprows= / header= | Tells pandas how many lines at the top are report header, and which line holds the real column names. |
A script can read the file from your computer or straight from a web address — pandas treats both the same way:
url = "https://raw.githubusercontent.com/sean-mccaman/acctg5150-090/main/week-02/JEA%20Detail.txt"
df = pd.read_csv(url, sep="\t", encoding="utf-16")
Each cleaning step you did by hand in Part 1 has a pandas equivalent. You do not need the exact syntax — you need to recognize the move.
The report header at the top and the page-break rows in the middle are not data. The script removes them — by skipping the top rows on read, and by filtering out rows where the data columns are empty.
The amounts arrive as text like " 50,000.00 " — quotes, spaces,
and a thousands comma. The script strips those characters and converts the
column to a real number, so it can be summed and compared.
df["Amount"] = df["Amount"].str.replace(",", "").str.strip().astype(float)
That one line chains four small operations. Reading it left to right:
| Piece | What it does |
|---|---|
.str |
The string accessor. It tells pandas "the next operation is a text operation — apply it to every value in this column at once." The methods after it only work because .str put them in text mode. |
.replace(",", "") |
Find-and-replace on the text. It replaces every comma with "" — an empty string, meaning nothing. So 50,000.00 becomes 50000.00: the thousands comma is removed. |
.strip() |
Removes whitespace — spaces, tabs, line breaks — from the start and end of each value, never the middle. So " 50000.00 " becomes "50000.00": the padding spaces are gone. |
.astype(float) |
Converts the cleaned text into an actual number (a float — a number that can have a decimal point). Only now can the column be summed or averaged. The text "50000.00" cannot be added up; the number 50000.0 can. |
Order matters here: the commas and spaces must come out before
.astype(float) runs, because converting to a number fails on text
that still contains a comma or a stray space.
Some values in the user column begin with an invisible non-printing character
— a stray line break. The fix is the same .strip() you just met:
df["ID"] = df["ID"].str.strip()
A line break counts as whitespace, so .strip() removes it from
the front of each value — turning a broken-looking "\nBeanCounter25"
back into a plain "BeanCounter25" that groups correctly when the
script counts entries per user.
Once the table is clean, a few short commands describe the whole dataset at once. This is the part to notice: pandas turns 23,000-plus rows into a handful of numbers instantly.
| Command | Answers the question |
|---|---|
df.shape | How many rows and columns are there? |
df.head() | What do the first few rows look like? |
df["ID"].nunique() | How many distinct users are there? |
df["ID"].value_counts() | How many journal entries does each user have? |
df["Amount"].sum() | What do all the amounts add up to? |
df.describe() | Count, mean, min, max for every numeric column at once. |
value_counts() is the one that produces your Lab 2 result: the
count of journal entries per user — the same table you built by hand with a
PivotTable in Part 1.
A printed result disappears when the script ends. To keep it — and to have a file you can open, check, and hand to the validator — the script writes a CSV:
df.to_csv("lab_02_cleaned.csv", index=False)
counts.to_csv("lab_02_user_counts.csv")
Related: encoding primer · Lab 2 Output Validator · back to the Lab 2 page