Lab 2 — Direct an AI to Clean a Dataset

Part 1 — Clean it yourself in Excel 25 pts

You will import the messy JEA Detail.txt into Excel, strip out everything that is not real data, and end with a clean ten-column table plus a count of journal entries for each user. Follow the steps as written — that exactness is the point. Power Query may be new to you. That is expected — the steps below assume no prior experience with it; do them in order. The Week 2 webinar also includes a short Power Query walkthrough and lab hints.

Look at the file first

Before Excel, open JEA Detail.txt in a plain-text editor — VS Code if you have it (preferred), otherwise Notepad on Windows or TextEdit on Mac. Just look. You are checking two things.

The file will look strange — that is the point, not a problem. Every letter may appear spread out with spaces between it, and the first line may begin with odd symbols. The file is not broken and your download is fine. That strangeness is the encoding issue you are here to see.

What is wrong with the data — see the list below.
The file's encoding. Every text file has an encoding — the system that turns stored bytes into readable characters. Your editor names it: in VS Code, in the blue bar at the bottom-right; in Notepad, in the status bar at the bottom, the second item from the right. You are looking for a short label such as UTF-16 LE, UTF-8, or ANSI. Write down exactly the one your editor shows — this file is not the usual UTF-8. You will need it in Part 2, and an import that assumes the wrong encoding fails before it reads a single row. New to the idea of encodings? Read the short encoding primer posted in this Week 2 module.

The file is a tab-separated export with several deliberate problems:

A report header. Several lines at the top — Company XYZ, REPORT DATE, GENERAL LEDGER DETAIL, a row of ==== — before the real data starts.
A two-line column header. The column names are split across two rows: an upper line carrying the longer labels (Account, Post, User, Transaction), and a lower, complete line (Account, Category, Date, Period, ID, and the rest). Only the lower line is the one to keep.
Page-break rows. A handful of times in the data, the export inserts two non-data rows where a printed report would break to a new page — a row of ==== and a row reading PAGE 0002 (and so on). They sit scattered in the data and must come out.
Quoted, spaced amounts. The Amount values are wrapped in quotes and padding spaces, with a thousands comma: " 50,000.00 " instead of a clean number.
A hidden character in the ID column. Some ID cells begin with an invisible non-printing character — a stray line break. The value displays as blank or pushed out of place, so the user looks like missing data. Excel's CLEAN function removes these characters; TRIM does not, because TRIM only removes spaces.

Underneath all of that are ten real columns: Account, Category, Date, Period, ID, Manual, Description, Number, Location, Amount.

No Excel? Read this first.

Excel is the intended tool for Part 1. If you do not have it, Google Sheets is an allowed substitute — use File → Import → Upload, choose tab as the separator, and follow the same logic (delete the header lines, delete the repeating page-break rows, keep the ten real columns, count entries per user). The screenshots will look slightly different but the task is identical. You can also check out a laptop with Excel from the University Library.

1.1 — Open Power Query and point it at the file

In Excel, go to the Data tab and choose Get Data → From File → From Text/CSV. This opens Power Query, Excel's import-and-transform tool.

Excel Data tab with the Get Data menu open, showing the From File option used to launch Power Query. — 1.1 — Launch Power Query from the Data tab.

1.2 — Select JEA Detail.txt

Browse to the JEA Detail.txt file you downloaded and open it.

The file picker dialog with JEA Detail.txt selected, ready to open in Power Query. — 1.2 — Choose the JEA Detail.txt file.

1.3 — Check the import preview

Power Query shows a preview. Confirm the delimiter is set to Tab. Notice the File Origin box near the top of the preview — that is the encoding, and Power Query detected it for you automatically. Remember that it had to be set; in Part 2 nothing will detect it for you. You will see the messy header lines at the top — that is expected. Click Transform Data (not Load) to open the Power Query editor.

The Power Query import preview window showing JEA Detail.txt newly opened with the tab delimiter and the messy report header visible at the top. — 1.3 — The import preview. The report header is still there; the File Origin box shows the detected encoding.

1.4 — See the raw data in the editor

The Power Query editor now shows every row, including the junk. The first several rows are the report header; somewhere below them the real column headers and data begin.

The Power Query editor with the raw JEA Detail.txt data loaded, report header rows still present above the real table. — 1.4 — Data loaded into the Power Query editor, still messy.

1.5 — Remove the report header and the extra header line

Two things sit above the real data: the report header, and then the upper of the file's two header lines. On the Home tab, use Remove Rows → Remove Top Rows and remove every row down to — but not including — the lower, complete column-header line, the one that reads Account, Category, Date, Period, ID, Manual, Description, Number, Location, Amount. Leave that complete line as the first row.

The Remove Top Rows dialog in Power Query, used to delete the report header lines above the real data. — 1.5 — Remove Top Rows drops the report header.

1.6 — Promote the header row

The first remaining row is now the complete column-header line. Use Use First Row as Headers so Account, Category, ID, Amount and the rest become actual column names instead of data.

Power Query after promoting the first data row to column headers, so the ten real column names appear as headers. — 1.6 — Use First Row as Headers promotes the real header line.

1.7 — Remove the page-break rows, fix the columns, and load

The page-break rows are scattered through the data — each is a row of ==== or a row reading PAGE and a number, with the rest of that row empty. The simplest way to drop them: pick a column that a real journal entry always fills — Category works well — and Filter it to remove the blank (null) rows. The page-break rows have nothing in that column, so they fall out; real entries stay. Then set each column's data type: click the small data-type icon to the left of the column name in its header, and choose the type — Amount to Decimal Number, the date column to Date. Setting Amount to a number is the step that strips the quotes, spaces, and comma from values like " 50,000.00 ", so do not skip it. When the step list on the right looks complete, click Close & Load.

The Power Query editor showing the full Applied Steps list and the cleaned ten-column journal-entry table with headers in row one. — 1.7 — The completed Applied Steps list and the cleaned ten-column table.

Look at your Applied Steps list. Every fix you just made is one line in that list on the right — remove top rows, promote headers, filter, change type. That list is a specification: an ordered, exact set of operations. Excel gave the same result each time because the steps were exact. Hold onto that idea — in Part 2 you write the same list in words.

1.8 — Count the journal entries for each user

On your cleaned table, you now need a count of how many journal entries belong to each user — the ID column. The straightforward way is a PivotTable: select your table, choose Insert → PivotTable, put ID in the Rows area and ID again in the Values area (it will summarize as Count). You get one row per user with a count beside it.

The hidden character will break a count row. One user's ID cells begin with an invisible non-printing character. In the PivotTable that user shows up with a blank or broken-looking label — the count beside it is right, but you cannot tell whose it is. Clean the ID column first (Excel's CLEAN function removes non-printing characters; TRIM alone will not, because TRIM only removes spaces). Note this trap — your Part 2 specification has to mention it.

New to PivotTables? There is a short practice tool in this Week 2 module, and the Week 2 lecture covers them.

Deliverable for Part 1 — your cleaned workbook. Save your Excel workbook with two things in it: the cleaned ten-column table (headers in row 1, the first several data rows showing), and the per-user count (each ID and its number of journal entries). You will upload that workbook file (.xlsx) to the Lab 2 quiz — if you used Google Sheets, download it first with File → Download → Microsoft Excel (.xlsx). Keep the per-user count where you can see it: that small table is the by-hand result you hold the AI's work against in Part 2.

Part 1 done when: you have a clean ten-column table (no report header, no page-break rows, amounts as real numbers) and a correct count of journal entries per ID — both saved in the workbook you will upload to the Lab 2 quiz.

Part 2 — Direct an AI to do the same job 15 + 25 pts

Now you get the same result a second way: you write a specification, and an AI builds a Python script that carries it out. You are not going to write Python, and you are not expected to read it. Your job is the specification — the precise instructions that make the AI build the right tool.

2.1 — Write your specification

Part 2 works in two stages: first everyone writes the specification, then you pick one of two flows to turn it into a working script. Start with the specification — open the Lab 2 Specification Template (linked here and in the Week 2 module). It is built on the same pattern as your Week 1 spec document: a file manifest, the Build Requirements, review / acceptance criteria, the grading rubric, and a self-assessment gate. The template marks which parts are already filled in for you and which you write.

The part you write — Build Requirements — is the centerpiece of the lab, the piece graded on precision. It has four parts, each mapping to something you already did in Part 1:

Input — what the file is. A tab-separated text export of journal-entry detail. State its encoding (the one you wrote down in Part 1) and name what is wrong with it: the report header, the two-line column header, the page-break rows, the quoted and spaced amounts, the non-printing character in the ID column.
Operations — the cleaning steps, in order. Describe to an AI what you did by hand in Part 1, have it draft a numbered step-by-step, then correct it: load the file with the right encoding, skip the report header and the upper header line, promote the complete header row, drop the page-break rows, turn the amounts into real numbers, strip the non-printing character out of the ID column, then count the journal entries per user. Order matters — say which comes first.
Edge cases — the specific traps, named exactly: the encoding, the two-line header (say which line to keep), the page-break rows (the ==== and PAGE lines), the quoted-and-spaced Amount values, the non-printing character in the ID column. For each, say what should happen.
Expected output — already filled in for you: the ten clean columns, a printed descriptive summary, and the two files the script writes (lab_02_cleaned.csv and lab_02_user_counts.csv).

The template's Review / Acceptance Criteria are given too — the script's per-user counts should match your Part 1 by-hand result, and the cleaned CSV should pass the Output Validator. Before you submit, work the template's Self-assessment gate.

This is the heart of the lab. Power Query quietly detected the file's encoding for you. A Python script will not — it does exactly what it is told and nothing more. Every problem Excel handled for you is now something your specification has to say out loud. A vague specification ("clean the file") produces a script that does the wrong thing, confidently. A precise one produces a script you can trust. The difference is the skill this lab teaches.

Save your work. Write your specification in your own copy of the template and save the file as you go — not only in a browser tab. You will upload the finished specification to the Lab 2 quiz and also hand its text to the AI in the next step.

2.2 — Choose your flow

Two routes turn your specification into a working script. They reach the same place — a working lab_02_analysis.py and the two CSV files — so pick the one that matches what you have on your computer. The prompting steps in 2.3 are the same in both.

Flow A — Google Colab (no Python install needed)

Use this if you do not have Python set up on your computer. Everything runs in your browser; you need a free Google account.

Open Google Colab and start a new notebook: File → New notebook.
Direct an AI to build the script — the three moves in step 2.3 below. Colab has a built-in AI assistant; if it is not available on your account, use any AI assistant (ChatGPT, Claude, or Gemini) and bring the script back.
Paste the finished script into a code cell and run it (the round run button, or Shift+Enter). The data file is read straight from the web — there is no upload step.

Flow B — Codex or ChatGPT + local Python

Use this if you already have Python installed and want to work on your own machine, with a terminal agent such as Codex or with the ChatGPT app.

Make a folder for the lab, and save your specification into it as a text file.
Direct the AI to build lab_02_analysis.py in that folder from your specification.
Run it from a terminal: python lab_02_analysis.py. The script reads the data file from the web, or you can download JEA Detail.txt into the folder.
Iterate: when you run the script, paste its output — or any error — back to the AI so it can refine the script against your specification.

A heads-up on Codex. Codex works well here, but it can use up the credits on your account quickly. If that is a concern, Flow A (Colab) is free.

2.3 — Direct the AI to build the script

Whichever flow you chose, directing the AI is a process, not a single prompt — and knowing what to prompt about is the skill. Three moves:

Move 1 — Hand over the spec and set the job. Give the AI your specification and tell it plainly what to build:

Here is a specification for a data-cleaning task. Write a single
Python script, with pandas, that does exactly what it describes.
I will run the script myself.

[your specification]

Move 2 — Pin down the output. A script that runs is not the same as a script that produced what you need. Tell the AI exactly what the script must hand back — this is the move students most often skip:

The script must:
 1. read the data file directly from this web address, with no
    upload step:
    https://raw.githubusercontent.com/sean-mccaman/acctg5150-090/main/week-02/JEA%20Detail.txt
 2. print a short descriptive summary -- the number of journal
    entries, the number of distinct users, and the count of
    entries per user;
 3. write two files: lab_02_cleaned.csv (the cleaned dataset)
    and lab_02_user_counts.csv (the per-user count).

The two CSV files are how you check the work. lab_02_user_counts.csv is what you compare to your Part 1 by-hand result; lab_02_cleaned.csv is what the Output Validator inspects. An analysis you cannot open is an analysis you cannot trust.

Move 3 — Build and run. Have the AI produce the script, then run it — in your Colab notebook for Flow A, in your terminal for Flow B. New to what pandas is doing? The pandas primer is a short, plain-language reference — you are not asked to write the code, only to recognize it. When the script is final, save it as lab_02_analysis.py — or keep the .ipynb notebook you ran it in. You can submit either.

You do not need to understand the Python yet. That is Part 3. For now your job is the specification and the output it produces.

Keep your first draft. Before you revise the specification in step 2.5, save a copy of your first version — the reflection questions ask what changed between your first specification and your final one.

2.4 — Run the script and check it against your by-hand result

Run the script. It prints a descriptive summary — read that first; the row count and per-user counts should look sane. Then check the output two ways.

The Output Validator. Open the Lab 2 Output Validator and upload your lab_02_cleaned.csv. It looks for the common mistakes a cleaning script makes — leftover header rows, page-break junk, an Amount column still stored as text — and gives you a short report.
The by-hand result. The real check: does the per-user count in lab_02_user_counts.csv match the count you built by hand in Part 1?

If the script cannot even read the file, that is almost always the encoding — see the encoding primer.

2.5 — If it does not match, fix the specification

A mismatch does not always mean the AI failed. Usually it means your specification left something vague or unsaid — but it can also mean your by-hand count in Part 1 had a small slip (the hidden-character trap in the ID column is the likely spot). Re-check both sides, then find the requirement that caused the gap. Common culprits:

The spec never named the encoding, so the script could not even load the file.
The spec did not say how to identify the page-break rows (the ==== and PAGE lines), so the script kept some — and the validator flags them.
The spec did not mention the two-line header, so the script promoted the wrong row.
The spec did not mention the non-printing character in the ID column, so a user's count row came out blank or broken.

Fix that requirement in your specification, ask the AI to rebuild the script, and run it again.

Diagnose — do not just re-ask. Re-running a vague request and hoping is the habit this lab teaches you to drop. Point at the exact requirement that was missing or unclear, fix that, and rebuild. Naming the requirement is the graded skill.

Deliverables for Part 2. Your finished specification and your script (lab_02_analysis.py) — both uploaded to the Lab 2 quiz.

Part 2 done when: your specification covers all five sections, you have a working lab_02_analysis.py, and its per-user count matches your Part 1 by-hand result (or you have explained the mismatch you are still chasing).

Part 3 — Understand what you built 20 pts

You directed an AI to build a script, and the script produced a number. That number now has your name on it. You do not need to be able to write the Python — but you do need to understand what it does, well enough to explain it. This part is how you get there.

3.1 — Bring your learning profile

In Week 1 you wrote about_me.md. Section 4 — "How I learn and what makes concepts stick" — describes how you actually absorb new material. Open that file and copy that section. You are about to put it to use. If your Section 4 is thin, take two minutes now to add real detail about how you learn — that detail is what makes this part work.

3.2 — Get an explainer, built the way you learn

Do not just ask the AI to "explain the code." Ask it for options, and steer. The goal is a single self-contained HTML page — an explainer — that walks through how each part of your script works, in the format that fits how you learn.

These prompts show you how to ask. The wording is yours — fill in the brackets.

Present me with 5 options for an HTML learning document that explains
how each part of this code works. I will paste the code next. I learn
best when [paste Section 4 of your about_me.md] -- lean the options
that way.

Paste your script. Read the five options. Pick the one that fits how you learn, and refine it:

Option 3 is closest. Build that one out as a single HTML file, but
[what you would change]. If anything I asked for is unclear, ask me
before you build it.

If none of them fit:

None of these fit how I learn. Give me 5 more, and make them more
[visual / example-driven / broken into smaller steps].

When you have the explainer, save it as a file named lab_02_explainer.html. Ask the AI to give it to you as a single downloadable HTML file; if it gives you the HTML as text instead, paste that text into a plain-text editor and save it with a .html extension. Open it in your browser to confirm it renders.

3.3 — Keep asking until it is clear

An explainer you do not understand is not finished. If a part does not land, say exactly what does not land, and ask again:

I still don't follow [the specific part]. Explain just that part
again, assume I have never written code, and use a real row from
the file as your example.

The bar: you could explain, in plain language, what your script does and why its number can be trusted. That is the test — not whether you could write the code, but whether you understand the tool you directed into being.

Deliverable for Part 3. The explainer file lab_02_explainer.html, uploaded to the Lab 2 quiz.

Part 3 done when: you have an lab_02_explainer.html that genuinely explains your script, and you could walk a classmate through what the script does without reading the code aloud.

The 3 reflection questions 15 pts

The Lab 2 quiz includes three short reflection questions. Answer all three — with specifics, pointing at your own work, not generalities:

Reflection 1 — what was wrong with the raw file, and which defect would most quietly break an analysis if someone missed it.
Reflection 2 — how your specification changed when you checked the script against your by-hand result: the one requirement you added or sharpened. If it matched on the first run, say what in your specification made that happen.
Reflection 3 — explain, to a non-technical colleague, what your script does and why its number can be trusted.

Save your work. Write your answers in your own document first, then paste them into the quiz. A Canvas session can time out; a file you saved cannot.

Component	Pts	Credit when…
Part 1 — your cleaned workbook	25	The file is cleaned in Excel or Power Query; the workbook shows the clean ten-column table and a correct per-user count.
Part 2 — the script and the check	25	A working `.py` script is submitted, and its per-user count was checked against the Part 1 by-hand result. If the counts differed, the mismatch was diagnosed to a specific requirement and the spec corrected; if they matched on the first run, the reflection names what in the spec made that happen.
Part 3 — the explainer	20	The explainer page is submitted and is about your own script.
The three reflection questions	15	All three answered, in your own words and pointing at your own work — they show you understand what you built.

Lab 2 — Direct an AI to Clean a Dataset

The story of this lab — start here

What you'll do

Get set up

Look at the file first

1.1 — Open Power Query and point it at the file

1.2 — Select JEA Detail.txt

1.3 — Check the import preview

1.4 — See the raw data in the editor

1.5 — Remove the report header and the extra header line

1.6 — Promote the header row

1.7 — Remove the page-break rows, fix the columns, and load

1.8 — Count the journal entries for each user

2.1 — Write your specification

2.2 — Choose your flow

2.3 — Direct the AI to build the script

2.4 — Run the script and check it against your by-hand result

2.5 — If it does not match, fix the specification

3.1 — Bring your learning profile

3.2 — Get an explainer, built the way you learn

3.3 — Keep asking until it is clear

Grading rubric

Submission

Need help?