Using AI & LLMs to reconnect to my hometown
The Great Recession was hard on a lot of industries including journalism. I remember in 2012 when the local paper, the Gloucester County Times of New Jersey, shut down. It was merged with other regional papers. It became a subsidiary of a state-wide media conglomerate.
The South Jersey Times is now having problems and is shutting down more of its operations. While people in New York City are lucky to have the Times, and even non-profits like The City and Hells Gate, there is a journalistic retreat from non-urban areas.
It’s clear that local journalism is important, and I’m not the first one to say this. There’s a lot of value in people knowing what their government is doing. Local government tends to be covered the least. However, the economics and business model need more work.
Today I’m going to talk about a newsletter I’ve just launched, the Gloucester County Gazette, which is going to take advantage of large language models to make civic engagement more accessible and transparent. This blog post is also going to be a bit technical.
Accessing raw sources
Glassboro has regular town council meetings, but what happens in them? If you haven’t attended, you may not know. All of these meetings are publicly accessible. Their minutes are posted on Google Drive. Beyond that, the town’s board of education and Rowan’s Board of Trustee meetings make their minutes available online too.
To obtain the borough minutes from Google Drive, I can actually use Google Drive’s API and that’s fairly easy. To get files from the other sources, I can scrape the websites using a script and the Cheerio library to help parse the HTML I get back.
These links are then compiled into RssArticle
objects, which can then be compiled into an RssFeed
. This is a fairly standard way to present and subscribe to a data feed. By merging all three data sources into a common data format, I have more flexibility over later parts of the data pipeline. I can even host the RSS feed for anyone to subscribe to, which you can see below:
async function getBoardMinutes(): Promise<RssArticle[]> {
const boeMinutes = 'https://www.gpsd.us/Page/31'
const boeFetch = await fetch.default(boeMinutes)
const boeHtml = await boeFetch.text()
const $ = cheerio.load(boeHtml)
const navs = $('ul.page-navigation')
const links = $(navs).find('a')
const articles: RssArticle[] = []
for (const link of links) {
const annualUrl = $(link).attr('href')!
const annualFetch = await fetch.default(`https://www.gpsd.us${annualUrl}`)
const annualHtml = await annualFetch.text()
const annual$ = cheerio.load(annualHtml)
const minutes = annual$('ul.ui-articles li')
for (const article of minutes) {
const ahref = $(article).find('h1 a')
const title = $(ahref).text().trim()
const sanitizedTitle = title
.replace('Reorganization Meeting Minutes -', '')
.replace('Reorganization Meeting -', '')
.replace('Public Budget Hearing -', '')
.replace('Public Minutes -', '')
.replace('Special Board Meeting -', '')
.replace('Special Board meeting', '')
.replace('Reorganization', '')
.replace('Reorg', '')
.replace('Public Budget Hearing', '')
.replace('Public Board Minutes', '')
.replace('Board Retreat', '')
.replace('Approved Minutes', '')
.trim()
if (!sanitizedTitle) continue
const url = $(ahref).attr('href')!.replace('../../', 'https://www.gpsd.us/')
const date = new Date(sanitizedTitle)
articles.push({
authors: ['Glassboro Board of Education'],
content: 'Open the link to view article',
guid: url,
link: url,
pubDate: date,
title,
})
}
}
return articles
}
All three sources connect to PDFs, which adds a bit of a challenge in processing.
Document AI
By default, I cannot make any sense of what is in these PDFs. Common PDF parsers like ones I tried on Hugging Face didn’t help. Borough minutes are not a common type of PDF, and the only custom models which exist are for things like W2s.
After talking with some colleagues, I settled on simply using Google Cloud’s OCR system, optical character recognition. Until I can get the text pulled out from the documents, I can’t do anything more.
With just a bit of setup, the OCR process actually works very effectively. However, the system can only manage fifteen pages at a time. I’ve just hard-coded the tool to do the first fifteen. Additionally, as I upload all my files to Google Cloud Storage I can just use the cloud link to avoid the problems of moving files back and forth.
import * as fs from 'fs'
import {DocumentProcessorServiceClient} from '@google-cloud/documentai';
import { google } from '@google-cloud/documentai/build/protos/protos';
const ocrProcessorId = 'projects/<...>/locations/us/processors/<...>'
const key = require('../src/gloucester-gazette-efbc71e4fda6.json');
const client = new DocumentProcessorServiceClient({credentials: key});
export async function convertPdfToText(gcsFile: string) {
const request: google.cloud.documentai.v1.IProcessRequest = {
name: ocrProcessorId,
gcsDocument: {
gcsUri: `gs://${gcsFile}`,
mimeType: 'application/pdf',
},
processOptions: {
fromStart: 15, // API maximum
}
}
const [result] = await client.processDocument(request)
const {document} = result
const {text} = document!
const limits = text?.indexOf('THE TAPE IS AVAILABLE')
if (limits && limits > -1) {
return text!.substring(0, limits)
}
return text!
}
I get a lot of text out of this function, since it’s fifteen pages of PDFs. I had to add that check at the end since some of the documents have long epilogues that created a lot of clutter.
Large Language Models
The final ingredient to this process is using a large language model like ChatGPT. In my case, since I’m already using Google Cloud, I went with Gemini. I’m just using the plain Gemini Pro, version 1.
What I want from the model is to summarize all of the text I obtained. I want it to provide key points and give me a title for the whole thing. In general, I want it to write a report on what happened since I didn’t attend.
The neat thing about language models is their flexibility. I can just throw all of my text into it, along with a certain task, and it’ll figure it out. Since this is in the backend, I just went with simple synchronous operations. Then, I wrote out the report into a new file.
const ocrText = await convertPdfToText(`glassboro-minutes/${article.title}`)
const llmHeader = await prompt(`Provide a title for the following borough minutes document as though you were a local journalist writing a column in paragraph form\n\n${ocrText}`)
const llmReport = await prompt(`Summarize the following borough minutes document as though you were a local journalist writing a column in paragraph form\n\n${ocrText}`)
const llmDetails = await prompt(`Describe key points and potential controversial topics minutes document as though you were a local journalist writing a column in bullet points\n\n${ocrText}`)
const header = llmHeader.candidates[0].content.parts[0].text
const report = llmReport.candidates[0].content.parts[0].text
const details = llmDetails.candidates[0].content.parts[0].text
fs.writeFileSync(`${article.title}-report.md`, `# ${header}\n\n${report}\n\n${details}`)
Editorial Control
When you do all of these together, you get a lot of text. Language models are good at generating text. However, it does tend to overlap a lot between all three responses. I can’t simply throw that output into a new post and call it a day. However, it does get you 90–95% of the way there. This is great for journalists who don’t have time to cover every single meeting from every single town in Gloucester County.
With some proof-reading and adding a handy cover image, you can see the results for yourself.
I did go through the source document to make sure these things were true, and anyone else is absolutely welcome to do it themselves. I’d say this report is a B-, not great but enough to get the point across. A professional, human journalist would certainly do a better job. If that happens, these AI-generated reports won’t matter. Until that happens, this is a good tool for many small municipalities which are underserved.
AI in journalism can be a problem. Large language models are designed to generate language. They have no sense of truth. They can just go on a tangent without any care for accuracy or sense. Previous experiments have been embarrassing and I wouldn’t want to repeat that. But taking five minutes to proofread something is a lot better than spending several hours from scratch. For small towns, I can understand that journalists cannot afford to do the latter, but that doesn’t help the residents.
What’s Next?
While I called the publication the Gloucester County Gazette, I don’t have a strong personal affiliation to the rest of the county. If other people in the area want me to cover some other government source, I can definitely add it to my data pipeline.
The code for this isn’t quite open source (it’s a bit of a mess under the hood), but I’d be willing to help out any budding journalists who want to try out new tools. And if you want to read the newsletter, you can visit the website and become a subscriber.
This is just a new project for me, so I’m definitely going to keep tweaking it. But already I have learned quite a bit about what’s happening in my hometown which is something that I didn’t have available to me before. That’s pretty cool.