Building an AI Powered Virtual Newsroom
Can Prompt Engineering provide a simple, cost-effective solution?
Perfecting Equilibrium Volume Four, Issue 2
Those evil-natured robots
They're programmed to destroy us
She's gotta be strong to fight them
So she's takin' lots of vitamins
'Cause she knows that
It'd be tragic
If those evil robots win
I know she can beat them
Oh, Yoshimi, they don't believe me
But you won't let those robots defeat me
Yoshimi, they don't believe me
But you won't let those robots eat me, Yoshimi
The Sunday Reader, April 13, 2025
What is the finest automobile in the world? Do you think it’s a Bugatti? A Rolls Royce? A Maybach?
Or maybe your ultimate driving machine has two wheels. A Ducati Monster?
And now that we’ve established your choice for transcendent transportation...what do you actually drive?
What!?!?! Not a single Perfecting Equilibrium reader has dropped $2 million on a Bugatti, or half a million on a Rolls? Why not?
Because we don’t want to live in our cars, is why. When we have that kind of money we spend it on a house and buy more reasonable wheels.
Which leads, of course, to the question of what is reasonable.
That’s been the real problem for the Virtual Newsroom Project. From the beginning there have been lots of technical solutions we could have built that would have provided excellent editorial support, such as writing background information paragraphs for news stories by summarizing previous articles.
The problem is that the best of these solutions costs Bugatti money. Training a model from scratch on clean known data would be best. But it requires millions of dollars in Graphics Processing Units – video cards – enough electricity to run a good-sized city, and truly enormous amounts of data. For example, Google pays Reddit millions annually to use every post in that site’s history to train.
All of Reddit is a single data set.
Also Also, using Reddit as your training dataset works out about as well as you’d expect.
Similarly, Meta – parent of Facebook and Instagram – trains its LLMs on every post ever made on those sites. And X/Twitter was just sold to xAI so that site can be used to train Grok. Again, every tweet in the 2-decade history of that site together adds up to a single training data set.
Worse, even if you have access to a data set that large – the Associated Press archives must be in that range, no? -- you can’t just crawl the data on the website. LLMs learn from every character fed to them, so you have to strip out everything you don’t want as part of their training. So you have to clean the data, stripping out all the formatting, HTML and scripts, for example. Then you have to chunk the data, chopping it up into 500 to 1,000-character “chunks” for efficient LLM processing.
There’s no way a startup newsroom can afford all that. The only organizations that have that kind of money are large corporations.
There are less expensive methods. For example, you can tune an already trained LLM with a smaller subject-specific data set. Or you can add a RAG – a Retrieval-Augmented Generation subsidiary database – with a subject-specific data set.
But each step down is less and less accurate. But at least they remain way too expensive for our purposes.
The point of Virtual Newsroom is to provide low and no-cost tools that can fertilize the flowering of thousands of new newsrooms. We want to make it viable for individual journalists and small teams to successfully cover, say, a small town or school district with nothing more than a cheap laptop and an internet connection. The key should be to reduce the launch costs of journalism startups as close to zero as possible.
For all their flaws, Large Language Models are a good fit for this task. That’s because the vast majority of news is largely known in advance. We don’t know who will win the game or the election, or how the council will vote on an issue. But we do know in advance when the game is, and who is playing, and what their records are and how their seasons have gone so far, and who is playing and who is hurt and will miss the action. Elections are much the same; indeed, most of us are sick of hearing about the contestants by the time the actual voting rolls around. And councils and other government bodies all publish agendas in advance.
So there isn’t one story about the game; there’s an advance story about the teams, their records, the injury list and some analysis of what they’ll have to do to win. Then there’s another story about what actually happens in the game. The same is true about council and government meetings.
Each new story was built upon the last, because each story must be complete. There is no guarantee that your reader has read or remembers any or all of the previous stories, so each update must contain the essence of all the previous entries.
This task is a perfect fit for a Large Language Model that is tuned with local news. Pulling together the earlier articles and then writing the background paragraphs would free reporters to focus on what’s new and therefore be more productive.
21st Century story generation is another task that fits right into the AI wheelhouse. As we all know in many way the proverbial village green has moved online. As we also all know, 98 percent of what’s online is garbage. But trolling through all that social media and local message boards can yield complaints about dangerous roads, land-use problems, polluters, corporate abuse and a host of other problems. None of this should be compiled into stories by AIs or humans. These are leads, like any other complaints or charges. A reporter needs to follow up and do some reporting, find what’s true, then report the facts.
It turns out that the best fit for this is the least amount of tech possible. The prompt you feed to a Large Language Model is in effect a natural language query. That means you can tune and control the AI and its output by what is called prompt engineering.
Let’s create a hypothetical to exam this: Our Hometown City has just released its proposed budget. While our intrepid reporter goes off to interview the Mayor and the leading opposition city councilman about the proposal, we’re going to use an AI to pull together and summarize the background for the story.
You could just prompt Grok or any other “AI” to write a summary of the budget. But you’ll get much, much better results if you treat the AI like you would a student or a cub reporter, and provide lots and lots of structure within your prompt.
So, yes, just prompt an AI “Write a 3-paragraph summary of the new budget.”
But you’ll get much better results with highly structured prompts that provide specific directions:
“Write a summary of a comparison of the budget proposal found here: HTTPS://WWW.MyCity.Gov/Budget2025.html
With the existing budget found here:
HTTPS://WWW.MyCity.Gov/Budget2024.html
Provide three bullet points with the largest changes in dollars from 2024 to 2025, and three bullet points with the largest percentage changes from 2024 to 2025.
Note several interesting constraints in this prompt, especially compared to “Write a 3-paragraph summary of the new budget.” This advanced prompt tells the AI exactly how to do the comparisons, and how to format the results.
Most importantly, we’re showing the AI where to find the data to use in the comparison. In other words, we’re getting most of the clean data benefits of building a RAG without any of the work!
Next week we’re going to do a little journalism and walk through this in depth at the same time. We’re going to feed some budgets into AIs using these prompt techniques and see how this nascent Virtual Newsroom performs.