The opinions stated here are my own, not those of my company.
It’s been three months since I published an article explaining how I’ve been pulling events from New York websites and converting them into iCalendar events. Since that original post, I’ve continued to work on this project over time, finding more venues that publish their events online.
At this point my system tracks 19 different venues and theater companies. Some, like Madison Square Garden, host events for a handful of theaters while others like New York’s Parks Service are even more scattered across the city.
In the interim, I’ve evolved my system to make it more extensible and work in more contexts.
Breaking up my calendars
The first thing I learned was that Google Calendar seems to have an upper limit of 1MB for calendar data. This does pose a bit of a problem as 19 different feeds merged together can easily exceed that limit. Although my calendar worked at first, I noticed it was no longer getting updates as events shifted.
My first fix was to reduce the number of events that would even appear, with a minimum cut-off of one month ago.
While it slightly reduced the number of events in my feed, this didn’t reduce the size enough. As each webpage I fetched would usually only show upcoming events, I needed to find a different way to minimize feed size.
I did this by adding a query parameter to my endpoint. With the
c param, I could specify the calendar name I wanted. Then in my calendar app I could import each event feed separately.
While this approach did force my subscriptions list to grow, it’s successfully allowed me to keep every calendar at a healthy size.
Technically this allows my system to have more flexibility. I’ve made the
c param an array, so an individual could select just the calendars they care about to merge them into one.
https://us-central1-redside-shiner.cloudfunctions.net/ical_fetch?c=downtownbrooklyn&debug=true shows an example of the URL scheme, with the
debug parameter changing the response output to be plaintext so it can be viewed in the browser.
Rendering pages happens in many ways
In inspecting calendar pages, often I would just run a fetch on that same URL in a Node context and be able to parse the HTML. It’s a brute-force approach that works.
For some venues, particularly concert venues, I would find that the page does not load events by default. Instead, they will call some other script that renders HTML on the page post-fetch. Some of these are actually good, like exposing the events directly in JSON to make parsing much easier.
City Winery however, embeds these events within the context of the page as
window.articleContent. I need to parse this from the page body and then convert it into a JSON object that can be read.
A number of pages seem to fail or load different content when just calling
fetch. My guess is that it’s some sort of rudimentary anti-spam defense. As such, my fetch call needs to add headers with fields read from the browser HTTP request. One of them seems to allow my call to identify as a regular user to load the events.
Web standards should be employed here
When the web expanded and blogging became a national pastime, browsers made it easy to discover the RSS feed on your webpage.
Website administrators, in addition to creating and maintaining their calendars in some machine-readable format, should come together on a standardized tag that points to their calendar.
Until that happens, I will continue to look at ways to fetch and parse webpages to pull out the information that I’m actually interested in.
It’d be great if this could improve. Having to open up 19 different webpages every week to find out what’s happening is burdensome. I don’t want to download an app just to check once a week. That introduces too much friction and development cost.
But my calendar is an app that I use everyday and it’s already installed on my phone. If your goal is to get your event in front of potential patrons, you may want to start looking at iCal.