How to implement RFC3339 in any language

As part of my work on JSON Type Definition, I’ve had the pleasure of seeing exactly how you implement RFC3339 (the IETF-recommended standard format for date/time on the Internet) across all major modern programming languages.

In some languages, there is no builtin way to parse or produce RFC3339 timestamps. This is a challenge for jtd-codegen, a tool which produces code from JSON Type Definition schemas. For instance, Python doesn’t ship with a parser for RFC3339; the stdlib’s datetime.parseisoformat and the third-party dateutil.parser.isoparse don’t actually parse RFC3339, they parse custom subsets of ISO 8601.

For RFC3339-less languages like Python, jtd-codegen produces its own RFC3339 implementation directly into the generated code. Furthermore, the JavaScript (another RFC3339-less language) and Python implementations of JSON Type Definition validation ship their own RFC3339 implementation, that way they don’t need to have any external dependencies.

So as a result, I know a bit more about parsing and serializing RFC3339 than I care to admit. Here’s my playbook for implementing RFC3339 in any language.

Step 1: Identify what will store the RFC3339 timestamp

What most people (including myself) mean by “a RFC 3339 timestamp” is what the RFC calls the “Internet date/time format”. Here is an example:

1985-04-12T23:20:50.52Z

Conceptually, a RFC 3339 timestamp gives you three pieces of information:

  • A date, using the Gregorian calendar. If you didn’t even know there were other calendars, then you probably use the Gregorian calendar every day.

    You will sometimes hear nerds refer talk of the “proleptic Gregorian calendar”. That just refers to the idea of “extending” the Gregorian calendar to dates before the calendar was decreed / adopted.

    For RFC3339, the Gregorian calendar consists of:

    • A year, which is between 0 and 9999.
    • A month, which is between 1 and 12.
    • A day, which is between 1 and 28, 29, 30, or 31 depending on the year and the month. If you’re from the West, you’ve probably already internalized this complexity; we’ll cover it later here anyway.
  • A time, which consists of:

    • An hour, which is between 0 and 23.
    • A minute, which is between 0 and 59.
    • A second, which is between 0 and 60. Yes, 60: not 59. Because of leap seconds, you sometimes have minutes that last 61 seconds. We’ll discuss this later.
    • A fraction of a second. This is very frequently left out, in which case the fractional part is typically interpreted to be zero.
  • An offset, which is a time delay/advance relative to UTC. Specifically, the offset is precise to the minute. This is valid (in fact, it’s given as an example in RFC3339):

    1937-01-01T12:00:27.87+00:20
    

    That +00:20 at the end says that the timestamp has an offset of 20 minutes ahead of UTC. Keep in mind that India is on UTC+5:30 and Nepal is on UTC+05:45, so ignore the minute part at your own peril.

Note that RFC3339 does not talk about timezones. An offset is not a timezone. A timezone is a geographic part of the world that agrees what time it is. At any given instant, a timezone has an associated UTC offset, but they can change that UTC offset for whatever reason (such as for summer time (Americans call this “daylight savings”)).

California and Oregon are on the same time zone, because people in California and Oregon always agree what time it is. People in Arizona and California agree on what time it is during the summer, but not during the winter (Arizona doesn’t do daylight savings), so they’re not on the same time zone.

If California were to one day decide that they want to unilaterally stay on UTC-8 all year, then California would form a new timezone.

In other words: timezones are a complicated and political topic. They change at the whim of governments, wars, and revolutions. RFC3339 wisely avoids them.

So when you’re parsing a RFC3339 timestamp, you need to find something in your language which can store these pieces of information. Often, you’ll need to compromise on some points:

  • Most languages don’t support representing leap seconds. My recommendation for such languages is to support parsing a 61st second, but to convert that to 60 before storing it in your language’s data structure. Either way, document clearly what you do with leap seconds. We’ll discuss this more later.

  • Most languages don’t support arbitrary UTC offsets, even though there’s no fundamental law of nature that makes arbitrary offsets inconceivable. For instance, Python’s datetime.timezone (it’s misnamed, it’s actually just an offset) is limited to 24 hours before/after UTC. Java’s ZoneOffset limits you to 18 hours before/after UTC. C#’s DateTimeOffset is limited to 14 hours before/after UTC.

    American Samoa (among others) is on UTC-11 and parts of Kiribati are on UTC+14. So C#’s restriction is as tight as you probably ever should go.

  • Some languages, in particular C#, have some abstraction like “Cultures” or some such. RFC3339 is a format meant for machines; it is not meant to be accommodating to any actual human culture (“people who like IETF standards” are not a culture). You will need to disable or choose some sort of “Neutral” option. For instance, in C# you will want to use CultureInfo.InvariantCulture.

    Do not mistakenly write software that uses the local computer’s locale when parsing or storing RFC3339 information. If you’re a Westerner, then it’s quite possible that your computer’s locale magically works well with RFC3339. You may find that your software breaks for people whose locale is different from yours. Not everyone uses the Gregorian calendar. This is mostly relevant to software that runs on user’s computers, less relevant for software running on your own servers.

JavaScript, at least until the Temporal API lands, doesn’t have any appropriate data structure directly in the language; all it has is Date, which can store a date and time but not an offset. But what you could do is use a JavaScript Date in combination with a number to store the UTC offset.

Step 2: Parsing timestamps

You can use a regular expression to parse a RFC3339 timestamp. The RFC has a pretty good ABNF that we can work from:

date-fullyear   = 4DIGIT
date-month      = 2DIGIT  ; 01-12
date-mday       = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
                          ; month/year
time-hour       = 2DIGIT  ; 00-23
time-minute     = 2DIGIT  ; 00-59
time-second     = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
                          ; rules
time-secfrac    = "." 1*DIGIT
time-numoffset  = ("+" / "-") time-hour ":" time-minute
time-offset     = "Z" / time-numoffset

partial-time    = time-hour ":" time-minute ":" time-second
                  [time-secfrac]
full-date       = date-fullyear "-" date-month "-" date-mday
full-time       = partial-time time-offset

date-time       = full-date "T" full-time

Although where the ABNF above says “T” and “Z”, it means to tolerate “t” or “z”.

What we can do with a regular expression is extract the following pieces of data:

  • Year
  • Month
  • Day
  • Hour
  • Minute
  • Second
  • Fractional Seconds (Optional)
  • Hour Offset and Minute Offset (Optional)

The regex with capturing groups for everything except the fractional seconds and offset is:

(\d{4})-(\d{2})-(\d{2})[tT](\d{2}):(\d{2}):(\d{2})

The regex for an offset is:

([zZ]|(\+|-)(\d{2}):(\d{2}))

Notice that this regex will return either 1 or 4 matches. The regex for the fractional seconds part, which is allowed to have as many digits as you want:

(\.\d+)?

So now we need to combine all of these regexes together, producing this pretty big regex that will return either 8 or 11 matching groups:

(\d{4})-(\d{2})-(\d{2})[tT](\d{2}):(\d{2}):(\d{2})(\.\d+)?([zZ]|(\+|-)(\d{2}):(\d{2}))

The first 6 groups will always be year, month, day, hour, minute, second. The 7th group will be the fractional seconds (it will be an empty string if there are no fractional seconds specified). If there are 8 total groups, then the offset is just “Z” (or “z”), aka an offset of zero. Otherwise there will be 11 groups: the 8th is pretty useless, the 9th is the sign (+ or -) of the offset, the 10th is the hour part of the offset, and the 11th is the minute part.

Step 3: Validating timestamps

Now that you’ve parsed the timestamp into its constituent parts, you’ve done the most basic sort of syntactic validation. But RFC3339 requires further validation:

  • You don’t need to validate the year. 0 to 9999 are all valid.

  • You need to validate that the month is in range.

    Implement this by checking that the month’s numerical value is between 1 and 12, inclusive.

  • You need to validate that the day is in range (for that month and year).

    Implement this by first parsing the numerical value of the year, month, and day. The day should be between:

    • 1 and 28 if the month is 1 (February) and the year is not a leap year.
    • 1 and 29 if the month is 1 and the year is a leap year.
    • 1 and 30 if the month is one of:
      • 4
      • 6
      • 9
      • 11
    • 1 and 31 if the month is one of:
      • 1
      • 3
      • 5
      • 7
      • 8
      • 10
      • 12

    A year is a leap year if this expression evaluates to true:

    (year % 4 == 0 && (year % 100 != 0 || year % 400 == 0))
    

    In other words: leap years are all multiple-of-4 years, except for multiple-of-100 years that aren’t also multiple-of-400 years.

  • You need to validate that the hour is in range.

    Implement this by checking that the hour’s numerical value is between 0 and 23, inclusive.

  • You need to validate that the minute is in range.

    Implement this by checking that the minute’s numerical value is between 0 and 59, inclusive.

  • You need to validate that the second is in range.

    Implement this by checking that the second’s numerical value is between 0 and 60, inclusive.

    If your language doesn’t actually support representing a minute with a leap-second number of seconds in it (i.e. a minute lasting 61 seconds), then you should probably accept 60 as input but convert it to 59. Most of all, you should document that you do this.

  • If there’s an offset hour and minute, do the same validations you did for the time hour and minute on those.

Then take all that validated information and put it into your data structure. You’ve just parsed an RFC3339 timestamp.

Step 4: Producing timestamps

This is comparatively easy. Assuming you have a data structure that can destructure into the data we’ve been talking about so far, then you just concatenate the following strings:

  • The year, zero-padded to be exactly four chars long
  • The char -
  • The month, zero-padded to be exactly two chars long
  • The char -
  • The day, zero-padded to be exactly two chars long
  • The char T
  • The hour, zero-padded to be exactly two chars long
  • The char :
  • The minute, zero-padded to be exactly two chars long
  • The char -
  • The second, zero-padded to be exactly two chars long
  • (This is optional) Is the offset is exactly zero? If yes, add the char Z and stop immediately.
  • Is the offset positive (or equal to zero, if you didn’t implement the previous step)?
    • If yes, the char +
    • If no, the char -
  • The hour part of the offset, zero-padded to be exactly two chars
  • The char :
  • The minute part of the offset, zero-padded to be exactly two chars