What's in a Tweet?

The Answer is Accompanied by a Build System Recommendation.

Once upon a time, I worked at Twitter (it was fun back then). One unsatisfying part of that software was Tweet text processing, which was defined by a nest of regular expressions that sometimes needed to be applied in a certain order. These regular expressions were implemented in at least four languages: Ruby, Java, JavaScript, and Objective-C. To keep these languages in sync, other Twitter engineers wisely wrote a single test corpus in YAML. That was a lot of work, but it does preserve interoperability most of the time.

Twitter HQ
Twitter HQ. March 2, 2015.

What are the rules for hashtags, @mentions, URLs, replies, tweet length, and all that? It’s way more complicated than one might expect. The first step was writing a Pest grammar for use with Rust, and attempting to match the Twitter conformance suite. Without that test suite, matching the Twitter behavior would have been very difficult.

The first comment in the grammar is: “A tweet is composed of a sequence of text and entities.”.

The next comment says: “An ‘entity’ represents some special content embedded in the tweet. This grammar only covers those that have representations in the tweet text. There are other types, such as polls, that do not appear in the text.”

“Pest uses a Parsing Expression Grammar (PEG), so the order of the choices here is significant. In particular, placing URLs before other entities helps resolve ambiguities where URL fragments might also be interpreted as hashtags.

The original implementations in Ruby and the other languages used a lot of regular expressions. As a result, there are mitigations for problems like Regular expression Denial of Service - ReDoS. There are also some workarounds where regular expressions aren’t quite the right fit for the task. For example, there are only two levels of nested parentheses allowed in Tweet URLs.

In Ruby 2.6, the one that ships with macOS 26 (Tahoe), the problems with backtracking in regular expressions detailed by Russ Cox in his article Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, …) are easy to observe. At least for some obvious test cases, these issues are much better in Ruby 3.2 and later. They seem to have applied the memoization techniques mentioned in the article. Twitter was written with Ruby 1.8.7, so Tweets are designed to avoid most of these pitfalls.

Parsing Expression Grammar

The “entity” production is defined as:

entity =
  _{
     possible_url |
     possible_hashtag |
     possible_mention |
     possible_cashtag
  }

The ”_” underscore character is what Pest calls a silent rule. That means the grammar checks them, but they will not show up in the parse tree. These “possible” matches are there because all of these entities have characters that are not allowed to precede them, and other characters that are not allowed to follow them. The Pest grammar inspects tokens until there is a certain match. For example, “@@sayrer” might be a mention, but it’s ruled out because mentions can’t contain ”@”, and cannot be preceded by ”@”. Writing “@sayrer” will match.

Other entities, like cashtags and hashtags, have similar constraints. URLs are a bit more complex. With URLs that begin with “http://” or “https://”, the main issue in English text are strings along the lines of “https://google.com/.”. That is a valid URL, but the period is part of the text, not the URL. That trailing period should not be linked or highlighted: “https://google.com/.”. There is a further twist, though. Tweets also highlight text like “google.com”. There’s no URL scheme to start off the link there, so Tweet parsers must examine possible URLs by probing the text for potential URLs until they find a valid TLD (top-level domain) followed by characters that are allowed. This rule means text like “anything.google.com” or “anything.google.co.uk” will be linked and highlighted, but “google.com.notatld” will not.

Sometimes, these rules produce unexpected results, as in this incident. The user meant to type ”…G-20. In…”, but missed a space, so “G-20.In” was linked and highlighted. It was an unregisterd domain, so someone registered it and then used it for heckling.

In addition to a long list of TLDs, the Pest grammar also includes a long rule for matching emoji. These need to be distinct from text, because emoji composed from combining characters must only count as one emoji toward the character limit.

Performance

This project didn’t start out with the aim of building a really fast Tweet parser, but its performance is monitored. The Pest grammar is very clear, but as shown below, it has some performance issues. It should be at least as fast as the Ruby and Java implementations, where most of the time is spent in their regex engines.

This section contains some language comparisons, and all of the usual benchmarking hazards are present. They might not be measuring quite the same thing, some benchmarks might be returning incorrect results, and maybe a language could be faster with a different approach. All of the tests run from the same data, and all of the conformance tests pass for all implementations. But feel free to submit a pull request.

Twitter Text: Rust/Pest vs Ruby vs Java (ops/sec, higher is better)

Operations per second comparing Pest, Java, and Ruby backends.

Rust with Pest
Ruby
Java

It seemed like the Rust code could be a lot faster. Profiles showed that the TLD and emoji rules were the most time-consuming parts of the grammar, even though they were hand-tuned a bit. Given that there was an extensive test suite, and a clear grammar, Claude Code seemed like a very good tool for writing optimizations. The first improvement attempted with Claude was to use some procedural data structures for the TLDs and emoji rules. Claude suggested adapting the code to use the emojis crate (”✨ Lookup emoji in O(1) time”), and similarly use its underlying phf crate (perfect hash functions) for the TLD list. Reasonable.

Claude Code got things mostly right. It vastly reduced the size of the grammar by using procedural code for the TLDs and emoji rules. There was already procedural code for other reasons anyway, so it adjusted the boundary of the grammar and that code. Its code needed a few tweaks, but less than ten or so. Claude Code was also strangely dismissive of failing tests and would sometimes give up, which was odd.

Twitter Text: Rust vs Ruby vs Java (ops/sec, higher is better)

Operations per second comparing Pest and External backends.

Rust with PHF Emoji and TLDs
Ruby
Java

That performance is much-improved, but it’s still slower than Java in one test. JavaScript would certainly be faster still, even though it hadn’t been tested it yet. The V8 folks have been hacking and optimizing that problem for a long time. All tests had been running in both versions of the Rust implementation, and Claude Code suggested moving more code from the grammar to the procedural code.

Another way to go would have been to use regular expressions in Rust. That would probably perform pretty well. It would require the fancy-regex crate, since Tweet text uses some lookahead assertions, and the normal regex crate (intentionally) does not support these. It would also still need some of the uglier and inefficient parts of the regex-based implementations, like removing overlapping entities in post-processing.

Nom

Since it would be possible to create a procedural parser by following the Pest grammar, Claude Code attempted to read the Pest grammar and write a Nom parser. It got all of the easier parts correct, but many complex tests were still failing. With some back-and-forth instruction, Claude Code was able to get everything working. In hindsight, it probably would have been faster to write these fixes by hand.

The Nom parser was profiled in a few ways. The first step was to use samply, an easy-to-use profiler that produces flame graphs viewable in the browser. The workflow was to read the profiles and talk over any hotspots with Claude Code. It would suggest optimizations, some of which were quite bogus, but it was possible to pick and choose. Then, Claude Code would write some microoptimizations, fail tests, and then fix them. Sometimes, the suggested optimization would end up being slower by the time it was correct. A lot of them made it through, though. The biggest wins were from switching to memchr where possible, and avoiding allocations. Not coincidentally, the memchr crate is written by the author of the Rust regex crate.

At some point, samply stopped finding new things to optimize. Gungraun wraps Valgrind, which counts instructions and many other metrics that don’t rely on wall-clock time. Gungraun was able to highlight some more optimizations that were not visible in the samply flame graphs.

The downside to all of these microoptimizations is that the code is not exactly fun to read. It is not what the parser combinator folks that wrote Nom had in mind. Or, as Claude.ai put it: “End result: a hand-rolled recursive descent parser wearing a Nom trenchcoat.” or “…and suddenly you’ve got something that’s 3x faster but looks like it was written by someone who hates you.”

Well, it is faster. The following charts change the y-axis from 2,500 to 160,000. That is a really big difference, but the benchmarks are simple. Lies, damned lies, statistics, and benchmarks. However, the results are consistent across machines and in many languages. Claude Code didn’t suggest any optimizations that were obviously cheating, and all optimizations were in the Tweet code, not the test harness. They all focused on the trickier parts of the Tweet grammar.

Twitter Text: Rust vs Ruby vs Java (ops/sec, higher is better)

Operations per second comparing Pest and External backends.

Rust with Nom
Ruby
Java

Those results are extremely good. Again, these benchmarks are simple. But this comparison has been run on other data sets, like large amounts of tweets, artificially long tweets, international languages, etc. The ugly Nom implementation is always at least an order of magnitude faster than the alternatives, but the interface is the same as the pure Pest version, and they are tested against each other.

Other Languages via FFI

Since the interface to the Rust library is so simple, it is easy to use from other languages via FFI. Foreign Function Interfaces (FFI) is the generic term. Each language has unique ways of calling the library, which are detailed below. The y-axes of these charts are scaled to 160k to match the Rust/Nom chart above.

Twitter Text: Java Rust FFI vs Java (ops/sec, higher is better)

Operations per second comparing Old Java and Rust FFI backends.

Rust FFI
Java

The Java FFI implementation is accomplished via the Java Foreign Function and Memory API (FFM API) and jextract.

There’s obviously a slowdown vs the Rust/Nom version, but the code is shared and it’s still pretty fast. Regular expressions in Java are not particularly fast compared to normal Java code, so the pure Java version would probably be faster with a hand-rolled parser that didn’t cross an FFI boundary. This wouldn’t necessarily be true for some of the other languages, since the regex engine is often faster than the language itself.

Twitter Text: Ruby Rust FFI vs Ruby (ops/sec, higher is better)

Operations per second comparing Ruby and Rust FFI backends.

Rust FFI
Ruby

The Ruby FFI implementation is accomplished via magnus. The Ruby FFI implementation is a little faster than the Java one. This shows that most of the time is spent in Rust, but the FFI boundary is a bit more expensive in the Java version. It’s probably string conversion (UTF-8 to UTF-16) in this case, but it hasn’t been measured.

This test was performed with Ruby 3.4.7. The original Twitter implementation used Ruby Enterprise Edition 1.8.7 and later. Ruby 3.4.7 is about 1.2-1.5x faster than REE 1.8.7 on the older ruby benchmarks for twitter-text 1.0.

Twitter Text: Obj-C Rust FFI vs Obj-C (ops/sec, higher is better)

Operations per second comparing Obj-C and Rust FFI backends.

Rust FFI
Obj-C

Tweets need to be processed on Apple clients, so there’s an Objective-C library. There’s an FFI directory that wraps the Rust implementation to make them callable from C (or Objective-C or C++). Rust has solid support for this task.

The original Objective-C implementation doesn’t have an Autolinker, because that’s a server side task. The Java implementation is used both on the server and in the Android client, where the Autolinker presumably goes unused.

Twitter Text: JS Rust WASM vs JavaScript (ops/sec, higher is better)

Operations per second comparing JavaScript and Rust WASM backends.

Rust WASM
JavaScript

This JavaScript/WASM test was profiled many times, and Claude Code diagnosed performance problems with a few truisms (it’s the WASM boundary etc). Everything it mentioned was in the profile, but none of it really dominated. The slowdown vs the native Rust implementation is large, but it seems to be just WASM itself that is slower. That’s expected, but it seems to be really costly in this test case.

This is quite a good result, given that the competition is V8 and its highly optimized regex JIT that generates native code.

New Languages

It’s possible to add other languages that haven’t previously had a Tweet parser.

Twitter Text: Rust vs C++ vs Swift vs Python (ops/sec, higher is better)

Operations per second comparing Rust vs C++ vs Swift vs Python.

Rust
C++
Swift
Python

This chart shows the result of adding C++ (as well as an ASAN build), Swift on macOS/Linux, and Python through PyO3.

Mastodon

Another thing that’s easier to do now is add new syntax across all of these implementations. Mastodon, a decentralized social network, happens to use twitter-text for highlighting text inside posts. They omit some features, like cashtags, and add their own @mention name highlighting.

Their mention syntax is something like @user@domain.tld. Support for this syntax was added in one commit across all languages. It was not a difficult change, but it was accomplished all at once.

Bazel

How is this all held together? Bazel. The MODULE.bazel file has all of the dependencies for the project. Rust, Java, C++/LLVM, Python, Swift, etc. Once Bazel is working, it will download all of that and start building.

There is one exception to that hermetic setup. The Bazel Ruby rules don’t work well with Bazel’s remote execution, so those require a local Ruby installation, and the Ruby tests don’t run in the remote build configuration.

The kind folks at BuildBuddy run all of the tests remotely. Here’s an example build log. So, on every commit, BuildBuddy runs the tests, but it only builds and tests what has changed. GitHub Actions are used to trigger rustfmt and clippy.

Cross Platform Applications

Building everything as a light wrapper around Rust is possible, but it doesn’t make sense if an application uses a lot of platform-native code (detailed UI, cameras, motion sensors, etc). In those cases, trying to do everything in Rust just makes the task more difficult, and probably duplicates code anyway. OpenAI has a case study where they used LLMs for their Sora Android port. For now, there does seem to be a sweet spot where server and client side logic can be written in Rust or equivalent, while porting the UI quickly to a native platform with LLMs.

This project should also serve as a recipe for a fairly complex polyglot monorepo.

debug