Today's Deep-Dive: Harper
Ep. 266

Today's Deep-Dive: Harper

Episode description

The Deep Dive explores Harper, an open-source, privacy-first grammar checker developed by Automatic, the company behind GitHub. Harper addresses the ‘triple threat’ of modern editing software: poor privacy, bad performance, and unnecessary cost. Unlike cloud-based tools like Grammarly, which send user data to remote servers, Harper operates entirely offline, ensuring complete privacy. It also outperforms the major open-source alternative, LanguageTool, which is resource-intensive and slow. Harper achieves this by using carefully optimized, hand-crafted rules and efficient algorithms, all built with the programming language Rust. Rust’s memory safety and speed contribute to Harper’s impressive efficiency, using less than 150th of LanguageTool’s memory requirements. Harper is also portable, able to run via WebAssembly, making it easy to integrate into various applications. The project is open-source under the Apache 2.0 license and has gained significant community support, with extensive documentation for integration into major text editors. Harper’s commitment to performance and privacy makes it a promising solution for users concerned about data security and efficiency.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now for 1 Euro - 30 days free!

Download transcript (.srt)
0:00

Welcome back to the Deep Dive, the place where we distill complex information into

0:04

high-value knowledge nuggets custom-tailored for you.

0:08

Before we jump into today's fascinating analysis of a really interesting privacy-first

0:12

software project, we want to acknowledge the support that makes these deep dives

0:16

possible.

0:17

This exploration is supported by Safe Server. They manage hosting for the next

0:20

generation of software and support clients with digital transformation.

0:24

You know, when we talk about incredible new, efficient software like we are today,

0:28

someone needs to host it securely. Safe Server is there for you. Find out more at

0:32

www.safeserver.de.

0:34

Okay, let's unpack this. Our mission today is a deep dive into Harper, an open-source

0:40

grammar checker.

0:42

It's a project created by the people at Automatic, you know, the company behind

0:45

GitHub. So, if you're like most people, you probably use writing tools, right?

0:49

But maybe you also worry about what happens to your words after you hit save.

0:52

Harper promises to be the solution.

0:54

We're going to try and understand how this tool tackles what we're calling the

0:57

triple threat of modern editing software.

0:59

Poor privacy, really bad performance, and unnecessary cost.

1:02

Yeah, and if we connect this to the bigger picture, I mean, this project signals a

1:05

pretty crucial shift.

1:07

The developers weren't just aiming for, like, feature parity with other tools. They

1:11

were aiming for a total paradigm shift toward user control.

1:15

Our stated goal is really clear. Harper is an offline, privacy-first grammar checker.

1:20

Fast, open source, rust-powered.

1:23

And that last phrase, rust-powered, that's kind of the secret ingredient that lets

1:26

them achieve everything else.

1:28

Absolutely. And the source material is fantastic because it doesn't just introduce

1:32

a new tool.

1:33

It really acts as a kind of postmortem on the failures of the tools we've, well,

1:37

basically been forced to use until now.

1:39

So why did the creators feel the necessity to build a new grammar checker from

1:43

scratch?

1:44

Well, it seems like it was born out of some profound professional frustration.

1:47

Let's start with the critique of the market leader, Grammarly.

1:51

The developers considered it fundamentally flawed, especially for professional,

1:55

high-volume use.

1:57

First, they called it too expensive and too overbearing.

2:01

Now, that's a bit subjective, sure, but the objector issues are maybe more critical.

2:06

The suggestions often lacked necessary context or, as the source materials bluntly

2:10

put it, were sometimes just plain wrong.

2:13

Right. And then we get to the core ethical problem, the thing that really dictates

2:16

how we feel about using these tools.

2:18

Exactly. It was definitively labeled a privacy nightmare.

2:22

I mean, when you use a cloud-based checker like Grammarly, everything you type,

2:26

every sensitive document, every internal memo, it gets sent off to their remote

2:30

servers.

2:31

And while companies might claim they don't sell the data, the source material

2:34

correctly points out the concern.

2:36

This data can be used for training large language models or maybe other proprietary

2:40

purposes we simply don't have disability into.

2:43

And that ties directly back to performance too, doesn't it?

2:46

Because even if you somehow trust the vendor completely, sending every single word

2:49

you write back and forth across the internet just to get a basic suggestion, that

2:54

must slow down the process considerably.

2:56

Oh, it absolutely does. That network round-trip time, you know, waiting for the

3:00

server to process and respond, it made revising work tedious and slow.

3:05

For someone trying to maintain that flow state while writing, that kind of lag is

3:09

just a constant disruption. It pulls you right out.

3:12

OK, so if the cloud-based solution is a privacy disaster zone, what about the major

3:17

open-source alternative language tool?

3:20

I mean, that should solve the privacy problem, shouldn't it? But the source

3:22

material suggests it has its own pretty massive issues.

3:26

Yeah, language tool is characterized as great, but, and it's a big but, only if you're

3:31

willing to dedicate an enormous amount of computing muscle to it.

3:35

Its resource demands are staggeringly high. It requires gigabytes of RAM, and crucially,

3:40

it forces the user to download this massive statistical package.

3:44

It's known as an anagram dataset, and it weighs in at around 16 gigabytes.

3:49

16 gigs, just for a grammar checker. Wow. We need to pause there for a second. What

3:56

exactly is an anagram dataset, and why does language tool need such a colossal file?

4:01

Right, that's essential context. An anagram dataset is basically a massive

4:05

statistical table. It's built from analyzing billions and billions of sentences.

4:10

It tells the software how likely certain words are to follow other words. So if you

4:14

see the dog ran, the anagram model confirms, yeah, that's a highly probable

4:18

sequence.

4:19

Language tool uses the statistical approach to catch errors. But to make those

4:23

predictions accurate across many languages, you need a gigantic corpus of data, and

4:28

that data is the 16 gigabyte download.

4:30

Okay, so language tool is privacy friendly because it runs locally, right on your

4:33

machine. But it's resource intensive because it's essentially hauling around this

4:38

library the size of a small country just to check sentence flow.

4:41

Precisely. And that sheer bulk translates directly into speed problems. The creator

4:45

found it too slow. It often took several seconds to lint even a moderate size

4:49

document.

4:50

Let's quickly clarify that term for our listeners. We keep using the word lint.

4:53

What does linting actually mean here?

4:56

Sure. Linting is essentially the software checking the quality and correctness of

5:01

your text or sometimes code.

5:04

So when we say lint time, we mean the time it takes the software to scan your

5:07

document, run all its rules against it, and then present the errors or suggestions.

5:12

And if that takes several seconds, yeah, it just kills your productivity, you lose

5:16

momentum.

5:17

Okay, that sets the stage perfectly for Harper then, because here's where the

5:23

comparison becomes really stark. Harper was explicitly engineered to be like the

5:27

Goldilocks solution, fast, small, and completely private.

5:30

Exactly. Harper's gains are, well, transformational seems like the right word.

5:34

First, the privacy promise is absolute. It is completely private. It functions

5:38

entirely offline.

5:40

Your machine is the beginning and the end of the data flow, period. But the

5:43

performance metrics, I mean, that's what really blows the competition out of the

5:47

water.

5:48

While LanguageTool takes several seconds and demands gigabytes of RAM, Harper

5:52

checks the same document in milliseconds. And regarding memory footprint, it uses

5:56

less than 150th of LanguageTool's memory requirements.

6:00

That's a staggering efficiency gain. But OK, this is the point where I have to be a

6:04

bit skeptical. Surely for a tool to be that fast and that small, it must compromise

6:08

on intelligence, right?

6:10

How can Harper possibly manage accuracy without that huge 16 gigabyte Engram file

6:15

and without cloud-based AI training?

6:18

That's the critical question, you're right. And it's answered by the choice of

6:22

technology. Instead of relying on that brute force statistics approach, the Ennegrams,

6:26

Harper relies on carefully optimized, hand-crafted rules, dictionaries, and highly

6:31

efficient algorithms. And crucially, these are all built using a programming

6:35

language called Rust.

6:37

OK, we've mentioned Rust powered several times now. Why is Rust the key here,

6:41

especially for that staggering efficiency gain, particularly regarding memory use?

6:46

Well, Rust is known for a couple of things. Memory safety and extreme speed. And

6:50

the memory savings compared to language tool, which is Java-based, are enormous.

6:56

It really comes down to how the languages handle memory management. You see, Java

6:59

relies on something called a garbage collector, or GC.

7:04

It's a process that runs in the background to automatically find and clean up

7:07

memory that's no longer being used.

7:09

But that GC itself requires constant runtime resources and adds significant memory

7:13

overhead. It's always kind of there.

7:16

Rust, on the other hand, manages memory differently. It figures it out at compile

7:19

time when the program is built.

7:21

It guarantees memory safety without needing a runtime garbage collector.

7:25

And this mechanism, this absence of a GC, is why Harper can deliver the same or

7:29

similar results with almost no memory overhead.

7:32

That's how it achieves that astonishing reduction, down to less than a hundred and

7:35

fiftieth of language tool's requirements.

7:38

Ah, okay. That distinction, getting rid of the garbage collector overhead, that's

7:42

the missing technical piece I needed.

7:44

That helps understand the efficiency gains. It sounds like a design choice that

7:47

prioritizes performance above everything else.

7:49

Exactly. And that focus on efficiency also leads directly to its portability.

7:54

The source notes that Harper is small enough to load via WebAssembly, or Wasm.

7:59

Right, Wasm. For our listeners, what's the practical implication? What's the so-what

8:03

of WebAssembly?

8:05

Think of Wasm as like a tiny, standardized virtual machine.

8:09

It lets pre-compiled code run almost instantly inside any modern web browser or

8:13

application environment, so the practical implication is huge.

8:17

You don't need to install anything complex. You don't need an API key. You

8:20

certainly don't need a server.

8:22

It makes Harper truly ubiquitous. You could integrate a grammar checker that runs

8:26

at new native speeds right into a browser application almost instantly.

8:29

It really reinforces that promise of quick, local, and private processing.

8:33

And finally, it is truly open source. It's under the Apache 2.0 license, available

8:36

at writewithharper.com.

8:39

Okay, so now let's talk about the practical application side. What does this all

8:42

mean for the person actually sitting down and writing?

8:45

Well, the ecosystem looks pretty robust. Currently, the project is focused on

8:49

English support only. That's important to note.

8:52

However, the core architecture was explicitly designed from the ground up to be extensible

8:56

for other languages, and the project clearly states they welcome contributions.

9:02

So expanding its linguistic reach seems like it's simply a matter of community

9:05

effort, building on what's already a very solid, very fast foundation.

9:10

And that commitment to performance, it isn't just a launch feature, is it? It

9:14

sounds like an ongoing philosophy.

9:16

It absolutely is. It seems like the defining feature, really. The development team

9:19

is so serious about maintaining peak performance that they state they consider it

9:23

long-lint times bugs

9:25

that signals they are constantly profiling and optimizing the code. They treat any

9:30

slowdown not as a minor issue, but as a critical defect that must be addressed

9:34

immediately. That's a strong commitment.

9:36

And looking at the health of the project, the community metrics, they speak volumes,

9:39

don't they? 8.1k stars on GitHub, 202 forks, 57 contributors.

9:44

For an open source project, 8,100 stars indicate significant developer trust and viability.

9:50

It shows this is well past the side project stage. It looks like a major ecosystem

9:53

tool now.

9:54

Yeah, definitely. And that level of community engagement has translated directly

9:58

into real-world usability. They have extensive documentation for integrating Harper

10:03

into, well, virtually every major professional text editor you can think of.

10:07

They list support for Visual Studio Code, NeoVim, Helix, Emacs, and Zed. Pretty

10:11

comprehensive.

10:13

Why is integration into specialized tools like NeoVim and Helix such a big deal?

10:15

What does that tell us?

10:17

Well, it demonstrates its incredible lightness, its efficiency. NeoVim and Helix

10:22

are terminal-based, very power-user-focused environments. They're often preferred

10:27

by developers specifically because they're fast and use very few resources.

10:32

The fact that Harper integrates seamlessly there proves that it's truly lightweight

10:36

enough for even the most resource-conscious environments. It really solidifies its

10:40

place as a professional, ready-to-use tool, not just some proof of concept.

10:45

OK, so to summarize the key takeaway for you, the listener, Archer seems to

10:49

successfully solve that triple threat we talked about with modern editing software,

10:53

the lack of privacy, the high cost or resource usage, and the poor performance.

10:58

And it does this by leveraging the speed and memory efficiency of modern languages

11:02

like Rust, combined with the transparent, collaborative power of the open-source

11:06

model. It really could be a blueprint for how critical software like this can be

11:10

built in the future.

11:11

Yeah, and this whole thing raises an important question, I think, for you to

11:14

consider. As AI models and editing tools become more ubiquitous, more powerful, but

11:20

also demand more and more of your data to function, how much value do you

11:25

personally place on processing your sensitive documents you're writing locally for

11:29

that guaranteed absolute privacy?

11:32

How does that compare to the perceived convenience, or maybe the occasional smarter

11:36

suggestion offered by those massive cloud-based services? That trade-off, you know,

11:41

between privacy and perhaps features or intelligence, that's a calculation we all

11:45

probably need to be ready to make more consciously going forward.

11:49

That's a fascinating dilemma to mull over. Thank you for diving deep with us today.

12:03

And remember, this deep dive was supported by SafeServer, your partner for hosting

12:03

project at www.safeserver.de. We'll catch you next time for the next Deep Dive.

12:03

project at www.safeserver.de. We'll catch you next time for the next Deep Dive.