Today's Deep-Dive: CollectiveAccess
Ep. 356

Today's Deep-Dive: CollectiveAccess

Episode description

In this episode of Deep Dive, we explore the powerful open-source platform quietly powering digital collections at museums, archives, and research institutions around the world: CollectiveAccess.

How do major cultural institutions manage vast and complex collections that include everything from historical manuscripts and paintings to audio recordings, films, and even high-resolution 3D scans of artifacts? The answer lies in a sophisticated digital infrastructure designed for scale, flexibility, and long-term preservation.

We break down the core architecture of CollectiveAccess and explain its unique two-part system: Providence, the robust backend where archivists catalog and manage collection data, and Pawtucket2, the public-facing interface that transforms structured metadata into engaging online exhibits.

Along the way, we unpack how the platform supports diverse metadata standards, handles massive media collections, and enables discovery through powerful tools like faceted search, interactive maps, and timelines. We also look at what’s new in CollectiveAccess 2.0 — from modernized APIs using GraphQL to AI-powered transcription and translation tools that are reshaping how institutions unlock audio and video archives.

Whether you’re a technologist, archivist, digital humanist, or simply curious about how the digital backbone of modern museums works, this episode offers a behind-the-scenes look at the systems preserving cultural heritage in the digital age.

Tune in as we dive deep into the architecture, innovation, and future of digital collections.

Gain digital sovereignty now and save costs

Let’s have a look at your digital challenges together. What tools are you currently using? Are your processes optimal? How is the state of backups and security updates?

Digital Souvereignty is easily achived with Open Source software (which usually cost way less, too). Our division Safeserver offers hosting, operation and maintenance for countless Free and Open Source tools.

Try it now!

Download transcript (.srt)
0:00

Welcome back to The Deep Dive.

0:01

Today we're tackling something huge

0:04

and often totally invisible.

0:06

Right.

0:07

How do the world's biggest cultural institutions,

0:10

we're talking major museums, huge archives, research

0:14

universities, how do they manage these just

0:17

colossal collections?

0:18

And we're not talking about a few spreadsheets here.

0:21

These collections have everything.

0:23

Digitized texts, paintings, high res video, audio histories,

0:26

and now even complex 3D scans of objects.

0:29

That's an incredible mix of data.

0:30

Exactly.

0:31

And handling all that, keeping it safe,

0:33

and making it accessible to people,

0:35

that needs a really industrial strength solution.

0:38

And that's our focus today.

0:39

We're looking at the open source software

0:41

that hundreds of these institutions

0:42

trust to solve this exact problem, collective access.

0:45

We're going to get into its core architecture, which

0:47

has this powerful duality providence in Pawtucket 2,

0:51

and we'll look especially at the big modernizing updates

0:53

in the new version 2.0.

0:55

So we've been unpacking the official documentation,

0:57

deep diving into the GitHub repos,

0:59

and really trying to understand the system.

1:01

And our mission here is to give you

1:03

a clear structural understanding of it.

1:06

We want you to see past the pretty website

1:08

and get how the engine behind modern archives actually works.

1:12

OK, so let's start at the very beginning, the why.

1:15

Collective access, or CA.

1:16

It started back in 2003.

1:18

Right, and it wasn't born to be a commercial product.

1:20

It was a direct response to a massive gap in the market.

1:24

What was the gap?

1:25

The tools that existed were just prohibitively expensive.

1:29

They were proprietary, they were rigid.

1:31

Institutions had these really complex needs,

1:33

but tight budgets.

1:35

So collective access came in as a free, open source alternative.

1:39

Exactly.

1:40

It's under the GPL 3.0 license.

1:42

That meant anyone, a small local history group

1:45

or a huge national gallery, could just download it and use it.

1:48

No massive licensing fees.

1:50

And that accessibility is what let

1:51

it grow into what it is today.

1:53

Which brings us to, I think, the most fundamental thing

1:56

you have to understand about it.

1:57

The core duality.

1:58

The duality, yeah.

1:59

I like to think of it like a restaurant.

2:00

You have the kitchen in the back, super organized complex

2:03

where all the prep work happens.

2:05

And then you have the dining room out front.

2:08

Beautiful, simplified, where the guests actually

2:10

interact with the final product.

2:12

You need that separation.

2:14

So the kitchen, the back end, where the hard work is done,

2:16

that's called Providence.

2:17

Yes.

2:18

That's the data management and cataloging application.

2:21

That's where the curators and archivists spend their days,

2:24

meticulously documenting every single item.

2:27

And the dining room, the front end, is Pawtucket 2.

2:30

Pawtucket 2, that's the public interface.

2:32

It's the website layer that takes all that organized data

2:35

from Providence and makes it look good,

2:37

makes it interactive for researchers, and well, for you.

2:40

And separating them isn't just for convenience, is it?

2:43

It's a really core architectural choice.

2:46

Why is that so important for an archive?

2:48

It's about sanity, really, and security.

2:51

If you try to manage everything in one single application,

2:54

you risk compromising your complex internal data standards

2:58

just to make a slick website.

2:59

Ah, I see.

3:00

This duality separates the incredibly detailed

3:03

internal organization from the user-friendly public discovery.

3:07

It means a web designer can go and update Pawtucket 2

3:09

without ever, ever risking the archival integrity of the data

3:13

that's locked down in Providence.

3:15

That makes perfect sense.

3:17

So before we dive deeper into Providence,

3:19

let's just take a moment to thank the supporter of this deep dive.

3:22

Good idea.

3:23

SafeServer handles the hosting of exactly this kind

3:25

of complex, high-demand software,

3:27

and they support institutions with their digital transformation.

3:31

So if you're looking for reliable hosting

3:33

for something powerful like collective access,

3:36

you can find more info at www.safeserver.de.

3:40

OK, let's go into that engine room.

3:42

Providence, this is the app that's

3:44

built to solve what I call the archivist's nightmare.

3:47

Handling extreme data diversity.

3:49

Exactly.

3:50

Yeah.

3:50

The number one feature of Providence

3:52

is just how configurable it is.

3:54

Unlike a lot of off-the-shelf software,

3:55

it's not built for just one type of collection.

3:57

It supports museums, archives, and research contexts

4:00

all at once.

4:01

When you say configurable, what does that actually

4:04

mean for the person cataloging something?

4:06

It means they are forced to sort of shoehorn their collection

4:10

into someone else's idea of what a field should be.

4:13

Providence lets you build catalogs

4:15

that conform to recognized standards, like Dublin Core,

4:18

or you can customize totally new fields.

4:21

And you can do that without a programmer?

4:24

Without writing custom code, yes.

4:25

The system adapts to the institution,

4:27

not the other way around, which is vital,

4:29

because a history museum's needs are just

4:31

wildly different from a natural science museum's.

4:33

And the media handles.

4:35

It's really impressive.

4:36

I mean, documents, images, audio, video, that's expected.

4:40

But it also handles 3D models.

4:41

And that detail tells you how modern the system is.

4:44

That's the key technical achievement, really.

4:46

How so?

4:47

Because managing a huge 3D scan of a fragile artifact, which

4:51

is basically a cloud of data points,

4:53

the same system as a simple PDF of a letter from 1910,

4:57

that requires immense flexibility on the back end.

5:00

It treats them both as collection items.

5:02

So it's not just about the variety of data.

5:04

It's also got to handle the day-to-day work

5:05

of running an institution.

5:07

What kind of workflow features are built in?

5:09

It's all about bulk processing.

5:11

An archive might acquire 10,000 records at once.

5:14

Providence supports batch importing, vexporting.

5:17

You can bring in massive data sets, clean them up,

5:19

and then quickly output reports, like inventory lists

5:22

as PDFs or spreadsheets.

5:23

It's built to scale.

5:25

So Providence is the highly granular professional tool

5:29

for data integrity.

5:30

Right.

5:31

But the public never sees any of that.

5:33

And all that effort is useless if people

5:35

can't find the stuff.

5:35

Absolutely.

5:36

So now we shift from the structured back end

5:38

to the elegant front end, Pawtucket II.

5:41

This is where the collection comes alive.

5:43

OK, so Pawtucket II takes that data

5:45

and focuses on the user experience.

5:47

You can style it to match a museum's brand.

5:50

But the real power is in the discovery tools, right?

5:52

That's right.

5:53

It goes way beyond a simple keyword search.

5:55

Pawtucket II lets you use customizable facets

5:58

and filters.

5:59

A facet is what for a non-technical user?

6:02

It's basically a preset category.

6:04

So instead of you having to search

6:05

for blue dress Victorian, you can just

6:08

filter by object type, dress, then era, Victorian,

6:14

and then color blue.

6:15

It lets you explore the collection,

6:17

even if you don't know exactly what you're looking for.

6:19

And it can turn static data into something interactive.

6:22

I've seen examples with maps and timelines.

6:25

How does it do that?

6:26

It leverages the structured data from Providence.

6:29

So if an archivist entered a date and a location

6:31

for an object, Pawtucket II can visualize it.

6:35

A timeline takes that static date field

6:37

and turns it into a navigable journey.

6:40

A map takes a GPS coordinate and shows you

6:42

where an object is from.

6:43

It completely changes the user's relationship

6:45

with the collection.

6:45

You're not just viewing, you're exploring.

6:48

And there are also features for public engagement, things

6:50

that let the community contribute.

6:51

Precisely.

6:52

The institution can decide to turn on public commenting

6:55

or tagging or even rating items.

6:57

For something like a local history archive,

6:59

that's invaluable.

7:01

You can have the community help identify people in old photos.

7:04

It turns the public into research assistants.

7:06

In a way, yes.

7:07

And this is not theoretical software.

7:09

It's running some really high-profile digital exhibits.

7:12

Oh, absolutely.

7:13

We're talking about projects like the Crest Digital

7:15

Archive at the National Gallery of Art or the Chicago Film

7:17

Archive.

7:18

They're all using this framework.

7:19

OK, so we've established the core duality, Providence,

7:23

the rock solid engine, and Pawtucket II,

7:25

the engaging public face.

7:28

Now let's get to the new stuff.

7:30

It was a long wait since version 1.7.

7:32

A very long wait.

7:34

But Collective Access version 2.0 brings

7:36

some huge leaps forward.

7:38

The focus seems to be on stability, connectivity,

7:42

and this is the big one, AI integration.

7:44

Yeah, version 2.0 is basically the developers saying, OK,

7:47

we're ready for the next decade.

7:49

First, they just future-proof the platform.

7:51

It's now compatible with modern server tech,

7:54

like PHP 8.2 and 8.3.

7:56

Not a flashy feature, but essential for security

7:59

and performance.

8:00

Vital.

8:00

But the real user benefits are somewhere else.

8:02

One key improvement is in historical tracking.

8:05

You mean provenance, knowing where an object has been.

8:08

Exactly.

8:09

V2.0 has a new, much more flexible system

8:12

for tracking change over time.

8:14

It's not just the creation date anymore.

8:16

It's about granular history, location changes,

8:19

ownership shifts, you name it.

8:21

OK, so better accuracy.

8:23

But the features that got everyone talking

8:25

are about automation, the AI stuff.

8:27

I was genuinely surprised when I read

8:29

about the transcription feature.

8:30

It's a massive leap in efficiency.

8:33

V2.0 uses machine learning in two huge areas.

8:36

First, automated translation.

8:38

The system supports services like DeepL and Google

8:41

Translate, so staff using Providence

8:43

can work in their own language.

8:44

But that immediately lowers the barrier

8:46

to entry for global institutions.

8:47

It does.

8:48

But the truly transformative feature

8:50

is the automated transcription.

8:51

It uses models like OpenAI Whisper

8:54

to automatically transcribe audio and video materials

8:57

right inside the workflow.

8:58

Just think about that.

8:59

An archivist gets 50 years of oral history interviews,

9:01

manually transcribing that could take months, maybe years.

9:05

And this integration automates it.

9:07

It turns all that spoken word into searchable text

9:10

almost instantly.

9:11

It redefines what's possible for searching media.

9:14

So much of that content was basically locked away

9:16

from keyword searches before.

9:17

This unlocks it.

9:18

Wow.

9:19

So that's efficiency through AI.

9:21

What about connectivity?

9:23

How does v2.0 talk to other systems?

9:26

They completely modernized it.

9:27

They introduced a new GraphQL-based API.

9:31

This is a huge deal for any developer trying

9:33

to build custom tools that need to talk to the Providence data.

9:36

For listeners who aren't developers,

9:38

why is switching to GraphQL so important?

9:40

Well, a traditional API often makes

9:42

you download a big packet of data,

9:44

even if you only need one tiny piece of it.

9:46

GraphQL is about precision.

9:48

A developer can ask for only the specific fields they need.

9:51

So it's faster, less resource intensive.

9:53

Much faster.

9:54

For custom visualizations, mobile apps,

9:56

any kind of integration, it's a major upgrade.

9:59

And there's also a new support for external media, right?

10:01

Yes, another efficiency thing.

10:03

V2.0 can now reference content that

10:05

lives somewhere else on YouTube, Vimeo, the internet archive.

10:08

So the museum doesn't have to eat up all its server space

10:11

storing a huge video file if it's already

10:13

hosted somewhere else.

10:13

Exactly, they just catalog the link.

10:15

They keep the metadata integrity and providence,

10:17

but save a fortune on storage costs.

10:20

And finally, digital preservation.

10:22

Long term integrity is everything.

10:24

Right, and V2.0 has a new export system

10:27

that uses something called configurable Bag IT packages.

10:31

Bag IT packages.

10:32

Think of them like standardized tamper-proof digital shipping

10:37

containers.

10:38

They bundle up all the data and the metadata

10:40

in a way that can be validated for integrity checks years

10:44

or decades down the line.

10:45

It's the gold standard.

10:46

So it's really clear.

10:48

Version 2.0 is all about automation, next gen

10:51

connectivity, and making these massive data

10:54

sets manageable and secure.

10:56

Absolutely, collective access is now, I think,

10:58

firmly positioned as a cutting edge open source powerhouse.

11:01

You have the surgical precision of Providence for management

11:04

and the dynamic presentation of Pawtucket 2.0 for the public.

11:07

These updates, especially the AI and the new API,

11:10

they just solidify its role for the future.

11:13

So the practical takeaway for you, listening,

11:15

is that you now understand that two-part architecture.

11:18

When you see a beautiful online museum

11:21

collection with interactive maps and timelines,

11:23

you know there's a good chance software like this

11:25

has the engine behind it all, letting

11:27

them present their world without compromising their data.

11:30

But this does raise an important question, something

11:32

for you to mull over as these tools become standard.

11:36

Given the new support for automated transcription

11:39

and translation, what are the ethical or contextual challenges

11:43

that come up?

11:44

When we use machine learning to interpret or tag

11:47

historical records, can an algorithm

11:49

truly capture the necessary historical nuance or context?

11:54

Can a machine understand what's not being said in an interview?

11:57

Exactly.

11:57

That's something every institution

11:59

using these amazing new tools is going to have to grapple with.

12:02

A fascinating and really crucial point to end on.

12:05

Thank you for joining us for this deep dive

12:06

into collective access.

12:08

And remember, this deep dive was supported by Safe Server.

12:11

If you are looking for reliable hosting for complex software

12:14

or support for your digital transformation projects,

12:16

you can find more information at www.safeserver.de.

12:21

Until then, keep digging.

12:21

Until then, keep digging.