Today's Deep-Dive: CollectiveAccess

0:00

Welcome back to The Deep Dive.

0:01

Today we're tackling something huge

0:04

and often totally invisible.

0:06

Right.

0:07

How do the world's biggest cultural institutions,

0:10

we're talking major museums, huge archives, research

0:14

universities, how do they manage these just

0:17

colossal collections?

0:18

And we're not talking about a few spreadsheets here.

0:21

These collections have everything.

0:23

Digitized texts, paintings, high res video, audio histories,

0:26

and now even complex 3D scans of objects.

0:29

That's an incredible mix of data.

0:30

Exactly.

0:31

And handling all that, keeping it safe,

0:33

and making it accessible to people,

0:35

that needs a really industrial strength solution.

0:38

And that's our focus today.

0:39

We're looking at the open source software

0:41

that hundreds of these institutions

0:42

trust to solve this exact problem, collective access.

0:45

We're going to get into its core architecture, which

0:47

has this powerful duality providence in Pawtucket 2,

0:51

and we'll look especially at the big modernizing updates

0:53

in the new version 2.0.

0:55

So we've been unpacking the official documentation,

0:57

deep diving into the GitHub repos,

0:59

and really trying to understand the system.

1:01

And our mission here is to give you

1:03

a clear structural understanding of it.

1:06

We want you to see past the pretty website

1:08

and get how the engine behind modern archives actually works.

1:12

OK, so let's start at the very beginning, the why.

1:15

Collective access, or CA.

1:16

It started back in 2003.

1:18

Right, and it wasn't born to be a commercial product.

1:20

It was a direct response to a massive gap in the market.

1:24

What was the gap?

1:25

The tools that existed were just prohibitively expensive.

1:29

They were proprietary, they were rigid.

1:31

Institutions had these really complex needs,

1:33

but tight budgets.

1:35

So collective access came in as a free, open source alternative.

1:39

Exactly.

1:40

It's under the GPL 3.0 license.

1:42

That meant anyone, a small local history group

1:45

or a huge national gallery, could just download it and use it.

1:48

No massive licensing fees.

1:50

And that accessibility is what let

1:51

it grow into what it is today.

1:53

Which brings us to, I think, the most fundamental thing

1:56

you have to understand about it.

1:57

The core duality.

1:58

The duality, yeah.

1:59

I like to think of it like a restaurant.

2:00

You have the kitchen in the back, super organized complex

2:03

where all the prep work happens.

2:05

And then you have the dining room out front.

2:08

Beautiful, simplified, where the guests actually

2:10

interact with the final product.

2:12

You need that separation.

2:14

So the kitchen, the back end, where the hard work is done,

2:16

that's called Providence.

2:17

Yes.

2:18

That's the data management and cataloging application.

2:21

That's where the curators and archivists spend their days,

2:24

meticulously documenting every single item.

2:27

And the dining room, the front end, is Pawtucket 2.

2:30

Pawtucket 2, that's the public interface.

2:32

It's the website layer that takes all that organized data

2:35

from Providence and makes it look good,

2:37

makes it interactive for researchers, and well, for you.

2:40

And separating them isn't just for convenience, is it?

2:43

It's a really core architectural choice.

2:46

Why is that so important for an archive?

2:48

It's about sanity, really, and security.

2:51

If you try to manage everything in one single application,

2:54

you risk compromising your complex internal data standards

2:58

just to make a slick website.

2:59

Ah, I see.

3:00

This duality separates the incredibly detailed

3:03

internal organization from the user-friendly public discovery.

3:07

It means a web designer can go and update Pawtucket 2

3:09

without ever, ever risking the archival integrity of the data

3:13

that's locked down in Providence.

3:15

That makes perfect sense.

3:17

So before we dive deeper into Providence,

3:19

let's just take a moment to thank the supporter of this deep dive.

3:22

Good idea.

3:23

SafeServer handles the hosting of exactly this kind

3:25

of complex, high-demand software,

3:27

and they support institutions with their digital transformation.

3:31

So if you're looking for reliable hosting

3:33

for something powerful like collective access,

3:36

you can find more info at www.safeserver.de.

3:40

OK, let's go into that engine room.

3:42

Providence, this is the app that's

3:44

built to solve what I call the archivist's nightmare.

3:47

Handling extreme data diversity.

3:49

Exactly.

3:50

Yeah.

3:50

The number one feature of Providence

3:52

is just how configurable it is.

3:54

Unlike a lot of off-the-shelf software,

3:55

it's not built for just one type of collection.

3:57

It supports museums, archives, and research contexts

4:00

all at once.

4:01

When you say configurable, what does that actually

4:04

mean for the person cataloging something?

4:06

It means they are forced to sort of shoehorn their collection

4:10

into someone else's idea of what a field should be.

4:13

Providence lets you build catalogs

4:15

that conform to recognized standards, like Dublin Core,

4:18

or you can customize totally new fields.

4:21

And you can do that without a programmer?

4:24

Without writing custom code, yes.

4:25

The system adapts to the institution,

4:27

not the other way around, which is vital,

4:29

because a history museum's needs are just

4:31

wildly different from a natural science museum's.

4:33

And the media handles.

4:35

It's really impressive.

4:36

I mean, documents, images, audio, video, that's expected.

4:40

But it also handles 3D models.

4:41

And that detail tells you how modern the system is.

4:44

That's the key technical achievement, really.

4:46

How so?

4:47

Because managing a huge 3D scan of a fragile artifact, which

4:51

is basically a cloud of data points,

4:53

the same system as a simple PDF of a letter from 1910,

4:57

that requires immense flexibility on the back end.

5:00

It treats them both as collection items.

5:02

So it's not just about the variety of data.

5:04

It's also got to handle the day-to-day work

5:05

of running an institution.

5:07

What kind of workflow features are built in?

5:09

It's all about bulk processing.

5:11

An archive might acquire 10,000 records at once.

5:14

Providence supports batch importing, vexporting.

5:17

You can bring in massive data sets, clean them up,

5:19

and then quickly output reports, like inventory lists

5:22

as PDFs or spreadsheets.

5:23

It's built to scale.

5:25

So Providence is the highly granular professional tool

5:29

for data integrity.

5:30

Right.

5:31

But the public never sees any of that.

5:33

And all that effort is useless if people

5:35

can't find the stuff.

5:35

Absolutely.

5:36

So now we shift from the structured back end

5:38

to the elegant front end, Pawtucket II.

5:41

This is where the collection comes alive.

5:43

OK, so Pawtucket II takes that data

5:45

and focuses on the user experience.

5:47

You can style it to match a museum's brand.

5:50

But the real power is in the discovery tools, right?

5:52

That's right.

5:53

It goes way beyond a simple keyword search.

5:55

Pawtucket II lets you use customizable facets

5:58

and filters.

5:59

A facet is what for a non-technical user?

6:02

It's basically a preset category.

6:04

So instead of you having to search

6:05

for blue dress Victorian, you can just

6:08

filter by object type, dress, then era, Victorian,

6:14

and then color blue.

6:15

It lets you explore the collection,

6:17

even if you don't know exactly what you're looking for.

6:19

And it can turn static data into something interactive.

6:22

I've seen examples with maps and timelines.

6:25

How does it do that?

6:26

It leverages the structured data from Providence.

6:29

So if an archivist entered a date and a location

6:31

for an object, Pawtucket II can visualize it.

6:35

A timeline takes that static date field

6:37

and turns it into a navigable journey.

6:40

A map takes a GPS coordinate and shows you

6:42

where an object is from.

6:43

It completely changes the user's relationship

6:45

with the collection.

6:45

You're not just viewing, you're exploring.

6:48

And there are also features for public engagement, things

6:50

that let the community contribute.

6:51

Precisely.

6:52

The institution can decide to turn on public commenting

6:55

or tagging or even rating items.

6:57

For something like a local history archive,

6:59

that's invaluable.

7:01

You can have the community help identify people in old photos.

7:04

It turns the public into research assistants.

7:06

In a way, yes.

7:07

And this is not theoretical software.

7:09

It's running some really high-profile digital exhibits.

7:12

Oh, absolutely.

7:13

We're talking about projects like the Crest Digital

7:15

Archive at the National Gallery of Art or the Chicago Film

7:17

Archive.

7:18

They're all using this framework.

7:19

OK, so we've established the core duality, Providence,

7:23

the rock solid engine, and Pawtucket II,

7:25

the engaging public face.

7:28

Now let's get to the new stuff.

7:30

It was a long wait since version 1.7.

7:32

A very long wait.

7:34

But Collective Access version 2.0 brings

7:36

some huge leaps forward.

7:38

The focus seems to be on stability, connectivity,

7:42

and this is the big one, AI integration.

7:44

Yeah, version 2.0 is basically the developers saying, OK,

7:47

we're ready for the next decade.

7:49

First, they just future-proof the platform.

7:51

It's now compatible with modern server tech,

7:54

like PHP 8.2 and 8.3.

7:56

Not a flashy feature, but essential for security

7:59

and performance.

8:00

Vital.

8:00

But the real user benefits are somewhere else.

8:02

One key improvement is in historical tracking.

8:05

You mean provenance, knowing where an object has been.

8:08

Exactly.

8:09

V2.0 has a new, much more flexible system

8:12

for tracking change over time.

8:14

It's not just the creation date anymore.

8:16

It's about granular history, location changes,

8:19

ownership shifts, you name it.

8:21

OK, so better accuracy.

8:23

But the features that got everyone talking

8:25

are about automation, the AI stuff.

8:27

I was genuinely surprised when I read

8:29

about the transcription feature.

8:30

It's a massive leap in efficiency.

8:33

V2.0 uses machine learning in two huge areas.

8:36

First, automated translation.

8:38

The system supports services like DeepL and Google

8:41

Translate, so staff using Providence

8:43

can work in their own language.

8:44

But that immediately lowers the barrier

8:46

to entry for global institutions.

8:47

It does.

8:48

But the truly transformative feature

8:50

is the automated transcription.

8:51

It uses models like OpenAI Whisper

8:54

to automatically transcribe audio and video materials

8:57

right inside the workflow.

8:58

Just think about that.

8:59

An archivist gets 50 years of oral history interviews,

9:01

manually transcribing that could take months, maybe years.

9:05

And this integration automates it.

9:07

It turns all that spoken word into searchable text

9:10

almost instantly.

9:11

It redefines what's possible for searching media.

9:14

So much of that content was basically locked away

9:16

from keyword searches before.

9:17

This unlocks it.

9:18

Wow.

9:19

So that's efficiency through AI.

9:21

What about connectivity?

9:23

How does v2.0 talk to other systems?

9:26

They completely modernized it.

9:27

They introduced a new GraphQL-based API.

9:31

This is a huge deal for any developer trying

9:33

to build custom tools that need to talk to the Providence data.

9:36

For listeners who aren't developers,

9:38

why is switching to GraphQL so important?

9:40

Well, a traditional API often makes

9:42

you download a big packet of data,

9:44

even if you only need one tiny piece of it.

9:46

GraphQL is about precision.

9:48

A developer can ask for only the specific fields they need.

9:51

So it's faster, less resource intensive.

9:53

Much faster.

9:54

For custom visualizations, mobile apps,

9:56

any kind of integration, it's a major upgrade.

9:59

And there's also a new support for external media, right?

10:01

Yes, another efficiency thing.

10:03

V2.0 can now reference content that

10:05

lives somewhere else on YouTube, Vimeo, the internet archive.

10:08

So the museum doesn't have to eat up all its server space

10:11

storing a huge video file if it's already

10:13

hosted somewhere else.

10:13

Exactly, they just catalog the link.

10:15

They keep the metadata integrity and providence,

10:17

but save a fortune on storage costs.

10:20

And finally, digital preservation.

10:22

Long term integrity is everything.

10:24

Right, and V2.0 has a new export system

10:27

that uses something called configurable Bag IT packages.

10:31

Bag IT packages.

10:32

Think of them like standardized tamper-proof digital shipping

10:37

containers.

10:38

They bundle up all the data and the metadata

10:40

in a way that can be validated for integrity checks years

10:44

or decades down the line.

10:45

It's the gold standard.

10:46

So it's really clear.

10:48

Version 2.0 is all about automation, next gen

10:51

connectivity, and making these massive data

10:54

sets manageable and secure.

10:56

Absolutely, collective access is now, I think,

10:58

firmly positioned as a cutting edge open source powerhouse.

11:01

You have the surgical precision of Providence for management

11:04

and the dynamic presentation of Pawtucket 2.0 for the public.

11:07

These updates, especially the AI and the new API,

11:10

they just solidify its role for the future.

11:13

So the practical takeaway for you, listening,

11:15

is that you now understand that two-part architecture.

11:18

When you see a beautiful online museum

11:21

collection with interactive maps and timelines,

11:23

you know there's a good chance software like this

11:25

has the engine behind it all, letting

11:27

them present their world without compromising their data.

11:30

But this does raise an important question, something

11:32

for you to mull over as these tools become standard.

11:36

Given the new support for automated transcription

11:39

and translation, what are the ethical or contextual challenges

11:43

that come up?

11:44

When we use machine learning to interpret or tag

11:47

historical records, can an algorithm

11:49

truly capture the necessary historical nuance or context?

11:54

Can a machine understand what's not being said in an interview?

11:57

Exactly.

11:57

That's something every institution

11:59

using these amazing new tools is going to have to grapple with.

12:02

A fascinating and really crucial point to end on.

12:05

Thank you for joining us for this deep dive

12:06

into collective access.

12:08

And remember, this deep dive was supported by Safe Server.

12:11

If you are looking for reliable hosting for complex software

12:14

or support for your digital transformation projects,

12:16

you can find more information at www.safeserver.de.

12:21

Until then, keep digging.

12:21

Until then, keep digging.

Today's Deep-Dive: CollectiveAccess

Episode description

Persons