Missing: billions of views (+ weekly update)

So I'm launching Challenge 3 to find them

Louis Barclay

Feb 23, 2024

I can’t stop obsessing over one thing from my piece about (lack of) basic TikTok data:

We have no idea what the most viewed content on social media is.

OK, yes — we do know things like the very top videos of all time.

But a list of, say, all TikTok videos above 10 million views? No way.

This is despite the fact that these videos are:

Public
Collectively viewed by billions of people
Make up the vast majority of activity on the platform

The final point is not just my own conjecture.

A brilliant paper published in December by researchers at UMass Amherst collected a random sample of YouTube videos. Among other findings, they were able to understand roughly how top-heavy YouTube is, in other words how much activity comes from just the top videos.

The answer? 94% of views from the 4% videos in their sample with >10,000 views.

And 51% of views from the 0.16% most viewed videos!

This power law distribution is not surprising — it makes sense, given that by definition most viewed videos will comprise a large amount of the views on any platform.

What is surprising is that we aren’t banging on the doors of big tech, demanding that they tell us what these top 4%, or even top 0.16%, are, so we can understand the lion’s share of the information our societies are consuming.

As Ryan McGrady, the paper’s lead author, wrote in The Atlantic: “We’ve come to accept that the most basic information about the platforms organizing our lives is unavailable.”

The best way I can sum up the absurdity of this situation is as follows:

There are billions of views that no one knows about.

Billions of views, missing.

Everyone knows about some of the most viewed videos. But no one knows about all of the most viewed videos. There is no bird’s eye view.

This matters because as I wrote previously, knowing the most viewed content is a short circuit to understanding what the algorithms on these platforms are doing, and therefore how they are influencing the information our societies consume.

Yes, it’s important to see the inner workings of the algorithm. But why not start somewhere more accessible, looking at the algorithm’s biggest consequences in the world — which is to say, the most viewed content?

I just can’t get this question off my mind, so I’ve chosen it as Challenge 3.

And I’m not going to limit it to TikTok — I’m going to start a campaign to get all large platforms to reveal their most viewed public content. Let’s call it the Missing Billions challenge.

It’s big undertaking, but I have a good idea of where to start.

Into the weeds of EU law

I’ve spent a lot of time working on the question of how to get data from platforms, as an open source developer running a study on Facebook, as a Resident Fellow at Reset considering how to study disinformation, and through the articles I’ve written so far on this blog about TikTok, including one about the different methods I could use to hunt this data down.

Last year, along with 66 other civil society organizations and researchers, I signed an open letter organized by Mozilla, which made recommendations to the European Commission around how platforms should provide data under the Digital Services Act (DSA) article 40, paragraph 12.

This wonderful paragraph of the DSA says that platforms should ‘give access without undue delay to data’ provided that it’s ‘publicly accessible in their online interface’.

The letter I signed is an excellent set of recommendations for, generally, what platforms must do.

But now that the DSA has come into force, I believe it’s time for another ask — specifically, for data that platforms must provide under 40.12, immediately.

You’re probably thinking I’d like this data to include some version of most viewed content. Yes, I would — but I’d like to approach the whole thing a little more strategically, with the data demand that is most likely to work, and that can therefore act as a wedge to unlock future data access requests.

So what does that look like? A campaign:

To send a second open letter to the European Commission
Requesting a specific, limited set of public data from platforms
Which is at the intersection of:
- 1) Least controversial
- 2) Most obviously ‘public’, so that platforms can’t argue
- 3) Most useful for research
- 4) Hardest for platforms to push back on
Which can be agreed on, and signed by, a large number of civil society organizations
And which therefore is most likely to lead to the European Commission enforcing the data request

Now, I happen to believe that asking for most viewed content would fit the bill there. But if others disagree, and there’s common ground around another specific data ask, that’s totally fine by me. Because the important thing is to establish a precedent under DSA 40.12 of making demands for public data, to unlock future requests.

Importantly, this strategy completely sidesteps asking for access to research APIs, which in any case are often woefully limited. DSA 40.12 doesn’t mention APIs — it very simply lays the ground for a direct request for the exact data we want.

What if the strategy fails? That’s OK — either way, it’s an incredible test case to understand the powers (or lack thereof) of the DSA.

So that’s what I’m going to do. And of course, it’s very possible that Mozilla or some of the other folks who signed the first letter are already working on exactly this. That’s what I’m going to find out next.

Let’s find the missing billions!

Weekly update

Challenge 1, Linknames: Surprise, surprise — I did not hear back from either Donald Glover or will.i.am. However, that doesn’t mean I’m going to give up, and I have much cooler linkname news anyway:
- For the first time ever, I’ve come across someone whose entire name is a domain. That’s pretty darn awesome. Emilie Ma, a 12 Challenges reader, got in touch and it turns out she is the owner of the domain emilie.ma, which redirects to her personal site kewbi.sh (another linkname!).
- Emilie is a fellow domain-name afficionado and has an extremely cool startup called NestedName which helps you find domain hacks around a specific name you have in mind. Love it.
Challenge 2: TrojanTok: More great work from Cole, but wow, it’s hard to make TikTok work. It feels like an uphill slog. We’re going to broaden the experimentation to try and find something that works.
Challenge 3: Missing Billions: Just launched it right here!
Typology of the social media feed: I wrote a piece introducing a series around the different aspects of social media feeds and how they’ve changed over time. I’m excited about this. I plan to make visualisations of how social media feeds have changed over time, to help articulate the paths we’ve all been going down as we’ve lurched from Facebook 2004 to Facebook 2019, to TikTok, to whatever’s next.
Why is no one making a new version of old Facebook? I got interested in this question off the back of a comment by Toby Mather in the Typology piece. Read my attempt to answer it here, and check out the great discussion on Hacker News too.
Read an awesome book about how and why big projects fail. It’s called How Big Things Get Done, and I liked it because I’m very good at doing zero planning and therefore failing hard. Which it turns out is the opposite of what you should do.
- I also recommended the book via Discord to one of my heroes, Martin Molin, the creator of the Marble Machine viral sensation, and he’s ordered a copy to help him out with the enormous undertaking of creating the third marble machine!
Social activity: Got a nice retweet from Cory Doctorow of a tweet about a tongue-in-cheek contraction (by Toby Mather) of the enshittification term - e14n. He also boosted it on Mastodon, which gave me hope that Mastodon might be a good place to talk about my work.
Stats: 158 subscribers, up 28 this week. Welcome! This shatters the 3 year time to get to 4.7k subscribers — if I have more weeks like this one, I should reach that point in just 22 weeks time. But I’m not going to reset my expectations.