I wrote last week about the insanity of not knowing the most viewed videos on TikTok — basic data which would tell us about the information entire societies are consuming via the platform.
These videos are hiding in plain sight. Each has been viewed by hundreds of millions, sometimes billions of people. But it’s impossible to get a simple list of them.
In that last post, I promised to write a follow up about how I might hunt these videos down, so here are all the methods I can think of, along with what each method tells us about the internet in 2024.
The data goal
So first off, what is the data goal here? For the purposes of this article, let’s say I want to know:
The IDs of the top 10,000 TikTok videos
Sorted by view_count (TikTok’s own metric)
Published on TikTok
At any point in 2023
Method 1: Google it
This seems like an obvious place to start. Searching for ‘most viewed TikTok videos’ and related terms brings up two things:
A bunch of articles listing most watched TikTok videos (usually around 10 of them). These do shed a little light on the very top of the pyramid, which is useful — although it’s hard to know whether this is comprehensive (i.e. includes all content types, not just entertainment).
A ton of TikTok auto-generated pages with whichever search terms you used, like this one for instance. This is not an actual view of most viewed TikTok videos, it’s just the search page for ‘Most Viewed TikTok Videos’ and it mostly includes videos that use those words in their title or description.
What this tells us about the internet in 2024: As far as I can tell, nearly everything we get from Google about this data is information managed by TikTok — both the media reports, which do not give a source but presumably are based on TikTok comms, and the thousands of TikTok auto-generated search pages, which crowd out more useful information on what the top TikTok videos are.
So although you might think Google is a great way to search for information you want, in fact, many big, rich companies have the resources to show you information they want you to see.
Conclusion: Useless.
Method 2: TikTok’s Research API
OK, but you’re doing research, right? So why not use the TikTok Research API?
First off, you’ll have to be affiliated with ‘not-for-profit academic institutions in the United States, and Europe’.
You’ll then have to sign up to incredibly restrictive terms governing how you can use the Research API and what you can do with the research you generate.
And the worst part? You won’t be able to come close to figuring out the top TikTok videos by view_count. You’ll be limited to 1,000 requests a day, which will give you 100 videos each max. That’s nothing, set against >10m videos uploaded to TikTok every day (estimate based on a number of unreliable sources). And no, you won’t be able to request videos over a certain view_count, or order the requests by view_count — because the API does not give you controls to do that.
What this tells us about the internet in 2024: There are Research APIs and there are Research APIs. TikTok’s is useful for them to be able to say ‘we have a Research API’ — but it doesn’t come close to allowing researchers to get this critical data. So next time you hear a platform saying ‘Look, we’re an open book!’, take another look.
Conclusion: For my goal, the TikTok Research API is even more useless than Google.
Method 3: Scraping
Since we can’t get any easy answers from Google or TikTok’s own Research API, we could try to scrape TikTok’s platform, in other words try to download information from TikTok’s servers ourselves. This is going to be hard.
As we just found out, >10m of videos are uploaded to TikTok every day, so billions a year. You can forget about scraping that dataset unless you are Google.
You could try set up multiple devices (physical and/or virtual) that will use TikTok and try to swipe on the For You Page towards the videos with the highest view counts, but it’s pretty unlikely that’s going to work or be comprehensive. It’ll also cost you money to set up multiple devices, and it’ll be hard to make them look ‘real’ to TikTok so the accounts you run on them don’t get banned.
Also, whichever scraping method you choose, TikTok will probably allege that you break provisions in their terms of service that bar using automated means to engage with the platform.
What this tells us about the internet in 2024: That it’s enormous, obviously, and that even if you do stuff that seems completely legit, platforms may disagree and use their vast resources to bully you. In this case, TikTok would probably send their lawyers after you even though all of the information you’d be seeking is completely public and Google themselves scrape it all the time.
Conclusion: Don’t even bother trying. Leaving aside the legals, it would be impossible to scrape this much data without a ton of money.
Method 4: Scraping other platforms
Because the most viewed videos on TikTok are by their nature extremely popular, they are likely to end up on other platforms. This means scraping could be done through the other platforms (many thanks to the member of CITR who suggested this approach to me). It could look something like:
Use scraping methods on a bunch of other platforms (Twitter, Reddit would be the most obvious choices) to search for links to TikTok videos — then load the underlying videos with some sort of script, and obtain their titles, view counts and date published. Add that information to your database and start ranking videos by view counts.
Now you’d have 3 platforms probably alleging you broke their terms of service (unless you use official APIs on Twitter and Reddit, which are likely to be expensive). But you’d be getting somewhere.
You may also have to worry about the data set not being comprehensive. There are likely to be biases behind which TikTok videos end up on other platforms (e.g. maybe shorter/entertainment-based ones mostly).
What this tells us about the internet in 2024: Content that does well in one place leaving traces all over. Reminds me of that great quote from Tom Eastman that “the internet is five giant websites, each filled with screenshots of text from the other four”.
Conclusion: I could don my detective’s cap and give it a go, but it would be painstaking work and it might not be comprehensive. Still, a very interesting angle.
Method 5: Crowdsourced data
What about finding some smart way to get this data using volunteers all around the world? That would be pretty cool — a little like the way Mozilla created RegretsReporter to study YouTube, or the way that Who Targets Me works. This is definitely a viable approach, if you have patience and a way to build up the base of volunteers who are going to help out. Some thoughts:
Browser extensions (as used by both of those examples) would be the easiest way forward. Users around the world can simply install them in their desktop browsers without TikTok etc. being able to interfere. In our case, the extension would scrape information on a volunteer’s behalf about any high view count videos, and send it back to our database.
You might think you’re safe from TikTok’s lawyers by going down this route. Unfortunately, precedents like the NYU Ad Observatory suggest this is not the case.
Alternatively, you could set up some sort of prize competition for the most viewed videos — give bounties to users who submit links to videos that end up being in the top 10,000. Expensive, but an interesting concept nonetheless — even more so if you limit the type of submission to news/politics videos instead of bringing in a ton of entertainment content that you’re not interested in (if you are studying disinformation, for instance).
Both approaches are going to require a long slog of getting lots of people to care enough about what you’re doing to participate. So no quick fix.
Your sample of TikTok videos will also likely be biased by the type of people you get to participate. TikTok is all about that recommended content based on your language, preferences, etc. — so if you get a bunch of developers to crowdsource your data, don’t be surprised if it is mainly about things they care about. In other words, the content is not going to be comprehensive, and you’ll even struggle to understand how un-comprehensive it is.
You’ll also have to be extremely careful with people’s data — make sure you only send strictly public information back to your server, and that you are being completely open with your users about what you are doing. In the EU, GDPR might be a problem.
What this tells us about the internet in 2024: People power! If you rally an audience, you can get a lot done.
Conclusion: Requires a lot of work, a lot of people, and will take a long time. But nonetheless, very interesting.
Method 6: Social monitoring tools
This one is basically a non-starter, but worth mentioning for the sake of completeness. There’s an entire industry of companies that scrape the major social platforms, including TikTok, and sell access — mostly to media and PR outfits who need to monitor various clients and industries. Some of these, for instance Tubular Labs, claim to have over 10 billion videos.
I probably won’t be using any of these tools anytime soon. They tend to cost thousands of dollars per month (prices are usually hidden until after you enquire), and even if I had the money, they would probably look suspiciously upon researchers who want to use them.
That’s because they are operating in a gray area, providing a service that the major platforms don’t make easy. How they get their hands on platform data is a mystery, but it seems safe to assume that either 1) they have contracts with the platforms directly or 2) they are scraping in ways that the platforms would allege are against their terms of service. That gives them an incentive to sell only to customers who won’t do anything adversarial to the platforms, like media and PR outfits.
What this tells us about the internet in 2024: You can get data from the platforms as long as you do it on their terms (or are willing to put up with the risk of getting shut down).
Conclusion: No way. If you’re a billionaire reading this who wants to pay the bills, feel free to get in touch.
Method 7: Through regulation
After a long slog through six methods, each with its own big problems (frankly, I’m amazed you’re still reading) we arrive at something of a Promised Land: using new regulations, specifically the European Union’s Digital Services Act (DSA), to request the data from TikTok directly.
The DSA’s Article 40 is where the action is at — paragraph 12, specifically:
12. Providers of very large online platforms or of very large online search engines shall give access without undue delay to data, including, where technically possible, to real-time data, provided that the data is publicly accessible in their online interface by researchers, including those affiliated to not for profit bodies, organisations and associations, who comply with the conditions set out in paragraph 8, points (b), (c), (d) and (e), and who use the data solely for performing research that contributes to the detection, identification and understanding of systemic risks in the Union pursuant to Article 34(1)
TikTok has been designated a very large online platform under these regulations so it sounds like in theory, I can just ask TikTok to give me the data I need. Great!
Well, kinda, I think.
It’s not actually clear what I should do now. Do I cold email whichever addresses I can find for TikTok’s data and legal team?
There’s no way TikTok aren’t going to push back. These regulations are brand new — TikTok have probably assembled an army of lawyers ready to fight every request made under them.
I don’t live in the EU, and I’m not an EU citizen, so I don’t even understand whether I can use this law. I think so? It doesn’t say that I have to be an EU citizen?
What this tells us about the internet in 2024: The regulators are coming, like it or not. The DSA is a huge step change in the way that platforms are regulated, and one of the best parts of it (in my humble opinion) is the way it’s going to force TikTok and pals to reveal all sorts of interesting data about how they work, and the information they recommend.
Conclusion: Although I have a few questions, this avenue is definitely worth exploring. So — let’s do it!
Coming up…
Follow along as I try out Method 7 (probably not in the next 12 Challenges piece, but in a couple of weeks). You can subscribe here:
Hello, it's me (Roi)
I was wondering if after all these years you'd like to meet (and discuss TickTok)
To go over everything (that can be done to find the most watched clip)
They say that time's supposed to heal ya (from FB's latest legal action)
But I ain't done much healing (so let's just share a pie).
Or more to the point.
Here is a list of EU NGOs you can partner with to take down TikTok and who may offer you legal support in the process.
https://edri.org/about-us/our-network/
Adele
A combo of methods 5 and 7 could be really interesting!