Indie game storeFree gamesFun gamesHorror games
Game developmentAssetsComics

Content-based search (academic research collaboration)

A topic by rndmcnlly created 85 days ago Views: 127 Replies: 8
Viewing posts 1 to 4

I'm running a university research lab interested in improving how we organize and access interactive media like games and apps. Towards making a prototype search engine that understands the interactive content of games rather than just their metadata (e.g. title/description/tags), we are making a crawler that will download and automatically play the first few minutes of every game we can find. Can we coordinate with you so as to not interfere with site operations or throw off download counts?

Here's one of our recent research papers on the general idea:

In an idea world, the search system works like Google News or Google Books. Just as these tools can determine that specific articles or specific pages in larger books are relevant to a user's information need, a game search engine can tell them that specific modes or moments in a game are relevant to their interest. The search engine doesn't host the content -- it redirects to some other site where either the user needs to pay for access to the game and (where supported) they can follow a deep link into a specific moment for supported games.


That sounds interesting! Presumably you need Itch to recognize your bot and leave it out of statistics? And also to know how fast you can crawl the site without tripping any wires? Suggestions to improve the API are welcome, too.

However, grabbing every game from the site that has at least a free demo sounds ambitious. We have rather a lot of them, you know. And some creators might be leery of it. I'll bring it up with our admins and see what they have to say.


In the timescale of the next week, I'm only hoping to get a very specific letter from someone at the company that says literally "If the proposal submitted by Dr. [insert the full name of the Principal Investigator] entitled [insert the proposal title] is selected for funding by the NSF, it is my intent to collaborate and/or commit resources as detailed in the Project Description or the Facilities, Equipment or Other Resources section of the proposal." I'm still writing the proposal now (deciding whether to mention Itch by name if I can get a letter, or just "public archive and marketplaces" otherwise). The only resources I actually need from Itch are rough agreement that you are okay with the effort to use your public data and willingness to answer questions about whether I'm breaking stuff -- super minimal. They letter mostly just shows that I've checked in with you folks and that my research plan won't come crumbling down the moment I try to start it up.

Anything beyond that, such as sending one of my graduate students there as an intern to replicate our prototype systems on top of your production stuff, can be saved for far far off in the future.

A bit off-topic, but I just re-read Terms of Service, in particular:

Publishers retain all ownership rights to the submitted content, and by submitting content to the Service, Publishers hereby grant the following:
To Users, a non-exclusive, perpetual license to access the content and to use, reproduce, distribute, display and perform such content as permitted through the functionality of the Service. Users shall retain a license to this content even after the content is removed from the Service.

That's an interesting phrasing. It says that users are allowed to "distribute" games they download, i.e. essentially waiving copyright? Usually terms of service are quite the opposite, it looks like the words "you may not" are missing here :)

So I guess content scraping cannot be forbidden with such liberal terms of service, isn't it? Of course, there're still technical considerations, like do not overflow website with requests, etc. Would be nice to have official rules for scraping.

I once scraped games' public information for game ids < 200000 as an experiment. Took a few weeks or so (I used very low request rate). Wanted to calculate some interesting stats about games, but abandoned the project because of lack of time. Maybe I'll resume it someday. A few stats I got: number of published games in this range was 125775, and total size of uploads of those games was ~5.7 Tb (I didn't download them, only collected metadata). Now I can see that number of games is more than 200k, and max game id is in ~450k range. Quite a lot, but doable :)


We can't grant users a license to waive copyright. Of course not. What we can do is assure users that once they bought a game they'll retain access to it even if it's later withdrawn from sale, so they can continue to download and play what they paid for. Otherwise you end up in situations like the Microsoft e-book store.

As for content scraping, do we need to make rules about it now? We were hoping not to need rules for every little thing. Thank you for doing this responsibly last time.

We can't grant users a license to waive copyright.

Yeah I know, that's why I said "essentially". Probably better to say "effectively". You cannot waive copyright, but you can require publishers to grant users a license with almost the same rights as copyright. Specifically, "license to distribute" sounds like it officially allows unlimited copying or pirating. I'm not sure I understand this right, I'm not a lawyer, I just saw it in terms and raised a question :)

As for content scraping, do we need to make rules about it now?

Ah, no, I don't really need rules. I guess being reasonable is enough. Although I would appreciate at least some informal guidance, i.e. what is recommended request rate for, or whether it's OK to download the games themselves. Downloading and analyzing terabytes of data is not that costly nowadays, especially considering that incoming traffic on clouds is usually free, but obviously some cost will be on for its outgoing traffic and CDN charges.

To be clear, I'm not working on this project right now, and it's not that serious. What's implemented (quite a while ago) is this: - simple analyzer of publicly available projects. Example analysis (game I'm working on): The idea is to expand analysis to all projects (rather than do it on-demand), and gather statistics about games binaries/resources, like what percentage of all games uses Unity, or what are the minimal required Linux distributions for games with Linux version, etc. I hope I'll get to work on it as a weekend project at some point.

If you manage to make such a sentient crawler - you won't need to bother with research or games - you can just sell the crawler!

Realistically, the first few minutes of my games are logo screen, instructions, menu screen, awards screen - then the game.  Your crawler would have to press the right button to navigate across that.  It would need to read the instructions to know what to do.  Only then can it "play" the game - and only if someone programs those instructions into them.  I don't think you're going to be able to do that with a bot.

You might be more successful by just gathering the various screenshots and / or videos and getting your data from those instead.

(1 edit) (+2)

One of our first projects in this area used screenshots collected while playing back speedrun input files: (tested on Super Mario World and Super Metroid). It showed that you could search by screenshots and find specific results within a given game, but it begged the question: where do you get the speedruns from?

In another project, we extracted screenshots and audio transcripts from YouTube Let's Play videos: We could find the "horses" in Last of Us and the "selfie" in Life is Strange, but still, can a machine do the playing instead of a human?

Of course a machine can -- and even really dumb strategies can sometimes do well. We tried out some simple strategies on games spanning from Atari 2600 to Nintendo 64:

But, like, can the algorithm cover serious ground on some timescale that compares with human gameplay? We looked at that in Super Mario World and Legend of Zelda: It turns out that both humans and machines loose steam as you let them explore for longer. Interleaving human and machine exploration lead to the most amount of ground covered for a given amount of time invested.

Do these things just press random buttons or does it learn from experience? Our first algorithms just pressed random buttons, yes, but the new ones (tested on in-development Unity games that integrate our machine playtesting script) do learn from experience and improve over time: They can bootstrap from human gameplay demonstration or from their own random flailing.

We're in the business of creating all this stuff and just giving it away for free. My lab brings in money by proposing interesting new work to be done (to be paid for by taxpayers) rather than selling the technology we developed last year. I'm scratching on the door of Itch as part of a proposal to get more money (from the US National Science Foundation) to do the next steps in these projects. It's tricky to get funding for research around games, but we're trying.

If someone at Itch can contact me directly at, that would be best. I'm just looking for a one-sentence form letter stating an intent to collaborate (at the level of email feedback, no system or legal changes) that I can include in my official research proposal document.