Post by No Time To Play in Content-based search (academic research collaboration)

Viewing post in Content-based search (academic research collaboration)

That sounds interesting! Presumably you need Itch to recognize your bot and leave it out of statistics? And also to know how fast you can crawl the site without tripping any wires? Suggestions to improve the API are welcome, too.

However, grabbing every game from the site that has at least a free demo sounds ambitious. We have rather a lot of them, you know. And some creators might be leery of it. I'll bring it up with our admins and see what they have to say.

Like Reply

rndmcnlly6 years ago

Excellent!

In the timescale of the next week, I'm only hoping to get a very specific letter from someone at the company that says literally "If the proposal submitted by Dr. [insert the full name of the Principal Investigator] entitled [insert the proposal title] is selected for funding by the NSF, it is my intent to collaborate and/or commit resources as detailed in the Project Description or the Facilities, Equipment or Other Resources section of the proposal." I'm still writing the proposal now (deciding whether to mention Itch by name if I can get a letter, or just "public archive and marketplaces" otherwise). The only resources I actually need from Itch are rough agreement that you are okay with the effort to use your public data and willingness to answer questions about whether I'm breaking stuff -- super minimal. They letter mostly just shows that I've checked in with you folks and that my research plan won't come crumbling down the moment I try to start it up.

Anything beyond that, such as sending one of my graduate students there as an intern to replicate our prototype systems on top of your production stuff, can be saved for far far off in the future.

Like Reply

quyse6 years ago

A bit off-topic, but I just re-read itch.io Terms of Service, in particular:

Publishers retain all ownership rights to the submitted content, and by submitting content to the Service, Publishers hereby grant the following:
...
To Users, a non-exclusive, perpetual license to access the content and to use, reproduce, distribute, display and perform such content as permitted through the functionality of the Service. Users shall retain a license to this content even after the content is removed from the Service.

That's an interesting phrasing. It says that users are allowed to "distribute" games they download, i.e. essentially waiving copyright? Usually terms of service are quite the opposite, it looks like the words "you may not" are missing here :)

So I guess content scraping cannot be forbidden with such liberal terms of service, isn't it? Of course, there're still technical considerations, like do not overflow website with requests, etc. Would be nice to have official rules for scraping.

I once scraped itch.io games' public information for game ids < 200000 as an experiment. Took a few weeks or so (I used very low request rate). Wanted to calculate some interesting stats about itch.io games, but abandoned the project because of lack of time. Maybe I'll resume it someday. A few stats I got: number of published games in this range was 125775, and total size of uploads of those games was ~5.7 Tb (I didn't download them, only collected metadata). Now I can see that number of games is more than 200k, and max game id is in ~450k range. Quite a lot, but doable :)

Like Reply

No Time To Play6 years ago

We can't grant users a license to waive copyright. Of course not. What we can do is assure users that once they bought a game they'll retain access to it even if it's later withdrawn from sale, so they can continue to download and play what they paid for. Otherwise you end up in situations like the Microsoft e-book store.

As for content scraping, do we need to make rules about it now? We were hoping not to need rules for every little thing. Thank you for doing this responsibly last time.

Like Reply

quyse6 years ago

We can't grant users a license to waive copyright.

Yeah I know, that's why I said "essentially". Probably better to say "effectively". You cannot waive copyright, but you can require publishers to grant users a license with almost the same rights as copyright. Specifically, "license to distribute" sounds like it officially allows unlimited copying or pirating. I'm not sure I understand this right, I'm not a lawyer, I just saw it in terms and raised a question :)

As for content scraping, do we need to make rules about it now?

Ah, no, I don't really need rules. I guess being reasonable is enough. Although I would appreciate at least some informal guidance, i.e. what is recommended request rate for itch.io, or whether it's OK to download the games themselves. Downloading and analyzing terabytes of data is not that costly nowadays, especially considering that incoming traffic on clouds is usually free, but obviously some cost will be on itch.io for its outgoing traffic and CDN charges.

To be clear, I'm not working on this project right now, and it's not that serious. What's implemented (quite a while ago) is this: https://itchy.quyse.io/ - simple analyzer of publicly available itch.io projects. Example analysis (game I'm working on): https://itchy.quyse.io/game/19957. The idea is to expand analysis to all itch.io projects (rather than do it on-demand), and gather statistics about games binaries/resources, like what percentage of all games uses Unity, or what are the minimal required Linux distributions for games with Linux version, etc. I hope I'll get to work on it as a weekend project at some point.

Like Reply