Board Thread:General Discussion/@comment-1274766-20130726204008

Hi,

I would like to introduce you to the collaborative filtering if you don't know it. It is easy to explain and can be very helpful to Wikia. It allows you to show on a wiki page of : "Lots of  contributors also contribute to  and ". This mechanism is used by Amazon. It can be alternatively placed instead of the spotlights for example and it suggests much more accurate suggestions. For instance, contributors of a series wiki can be interested in a video game wiki, which would be hard to guess.

How to do that
Don't be afraid. I will discuss about the performance issues later. Here is the principle:
 * 1) Pick up a wiki
 * 2) List all the contributors
 * 3) Choose another wiki
 * 4) List all the contributors of this wiki
 * 5) Compare the contributors

If there are lots of contributors that contribute to the both wikis, then those wikis are highly related. If you do this with all the wikis, you will know how much each wiki is related to the first one. Then you can keep the n highest related ones.

Wow! That's dramatically expensive! Well, it is only the principle. Now let's see how to do that for real. First, it won't be processed each time a wiki page is rendered. Only a couple of wiki names will be stored in a database for each wiki. It takes less space than the wiki description. Then, this process will be done only once, for instance, monthly, for all the wiki. To improve the performance, it can be done on wikis by language separately. The script can ignore all the contributors that have only contributed to one wiki. It does not require to request the production database. The script can run on a backup database separately. The script is not critical as it is not a problem to keep the old data of the last month. Now here is how I think it is the most efficient way to implement the algorithm:


 * 1) Create a map whose key can be the name of a wiki.
 * 2) Retrieve all the contributors that have contributed to at least two wikis.
 * 3) Load a portion of the contributors.
 * 4) For each contributor of the portion,
 * Retrieve all the wiki names where the contributor has contributed
 * For each wiki name,
 * If there is no key with the name of the wiki, then,
 * Add an entry in the map with the wiki name as the key and create a sub-map.
 * For each remaining wiki name,
 * If there is no key with the name of the wiki in the sub-map, then,
 * Add an entry with the wiki name as the key and set the value 1.
 * or else,
 * Increment the value.


 * 1) For each value of the main map,
 * Create an empty sorted list.
 * For each value of the sub map,
 * If the count of the current wiki is higher than the nth item in the sorted list, then,
 * Insert the wiki in the list and remove the n + 1th item if any.

-> Now you have a sorted list of related wikis for each wiki.

An average contributor contributes to six wikis. If you have u wiki-agnostic contributors, the algorithm is O(u*6*6), that is to say O(u). Not all the contributors need to be loaded at the same time. It could be processed hundred by hundred for example. You can decide to do this on data of all time or only on data of the month. Do not process wiki by theme separetely, the main interest is to find surprising links. If you want to suggest three wikis, you can store, say, the twenty highest related wikis and remove the ones in which the current page reader has already contributed.

"The wiki founders will hate us"
Some wiki founders may not appreciate this tool as they may think that it makes their contributors go. You should explain to them that this will also make other contributors arrive.

20:40, July 26, 2013 (UTC) 