Use visual search frontend for Wikipedia

From Meta, a Wikimedia project coordination wiki
This is a proposal for a new Wikimedia sister project.
Visual Search for Wikipedia
Status of the proposal
Statusunder discussion
Details of the proposal
Project descriptionAdd an alternative search using images for each result along with differentiating keywords, modeled after www.scrappycito.com. See samples for "small dog", "Bob Jones", and "Taylor Swift" below. Text categorization is used for pages without images to determine generic image based on the topic.
Is it a multilingual wiki?It will first fully supported be for English. Other languages can be handled as is provided whitespace tokenization is sufficient. Language like Chinese or Japanese without whitespace tokenization will require custom preprocessing (n.b., presumably in place for current search). For text categorization, a native speaker will need to determine a mapping from a representative user category for each for a few dozen generic categories. This would require less than one week's work, including time for training the classifier. ScrappyCito, LLC would be doing this for the top 20 languages (e.g., based on wiki popularity) in the course of a year.
Potential number of languagesThis will support many languages; see previous section. Most languages will be supported out of the box, except the handful of languages without whitespace tokenizer need to use the one from the corresponding wikipedia's regular search. In addition, image selection for text-only pages requires a category mapping created by native speaker, as described above.
Proposed taglineImage-centric results with keywords: clutter free search facilitating quicker browsing and enabling great disambiguation!
Technical requirements
New features to requireSoftware with source for the server will be provided by ScrappyCito, LLC. Advice will be provided for customization.


It would be good for Wikipedia to use visual search front end. Note that a big incentive for this is that users will be drawn to Wikipedia to use this type of search rather than Google Search or Bing. This would be good because these search engines often show Wikipedia content for popular entities like sports stars or tourist attractions, which cuts down on Wikipedia traffic.

You will be able to use the visual search frontend I developed without charge for the duration of my patent in the works (i.e., license free). Here is a simple example with Wikipedia search on left with white background and Scrappy Search on right with tan:

Wikipedia vs. Scrappy search
Wikipedia vs. Scrappy search

The full example can be found at following URL:

   http://www.scrappycito.com/wikipedia-vs-scrappy-search-small-dog-breeds-en-wiki-site.png  

See http://www.scrappycito.com for the stable version and http://www.tomasohara.trade:9330 for the work-in-progress version with support for handheld devices and also better aesthetics (n.b., used in examples).

Two other examples illustrate added benefit of this visual search. First, disambiguation becomes based on images and keywords rather than just snippets of text. See

  http://www.scrappycito.com/wikipedia-vs-scrappy-search-bob-jones-en-wiki-site.png

In addition, alternative pages for the same entity become much more engaging:

  http://www.scrappycito.com/wikipedia-vs-scrappy-search-taylor-swift-en-wiki-site.png

I think this will be extremely popular with the Instagram crowd and younger users in general (e.g., younger than 30). To do similar searches, just add "site:en.wikipedia.org", as in following example:

   Lionel Messi site:en.wikipedia.org

The patent for this visual search will be owned by my company ScrappyCito, LLC. If the company gets acquired, I will require that they honor the license-free usage of the visual search system by Wikimedia for Wikipedia. (They will likewise be required to pass along this license-free usage requirement if they in turn are acquired). You will have access to the current source code for use in Wikipedia and other approved projects.

I am doing this both for exposure and because I want to help keep Wikipedia viable. What I can do is develop a prototype for the Simple English Wikipedia on my server and help with the deployment for the regular English Wikipedia on your servers once approved.

Proposed by[edit]

Tom O'Hara (https://meta.wikimedia.org/wiki/User:Tomasohara)

People interested[edit]

Potential clash with mission[edit]

Some concerns raised:

  • If this is your software, Tom, and your server, just implement it and link to wikipedia just like Google links to it. To propose a commercial sister project sounds strange given the vision and mission of the wikimedia movement--which is free to use commercial and non-commercial. This includes the contents as well as the software. Also, it is ad-free. --ThurnerRupert (talk) 18:31, 7 July 2018 (UTC)[reply]

Thanks for the feedback, and sorry for the long delay: I didn't see the notification of the changes made here. Below I offer ways to address both of your concerns. --tomasohara (talk) 04:25, 2 Sept 2018 (UTC)

Rationale[edit]

I mentioned on the Wikimedia mailing list, that this was a placeholder (as not a sister project), until a better location was found. I also mentioned that the issue of proprietary software can be addressed by introducing the notion of "wikimedia-friendly sharing", rather than "unrestricted sharing". Basically, the visual search engine source code can be copied by anyone, and the burden will be on ScrappyCito to track down wikimedia-unfriendly organizations using the software. An open source variant is suggested below, based on MySQL.

Clarification (e.g., constraints)[edit]

The reason for running the software on a Wikipedia server is to support search from within Wikipedia (i.e., not from an external search engine). For efficient updates, this will require access to the underlying DB. Otherwise, synchronizing the front end with the Wikipedia content changes would entail waiting up to a month for the next Wikipedia dump. In addition, there would be no ads, which would only be an optional feature. Wikipedia developers would be free to disable that option as well as to modify the code as they see fit.

Suggestion[edit]

Wikimedia currently uses MySQL, which is not completely open source. Analogous to the MySQL/MariaDB relationship, there could be a variant of the visual search front end that would be open source. For example, the commercial front end might include security features suitable for enterprise search, which would not be available in the open source version. Wikipedias are community edited, so a lack of such security features would not be critical. As an alternative to using different versions of the code base, this could instead incorporate a license model with constraints for commercial usage, such as in the Business Source License (BSL) developed by the creators of MariaDB (i.e., the original MySQL developers).