Many people rightfully have concerns about their personal information and privacy being at the liberty of large companies. While there are many different projects whose goals are to allow users to reclaim ownership of their data, there are still some areas of normal computing that have been difficult for users to break free from business-controlled products.
Search engines are one area that many privacy-conscious people complain about. YaCy is a project meant to fix the problem of search engine providers using your data for purposes you did not intend. YaCy is a peer-to-peer search engine, meaning that there is no centralized authority or server where your information is stored. It works by connecting to a network of people also running YaCy instances and crawling the web to create a distributed index of sites.
In this guide, we will discuss how to get started with YaCy on an Ubuntu 12.04 VPS instance. You can then use this to either contribute to the global network of search peers, or to create search indexes for your own pages and projects.
YaCy has very few dependencies outside of the package. Pretty much the only thing required on a modern Linux distribution should be the open Java development kit version 6.
We can get this from the default Ubuntu repositories by typing:
sudo apt-get update sudo apt-get install openjdk-6-jdk
This will take awhile to download all of the necessary components.
Once that is complete, you can get the latest version of YaCy from the project’s website. On the right-hand side, right-click or control click the link for GNU/Linux and select copy link location:
Back on your VPS, change to your user’s home directory and download the program using wget:
cd ~ wget http://yacy.net/release/yacy_v1.68_20140209_9000.tar.gz
Once this has finished downloading, you can extract the files into its own directory:
tar xzvf yacy*
We now have all of the components necessary to run our own search engine.
We are almost ready to start utilizing the YaCy search engine. Before we begin, we need to adjust one parameter.
Change into the YaCy directory. From here, we will be able to make the necessary changes and then start the service:
We need to add an administrator username and password combination to a file so that we can explore the entire interface. With your text editor, open the YaCy default initialization file:
This is a very long configuration file that is well commented. The parameter that we are looking for is called
Search for the
adminAccount parameter. You will see that it is unset currently:
adminAccount= adminAccountBase64MD5= adminAccountUserName=admin
You need to set an admin account and password the following format:
<pre> adminAccount=admin:<span class=“highlight”>your_password</span> adminAccountBase64MD5= adminAccountUserName=admin </pre>
This will allow you to sign into the administrative sections of the web interface once you start the service.
Save and close the file.
When you are ready, start the service by typing:
This will start up the YaCy search engine.
We now can access our search engine by navigating to this page with your web browser:
<pre> http://<span class=“highlight”>server_ip</span>:8090 </pre>
You should be presented with the main YaCy search page:
As you can see, this is a pretty conventional search engine page. You can search using the provided search bar without any additional configuration, if you wish.
We will be exploring the administration interface though, because that provides us with a lot more flexibility. Click on the “Administration” link in the upper-left corner of the page:
You will be taken to the basic configuration page:
This will go over some common options that you may wish to set up right away.
First, it asks about the language preferences. Change this if one of the other languages listed is more appropriate for your uses.
The second question decides how you want to use this YaCy instance. The default configuration is to use your computer to join the global search network that crawls and indexes the web. This is how peer-based searching operates to replace traditional search engines.
This will help allow you to join peers in providing a great search resource, and will allow you to leverage the work that others have already started.
If you don’t want to use YaCy as a traditional search engine, you can instead choose to create a search portal for a single site by selecting the second option, or use it to index the local network by selecting the third option.
For now, we will select the first option.
The third setting is to create a unique peer name for this computer. If you have multiple servers running YaCy, this becomes increasingly important if you want to peer with them exclusively. Either way, select a unique name here.
For the fourth section, deselect “Configure your router for YaCy” since our search engine is installed on a VPS that is not behind a traditional router.
Click on “Set Configuration” when you are finished.
You can now search using the indexes kept on your YaCy peers. The search results will become more and more accurate the more people participate in the system.
We can contribute by crawling sites on our instance of YaCy so that other peers can find the pages we crawled.
To start this process, click on the “Crawler / Harvester” link on the left-hand side under the “Index Production” section.
If you’ve attempted to search for something and did not get the results you were looking for, consider starting to index the pages on a site with your instance. It will make your search more accurate for yourself and your peers.
Type in the URL that you want to index in the “Start URL” section:
This should populate a list of links that YaCy found on the URL in question. You can select either the original URL that you inputted, or choose to use the link list from the page you typed.
Furthermore, you can select whether you would like to index any links within the domain, or whether you would only like to index those that are a sub-path of the given URL.
The difference is that if you typed in
http://example.com/about, the first option would index
http://example.com/sites, while the second option would only index pages located below the inputted path (
You can limit the number of documents that your crawl will index. Click “Start New Crawl” when you are finished to begin crawling the selected site.
Click on the “Creation Monitor” link on the left-hand side to see the progress of the indexing. You should see something like this:
Your server will crawl the URL specified at the rate of 2 requests per second until it has either run out of links chained together or reached the limit you set.
If you then search for a page related to your crawl, the results you indexed should contribute to the results.
One thing that YaCy can be used for is to provide search functionality for your website. You can configure your site index to operate as a search engine restricted to your domain.
First, select “Admin Console” under the “Peer Control” section in the left-hand side. In the admin console, go back to the “Basic Configuration” page.
This time, for the second question, choose “Search portal for your own web pages”:
Click “Set Configuration” on the bottom.
Next, you need to crawl your domain to generate the content that will be available through your search tool. Again, click on the “Crawler / Harvester” link under the “Index Production” section on the left-hand side.
Enter your URL in the “Start URL” field. Click “Start New Crawl” when you have selected your options:
Next, click on the “Search Integration into External Sites” link under the “Search Design” section on the left-hand side.
There are two separate ways to configure YaCy searching. We will be using the second one, called “Remote access through selected YaCy Peer”.
You will see that YaCy automatically generates the code that you will need to embed within a web page on your site:
On your site, you need to create a page that has this code inside. You may have to adjust the IP address and port to match the configuration of the server with YaCy installed.
For my site, I created a
search.html page in the document root of my server. I made a simple html page, and included the code generated by YaCy:
You can then save the file and access it from your web browser by going to:
<pre> http://<span class=“highlight”>your_web_domain</span>/search.html </pre>
My page looks like this:
As you type in terms, you should see pages within your domain that are relevant to the query:
You can use YaCy in a great number of ways. If you wish to contribute to the global index in order to create a viable alternative to search engines maintained by corporations, you can easily crawl sites and allow your server to be a peer for other users.
If you need a great search engine for your site, YaCy provides that option as well. YaCy is very flexible and is an interesting solution to the problem of privacy concerns.
<div class=“author”>By Justin Ellingwood</div>
Join our DigitalOcean community of over a million developers for free! Get help and share knowledge in our Questions & Answers section, find tutorials and tools that will help you grow as a developer and scale your project or business, and subscribe to topics of interest.Sign up