Taming the Wild Web


Our CTO, Austin, taking a break from technology.

Here at SudoSearch we are faced with a difficult problem: There are about 14.98 billion known webpages on the internet, and we need to deliver you specific, relevant and awesome results. No small task, but what can we say? We’re ambitious.

Although large search engines like Google and Bing do a good job indexing a large amount of info, the big guys all operate under the same search paradigm – information only when you actively seek it. SudoSearch finds information both actively and passively. We are searching for you even when you aren’t in front of your computer, phone or tablet.

Another challenge: there is no standard format for information on the web, so finding relevant information can be difficult.

Want to learn more about how we do it? Read on…


Finding anything on the internet without help is like trying to find your favorite novel in the Library of Congress. Every book is placed on a random shelf, most books are missing titles and/or authors, some books are incomplete, have backward text or no text at all, and books are randomly disappearing and reappearing in different locations.

Luckily, we have the power of technology to help us. When we do find information we want, it is our job to categorize it, store it and make it searchable. For this we use the excellent Apache Solr search platform – it allows us to store information in a variety of formats but also retrieve entire documents easily and quickly.


Actually getting data is a little bit harder. Luckily there has been a recent shift with content sources towards providing their information in structured, documented ways. We usually call this an API, or application programming interface. SudoSearch leverages these sources heavily. In order to consume data from these APIs we simply point our system to their location, specify which information we are interested in, and store it away in Solr for later use. Since we are mainly a Java shop, we use the Jackson JSON parser for JSON APIs, and JaxB for XML APIs. APIs are great and we are super excited for the possibilities they provide for the web as a whole. SudoSearch will provide its own API when it is more mature.


Not all information we are interested in is as easy to obtain as simply asking an API, in fact, most of it isn’t! So we often have to resort to good ‘ol crawling. For this, Crawler4j is our weapon of choice. It is small, easy to configure and really fast! We also use’s excellent Goose data extractor when appropriate, especially when we want images or other media from a data source. Once a page is crawled, we treat the resulting information the same as we do from our API based sources, and store it in Solr. You, the user, won’t know how we got the information you are looking for, only that it is there.


Lastly, we also have support for RSS feeds. SudoSearch allows you to subscribe to RSS feeds as you would with any other RSS reader, and enjoy them from your “feed” page. However we go one step further than most RSS readers and actually store the data contained in RSS so that our users may find it elsewhere. Looking for your kitten picture fix? You may find results doing a regular search on SudoSearch that belong to a kitten RSS feed, and we’ll provide you the ability to subscribe to it. We accomplish this with the help of the ROME Rss Library for Java, another excellent open source project that helps SudoSearch accomplish its goals so that you can find what you are looking for.

All in all, by utilizing the wealth of tools available on the modern web, we’re developing the next generation of search in stride with the rest of the web: customizable, interactive and resourceful.

We can’t wait for you to experience SudoSearch. Sign up for your private beta invite to be one of the first to check it out.