CDNPAL search engine aims to provide a new way to organize the web. It's a fresh departure from Rankdex and Pagerank algorithms
In the beginning there were low bit rate 16-2400 baud telephone modems and dial in city bulletin boards that were commonly accessed from Commodore 64s and IBM compatible machines. Eventually a network was built called the Internet, and it connected local networks together worldwide. A method of finding content on these connected machines became necessary. Yahoo!, AOL, and MSN each had different ways of indexing content on remote machines using user submitted directories, much in the way Kickstarter shows you projects.
Eventually two Stanford students, Sergey Brin and Lawrence Page, developed a ranking system that hinged on HTML anchor tags in web pages called PageRank based on Rankdex. It would recursively sort web pages indexed by the Googlebot crawler into a logical hierarchy based on what pages were linked to it and what pages it linked to, assigning it a score. The higher the score, the more prominent the position in search results.
WHY CHANGE WEB SEARCH, WHY NOW?
There are problems with PageRank and Rankdex in that pages linked to by high ranking websites may not be indicative of a positive reference to that linked page or document. For instance, you can have a link from your high scoring website saying, I really hate x,y,z and that linked page will now inherit a higher rank from the page even though that was not the intention. The other problem is UGC or user generated content, where a website may have a high ranking score based on its popularity, but the authors are random people who arbitrarily join the website, and those people who have little or no history with the score instantly inherit the ranking mechanism. Another stinging problem with the PageRank system is that false or misleading information can rise to the top of search results due to the purely automated nature of the sorting of World Wide Web content.
HOW CAN IT BE DONE?
CDNPAL search aims to provide a better mechanism for sorting content on the World Wide Web. We have a mechanism which insures that pages and documents on the web are weighted with their intended weight. Our system does not limit voting rights to content creators, but lets everyone have an opportunity to weigh in on which content is more important and what is more prominent and relevant as a search result. We do this through the use of multi-browser plugins, and iOS and Android applications where users can opt in to sending their data to the system and that data is used to calculate the importance of web pages.
CDNPAL engine is also different and special in that it's not simply a visualization engine which just shows users end search results. It's like a Lego set where all the data collected is available via a REST API, and not just as web search results. As a user, you can pull the entire WWW structure of websites in an entire city for instance as JSON and pull down the indexed hierarchy of their websites or online documents. You can do this because our crawler not only collects information about web pages and their hierarchies but also collects all the network properties and geo-locational properties as well while it's indexing web content.
Lastly, CDNPAL is different in that it re-indexes web pages as Open Graph objects you can use in social graphs in conjunction with your own social information or to use in any way from presentations to applications.
SO WHAT'S ALREADY THERE?
The project already has some important resources such as Amazon AWS 3 year EC2 reserved instances including high CPU instances in both the Virginia, and Northern California regions, and some of the base modules required for a search engine are partially complete.
A modern search engine consists of 3 parts. The crawler, the link map and the index. You can check out some alpha source code to a reduced functionality, single process version of the crawler. It is not meant to be run in a production environment and some basic functionality like robots.txt processing is not included in this download version.
The basic components have already been coded, but we need to wire them together and finish a working build of the entire search engine and API clients for mobile devices and the web. We also need to polish and make sure that the code is working properly at a much larger scale than the skeleton framework we have working now.
WHAT WILL THE MONEY BE USED FOR?
The money will be used to pay for costs associated with writing the remaining project application code, and to pay for Amazon Web Services for costs associated with indexing web documents. A small amount will be used towards creating schwag for rewards like hats, posters and picture books.
This project aims to deliver the source code and schwag for the search engine to project contributors and to bridge the divide between the online resources the public uses, and those the public has transparent access to.
I STILL DON'T UNDERSTAND THE CONCEPT :(
What if TV was guided not by Nielsen Ratings, but by TV shows mentioning other TV shows. That's Google. We want to introduce Neilsen rating style sorting of normalized web documents with OpenGraph being the normalization factor and Hadoop as our sorting mechanism.
1. We make various documents on the web normal, by formatting their characteristics into OpenGraph objects.
2. We store those objects in a big, huge scalable database
3. We poll users and sort references to those objects based on what users want.
4. We show the most popular for any given type or category to users that are looking for something.
5. The web is a hit TV show.
We noticed that Kickstarter is mainly a community of gamers, so we have added a new project reward at $1. We are avid users of Hype for Mac HTML5 animation creator. As an extra reward we will create an animation game in the same theme as the Hype HTML5 movie on cdnpal.com where you get to shoot our mascot rabbit Duck Hunter style. This is not a huge challenge for us, so we will have this ready by June 1, 2012 for you to play.
So if you don't like search engines, social graphs, or what not, this is a really good reason to contribute to the project.
At this point, we have modified the crawler to only grab Open Graph information, or create it from document data, for later compilation. The crawler also records network properties such as the location of the remote website server, and the contact information such as geo-location of the OG content by business address, or other location hints. By focusing only on what we want to achieve and leaving traditional search behind we have a greater chance at giving users something brand new.
Here is a simplified flow chart:
WE NEED DEVELOPERS RIGHT NOW
For whatever reason, you don't want to help us financially?
There's another way you can help. Please contact us to help us write the code.
One of our large problems is the high cost of educated and or experienced Java programming labor in Southern California and the legal overhead of having employees and the paperwork. So we have some programmers we work with out of the country, but ultimately we need people here that we can do status meetings with every day. We are also Java programmers and need to make our team bigger.
Show up to our office in sunny Los Angeles, California, and have a seat at at a lovely workstation we will provide you with dual monitors so you can help us finish our working prototypes. There are no set work hours, and you can do whatever you think is helpful to the project. This would be a good opportunity for a college student on summer vacation.
Your resume will shine the word "impressive" after you get done interning for this project!
As a plus, you will also be in close proximity to Las Vegas and Burning Man and we can help you find accommodations.
Do you have burning questions about this project you want to ask right now?
We've noticed that Kickstarter is largely a community of gamers, so we invite you to contact Chris on Sony Playstation Network. His gamer tag is lacoder. Feel free to ask questions, add him or start a chat if you want.
Images, HTML documents, XML documents, Audio files, Video files, and anything else that can be quantitatively described by OpenGraph data and crawled publicly online.
Yes. There will be proper comments before every method in every class, and you will get a full JavaDoc of all source packages, except for mobile.
As an example here is the JavaDoc for the current download build of the sFTP desktop Java app we made a couple years ago called Transfolia: http://www.cdnpal.com/transfolia/doc/
We have a sFTP application called Transfolia which is cross platform that we made a couple years ago. It's also Java, and it works reasonably well considering that we were working full time when we made it and only had nights to get it done.
You can register the software with the username: test and password: test
If you don't know how FTP or sFTP works, then you should probably not use the software. It's like a cross platform version of WinSCP or CuteFTP. It works best with sFTP and has problems with some FTP hosts like GoDaddy. We didn't have the resources to put in to perfect all the protocols, so we focused on sFTP because it's secure. It works super well with EC2 and we use it all the time. Note that you have to use .pem files instead of putty encoded ppk files for private keys. For those who don't know sFTP, it's FTP through your SSH port via tunneling. 99.9% of Linux and BSD servers have sFTP servers running.
I'm not technical, I don't really understand the details of how search works, what does this project want to accomplish?
So what we are really saying is that for the past 15 years the web has been largely dominated by Google's way of organizing what you search for. We have a new way of organizing the World Wide Web that we think will work better.
We have an advertising system which you can see by viewing the video at http://www.cpcpal.com which we plan to use with the finished code. The credits can be redeemed on that system to use for advertising. The original system used dollars, but we are switching to a credit based system.
We used to use 3rd party APIs to temporarily provide search results such as Yahoo! BOSS, but those generic results are very similar to Google's and do not really offer anything new to the user.
We don't merely want to show another company like Yahoo or Microsoft Bing's search results via an API. We don't want to simply mash up or visualize existing search indexes at all. We want to finish our search engine code, and really let people take advantage of the benefits that our system offers. We also want make sure that the source code is open to anybody that buys the $25 reward so they can start their own search website, mobile app, use it for educational or commercial research, or use it in a project.
The reason you have to pay $25 for the project source code is that the source code is largely what will be produced by the reward donations, so that is largely the product that is being funded here.
Other projects have made the source code free, and mailed donators a CD copy of the code. But lets face it, that's just wasting plastic, and nobody really uses CDROMs anymore. We feel that the environment would be better off if we just made the download version the reward instead so we did.
Yes, it will be totally free to sign up and use just like Google or Yahoo. Advertising and ad credits however will not be free, hence they are valued in rewards.
Yes and no. There will be a free REST API key which is limited to a low number of requests per day. The key you get here in the rewards has no preset limit.
We'll end up finishing the project any way with private investments. It will take longer and you won't get any of the rewards, and we may not be able to afford to make all the components open source. If you're at all interested in search and moving it forward, please take a good hard look at our rewards and the merit of the project.
We realize that this isn't a fun game that you can play, but we assure you that this project does have entertainment value in the same way that social networking itself is entertaining. The fancy artwork you love to look at will be created by the users of the project, not by the project itself.
If we knew how to put game sprites into Hadoop to motivate backers, we would.
The significance of this code is to show that we are actually working on this. Our previous version of the technology was a hybrid of PHP and C++ and this snapshot was the beginning of the port to Java with the Apache Open Source version of Big Table and the Data Nucleus Open Source version of Datastore in GWT.
So there are a lot of CS related problems with the code, and we fixed and are working to fix them all. Things like multi-threading, robots.txt processing, multiple document types, language processing ect... are not supported by the demo code. Also, notably, the queuing of the hyperlinks to be crawled is not preferable.
When the crawler logic was originally written in PHP and C++, a lot of the sorting which should have been encapsulated in another process was put into the crawler. So that logic is now performed by the Hadoop code.
We have also gotten rid of the WebPage, WebTag and WebPageTreeNode objects and their factories. Now the crawler is only grabbing, optionally creating, and storing OpenGraph objects as JSON in HBase. So a great deal of the example code from the zip is now gone. Also the inline parsing of URLs has been removed and replaced. The domain queuing has been removed and replaced. A lot of these improvements are due to the fact that multiple instances of the crawler run in parallel on different machines.
The old example code which is linked to here originally picked up the anchor text to pass on to a linker in a sorted HashMap to give some initial clues to sort on before user polling was processed. The new model leaves all the complex logic to the distributed sorting tier with Hadoop. So now the crawler simply collects OG objects, creates immutable HDFS data with HBase, and does nothing else. All the sorting and mapping is now done in the mapping phase. The old example code created a hierarchy of WebPage objects with a Tree structure, and the new code simply uses a link map to map OpenGraph objects to their parent using the parent's key instead freeing up lots of processing and memory.
So we want the people who pay for the $25 reward to have the finished and working crawler code as well as the rest of the search engine source code. We posted some early example code of the Java port to show people that it's a work in progress and that we are in fact working on it. The code is more advanced now and there are 3 backend modules including the crawler.
Our crawler will use 2 tiers to determine location of pages. The first being the business address in the page HTML, and a fall back to the physical location of the machine running the website ( * lookup via http://www.maxmind.com licensed DB ). For picture documents, it will attempt to read it from Exif information.
We will add our own metadata to the OpenGraph object as optional data to describe the location of the audio, video, image, and HTML documents.
* ip lookups will be disabled in the OpenSource version as we can not redistribute the proprietary location data we use.
Not at this time because we're not sure if the project will get funded here on Kickstarter or not, but if you like we can run a JavaDoc on it and show you that. Get a hold of Chris on PSN by adding lacoder if that interests you.