The contributors to Libscore, including our own Creative Director Jesse Chase, wanted to offer this post as a thank you for all the support the project has received. Julian Shapiro launched Libscore last month hoping that the developer community would find the tool useful, and continues to be grateful for all of the positivity and constructive feedback throughout the web.
In this post, we'll break down the technology that Libscore leverages and discuss some of the challenges getting it off the ground. We were also lucky enough to talk with Julian and get some insight as to where he sees the project going.
Unlike traditional web crawlers, Libscore thoroughly scans the run-time environment of each website into a headless browser. This allows Libscore to monitor the operating environment on each website and to detect as many libraries as possible – even those that have been pre-bundled and required as modules. The tradeoff, of course, is that running one million headless browser connections is much more resource intensive than performing basic cURL requests and parsing static HTML.
On the backend, crawls are queued via Redis with the results stored in MongoDB. Both services are loaded fully into RAM which allows our RESTful API to serve requests faster than it would querying the disk. The main bottleneck to crawling concurrency is network bandwidth, but thanks to DigitalOcean, it was a breeze to repeatedly clone instances and run crawls during off-peak times in different regions. Ultimately, using just a few high-RAM DigitalOcean instances, we parse 600 websites per minute and complete the entire crawl in under 36 hours at the end of each month.
As the crawler runs, raw library usage data for each site is appended to a master JSON file, which we simply read from the file system with Nodejs. Once all the raw usage data is collected we start a process dubbed "ingestion", which is responsible for aggregating the results and making them accessible via the API. We actually attempted to load the entire dataset into ram to perform our calculations, but quickly ran into a quirky problem with V8 not being able to allocate anymore than approximately 1GB of memory for arrays. For now, we are splitting up the raw dump into smaller files to bypass the memory limit, though in the future we might just rewrite the project to use a more suitable language and environment.
While Libscore currently serves as an invaluable tool for surfacing library adoption data, the future is even more exciting. To illustrate it let's jump ahead 6 months – smack in the middle of summer. At this point, Libscore will have crawled through the top million sites 6 times already (or 6 million domain crawls!), bringing forth rich month-over-month trend data on library usage.
By providing users with a soon-to-be-released time series graph, with the ability to plot multiple libraries over the same time period, developers will gain new insights into how libraries are changing over time. For example, users will be able to see why a library's usage plummeted from one month to the next – potentially due to the increased adoption of another library. Soon, this data will be fully visualized.