Tuesday, May 15, 2018

2018-05-15: Archives Unleashed: Toronto Datathon Trip Report

The Archives Unleashed team (pictured below) hosted a two-day datathon, April 26-27, 2018, at the University of Toronto’s Robarts Library. This time around, Shawn Jones and I were selected to represent the Web Science and Digital Libraries (WSDL) research group from Old Dominion University. This event was the first in a series of four planned datathons to give researchers, archivists, computer scientists, and many others the opportunity to get hands-on experience with the Archives Unleashed Toolkit (AUT) and provide valuable feedback to the team. The AUT facilitates analysis and processing of web archives at scale and the datathons are designed to help participants find ways to incorporate these tools into their own workflow. Check out the Archives Unleashed team on Twitter and their website to find other ways to get involved and stay up to date with the work they’re doing.

Archives Unleashed datathon organizers (left to right): Nich Worby, Ryan Deschamps, Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz

Day 1 - April 26, 2018
Ian Milligan kicked off the event by talking about why these datathons are so important to the Archives Unleashed project team. For the project to be a success, the team needs to: build a community, create a common vision for web archiving tool development, avoid black box systems that nobody really understands, and equip the community with these tools to be able to work as a collective.

Many presentations, conversations, and tweets during the event indicated that working with web archives, particularly WARC files, can be messy, intimidating, and really difficult. The AUT tries to help simplify the process by breaking it down into four parts:
  1. Filter - focus on a date range, a single domain, or specific content
  2. Analyze - extract information that might be useful such as links, tags, named entities, etc.
  3. Aggregate - summarize the analysis by counting, finding maximum values, averages, etc.
  4. Visualize - create tables from the results or files for use in external applications, such as Gephi

We were encouraged to use the AUT throughout the event to go through the process of filtering, analyzing, aggregating, and visualizing for ourselves. Multiple datasets were provided to us and preloaded onto powerful virtual machines, provided by Compute Canada, in an effort to maximize the time spent working with the AUT instead of fiddling with settings and data transfers.

Now that we knew the who, what, and why of the datathon, it was time to create our teams and get to work. We wrote down research questions (pink), datasets (green), and technologies/techniques (yellow) we were interested in using on sticky notes and posted them on a whiteboard. Teams started to form naturally from the discussion, but not very quickly, until we got a little help from Jimmy and Ian to keep things moving.

I worked with Jayanthy Chengan, Justin Littman, Shawn Walker, and Russell White. We wanted to use the #neveragain tweet dataset to see if we could filter out spam links and create a list of better quality seed URLs for archiving. Our main goal was to use the AUT without relying on other tools that we may have already been familiar with. Many of us had never even heard of Scala, the language that AUT is written in. We had all worked through the homework leading up to the datathon, but it still took us a few hours to get over the initial jitters and become productive.

Scala was a point of contention among many participants. Why not use Python or another language that more people are familiar with and can easily interface with existing tools? Jimmy had an answer ready, as he did for every question thrown at him over the course of the event.

Around 5pm, it was time for dinner at Duke of York. My team decided against trying to get everyone up and running on their local machines, to enjoy dinner, and come back fresh for day 2.

Day 2 - April 27, 2018 
Day 2 began with what felt like an epiphany for our team:
In reality, it was more like:

Either way, we learned from the hiccups of the first day and began working at a much faster pace. All of the teams worked right up until the deadline to turn in slides, with a few coffee breaks and lightning talks sprinkled throughout. I'll include more information on the lightning talks and team presentations as they become available.

Lightning Talks
  • Jimmy Lin led a brainstorming session about moving the AUT from RDD to DataFrames. Samantha Fritz posted a summary of the feedback received where you can participate in the discussion.
  • Nick Ruest talked about Warclight, a tool that helps with discovery within a WARC collection. He showed off a demo of it after giving us a little background information.
  • Shawn Jones presented the five minute version of a blog post he wrote last year that talks about summarizing web archive collections.
  • Justin Litmann presented TweetSets, a service that allows a user to derive their own Twitter dataset from existing ones. You can filter by any Tweet attributes such as text, hashtags, mentions, date created, etc.
  • Shawn Walker talked about the idea of using something similar to a credit score to warn users, in realtime, of the risk that content they're viewing may be misinformation.

At 3:30pm, Ian introduced the teams and we began final presentations right on time.

Team Make Tweets Great Again (Shawn Jones' team) used a dataset including tweets sent to @realdonaldtrump between June 2017 and now, along with tweets with #MAGA in them from June - October 2017. A few of the questions they had were:

  • As a Washington insider falls from grace, how quickly do those active in #MAGA and @realDonaldTrump shift allegiance?
  • Did sentiment change towards Bannon before and after he was fired by Trump?

They used positive or negative sentiment (emojis and text-based analysis) as an indicator of shifting allegiance towards a person. There was a decline in the sentiment rating for Steve Bannon when he was fired in August 2017, but the real takeaway is that people really love the ๐Ÿ˜‚ emoji. Shawn worked with Jacqueline Whyte Appleby and Amanda Oliver. Jacqueline decided to focus on Bannon for the analysis, Amanda came up with the idea to use emojis, and Shawn used twarc to gather the information they would need.

Team Pipeline Research used datasets made up of WARC files of pipeline activism and Canadian political party pages, along with tweets (#NoASP, #NoDAPL,  #StopKM, #KinderMorgan). From the datasets, they were able to generate word clouds, find the image most frequently used, perform link analysis between pages, and analyze the frequency of hashtags used in the tweets. Through the analysis process, they discovered that some URLs had made it into the collection erroneously. 

Team Spam Links (my team) used a dataset including tweets with various hashtags related to the Never Again/March for Our Lives movement. The question we wanted to answer was “What is the best algorithm for extracting quality seed URLs from social media data?”. We created a Top 50 list of URLs tweeted in the unfiltered dataset and coded them as relevant, not relevant, or indeterminate. We then came up with multiple factors to filter the dataset by (users with/without the default Twitter profile picture, with/without bio in profile, user follower counts, including/excluding retweets, etc.) and generated a new Top 50 list each time. The original Top 50 list was then compared to each of the filtered Top 50 lists.

We didn’t find a significant change in the rankings of the spam URLs, but we think that’s because there just weren’t that many in the dataset’s Top 50 to begin with. Running these experiments against other datasets and expanding the Top 50 to maybe the Top 100 or more would likely yield better results. Special thanks to Justin and Shawn Walker for getting us started and doing the heavy lifting, Russell for coding all of the URLs, and Jayanthy for figuring out Scala with me.

Team BC Teacher Labour was the final group of the day and they used a dataset from Archive-It about the British Columbia Teachers’ Labour Dispute. While exploring the dataset with the AUT, they created word clouds showing the frequency of words compared between multiple domains, network graphs showing named entities and how they related to each other, and many others. The most interesting visual they created was an animated GIF that quickly showed the primary image from each memento, giving a good overview of the types of images in the collection. 

Team Just Kidding, There’s One More Thing was a team of one: Jimmy Lin. Jimmy was busy listening to feedback about Scala vs. Python and working on his own secret project. He created a new branch of the AUT running in a Python environment, enabling some of the things people were asking for at the beginning of Day 1. Awesome.

After Jimmy’s surprise, the organizers and teams voted for the winning project. All of the projects were great, but there can only be one winner and that was Team Make Tweets Great Again! I’m still convinced there’s a correlation between the number of emojis in their presentation, their team name, and the number of votes they received but ๐Ÿคท๐Ÿป‍♂️. Just kidding ๐Ÿ˜‚, your presentation was ๐Ÿ”ฅ. Congratulations ๐ŸŽŠ to Shawn and his team! 

I’m brand new to the world of web archiving and this was my first time attending an event like this, so I had some trepidation leading up to the first day. However, I quickly discovered that the organizers and participants, regardless of skill level or background, were there to learn and willing to share their own knowledge. I would highly encourage anyone, especially if you’re in a situation similar to mine, to apply for the Vancouver datathon that was announced at the end of Day 2 or one of the 2019 datathons taking place in the United States.

Thanks again to the organizers (Ian Milligan, Jimmy Lin, Nick Ruest, Samantha Fritz, Ryan Deschamps, and Nich Worby), their partners, and the University of Toronto for hosting us. Looking forward to the next one!

- Brian Griffin

Friday, May 4, 2018

2018-05-04: An exploration of URL diversity measures

Recently, as part of a research effort to describe a collections of URLs, I was faced with the problem of identifying a quantitative measure that indicates how many different kinds of URLs there are in a collection. In other words, what is the level of diversity in a collection of URLs? Ideally a diversity measure should produce a normalized value between 0 and 1. A 0 value means no diversity, for example, a collection of duplicate URLs (Fig. 2 first row, first column). In contrast, a diversity value of 1 indicates maximum diversity - all different URLs (Fig. 2, first row, last column):
1. http://www.cnn.com/path/to/story?p=v
2. https://www.vox.com/path/to/story
3. https://www.foxnews.com/path/to/story
Surprisingly, I did not find a standard URL diversity measure in the Web Science community, so I introduced the WSDL diversity index (described below). I acknowledge there may be other URL diversity measures in the Web Science community that exist under different names. 
Not surprisingly, Biologist (especially Conservation Biologist) have multiple measures for quantifying biodiversity called diversity indices. In this blog post, I will briefly describe how some common biodiversity measures in addition to the WSDL diversity index can be used to quantify URL diversity. Additionally, I have provided recommendations for choosing a URL diversity measure depending on the problem domain. I have also provided a simple python script that reads a text file containing URLs and produces the URL diversity scores of the different measures introduced in this post.
Fig. 2: WSDL URL diversity matrix of examples across multiple policies (URL, hostname, and domain). For all policies, the schemes, URL parameters, and fragments are stripped before calculation. For hostname diversity calculation, only the host is considered, and for domain diversity calculation, only the domain is considered.
I believe the problem of quantify how many different species there are in biological community is very similar to the problem of quantify how many different URLs there are in a collection of URLs. Biodiversity measures (or diversity indices) express the degree of variety in a community. Such measures answer questions such as: does a community of mushrooms only include one, two, or three species of mushrooms? Similarly, a URL diversity measure expresses the degree of variety in a collection of URLs and answers questions such as: does a collection of URLs only represent one (e.g cnn.com), two (cnn.com and foxnews.com), or three (cnn.com, foxnews.com, and nytimes.com) domains. Even though the biodiversity diversity indices and URL diversity measures are similar, it is important to note that since both domains are different their respective diversity measures reflect these differences. For example, the WSDL diversity index I introduce later does not reward duplicate URLs because duplicate URLs do not increase the informational value of a URL collection.

URL Diversity Measures

Let us consider the WSDL diversity index for quantifying URL diversity, and apply popular biodiversity indices to quantify URL diversity.

URL preprocessing:
Since URLs have aliases, the following steps were taken before the URL diversity was calculated.

1. Scheme removal: This transforms

2. URL parameters and fragment removal: This transforms

3. Multi-policy and combined (or unified) policy URL diversity: For the WSDL diversity index (introduced below), the URL diversity can be calculated for multiple separate policies such as the URL (www.cnn.com/path/to/story), Domain (cnn.com), or Hostname (www.cnn.com). For the biodiversity measures introduced,  the URL diversity can also be calculated by combining policies. For example, URL diversity calculation done by combining Hostname (or domain) with URL paths. This involves considering the Hostnames (or domains) as the species and the URL paths as individuals. I call this combined policy approach of calculating URL diversity, unified diversity.

WSDL diversity index:

The WSDL diversity index (Fig. 3) rewards variety and not duplication. It is the ratio of unique items  (URIs or Domain names, or Hostnames) to the total number of items |C|. We subtract 1 from both numerator and denominator in order to normalize (0 - 1 range) the index. A value of 0 (e.g., Fig 2. first row, first column) is assigned by a list of duplicate URLs. A value of 1 is assigned by a list of distinct URLs (e.g., Fig. 2 first row, last column).
Fig. 3: The WSDL diversity index (Equation 1) and the explanation of variables. U represents the count of unique URLs (or species - R).  |C| represents the number of URLs (or individuals N).
Unlike the other biodiversity indices introduced next, the WSDL diversity index can be calculated for separate policies: URL, Domain, and Hostname. This is because the numerator of the formula considers uniqueness not counts. In other words the numerator operates over sets of URLs (no duplicates allowed) unlike the biodiversity measures that operate over lists (duplicates allowed). Since the biodiversity measures introduced below take counts (count of species) into account, calculation of all the URL diversity across multiple policies results in the same diversity value except if the polices are combined (e.g., Hostname combined with URL paths).

The Simpson's diversity index (Fig. 4, equation 2) is a common diversity measure in Ecology that quantifies the degree of biodiversity (variety of species) in a community of organisms. It is also known as the Herfindahl–Hirschman index (HHI) in Economics, and Hunter-Gaston index in Microbiology. The index simultaneous quantifies two quantities - the richness (number of different kinds of organisms) and evenness (the proportion of each species present) in a bio-community. Additionally, the index produces diversity values ranging between 0 and 1. 0 means no diversity and 1 means maximum diversity.
Fig. 4: Simpson's diversity index (Equation 2) and Shannon's evenness index (Equation 3) and the explanation of variables (R, n_i (n subscript i), and N) they share.
Applying the Simpson's diversity index to measure URL diversity:
There are multiple variants of the Simpson's diversity index, the variant showed in Fig. 4, equation. 2 is applicable to measuring URL diversity in two ways. First, we may consider URLs as the species of biological organisms (Method 1). Second, we may consider the Hostnames as the species (Method 2)  and the URL paths as the individuals. There are three parameters needed to use Simpson's diversity index (Fig. 4):
Method 1:
  1. R - total number of species (or URLs)
  2. n_i (n subscript i) - number of individuals for a given species, and 
  3. N - total number of individuals
Method 2 (Unified diversity):
  1. R - total number of species where the Hostnames (or Domains) are the species
  2. n_i (n subscript i) - number of individuals (URL paths) for a given species, and
  3. N - total number of individuals
Fig. 5a applies Method 1 to calculate the URL diversity. In Fig. 5a, there are 3 different URLs interpreted as 3 species (R = 3) in the Simpson's diversity index formula (Fig. 4, equation. 2):
1. www.cnn.com/path/to/story1
2. www.cnn.com/path/to/story2
3. www.vox.com/path/to/story1

Fig. 5a: Example showing how the Simpson's diversity index and Shannon's evenness index can be applied to calculate URL diversity by setting three variables: R represents the number of species (URLs). In the example, there are 3 different URLs. n_i (n subscript i) represents the count of the species (n_1 = 3, n_2 = 1, and n_3 = 1). N represents the total number of individuals (URLs). The Simpson's diversity index (Fig. 4, equation 2) is 0.7, Shannon's evenness index - 0.86
The first URL has 3 copies which can be interpreted as 3 individuals (for the first species - n_0) in the Simpson's diversity index formula. The second and third URLs have 1 copy each, similarly, this can be interpreted as 1 individual for the second (n_1) and third species (n_2). In total (including duplicates) we have 5 URL individuals (N = 5). With all the parameters of the Simpson's diversity index (Fig. 4, equation 2) set, the diversity index for the example in Fig. 5a is 0.7.
Fig. 5b: Example showing how to the Simpson's diversity index and Shannon's diversity index can be applied to calculate unified URL diversity by interpreting Hostnames as the species (R) and the URLs paths as the individuals (n_i). This method combines the Hostname (or Domain) with URL paths for URL diversity calculation.
Fig. 5b applies Method 2 to calculate the Unified diversity. In the unified diversity calculation, the policies are combined (Hostname with URL paths). For example, in Fig. 5b the species represent the Hostnames and the URL paths are considered the individuals.

Shannon-Wiener diversity index:

The Shannon-Wiener diversity index or Shannon's diversity index comes from information theory where it is used to quantify the entropy in a string. However, in Ecology, similar to the Simpson's index, it is applied to quantify the biodiversity in a community. It simultaneously measures the richness (number of species) and the evenness (homogeneity of the species). The Shannon's Evenness Index  (SEI) is the Shannon's diversity index divided by the maximum diversity (ln(R)) which occurs when each species has the same frequency (maximum evenness).

Applying the SEI to measure URL diversity:
Fig. 6: Example showing how the URL diversity indices differ. For example, the WSDL diversity index rewards URL uniqueness and penalizes URL duplication since the duplication of URLs does not increase informational value, but the Shannon's evenness index rewards balance in the proportion of URLs. It is also important to note that calculation of URL diversity across multiple separate policies (URL, domain, and hostname) is only possible with the WSDL diversity index.
The variables in the SEI are the same variables in the Simpson's diversity index. Fig 5a. evaluates the SEI (Equation 3) for a set of URLs, while Fig. 5b. calculates the unified URL diversity by interpreting the Hostnames as species.
I recommend using the WSDL diversity index for measuring URL diversity if the inclusion of a duplicate URL should not be rewarded and there is a need to calculate URL diversity across multiple separate policies (URL, domain, and hostname). Both Simpson's diversity index and Shannon evenness index strive to simultaneously capture richness and evenness. I believe Shannon's evenness index does a better job capturing evenness which happens when the proportion of species is distributed evenly (Fig. 6 first row, second column). I recommend using the Simpson's diversity and Shannon's evenness indices for URL diversity calculation when the definition of diversity is similar to the Ecological meaning of diversity and the presence of duplicate URLs need not penalize the overall diversity score. The source code that implements the URL diversity measures introduced here is publicly available.
-- Nwala (@acnwala)

Monday, April 30, 2018

2018-04-30: A High Fidelity MS Thesis, To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages

It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters Degree and the thesis that got me to this point. If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of my thesis topic:

Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it is ok if you can not as I will give an oh so brief TL;DR;. The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues involving this little thing, that I have come to , known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archives, was each web archive has their own unique and subtle variation of this thing called "replay". The next post about the curious case of mendely.com user pages (A State Of Replay) further confirmed that to us.

We found that not only does there exist variations in how web archives perform URL rewriting (URI-Rs URI-Ms) but also that, depending on the replay scheme employed, web archives are also modifying the JavaScript execution environment of the browser and the archived JavaScript code itself beyond URL rewriting! As you can imagine this left us asking a number of questions that lead to the realization that the web archiving lacks the terminology required to effectively describe the existing styles of replay and the modifications made to an archived web page and its embedded resources in order to facilitate replay.

Thus my thesis was born and is titled "To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages".

Since I am known around the ws-dl headquarters for my love of deep diving into the secrets of (securely) replaying JavaScript, I will keep the length of this blog post to a minimum. The thesis can be broken down into three parts, namely Styles Of Replay, Memento Modifications, and Auto-Generating Client-Side Rewriters. For more detail information about my thesis, I have embedded my defense's slides below and the full text of the thesis has been made available.

Styles Of Replay

The existing styles of replaying mementos from web archives is broken down into two distinct models, namely "Wayback" and "Non-Wayback", and each has its own distinct styles. For the sake of simplicity and length of this blog post I will only (briefly) cover the replay styles of the "Wayback" model.

Non-Sandboxing Replay

Non-sandboxing replay is the style of replay that does not separate the replayed memento from the archive-controlled portion of replay, namely the banner. This style of replay is considered the OG (original gangster) way for replaying mementos simply because it was, at the time, the only way to replay mementos and was introduced by the Internet Archive's Wayback Machine. To both clarify and illustrate what we mean by "does not separate the replayed memento from archive-controlled portion of replay", consider the image below displaying the HTML and frame tree for a http://2016.makemepulse.com memento replayed from the Internet Archive on October 22, 2017.

As you can see from the image above, the archive's banner and the memento exist together on the same domain (web.archive.org). Implying that the replayed memento(s) can tamper with the banner (displayed during replay) and or interfere with archive control over replay. Non-malicious examples of mementos containing HTML tags that can both tamper with the banner and interfere with archive control over replay skip to the Replay Preserving modifications section of post. Now to address the recent claim that "memento(s) were hacked in the archive" and its correlation to non-sanboxing replay. Additional discussion on this topic can be found in Dr. Michael Nelson's blog post covering the case of blog.reidreport.com and in his presentation for the National Forum on Ethics and Archiving the Web (slides, trip report).

For a memento to be considered (actually) hacked, the web archive the memento is replayed (retrieved) from must be have been compromised in a manner that requires the hack to be made within the data-stores of the archive and does not involve user initiated preservation. However, user initiated preservation can only tamper with a non-hacked memento when replayed from an archive. The tampering occurs when an embedded resource, previously un-archived at the memento-datetime of the "hacked" memento, is archived from the future (present datetime relative to memento-datetime) and typically involves the usage of JavaScript. Unlike non-sandboxing replay, the next style of Wayback replay, Sandboxed Replay, directly addresses this issue and the issues of how to securely replay archived JavaScript. PS. No signs of tampering, JavaScript based or otherwise, were present in the blog.reidreport.com mementos from the Library of Congress. How do I know??? Read my thesis and or look over my thesis defense slides, I cover in detail what is involved in the mitigation of JavaScript based memento tampering and know what that actually looks like .

Sandboxed Replay

Sandboxed replay is the style of replay that separates the replayed memento from the archive-controlled portion of the page through replay isolation. Replay isolation is the usage of an iframe to the sandbox the replayed memento, replayed from a different domain, from the archive controlled portion of replay. Because replay is split into two different domains (illustrated in the image seen below), one for the replay of the memento and one for the archived controlled portion of replay (banner), the memento cannot tamper with the archives control over replay or the banner. Due to security restrictions placed on web pages from different origins by the browser called the Same Origin Policy. Web archives employing sandboxed replay typically also perform the memento modification style known as Temporal Jailing. This style of replay is currently employed by Webrecorder and all web archives using Pywb (open source, python implementation of the Wayback Machine). For more information on the security issues involved in high-fidelity web archiving see the talk entitled Thinking like a hacker: Security Considerations for High-Fidelity Web Archives given by Ilya Kreymer and Jack Cushman at WAC2017 (trip report), as well as, Dr. David Rosenthal's commentary on the talk.

Memento Modifications

The modification made by web archives to mementos in order to facilitate there replay can be broken down into three categories, the first of which is Archival Linkage.

Archival Linkage Modifications

Archival linkage modifications are made by the archive to a memento and its embedded resources in order to serve (replay) them from the archive. The archival linkage category of modifications are the most fundamental and necessary modifications made to mementos by web archives simply because they prevent the Zombie Apocalypse. You are probably already familiar with this category of memento modifications as it is more commonly referred to as URL rewriting

<!-- pre rewritten -->
<link rel="stylesheet" href="/foreverTime.css">
<!-- post rewritten -->
<link rel="stylesheet" href="/20171007035807cs_/foreverTime.css">

URL rewriting (archival linkage modifications) ensures that you can relive (replay) mementos, not from the live web, but from the archive. Hence the necessity and requirement for this kind of memento modifications. However, it is becoming necessary to seemingly damage mementos in order to simply replay them.

Replay Preserving Modifications

Replay Preserving Modifications are modifications made by web archives to specific HTML element and attribute pairs in order to negate their intended semantics. To illustrate this, let us consider two examples, the first of which was introduced by our fearless leader Dr. Michael Nelson and is known as the zombie introducing meta refresh tag shown below.

<meta http-equiv="refresh" content="35;url=?zombie=666"/>

As you are familiar, the meta refresh tag will, after 35 seconds, refresh the page with the "?zombie=666" appended to original URL. When a page containing this dastardly tag is archived and replayed, the results of the refresh plus appending "?zombie=666" to the URI-M causes the browser to navigate to a new URI-M that was never archived. To overcome this archives must arm themselves with the attribute prefixing shotgun in order to negate the tag and attribute's effects. A successful defense against the zombie invasion when using the attribute prefixing shotgun is shown below.

    <meta _http-equiv="refresh" _content="35;url=?zombie=666"/>

Now let me introduce to you a new more insidious tag that does not introduce a zombie into replay but rather a demon known as the meta csp tag, shown below.

<meta http-equiv="Content-Security-Policy"
 content="default-src http://notarchive.com; img-src ...."/>

Naturally, web archives do not want web pages to be delivering their own Content-Security-Policies via meta tag because the results are devastating, as shown by the YouTube video below.

Readers have no fear, this issue is fixed!!!! I fixed the meta csp issue for Pywb and Webrecorder in pull request #274 submitted to Pywb. I also reported this to the Internet Archive and they promptly got around to fixing it.

Temporal Jailing

The final category of modifications, known as temporal Jailing, is the emulation of the JavaScript environment as it existed at the original memento-datetime through client-side rewriting. Temporal jailing ensures both the secure replay of JavaScript and that JavaScript can not tamper with time (introduce zombies) by applying overrides to the JavaScript APIs provided by the browser in order to intercept un-rewriten urls. Yes there is more to it, a whole lot more, but because it involves replaying JavaScript and I am attempting to keep this blog post reasonably short(ish), I must force you to consult my thesis or thesis defense slides for more specific details. However, for more information about the impact of JavaScript on archivability, and measuring the impact of missing resources see Dr. Justin Brunelle's Ph.D. wrap up blog post. The technique for the secure replay of JavaScript known as temporal jailing is currently used by Webrecorder and Pywb.

Auto-Generating Client-Side Rewriters

Have I mention yet just how much I JavaScript?? If not, lemme give you a brief overview of how I am auto-generating client-side rewriting libraries, created a new way to replay JavaScript (currently used in production by Webrecorder and Pywb) and increased the replay fidelity of the Internet Archive's Wayback Machine.

First up let me introduce to you Emu: Easily Maintained Client-Side URL Rewriter (GitHub). Emu allows for any web archive to generate their own generic client-side rewriting library, that conforms to the de facto standard implementation Pywb's wombat.js, by supplying it the Web IDL definitions for the JavaScript APIs of the browser. Web IDL was created by the W3C to describe interfaces intended to be implemented in web browser, allow the behavior of common script objects in the web platform to be specified more readily, and provide how interfaces described with Web IDL correspond to constructs within ECMAScript execution environments. You may be wondering how can I guarantee this tool will generate a client-side rewriter that provides complete coverage of the JavaScript APIs of the browser and that we can readily obtain these Web IDL definitions? My answer is simple and it is to confider the following excerpt from the HTML specification:

This specification uses the term document to refer to any use of HTML, ..., as well as to fully-fledged interactive applications. The term is used to refer both to Document objects and their descendant DOM trees, and to serialized byte streams using the HTML syntax or the XML syntax, depending on context ... User agents that support scripting must also be conforming implementations of the IDL fragments in this specification, as described in the Web IDL specification

Pretty cool right, what is even cooler is that a good number of your major browsers/browser engines (Chromium, FireFox, and Webkit) generate and make publicly available Web IDL definitions representing the browsers/engines conformity to the specification! Next up a new way to replay JavaScript.

Remember the curious case of mendely.com user pages (A State Of Replay) and how we found out that Archive-It, in addition to applying archival linkage modifications, was rewriting JavaScript code to substitute a new foreign, archive controlled, version of the JavaScript APIs it was targeting. This is shown in the image below.

Archive-It rewriting embedded JavaScript from the memento for the curious case mendely.com user pages

Hmmmm, looks like Archive-It is only rewriting only two out of four instances of the text string location in the example shown above. This JavaScript rewriting was targeting the Location interface which controls the location of the browser. Ok, so how well would Pywb/Webrecorder do in this situation?? From the image shown below, not as good and maybe a tad bit worse...

Pywb v0.33 replay of https://reacttraining.com/react-router/web/example/auth-workflow

That's right folks, JavaScript rewrites in HTML. Why??? See below.

Bundling HTML in JavaScript, https://reacttraining.com/react-router/15-5fae8d6cf7d50c1c6c7a.js

Because the documentation site for React Router was bundling HTML inside of JavaScript containing the text string "location" (shown above), the rewrites were exposed in the documentations HTML displayed to page viewers (second image above). In combination with how Archive-It is also rewriting archived JavaScript, in a similar manner, I was like this needs to be fix. And fix it I did. Let me introduce to you a brand new way of replaying archived JavaScript shown below.

// window proxy
new window.Proxy({}, {
  get (target, prop) {/*intercept attribute getter calls*/},
  set (target, prop, value) {/*intercept attribute setter calls*/},
  has (target, prop) {/*intercept attribute lookup*/},
  ownKeys (target) {/*intercept own property lookup*/},
  getOwnPropertyDescriptor (target, key) {/*intercept descriptor lookup*/},
  getPrototypeOf (target) {/*intercept prototype retrieval*/},
  setPrototypeOf (target, newProto) {/*intercept prototype changes*/},
  isExtensible (target) {/*intercept is object extendable lookup*/},
  preventExtensions (target) {/*intercept prevent extension calls*/},
  deleteProperty (target, prop) {/*intercept is property deletion*/},
  defineProperty (target, prop, desc) {/*intercept new property definition*/},

// document proxy
new window.Proxy(window.document, {
  get (target, prop) {/*intercept attribute getter calls*/},
  set (target, prop, value) {/*intercept attribute setter calls*/}

The native JavaScript Proxy object allows an archive to perform runtime reflection on the proxied object. Simply put, it allows an archive to defined custom or restricted behavior for the proxied object. I have annotated the code snippet above with additional information about the particulars of how archives can use the Proxy object. Archives using the JavaScript Proxy object in combination with the setup shown below, web archives can guarantee the secure replay of archived JavaScript and do not have to perform the kind of rewriting shown above. Yay! Less archival modification of JavaScript!! This method of replaying archived JavaScript was merged into Pywb on August 4, 2017 (contributed by yours truly) and has been used in production by Webrecoder since August 21, 2017. Now to tell you about how I increased the replay fidelity of the Internet Archive and how you can too .

var __archive$assign$function__ = function(name) {/*return archive override*/};
  // archive overrides shadow these interfaces
  let window = __archive$assign$function__("window");
  let self = __archive$assign$function__("self");
  let document = __archive$assign$function__("document");
  let location = __archive$assign$function__("location");
  let top = __archive$assign$function__("top");
  let parent = __archive$assign$function__("parent");
  let frames = __archive$assign$function__("frames");
  let opener = __archive$assign$function__("opener");
  /* archived JavaScript */

Ok so I generated a client-side rewriter for the Internet Archive's Wayback Machine using the code that is now Emu and crawled 577 Internet Archive mementos from the top 700 web pages found in the Alexa top 1 million web site list circa June 2017. The crawler I wrote for this can be found on GitHub . By using the generated client-side rewriter I was able to increase the cumulative number of requests made by the Internet Archive mementos by 32.8%, a 45,051 request increase (graph of this metric shown below). Remember that each additional request corresponds to a resource that previously was unable to be replayed from the Wayback Machine.

Hey look, I also decreased the number of requests blocked by the content-security policy of the Wayback Machine by 87.5%, a 5,972 request increase (graph of this metric shown below). Remember, that earch request un-blocked corresponds to a URI-R the Wayback Machine could not rewrite server-side and requires the usage of client-side rewriting (Pywb and Webrecorder are using this technique already).

Now you must be thinking this impressive to say the least, but how do I know these numbers are not faked / or doctored in some way in order to give a client-side rewriting the advantage??? Well you know what they say seeing is believing!!! The generated client-side rewriter used in the crawl that produced the numbers shown to you today is available as the Wayback++ Chrome and Firefox browser extension! Source code for it is on GitHub as well. And oh look, a video demonstrating the increase in replay fidelity gained if the Internet Archive were to use client-side. Oh, I almost forgot to mention that at the 1:47 mark in the video I make mementos of cnn.com replayable again from the Internet Archive. Winning!!

Pretty good for just a masters thesis wouldn't you agree. Now it's time for the obligatory list of all the things I have created in the process of this research and time as a masters student:

What is next you may ask??? Well I am going to be taking a break before I start down the path known as a Ph.D. Why??????? To become the senior backend developer for Webrecorder of course! There is so so much to be learned from actually getting my hands dirty in facilitating high-fidelity web archiving such that when I return, I will have a much better idea of what my research's focus should be on.

If I have said this once, I have said this a million times. When you use a web browser in the preservation process, there is no such thing as an un-archivable web page! Long live high-fidelity web archiving!

- John Berlin (@johnaberlin , @N0taN3rd )