Wednesday, March 21, 2018

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English

Fig. 1 - Barack Obama's Twitter page in Urdu

The ODU WSDL lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not display in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications when looked at from a digital archivist perspective.

The problem became more evident when Miranda Smith (a member of the WSDL lab) was finalizing the implementation of a Twitter Follower-History-Count tool.  The application makes use of mementos extracted from the Internet Archive (IA) in order to find the number of followers that a particular Twitter account had acquired through time. The tool expects the Web page retrieved from the IA to be rendered in English in order to perform Web scraping and extract for the number of followers a Twitter account had at a particular time. Since it was now evident that Twitter pages were not archived in English only, we had to decide to account for all possible language settings or discard non-English mementos. We asked ourselves: Why are some Twitter pages archived in Non-English languages that we generally expected to be in English? Note that we are referring to the interface/template language and not the language of the tweet content.

We later found that this issue is more prevalent than we initially thought it was. We selected the previous US President Barack Obama as our personality to explore how many languages and how often his Twitter page was archived. We downloaded the TimeMap of his page using MemGator and then downloaded all the mementos in it for analysis. We found that his Twitter page was archived in 47 different languages (all the languages that Twitter currently supports, a subset of which is supported in their widgets) across five different web archives, including Internet Archive (IA), Archive-It (AIT), Library of Congress (LoC), UK Web Archive (UKWA), and Portuguese Web Archive (PT). Our dataset shows that overall only 53% of his pages (out of over 9,000 properly archived mementos) were archived in English. Of the remaining 47% mementos 22% were archived in Kannada and 25% in 45 other languages combined. We excluded mementos from our dataset that were not "200 OK" or did not have language information.

Fig. 2 shows that in the UKWA English is only 5% of languages in which Barack Obama's Twitter pages were archived. Conversely, in the IA, about half of the number of Barack Obama's Twitter pages are archived in English as much as all the remaining languages combined. It is worth noting that AIT is a subset of the IA. On the one hand, it is good to have more language diversity in archives (for example, the archival record is more complete for English language web pages than other languages). On the other hand, it is very disconcerting when the page is captured in a language not anticipated. We also noted that Twitter pages in the Kannada language are archived more often than all other non-English languages combined, although Kannada ranks 32 globally by the number of native speakers which are 0.58% of the global population. We tried to find out why some Twitter pages were archived in non-English languages that belong to accounts that generally tweet in English. We also tried to find out why Kannada is so prevalent among many other non-English languages. Our findings follow.

Fig. 2 Barack Obama Twitter Page Language Distribution in Web Archives

We started investigating the reason why web archives sometimes capture pages in non-English languages, and we came up with the following potential reasons:
  • Some JavaScript in the archived page is changing the template text in another language at the replay time
  • A cached page on a shared proxy is serving content in other languages
  • "Save Page Now"-like features are utilizing users' browsers' language preferences to capture pages
  • Geo-location-based language setting
  • Crawler jobs are intentionally or unintentionally configured to send a different "Accept-Language" header
The actual reason turned out to have nothing to do with any of these, instead it was related to cookies, but describing our thought process and how we arrived at the root of the issue has some important lessons worth sharing.

Evil JavaScript

Since JavaScript is known to cause issues in web archiving (a previous blog post by John Berlin expands on this problem), both at capture and replay time, we first thought this has to do with some client-side localization where a wrong translation file is leaking at replay time. However, when we looked at the page source in a browser as well as on the terminal using curl (as illustrated below), it was clear that translated markup is being generated on the server side. Hence, this possibility was struck off.

$ curl --silent | grep "<meta name=\"description\""
  <meta name="description" content="من الأخبار العاجله حتى الترفيه إلى الرياضة والسياسة، احصل على القصه كامله مع التعليق المباشر.">


We thought Twitter might be doing content negotiation using "Accept-Language" request header, so we changed language preference in our web browser and opened Twitter in an incognito window which confirmed our hypothesis. Twitter did indeed consider the language preference sent by the browser and responded a page in that language. However, when we investigated HTTP response headers we found that does not return the "Vary" header when it should. This behavior can be dangerous because the content negotiation is happening on "Accept-Language" header, but it is not advertised as a factor of content negotiation. This means, a proxy can cache a response to a URI in some language and serve it back to someone else when the same URI is requested, even with a different language in the "Accept-Language" setting. We considered this as a potential possibility of how an undesired response can get archived. 

On further investigation we found that Twitter tries very hard (sometimes in wrong ways) to make sure their pages are not cached. This can be seen in their response headers illustrated below. The Cache-Control and obsolete Pragma headers explicitly ask proxies and clients not to cache the response itself or anything about the response by setting values to "no-cache" and "no-store". The Date (the date/time at which the response was originated) and Last-Modified headers are set to the same value to ensure that the cache (if stored) becomes invalid immediately. Additionally, the Expires header (the date/time after which the response is considered stale) is set to March 31, 1981, a date far in the past, long before Twitter even existed, to further enforce cache invalidation.

$ curl -I
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
pragma: no-cache
date: Sun, 18 Mar 2018 17:43:25 GMT
last-modified: Sun, 18 Mar 2018 17:43:25 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT

Hence, the possibility of a cache returning pages in different languages due to the missing "Vary" header was also not sufficient to justify the number of mementos in non-English languages.


We thought about the possibility that Twitter identifies a potential language for guest visitors based on their IP address (to guess the geo-location). However, the languages seen in mementos do not align with the places where archival crawlers are located. For example, the Kannada language that is dominating in the UK Web Archive is spoken in the State of Karnataka in India, and it is unlikely that the UK Web Archive is crawling from machines located in Karnataka.

On-demand Archiving

The Internet Archive recently introduced the "Save Page Now" feature, which acts as a proxy and forwards request headers of the user to the upstream web server rather than its own. This behavior can be observed in a memento that we requested for an HTTP echo service, HTTPBin, from our browser. The service echoes back data in the response that it receives from the client in the request. By archiving it, we expect to see headers that identify the client that the service has seen as the requesting client. The headers shown there are of our browser, not of the IA's crawler, especially the "Accept-Language" (that we customized in our browser) and the "User-Agent" headers, which confirms our hypothesis that IA's Save Page Now feature acts as a proxy.

$ curl
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "es",
    "Connection": "close",
    "Host": "",
    "Referer": "",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
  "json": null,
  "method": "GET",
  "origin": "",
  "url": ""

This behavior made us consider that people from different regions of the world with different language setting in their browsers, when using "Save Page Now" feature, would end up preserving Twitter pages in the language of their preference (since Twitter does honor "Accept-Language" header in some cases). However, we were unable to replicate this in our browser. Also, not every archive has on-demand archiving and thus could never replay users' request headers.

We also repeated this experiment in, another on-demand web archive. Unlike IA, they do not replay users' headers like a proxy, instead they have their custom request headers. does not show the original markup, instead it modifies the page heavily before serving, so a curl output will not be very useful. However, the content of our archived HTTP echo service page look like this:

  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip",
    "Accept-Language": "tt,en;q=0.5",
    "Connection": "close",
    "Host": "",
    "Referer": "",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2704.79 Safari/537.36"
  "json": null,
  "method": "GET",
  "origin": ",",
  "url": ""

Note that it has its custom "Accept-Language" and "User-Agent" headers (different from our browser from which we requested the capture). It also has a custom "Referer" header included. However, unlike IA, it replayed our IP address as origin. We then captured ( followed by ( to see if the language session sticks across two successive Twitter requests, but that was not the case as the second page was archived in English (not in Arabic). Though, this does not necessarily prove that their crawler does not have this issue. It is possible that two different instances of their crawler handled the two requests or some other links of Twitter pages (with "?lang=en") were archived between the two requests by someone else. However, we do not have sufficient information to be certain about it.

Misconfigured Crawler

Some of the early memento we observed this behavior happening were from the Archive-It. So, we thought that some collection maintainers might have misconfigured their crawling job that sends a non-default "Accept-Language" header, resulting in such mementos. Since we did not have access to their crawling configuration, there was very little we could do to test this hypothesis. Many of the leading web archives are using Heritrix as their crawler, including Archive-It, and we happen to have some WARC files from AI, so we started looking into those. We looked for request records of those WARC files for any Twitter links to see what "Accept-Language" header was sent. We were quite surprised to see that Heritrix never sent any "Accept-Language" headers to any server, so this could not be the reason at all. However, when looking into those WARC files, we saw "Cookie" headers sent to the servers in the request records of Twitter and many others. This lead us to uncover the actual cause of the issue.

Cookies, the Real Culprit

So far, we have been considering Heritrix to be a stateless crawler, but when we looked into the WARC files of AI, we observed Cookies being sent to servers. This means Heritrix does have Cookie management built-in (which is often necessary to meaningfully capture some sites). With this discovery, we started investigating Twitter's behavior from a different perspective. The page source of Twitter has a list of alternate links for each language they provide localization for (currently, 47 languages). This list can get added to the frontier queue of the crawler. Though, these links have a different URI (i.e., having a query parameter "?lang=<lang-code>"), once any of these links are loaded, the session is set for that language until the language is explicitly changed or the session expires/cleared. In the past they had options in the interface to manually select a language, which then gets set for the session. It is understandable that general purpose web sites cannot rely on the "Accept-Language" completely for localization related content negotiation as browsers have made it difficult to customize language preferences, especially if one has to set it on a per-site basis.

We experimented with Twitter's language related behavior in our web browser by navigating to, which yields the page in the Arabic language. Then navigating to any Twitter page such as or (without the explicit "lang" query parameter) continues to serve Arabic pages (if a Twitter account is not logged in). Here is how Twitter's server behaves for language negotiation:

  • If a "lang" query parameter (with a supported language) is present in any Twitter link, that page is served in the corresponding language.
  • If the user is a guest, value from the "lang" parameter is set for the session (this gets set each time an explicit language parameter is passed) and remains sticky until changed/cleared.
  • If the user is logged in (using Twitter's credentials), the default language preference is taken from their profile preferences, so the page will only show in a different language if an explicit "lang" parameter is present in the URI. However, it is worth noting that crawlers generally behave like guests.
  • If the user is a guest and no "lang" parameter is passed, Twitter falls back to the language supplied in the "Accept-Language" header.
  • If the user is a guest, no "lang" parameter is passed, and no "Accept-Language" header is provided, then responses are in English (though, this could be affected by Geo-IP, which we did not test).

In the example below we illustrate some of that behavior using curl. First, we fetch Twitter's home page in Arabic using explicit "lang" query parameter and show that the response was indeed in Arabic as it contains lang="ar" attribute in the <html> element tag. We also saved any cookies that the server might want to set in the "/tmp/twitter.cookie" file. We then showed that the cookie file does indeed have a "lang" cookie with the value "ar" (there are some other cookies in it, but those are not relevant here). In the next step, we fetched Twitter's home page without any explicit "lang" query parameter and received a response in the default English language. Then we fetched the home page with the "Accept-Language: ur" header and got the responses in Urdu. Finally, we fetched the home page again, but this time supplied the saved cookies (that includes "lang=ar" cookie) and received the response again in Arabic.

$ curl --silent -c /tmp/twitter.cookie | grep "<html"
<html lang="ar"</span><nowiki> data-scribe-reduced-action-queue="true">

$ grep lang /tmp/twitter.cookie FALSE / FALSE 0 lang ar

$ curl --silent | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">

$ curl --silent -H "Accept-Language: ur" | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">

$ curl --silent -b /tmp/twitter.cookie | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">

Twitter Cookies and Heritrix

Now that we understood the reason, we wanted to replicate what is happening in a real archival crawler. We used Heritrix to simulate the effect that Twitter cookies have when a Twitter page gets archived in the IA. The order of these links was carefully chosen to see if the first link sets the language to Arabic and then the second one gets captured in Arabic or not. We seeded the following URLs and placed them in the same sequence inside Heritrix's configuration file:
We had already proven that the first URI which included the language identifier for Arabic (lang=ar) will place the language identifier inside the cookie. The question now becomes: What is the effect this cookie will have on subsequent crawls/requests of future Twitter pages? Is the language identifier going to stay the same as the one already set in the cookie? Is is it going to revert to a default language preference? The common expectation for our seeded URIs is that the first Twitter page will be archived in Arabic, and that the second page will be archived in English, since a request to a top level .com domain is usually defaulted to the English language. However, since we have observed that the Twitter cookies contain the language identifier when this parameter is passed in the URI, then if subsequent Twitter pages use the same cookie, it is plausible that the language identifier is going to be maintained.

After running the crawling job in Heritrix for the seeded URIs, we inspected the WARC file generated by Heritrix. The results were as we expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic. Relevant portions of the resulting WARC file are shown below:

WARC-Type: request
WARC-Date: 2018-03-16T21:58:44Z
WARC-Concurrent-To: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
WARC-Record-ID: <urn:uuid:473273f6-48fa-4dd3-a5f0-81caf9786e07>
Content-Type: application/http; msgtype=request
Content-Length: 301

GET /?lang=ar HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cookie: guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC, shown above, is a  request record for the URI Highlighted lines illustrates the GET request made to the host "" with the path and query parameter "/?lang=ar". This request yielded a response from Twitter that contains a "set-cookie" header with the language identifier included in the URI "lang=ar" as shown in the portion of the WARC below. The HTML was rendered in Arabic (notice the highlighted <html> element with the lang attribute in the response payload below).

WARC-Type: response
WARC-Date: 2018-03-16T21:58:44Z
WARC-Record-ID: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
Content-Type: application/http; msgtype=response
Content-Length: 151985

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 150665
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:44 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:44 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:34 UTC; Path=/;; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/;; Secure; HTTPOnly
set-cookie: lang=ar; Path=/
set-cookie: ct0=10558ec97ee83fe0f2bc6de552ed4b0e; Expires=Sat, 17 Mar 2018 03:58:44 UTC; Path=/;; Secure
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: 2a2fc89f51b930202ab24be79b305312
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 100
x-transaction: 001495f800dc517f
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">

The subsequent request in the seeded Heritrix configuration file ( generated an additional request record which is shown on the WARC file portion below. The highlighted lines illustrates the GET request made to the host "" with the path and query parameter "/phonedude_mln/". You may notice that a "Cookie"  with the value lang=ar was included as one of the parameters in the header request which was set initially by the first seeded URI. The results were as expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic.

WARC-Type: request
WARC-Date: 2018-03-16T21:58:48Z
WARC-Concurrent-To: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
WARC-Record-ID: <urn:uuid:eef134ed-f3dc-459b-95e7-624b4d747bc1>
Content-Type: application/http; msgtype=request
Content-Length: 655

GET /phonedude_mln/ HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Cookie: lang=ar; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; ct0=10558ec97ee83fe0f2bc6de552ed4b0e; guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC file, shown below, shows the effect of Heritrix saving and playing the "Cookie" headers. The highlighted <html> element proved that the  HTML language identifier was set to Arabic on the second seeded URI (, although the URI did not include in the language identifier.

WARC-Type: response
WARC-Date: 2018-03-16T21:58:48Z
WARC-Record-ID: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
Content-Type: application/http; msgtype=response
Content-Length: 518086

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 516921
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:48 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:48 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:38 UTC; Path=/;; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/;; Secure; HTTPOnly
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: ef102c969c74f3abf92966e5ffddb6ba
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 335
x-transaction: 0014986c00687fa3
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">

We used PyWb to replay pages from the captured WARC file. Fig. 3 is the page rendered after retrieving the first seeded URI of our collection ( For those not familiar with Arabic, this is indeed Twitter's home page in Arabic.


Fig. 4 is the representation given by PyWb after requesting the second seeded URI ( The page was rendered using Arabic as the default language, although we did not include this setting in the URI, nor did our browser language settings include Arabic.

Fig.4 in Arabic

Why is Kannada More Prominent?

As we noted before, Twitter's page source now includes a list of alternate links for 47 supported languages. These links look something like this:

<link rel="alternate" hreflang="fr" href="">
<link rel="alternate" hreflang="en" href="">
<link rel="alternate" hreflang="ar" href="">
<link rel="alternate" hreflang="kn" href="">

The fact that Kannada ("kn") is the last language in the list is why it is so prevalent in web archives. While other language specific links overwrite the session set by their predecessor, the last one affects many more Twitter links in the frontier queue. Twitter started supporting Kannada along with three other Indian languages in July 2015 and placed it at the very end of language related alternate links. Since then, it has been captured more often in various archives than any other non-English language. Before these new languages were added, Bengali used to be the last link in the alternate language links for about a year. Our dataset shows dense archival activity for Bengali between July 2014 to July 2015, then Kannada took over. This confirms our hypothesis about the spatial placement of the last language related link sticking the session for a long time with that language. This affects all upcoming links in the crawlers' frontier queue from the same domain until another language specific link overwrites the session.

What Should We Do About It?

Disabling cookies does not seem to be a good option for crawlers as some sites would try hard to set a cookie by repeatedly returning redirect responses until their desired "Cookie" headers is included in the request. However, explicitly reducing the cookie expiration duration in crawlers could potentially mitigate the long-lasting impact of such sticky cookies. Garbage collecting any cookie that was set more than a few seconds ago would make sure that no cookie is being reused for more than a few successive requests. Sandboxing crawl jobs in many isolated sessions is another potential solution to minimize the impact. Alternatively, some filtering policies can be set for URLs that set any session cookies to download them in a separate short-lived session to isolate them from rest of the crawl frontier queue.


The problem of portions of Twitter pages unintentionally being archived in non-English languages is quite significant. We found that 47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive. We found that the root cause has to do with Twitter's sticky language sessions maintained using cookies which Heritrix crawler seems to honor.

The Kannada language being the last one in the list of language-specific alternate links on Twitter's pages makes it overwrite the language cookies resulting from the URLs in other languages listed above it. This causes more Twitter pages in the frontier queue being archived in Kannada than other non-English languages. Crawlers are generally considered to be stateless, but honoring cookies makes them somewhat stateful. This behavior in web archives may not be specific to just Twitter, but many other sites that utilize cookies for content negotiation might have some similar consequences. This issue can potentially be mitigated by reducing the cookie expiration duration explicitly in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances.

Sawood Alam
Plinio Vargas

Thursday, March 15, 2018

2018-03-15: Paywalls in the Internet Archive

Paywall page from The Advertister

Paywalls have become increasingly notable in the Internet Archive over the past few years. In our recent investigation into news similarity for U.S. news outlets, we chose from a list of websites and then pulled the top stories. We did not initially include subscriber based sites, such as The Financial Times or Wall Street Journal, because these sites only provided snippets of an article, and then users would be confronted with a "Subscribe Now" sign to view the remaining content. The New York Times, as well as other news sites, also have subscriber based content but access is only limited once a user has exceeded a set number of stories seen. In our study of 30 days of news sites, we found 24 URIs that were deemed to be paywalls, and these are listed below:

Memento Responses

All of these URIs point to the Internet Archive but result in an HTTP status code of 404. We took all of these URI-Ms from the homepage of their respective news sites and tried to see how the Internet Archive captured these URI-Ms over a period of a month within the Internet Archive.

The image above shows requests sent to the Internet Archive's memento API with the initial request being 0 days and then adding 1, 7, 30 days to the inital request to see if the URI-M retrieved resolved to something other than 404. The initial request to these mementos all had a 404 status code. Adding a day to the memento and then requesting a new copy from the Internet Archive resulted in some of the URI-Ms resolving to with a 200 response code showing that these articles became available. Adding 7 days to the initial request date time shows that by this time the Internet Archive has found copies for all but 1 URI-M. This result is then repeated when adding 30 days to the initial memento request date time. The response code "0" indicates no response code caused by a infinite redirect loop. The chart follows the idea that content is released as free once a period of time has passed.

For the New York Times articles, they end up redirecting to a different part of the New York Times website: Although each of the URIs resolve with a 404 status code an earlier capture shows that it was a login page asking for signup or subscription:

Paywalls in Academia

Paywalls restrict not just news content but also academic content. When users are directly linked through a DOI assigned to a paper,  they are often redirected to a splash page showing a short description of a paper but not the actual pdf document. An example of this is: This URI currently points to for a published paper, but the content is only available via purchase:

In order to actually access the content a user is first redirected to the splash page, and is then required to purchase the requested content.

If we search for this DOI in the Internet Archive,, we find that it will ultimately lead to a memento,, of the same paywall we found on the live web. This shows that both the DOI and the paywall are archived, but the PDF is not ("Buy (PDF) USD 39.95").

Organizations that are willing to pay for a subscription to an association that host the academic papers will have access to content. A popular example is the ACM Digital Library. When users visit pages like springerlink, they may not have the option of getting the blue "Download PDF" button but rather a grey button signifying it is disabled for a non-subscribed user.

Van de Sompel et al. investigated 1.6 million URI references from arXiv and PubMed Central and found that over 500,000 of the URIs were locating URIs indicating the current document location. These URIs can expire over time and removes the use of DOIs.

Searching for Similarity

When considering hard paywall sites like Financial Times (FT) and Wall Street Journal (WSJ) it's intuitive that most of the paywall pages a non-subscribed user sees will be relatively the same. We experimented with 10 of the top WSJ articles on 11/01/2016 where each article was scraped from the homepage of WSJ. From these 10 articles we did pairwise comparisons between each article by taking the SimHash of each article's HTML representation and then computing the Hamming distance between each unique paired SimHash bit string. 

We found that pages with completely different representations stood out with a higher hamming distance of 40+ bits, while articles that had the same styled representation had at most a 3-bit hamming distance, regardless if the article was a snippet or a full length article. This showed that SimHash was not well suited for discovering differences in content but rather differences in content representation such as changes in: CSS, HTML, or Javascript. It didn't help our observations that WSJ was including entire font-family data text inside of their HTML at the time. In reference to Maciej Ceglowski's post on "The Website Obesity Crisis," WSJ injecting a CSS font-family data string does not aid in a healthy "web pyramid":

From here, I decided to explore the option of using a binary image classifier for a thumbnail of a news site, labeling an image as a "paywall_page" or a "content_page." To accomplish this I decided to use Tensorflow and the very easily applicable examples provided by the "Tensorflow for poets" tutorial. Utilizing the MobileNet model, I trained 122 paywall images and 119 content page images, mainly news homepages and articles. The images were collected using Google Images and manually classified as content or paywall pages.

I trained the model with the new images for 4000 iterations and this produced an accuracy of 80-88%. As a result, I built a simple web application named paywall-classify, that can be found on Github, that utilizes Puppeteer to take screenshots for a given list of URIs (maximum 10) at a resolution of 1920x1080 and then uses Tensorflow to classify images as well. More instructions on how to use the application can be found in the repository readme.

There are many other techniques that could be considered for image classification of webpages, for example, slicing a full page image of a news website into sections. However this approach would more than likely show bias towards the content classification as the "subscribe now" seems to always be at the top of an article meaning slicing would only have this portion in 1/n slices. For this application I also didn't consider the possibility of scrolling down a page to trigger a javascript popup of a paywall message.

Other approaches might utilize textual analysis, such as performing Naive Bayes classification on terms collected from a paywall page and then building a classifier from there. 

What to take away

It's actually difficult to find a cause as to why some the URI-Ms listed result in 404 responses while other articles for those sites may be a 200 response on their first memento. The New York Times has a limit of 10 "free" articles for each user, so perhaps at crawl time the Internet Archive hit its quota. As per Mat Kelly et al. in Impact of URI Canonicalization on Memento Count, they talk about "archived 302s", indicating at crawl time a live web site returns an HTTP 302 redirect, meaning these New York Times articles may actually be redirecting to a login page at crawl time.

-Grant Atkins (@grantcatkins)

Wednesday, March 14, 2018

2018-03-14: Twitter Follower Count History via the Internet Archive

The USA Gymnastics team shows significant growth during the years the Olympics are held.

Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected. 

The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under observation. Its correctness is dependent on new accounts immediately following the account under observation to get an accurate lower bound. The order Twitter returns followers is subject to unannounced change, so it can't be depended on to work long term. That will not show when an account starts losing followers, because it only returns users still following the account. This tool will help accurately gather and plot the follower count based on mementos, or archived web pages, collected from the Internet Archive to show growth rates, track deleted accounts, and help pinpoint when an account might have bought bots to increase follower numbers.

I improved on a Python script, created by Orkun Krand, that collects the followers for a specific Twitter username from the mementos found in the Internet Archive. The code can be found on Github. Through the historical pages kept in the Internet Archive, the number of followers can be observed for a specific date of the collected memento. This script collects the follower count by identifying various CSS Selectors associated with the follower count for most of the major layouts Twitter has implemented. If a Twitter page isn't popular enough to warrant being archived, or too new, then no data can be collected on that user.

This code is especially useful for investigating users that have been deleted from Twitter. The Russian troll @Ten_GOP, impersonating the Tennessee GOP was deleted once discovered. However, with the Internet Archive we can still study its growth rate while it was active and being archived. 
In February 2018, there was an outcry as conservatives lost, mostly temporarily, thousands of followers due to Twitter suspending suspected bot accounts. This script enables investigating users who have lost followers, and for how long they lost them. It is important to note that the default flag to collect one memento a month is not expected to have the granularity to view behaviors that typically happen on a small time frame. To correct that, the flag [-e] to collect all mementos for an account should be used. The republican political commentator @mitchellvii lost followers in two recorded incidences. In January 2017 from the 1st to the 4th, @mitchellvii lost 1270 followers. In April 2017 from the 15th to the 17th, @mitchellvii lost 1602 followers. Using only the Twitter API to collect follower growth would not show this phenomenon.


  • Python 3
  • R* (to create graph)
  • bs4
  • urllib
  • archivenow* (push to archive)
  • datetime* (push to archive)

How to run the script:

$ git clone
$ cd FollowerCountHistory
$ ./ [-h] [-g] [-e] [-p | -P] <twitter-username-without-@> 


The program will create a folder named <twitter-username-without-@>. This folder will contain two .csv files. One, labeled <twitter-username-without-@>.csv, will contain the dates collected, the number of followers for that date, and the URL for that memento. The other, labeled <twitter-username-without-@>-Error.csv, will contain all the dates of mementos where the follower count was not collected and will list the reason why. All file and folder names are named after the Twitter username provided, after being cleaned to ensure system safety.

If the flag [-g] is used, then the script will create an image <twitter-username-without-@>-line.png of the data plotted on a line chart created by the follower_count_linechart.R script. An example of that graph is shown as the heading image for the user @USAGym, the official USA Olympic gymnastics team. The popularity of the page changes with the cycle of the Summer Olympics, evidenced by most of the follower growth occurring in 2012 and 2016.

Example Output:

./ -g -p USAGym
242 archive points found
Not Pushing to Archive. Last Memento Within Current Month.
null device 

cd usagym/; ls
usagym.csv  usagym-Error.csv  usagym-line.png

How it works:

$ ./ --help

usage: [-h] [-g] [-p | -P] [-e] uname

Follower Count History. Given a Twitter username, collect follower counts from
the Internet Archive.

positional arguments:

uname       Twitter username without @

optional arguments:

-h, --help  show this help message and exit
-g          Generate a graph with data points
-p          Push to Internet Archive
-P          Push to all archives available through ArchiveNow
-e          Collect every memento, not just one per month

First, the timemap, the list of all mementos for that URI, is collected for Then, the script collects the dates from the timemap for each memento. Finally, it dereferences each memento and extracts the follower count if all the following apply:
    1. A previously created .csv of the name the script would generate does not contain the date.
    2. The memento is not in the same month as a previously collected memento, unless [-e] is used.
    3. The page format can be interpreted to find the follower count.
    4. The follower count number can be converted to an Arabic numeral.
A .csv is created, or appended to, to contain the date, number of followers, and memento URI for each collected data point.
A error .csv is created, or appended, with the date, number of followers, and memento URI for each data point that was not collected. This will contain repeats if run repeatedly because it will not delete the old entries while writing the new errors in.

If the [-g] flag is used, a .png of the line chart will be created "<twitter-username-without-@>-line.png".
If the [-p] flag is used, the URI will be pushed to the Internet Archive to create a new memento if there is no current memento.
If the [-P] flag is used, the URI will be pushed to all archives available through archivenow to create new mementos if there is no current memento in Internet Archive.
If the [-e] flag is used, every memento will be collected instead of collecting just one per month.

As a note for future use, if the Twitter layout undergoes another change, the code will need to be updated to continue successfully collecting data.

Special thanks to Orkun Krand, whose work I am continuing.
--Miranda Smith (@mir_smi)

Monday, March 12, 2018

2018-03-12: NEH ODH Project Directors' Meeting

Michael and I attended the NEH Office of Digital Humanities (ODH) Project Directors' Meeting and the "ODH at Ten" celebration (#ODHatTen) on February 9 in DC.  We were invited because of our recent NEH Digital Humanities Advancement Grant, "Visualizing Webpage Changes Over Time" (described briefly in a previous blog post when the award was first announced), which is joint work with Pamela Graham and Alex Thurman from Columbia University Libraries and Deborah Kempe from the Frick Art Reference Library and NYARC.

The presentations were recorded, so I expect to see a set of videos available in the future, as was done for the 2014 meeting (my 2014 trip report).

The afternoon keynote was given by Kate Zwaard, Chief of National Digital Initiatives at the Library of Congress. She highlighted the great work being done at LC Labs.

After the keynote, each project director was allowed 3 slides and 3 minutes to present an overview of their newly funded work.  There were 45 projects highlighted and short descriptions of each are available through the award announcements (awarded in August 2017, awarded in December 2017).  Remember, video is coming soon for all of the 3-minute lightning talks.

Here are my 3 slides, previewing our grid, animation/slider, and timeline views for visualizing significant webpage changes over time.

Visualizing Webpage Changes Over Time from Michele Weigle

Following the lightning talks, the ODH at Ten celebration began with a keynote by Honorable John Unsworth, NEH National Council Member and University Librarian and Dean of Libraries at the University of Virginia.

I was honored to be invited to participate in the closing panel highlighting the impact that ODH support had on our individual careers and looking ahead to future research directions in digital humanities. 
Panel: Amanda French (George Washington), Jesse Casana (Dartmouth College), Greg Crane (Tufts), Julia Flanders (Northeastern), Dan Cohen (Northeastern),  Michele Weigle (Old Dominion), Matt Kirschenbaum (University of Maryland)

Thanks to the ODH staff, especially ODH Director Brett Bobley and our current Program Officer Jen Serventi, for organizing a great meeting.  It was also great to be able to catch up with our first ODH Program Officer, Perry Collins. We are so appreciative of the support for our research from NEH ODH.

Here are more tweets from our day at ODH:


Sunday, March 4, 2018

2018-03-04: Installing Stanford CoreNLP in a Docker Container

Fig. 1: Example of Text Labeled with the CoreNLP Part-of-Speech, Named-Entity Recognizer and Dependency Annotators.
The Stanford CoreNLP suite provides a wide range of important natural language processing applications such as Part-of-Speech (POS) Tagging and Named-Entity Recognition (NER) Tagging. CoreNLP is written in Java and there is support for other languages. I tested a couple of the latest Python wrappers that provide access to CoreNLP but was unable to get them working due to different environment-related complications. Fortunately, with the help of Sawood Alam, our very able Docker campus ambassador at Old Dominion University, I was able to create a Dockerfile that installs and runs the CoreNLP server (version 3.8.0) in a container. This eliminated the headaches of installing the server and also provided a simple method of accessing CoreNLP services through HTTP requests.
How to run the CoreNLP server on localhost port 9000 from a Docker container
  1. Install Docker if not already available
  2. Pull the image from the repository and run the container:
Using the server
The server can be used either from the browser or the command line or custom scripts:
  1. Browser: To use the CoreNLP server from the browser, open your browser and visit http://localhost:9000/. This presents the user interface (Fig. 1) of the CoreNLP server.
  2. Command line (NER example):
    Fig. 2: Sample request URL sent to the Named Entity Annotator 
    To use the CoreNLP server from the terminal, learn how to send requests to the particular annotator from the CoreNLP usage webpage or learn from the request URL the browser (1.) sends to the server. For example, this request URL was sent to the server by from the browser (Fig. 2), and corresponds to following command that uses the Named-Entity Recognition system to label the supplied text:
  3. Custom script (NER example): I created a Python function nlpGetEntities() that uses the NER annotator to label a user-supplied text.
To stop the server, issue the following command: 
The Dockerfile I created targets CoreNLP version 3.8.0 (2017-06-09). There is a newer version of the service (3.9.1). I believe it should be easy to adapt the Dockerfile to install the latest version by replacing all occurrences of "2017-06-09" with "2018-02-27" in the Dockerfile.  However, I have not tested this operation since version 3.9.1 is marginally different from version 3.8.0 for my use case, and I have not tested version 3.9.1 with my application benchmark. 


Tuesday, February 27, 2018

2018-02-27: Summary of Gathering Alumni Information from a Web Social Network

While researching my dissertation topic (slides 2--28) on social media profile discovery, I encountered a related paper titled Gathering Alumni Information from a Web Social Network written by Gabriel Resende Gonçalves, Anderson Almeida Ferreira, and Guilherme Tavares de Assis, which was published in the proceedings of the 9th IEEE Latin American Web Congress (LA-WEB). In this paper, the authors detailed their approach to define a semi-automated method to gather information regarding alumni of a given undergraduate program at Brazilian higher education institutions. Specifically, they use the Google Custom Search Engine (CSE) to identify candidate LinkedIn pages based on a comparative evaluation of similar pages in their training set. The authors contend alumni are efficiently found through their process, which is facilitated by focused crawling of data publicly available on social networks posted by the alumni themselves. The proposed methodology consists of three main modules and two data repositories, which are depicted in Figure 1. Using this functional architecture, the authors constructed a tool that gathers professional data on the alumni in undergraduate programs of interest, then proceeds to classify the associated HTML page to determine relevance. A summary of their methodology is presented here.

Functional architecture of the proposed method
Figure 1 - Functional architecture of the proposed method


The first repository, Pages Repository, stores the web pages from the initial set of data samples which are used to start the classification process. This set is comprised of alumni lists obtained from five universities across Brazil. The lists contain the names of students enrolled between 2000 and 2010 in undergraduate programs, namely Computer Science at three institutions, Metallurgical Engineering at one institution, and Chemistry at one institution. The total number of alumni available on all lists is 6,093. For the purpose of validation, a random set of 15 alumni are extracted from each list as training examples during each run of their classifier. The second repository, Final Database, is the database where academic data on each alumnus is stored for further analysis.


The first module, Searcher, determines the candidate pages from a Google result set that might belong to the alumni group. LinkedIn is the social network of choice from which the authors leverage public pages on the web which have been indexed by a search engine. The search is initiated using a combination of the first, middle and last names of a given alumnus, then, relevant data concerning the undergraduate program, program degree, and institution are extracted from the candidate pages. The authors chose not to search using LinkedIn's Application Programming Interface (API) due to its inherent limitations. Specifically, the API requires authentication by a registered LinkedIn user and searches are restricted to the first degree connections of the user conducting the search. As an alternative, the authors use the Google Custom Search Engine which provides access to Google's  massive repository of indexed pages, but is limited to 100 daily free searches returning 100 results per query.

We should note in the years since this paper was published in 2014, LinkedIn has instituted a number of security measures to impede data harvesting of public profiles. They employ a series of automated tools, FUSE, Quicksand, Sentinel, and Org Block, that are used to monitor suspicious activity and block web scraping. Requests are throttled based on the requester's IP address (see HIQ Labs V. LinkedIn Corporation).  Anonymous viewing of a large number of public LinkedIn profile pages, even if retrieved using Google's boolean search criteria, is not always possible. After an undisclosed number of  public profile views, LinkedIn forces the user to either sign up or log in as a way to thwart scraping by 3rd party applications (Figure 2).

LinkedIn Anonymous Search Limit Reached
Figure 2 - LinkedIn Anonymous Search Limit Reached
The second module, Filter, determines the significance of the candidate pages provided by the Searcher module via the Pages Repository. The classification process determines the similarity among pages using the academic information on the LinkedIn page as terms which are then separated into categories that describe the undergraduate program, institution, and degree. The authors proceed to use Cosine Similarity to build a relationship between candidate pages from the Searcher module and the initial training set based on term frequency and specify a 30% threshold for the minimum percentage of pages on which a term must appear.

The third module, Extraction, extracts the demographic and academic information from the HTML pages returned by the Filter module using regular expressions as shown in Figure 3. The extracted information is stored in the Final Database for further analysis using the Naive Bayes bag-of-words model to identify specific alumni of the desired undergraduate program.

Figure 3 - Regular Expressions Used by Extraction Module

Results and Takeaways

The authors acknowledge that obtaining an initial list of alumni names is not a major obstacle. However, collecting the initial set of sample pages from a social network, such as LinkedIn, may be time consuming and labor intensive even with small data sets. Their evaluation, as shown in Figure 4, indicates satisfactory precision and the methodology proposed in their paper is able to find an average of 7.5% to 12.2% of alumni for undergraduate programs with more than 1,000 alumni.

Pages Retrieved and Precision Results For Proposed Method and Baseline
Figure 4 - Pages Retrieved and Precision Results For Proposed Method and Baseline
Given the highly structured design of LinkedIn HTML pages, we would expect the Filter and Extraction modules to identify and successful retrieve a higher percentage of alumni; even without applying a machine learning technique. The bulk of this paper's research is predicated upon access to public data on the web. If social media networks choose to present barriers that impede the collection of this public information, continued research by these authors and others will be significantly impacted. With regards to LinkedIn public profiles, we can only anticipate the imminent outcome of pending litigation which will determine who controls publicly available data.

--Corren McCoy (@correnmccoy)

Gonçalves, G. R., Ferreira, A. A., de Assis, G. T., & Tavares, A. I. (2014, October). Gathering alumni information from a web social network. In Web Congress (LA-WEB), 2014 9th Latin American (pp. 100-108). IEEE.