This is an introduction to writing new behaviors for the Browsertrix crawler and extension. Behaviors range from a variety of usecases, such as automatically scrolling through the page to load more content, clicking through UI elements that open pop-ups, modals, and sidebars– or performing specific actions on social media sites or other non-conventional websites, specifically Single Page Applications (SPAs).
We're going to write a TikTok video behavior that scrolls through and expands each comment thread.
To create a new behavior, we first create a new file in the src/site/
directory. This will define a JavaScript module that extends the Behavior
class as well as a Symbol.asyncIterator
function that acts as the "entry" to
the behavior's actions.
We'll name our file tiktok.js
and add the basic elements needed to define
our behavior:
/* src/site/tiktok.js */
import { Behavior } from "../lib/behavior";
class TikTokVideoBehavior extends Behavior {
constructor() {
super();
}
async* [Symbol.asyncIterator]() {
yield "TikTok Video Behavior Complete";
}
}
The next step we need to take is to include our new behavior into the
src/site/index.js
module, like so:
/* Other video behaviors */
import { TikTokVideoBehavior } from "./tiktok";
const siteBehaviors = [
/* Other video behaviors */
TikTokVideoBehavior,
];
export default siteBehaviors;
Next we define an isMatch
method for our TikTokVideoBehavior
class. This
method checks whether the behavior should run on the current page. Let's use a
regular expression to match against the current path:
class TikTokVideoBehavior extends Behavior {
// ...
static isMatch() {
const pathRegex = /https:\/\/(www\.)?tiktok\.com\/@.+\/video\/\d+/;
return window.location.href.match(pathRegex);
}
}
Note that the isMatch
method is static
, meaning it is defined on the class
itself. When this function returns true, our code in the Symbol.asyncIterator
method will run.
We're now ready to run behaviors on TikTok video pages. browsertrix-behaviors
relies on XPath queries to find DOM nodes and interact with them.
After looking through the page's HTML, you'll see that the comment threads all
live inside an element whose classname corresponds with its element type
followed by CommentListContainer
.
An XPath query that looks for div
elements with a similar class, looks
something like this:
//div[contains(@class, 'CommentListContainer')]
Let's define a constant to reference these queries:
const Q = {
commentListContainer: "//div[contains(@class, 'CommentListContainer')]",
}
In order to find an element via XPath, we import a function from the
src/lib/utils.js
module:
Next, we expand the Symbol.asyncIterator
method like so:
class TikTokVideoBehavior extends Behavior {
// ...
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
console.log("[LOG] List Container", commentList);
yield "TikTok Video Behavior Complete";
}
}
For now we're just logging out the result returned by the query.
Let's test the pieces we've built so far in the browser. At this point your
src/site/tiktok.js
module should look something like this:
import { Behavior } from "../lib/behavior";
import { xpathNode } from "../lib/utils";
const Q = {
commentListContainer: "//div[contains(@class, 'CommentListContainer')]"
};
export const BREADTH_ALL = Symbol("BREADTH_ALL");
export class TikTokVideoBehavior extends Behavior {
static get name() {
return "TikTokVideo";
}
static isMatch() {
const pathRegex = /https:\/\/(www\.)?tiktok\.com\/@.+\/video\/\d+/;
return window.location.href.match(pathRegex);
}
constructor() {
super();
}
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
console.log("[LOG] List Container", commentList);
yield "TikTok Video Behavior Complete";
}
}
We'll build what we have so far via the build-dev
script defined in
package.json
. Make sure you have yarn
installed as well as the project's
dependencies by running:
$ npm install -g yarn
$ yarn install
Next, we build our behaviors in development mode:
$ yarn run build-dev
This will compile and output our behaviors into dist/behaviors.js
. We can
copy this code and run it in our browser to test behaviors directly. Make sure
you're viewing a TikTok Video, and we can test our code in
Chrome for example by following these steps:
TODO: Record video outlining this process
- Open the Developer Tools
- Click the
Sources
tab - In the left sidebar, open the
Snippets
tab - Click the
+ New Snippet
button - Build the development output via
yarn run build-dev
- Copy the output from
dist/index.js
- Running
cat dist/index.js | pbcopy
on Linux/MacOS will copy the contents to your clipboard.
- Running
- Paste the contents in the new Snippet
- At the bottom of the window click the "Play" button or press
Ctrl/Cmd+Enter
Our code is now loaded into the browser, and we can interact with it directly.
In order to run our new behavior, open the Console
tab in the Developer Tools
and run the following code:
self.__bx_behaviors.run({
autofetch: false,
autoplay: false,
autoscroll: false,
siteSpecific: true
});
A few outputs should appear in your console that look like the following:
> {data: 'Starting Site-Specific Behavior: TikTokVideo', type: 'debug'}
> [LOG] List Container <div class="tiktok-...-DivCommentListContainer ...">…</div>
> {data: 'Waiting for behaviors to finish', type: 'debug'}
> {data: 'TikTok Video Behavior Complete', type: 'info'}
> {data: {state: {}, msg: 'done!'}, type: 'info'}
> {data: 'All Behaviors Done for https://www.tiktok.com/@webbstyles/video/7143026261693123882', type: 'debug'}
Since we've identified the element containing the video's comemnts, we can now
write an XPath query that matches against them. This query is very similar to
the one we're using for the commentListContainer
, with one notable difference:
const Q = {
commentListContainer: "//div[contains(@class, 'CommentListContainer')]",
commentItemContainer: "div[contains(@class, 'CommentItemContainer')]",
};
Note that there's no //
prepended to this query, which is specifically used to
match against the entire node structure recursively from the root. While this
is useful for identifying our CommentListContainer
, it is less appropriate
when defining a query which needs to target specific nodes or— in the case of
the new CommentItemContainer
query— is intended for a helper function that
utilizes it with other parameters.
Let's now import the iterChildMatches
function from our utils module:
import { iterChildMatches, xpathNode } from "../lib/utils";
// ^---- New import
This function takes a query just as xpathNode
does, along with a parent node
that specifies where we'd like to search. The result is an iterator that looks
for the next element that matches the query and also waits for a new one
before exhausting its search. This allows us to "infinitely scroll" if you will
by matching against both the existing elements and potential ones that may
appear as a result of scrolling or other UI behaviors.
We can now generate this list and iterate through it using an async for
loop:
export class TikTokVideoBehavior extends Behavior {
// ...
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
const commentItems = iterChildMatches(Q.commentItemContainer, commentList);
for await (const item of commentItems) {
// ... do something with each comment
}
yield "TikTok Video Behavior Complete";
}
}
Now that we're iterating through each comment, we can call the scrollIntoView
method to scroll through them. This method takes a set of options that the base
Behavior
class provides for us:
export class TikTokVideoBehavior extends Behavior {
// ...
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
const commentItems = iterChildMatches(Q.commentItemContainer, commentList);
for await (const item of commentItems) {
item.scrollIntoView(this.scrollOpts);
}
yield "TikTok Video Behavior Complete";
}
}
We're now scrolling through our comments, but the asyncIterator
we've written
for our behavior does not yield
any results other than at the end when it has
completed. This can cause issues down the line, specifically for the
ArchiveWeb.page extension which relies on the ability to pause
and resume site behaviors.
Additionally, our behavior isn't logging any results or accumulating totals as
it crawls through the page. We can resolve this by using the getState
method
defined in the Behavior
base class:
// in the async* [Symbol.asyncIterator] method
for await (const item of commentItems) {
item.scrollIntoView(this.scrollOpts);
yield this.getState("View thread", "threads");
}
Our behavior now yields a result each time we scroll to a comment, and the extension can now pause and resume the scrolling behavior.
Since we're able to identify each comment, we can now look for specific parts of
the element like buttons that perform actions on the page. Let's define a new
method called expandThread
. This method will take a comment item and look for
the View more replies
button, which we identify with the following query:
const Q = {
// ...
viewMoreReplies: ".//p[contains(@class, 'ReplyActionText')]",
}
Next for our expandThread Method
:
export class TikTokVideoBehavior extends Behavior {
// ...
async* expandThread(item) {
const viewMore = xpathNode(Q.viewMoreReplies, item);
if (!viewMore) return;
// ... do something with the "View more repleis" button
}
// ...
}
As you can see, we've targeted the "View more replies" button with our query
using xpathNode
; note that it also takes a "parent" argument that specifies
where to look. Our new method then checks whether or not the button exists
before continuing on. Let's import a handy function that will both scroll the
button into the view as well as click it after a provided amount of time:
import { iterChildMatches, scrollAndClick, xpathNode } from "../lib/utils";
// ^---- New import
We can now use scrollAndClick
in our new method as well as this.getState
to
mark our progress:
export class TikTokVideoBehavior extends Behavior {
// ...
async* expandThread(item) {
const viewMore = xpathNode(Q.viewMoreReplies, item);
if (!viewMore) return;
await scrollAndClick(viewMore, 500);
yield this.getState("Expand thread", "expandedThreads");
}
// ...
}
Finally, let's add our new method to the Symbol.asyncIterator
method:
export class TikTokVideoBehavior extends Behavior {
// ...
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
const commentItems = iterChildMatches(Q.commentItemContainer, commentList);
for await (const item of commentItems) {
item.scrollIntoView(this.scrollOpts);
yield* this.expandThread(item);
}
yield "TikTok Video Behavior Complete";
}
}
Note that we use the yield*
keyword in order to yield each result of the
expandThread
method one at a time. This maintains our fine-grained control
over pausing and resuming the behavior.
Content pending...
So far only the initial "View more comments" button is clicked and the resulting content is loaded, but in order to fully scrape entire videos' comment sections we'll need to go beyond just loading one set of replies.
One tactic we can use is a recursive method that clicks the View more
button
that appearsa when more comments are available to load in each thread after
initial expansion. There are a few things to consider:
- The
View more
button doesn't exist until we've already clickedView more replies
. - The button may not appear at all if the end of the thread has been reached.
- Clicking the button once "destroys" it and once a new group of comments has loaded, a new button accompanies them.
The main tool we'll use to alleviate some of these complexities is called
waitUntilNode
:
import { iterChildMatches, scrollAndClick, waitUntilNode, xpathNode } from "../lib/utils";
// ^---- New import
Similar to our other query helper functions, waitUntilNode
takes an XPath
query along with a parent node, but it returns a Promise
as it waits for some
period of time before giving up on the query. It also accepts a third argument
that compares another node instance against what the query returns. This is
particularly needed for item 3 in the above list.
Let's add the last query we'll need to crawl comment threads, which targets the
View more
buttons:
const Q = {
// ...
viewMoreThread: ".//p[starts-with(@data-e2e, 'view-more')]"
};
Now we can define a crawlThread
method that utilizes our new query as well as
the waitUntilNode
function:
export class TikTokVideoBehavior extends Behavior {
// ...
async* crawlThread(parentNode, prev = null) {
const next = await waitUntilNode(Q.viewMoreThread, parentNode, prev);
if (!next) return;
}
// ...
}
Similarly to expandThread
we exit the function if no button is found. We
added a prev
argument to the method with a default value of null
. This will
allow us to call the method recursively with our next
node if it exists.
Let's complete our new method like so:
export class TikTokVideoBehavior extends Behavior {
// ...
async* crawlThread(parentNode, prev = null) {
const next = await waitUntilNode(Q.viewMoreThread, parentNode, prev);
if (!next) return;
await scrollAndClick(next, 500);
yield this.getState("View more replies", "replies");
yield* this.crawlThread(parentNode, next);
}
// ...
}
Using recursion is the key to this method, as it relies on previous versions of a similar node in order to iterate through the entirety of a comment thread.
There is a slight edge-case that can occur that we must account for, however. In
some instances, an element matching our viewMoreThread
button will appear with
empty text. For our purposes, this indicates that there's either a delay in all
of the properties of new element appearing or that our thread has no more
replies to load. We can fix this by adding an additional logic check, but a more
elegant solution lies within our XPath query:
const Q = {
// ...
viewMoreThread: ".//p[starts-with(@data-e2e, 'view-more') and string-length(text()) > 0]"
};
By using the string-length(text())
function, we have access to the inner text
of our target element. Our query now ignores the edge-case when a blank button
appears on the page.
Lastly, our expandThread method needs to call our new crawling method after it's finished expanding the first round of replies:
export class TikTokVideoBehavior extends Behavior {
// ...
async* expandThread(item) {
const viewMore = xpathNode(Q.viewMoreReplies, item);
if (!viewMore) return;
await scrollAndClick(viewMore, 500);
yield this.getState("Expand thread", "expandedThreads");
yield* this.crawlThread(item, null);
// ^ Begin crawling through additional replies
}
// ...
}
We pass null
as our second argument to crawlThread
in order to specify that
no previous element exists. This allows for the first waitUntilNode
call to
return the first element matching our viewMoreThread
query without checking
for a previous version of the element.
While passing options to our behavior through the extension isn't currently available, we can both plan for that future functionality as well as allow code that injects these behaviors to use them.
One example of a useful option is defining breadth
, that is how many times
we'd like to expand each thread before moving on to the next. In some cases we
may want to see every reply, but for videos with a large number of comments it's
often more practical to only see a limited amount of top replies.
We'll define the breadth
option one of two types:
- a number representing how many times we want to click the "more replies" button
- a symbol that tells the behavior to look through every reply
Since the latter option is how our behavior has worked all along, we'll define
it as a default when no breadth
option is provided.
First let's define a symbol:
export const BREADTH_ALL = Symbol("BREADTH_ALL");
Next, we modify our constructor
class method:
export class TikTokVideoBehavior extends Behavior {
// ...
constructor({ breadth = BREADTH_ALL }) {
super();
this.setOpts({ breadth });
}
// ...
}
As we can see, our behavior expects all options to be passed as an object to the
class constructor. We then use the setOpts
method included in the base class,
which stores our breadth
option for later use.
Let's define a breadthComplete
method that checks whether a number exceeds the
amount of iterations defined in our behavior's options:
export class TikTokVideoBehavior extends Behavior {
// ...
breadthComplete(iter) {
const breadth = this.getOpt("breadth");
return breadth !== BREADTH_ALL && breadth <= iter;
}
// ...
}
This method uses the corresponding getOpts
method defined in the base class.
We can now use our new helper method to check whether we want to expand any
threads at all in our Symbol.asyncIterator
method:
export class TikTokVideoBehavior extends Behavior {
// ...
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
const commentItems = iterChildMatches(Q.commentItemContainer, commentList);
for await (const item of commentItems) {
item.scrollIntoView(this.scrollOpts);
yield this.getState("View thread", "threads");
if (this.breadthComplete(0)) continue;
// ^ Continue without expanding the thread if `breadth` is 0
yield* this.expandThread(item);
}
yield "TikTok Video Behavior Complete";
}
// ...
}
Next, we'll modify our crawlThread
method:
export class TikTokVideoBehavior extends Behavior {
// ...
async* crawlThread(parentNode, prev = null, iter = 0) {
const next = await waitUntilNode(Q.viewMoreThread, parentNode, prev);
if (!next || this.breadthComplete(iter)) return;
await scrollAndClick(next, 500);
yield this.getState("View more replies", "replies");
yield* this.crawlThread(parentNode, next, iter + 1);
}
// ...
}
We've added a new extra iter
parameter in order to track how many times we've
loaded new replies. This number is incremented on each recursive call.
Additionally, we check if breadthComplete
is true when looking for our next
button.
Finally, the expandThread
method needs to pass an initial iter
parameter to
crawlThread
. Since expandThread
does load more replies, our initial number
is 1
:
export class TikTokVideoBehavior extends Behavior {
// ...
async* expandThread(item) {
const viewMore = xpathNode(Q.viewMoreReplies, item);
if (!viewMore) return;
await scrollAndClick(viewMore, 500);
yield this.getState("Expand thread", "expandedThreads");
yield* this.crawlThread(item, null, 1);
}
// ...
}
Congratulations! We've completed a working TikTok video behavior that iterates through each thread and their replies. The final code looks something like this:
import { Behavior } from "../lib/behavior";
import { iterChildMatches, scrollAndClick, waitUntilNode, xpathNode } from "../lib/utils";
const Q = {
commentListContainer: "//div[contains(@class, 'CommentListContainer')]",
commentItemContainer: "div[contains(@class, 'CommentItemContainer')]",
viewMoreReplies: ".//p[contains(@class, 'ReplyActionText')]",
viewMoreThread: ".//p[starts-with(@data-e2e, 'view-more') and string-length(text()) > 0]"
};
export const BREADTH_ALL = Symbol("BREADTH_ALL");
export class TikTokVideoBehavior extends Behavior {
static get name() {
return "TikTokVideo";
}
static isMatch() {
const pathRegex = /https:\/\/(www\.)?tiktok\.com\/@.+\/video\/\d+/;
return window.location.href.match(pathRegex);
}
constructor({ breadth = BREADTH_ALL }) {
super();
this.setOpts({ breadth });
}
breadthComplete(iter) {
const breadth = this.getOpt("breadth");
return breadth !== BREADTH_ALL && breadth <= iter;
}
async* crawlThread(parentNode, prev = null, iter = 0) {
const next = await waitUntilNode(Q.viewMoreThread, parentNode, prev);
if (!next || this.breadthComplete(iter)) return;
await scrollAndClick(next, 500);
yield this.getState("View more replies", "replies");
yield* this.crawlThread(parentNode, next, iter + 1);
}
async* expandThread(item) {
const viewMore = xpathNode(Q.viewMoreReplies, item);
if (!viewMore) return;
await scrollAndClick(viewMore, 500);
yield this.getState("Expand thread", "expandedThreads");
yield* this.crawlThread(item, null, 1);
}
async* [Symbol.asyncIterator]() {
const commentList = xpathNode(Q.commentListContainer);
const commentItems = iterChildMatches(Q.commentItemContainer, commentList);
for await (const item of commentItems) {
item.scrollIntoView(this.scrollOpts);
yield this.getState("View thread", "threads");
if (this.breadthComplete(0)) continue;
yield* this.expandThread(item);
}
yield "TikTok Video Behavior Complete";
}
}
In our first checkpoint we saw how to run our behavior on a webpage. We can also
include our new breadth
option in this process in our console. To do so, we
pass an object instead of true
to the siteSpecific
option. We use the string
found in the static name
method to reference our class:
self.__bx_behaviors.run({
autofetch: false,
autoplay: false,
autoscroll: false,
siteSpecific: {
TikTokVideo: { breadth: 3 }
}
});
Running this code on a video page will now only expand threads three times before moving on to the next.