Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Selenium Grid Capability #1123

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
<caffeine.version>3.1.8</caffeine.version>
<xsoup.version>0.3.7</xsoup.version>
<awaitility.version>4.2.0</awaitility.version>
<gson.version>2.8.7</gson.version>
</properties>

<build>
Expand All @@ -51,6 +52,7 @@
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>attach-test</id>
Expand Down Expand Up @@ -261,7 +263,11 @@
<version>${selenium.version}</version>
<scope>test</scope>
</dependency>

<dependency>
msghasan marked this conversation as resolved.
Show resolved Hide resolved
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>${gson.version}</version>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
/**
* Licensed to DigitalPebble Ltd under one or more contributor license agreements. See the NOTICE
* file distributed with this work for additional information regarding copyright ownership.
* DigitalPebble licenses this file to You under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. You may obtain a copy of the
* License at
*
* <p>http://www.apache.org/licenses/LICENSE-2.0
*
* <p>Unless required by applicable law or agreed to in writing, software distributed under the
* License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
* express or implied. See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.digitalpebble.stormcrawler.protocol.selenium;

import com.digitalpebble.stormcrawler.util.ConfUtils;
import java.net.URL;
import java.time.Duration;
import java.util.List;
import java.util.Timer;
import java.util.TimerTask;
import org.apache.storm.Config;
import org.openqa.selenium.Dimension;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;

public class SeleniumGridImpl extends SeleniumGridProtocol {

private TimerTask timerTask;
msghasan marked this conversation as resolved.
Show resolved Hide resolved
private Timer updateQueue;
private int noOfWorkers = 0;
private String gridAddress;
msghasan marked this conversation as resolved.
Show resolved Hide resolved
private final DesiredCapabilities capabilities = new DesiredCapabilities();

@Override
public void configure(Config conf) {
super.configure(conf);
noOfWorkers = ConfUtils.getInt(conf, "topology.workers", 2);
gridAddress = super.gridAddress;
msghasan marked this conversation as resolved.
Show resolved Hide resolved
capabilities.setBrowserName(
ConfUtils.getString(conf, "selenium.capabilities.browserName", "chrome"));
capabilities.setCapability("newSessionWaitTimeout", 600);
capabilities.setCapability("browserTimeout", 600);
updateQueue = new Timer();
updateQueueOfBrowsers();
}

protected void updateQueueOfBrowsers() {
timerTask =
new TimerTask() {
@Override
public void run() {
try {
List list = getAllNodesList();
msghasan marked this conversation as resolved.
Show resolved Hide resolved
LOG.info("Blocking Queue size: " + driversQueue.size());
while (getSessionsCount(list) > 0) {
int size = list.size();
// Check if the queue size is more than the actual
// no of browsers allowed per worker
// means crawler services are idle because all drivers are in queue
if (driversQueue.size() >= (size / noOfWorkers)) {
SeleniumGridProtocol.Holder holder = driversQueue.take();
long totalTime = System.currentTimeMillis() - holder.getTime();
// clear the queue if the total time spent in the queue is more
// than 4.5 minutes
// As after 5 mintues the driver would be idle and will throw
msghasan marked this conversation as resolved.
Show resolved Hide resolved
// exception
// while we try to use that driver for fetching
if (totalTime > 1000 * 4.5 * 60) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5min is the default of Selenium Grid. Can we make it configurable?

Copy link
Contributor Author

@msghasan msghasan Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to say can we configure the default idle time of selenium . I do not know I tried it once to increase it to 10 mins but was not successful. If you can help me with this.

Copy link
Contributor

@rzo1 rzo1 Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can configure the default timeout on the server side of selenium grid, so it would be useful to have a config option in SC to adjust it accordingly if it is changed on the grid side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can configure the default timeout on the server side of selenium grid, so it would be useful to have a config option in SC to adjust it accordingly if it is changed on the grid side.

grid.timeout.minutes this is the configuration I am planning to give

driversQueue.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we clear the full queue? My impression was, that the queue is used to hold Driver instances, so we should just remove that specifc driver or did I understand something wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say when the crawler is in a idle mode and has not been running any crawls, the drivers would be sitting idle and would have lost the session .

To check every driver session id would be a costly affair. So its better if we clear the queue and again add it from the grid which are active and available in the grid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here two sets of list are getting handled one the grid handles and the other queue we handle per worker. So its better to clear out the queue and try adding it again from the available nodes from the grid.

}
}
// so that browsers get equally divided among workers
if (driversQueue.size() <= size / noOfWorkers) {
RemoteWebDriver driver = getDriverFromNode();
if (driver != null) {
driversQueue.put(
new SeleniumGridProtocol.Holder(
driver, System.currentTimeMillis()));
LOG.info(
"Placed driver in blocking queue: "
+ driversQueue.size());
}
}
list = getAllNodesList();
}
} catch (Exception e) {
LOG.error(
"Exception while running task for adding driver to the queue",
e);
}
}
};
// update the queue every 5 minutes
updateQueue.schedule(timerTask, 0 * 60 * 1000, 5 * 60 * 1000);
rzo1 marked this conversation as resolved.
Show resolved Hide resolved
}

protected RemoteWebDriver getDriverFromNode() {
int sessionCount = 0;
msghasan marked this conversation as resolved.
Show resolved Hide resolved
RemoteWebDriver driver = null;

try {
LOG.debug("Adding new driver from " + gridAddress);
driver = new RemoteWebDriver(new URL(gridAddress), capabilities);
WebDriver.Timeouts touts = driver.manage().timeouts();
WebDriver.Window window = driver.manage().window();
touts.implicitlyWait(Duration.ofSeconds(6));
msghasan marked this conversation as resolved.
Show resolved Hide resolved
touts.pageLoadTimeout(Duration.ofSeconds(60));
touts.scriptTimeout(Duration.ofSeconds(30));
window.setSize(new Dimension(1980, 1280));
LOG.debug("Inside getDriverFromGrid to set web drivers" + driver.hashCode());
} catch (Exception e) {
msghasan marked this conversation as resolved.
Show resolved Hide resolved
}
return driver;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
/**
* Licensed to DigitalPebble Ltd under one or more contributor license agreements. See the NOTICE
* file distributed with this work for additional information regarding copyright ownership.
* DigitalPebble licenses this file to You under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. You may obtain a copy of the
* License at
*
* <p>http://www.apache.org/licenses/LICENSE-2.0
*
* <p>Unless required by applicable law or agreed to in writing, software distributed under the
* License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
* express or implied. See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.digitalpebble.stormcrawler.protocol.selenium;

import com.digitalpebble.stormcrawler.Metadata;
import com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol;
import com.digitalpebble.stormcrawler.protocol.HttpHeaders;
import com.digitalpebble.stormcrawler.protocol.ProtocolResponse;
import com.digitalpebble.stormcrawler.util.ConfUtils;
import com.google.gson.Gson;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import java.util.List;
import java.util.Map;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.storm.Config;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.slf4j.LoggerFactory;

public abstract class SeleniumGridProtocol extends AbstractHttpProtocol {
protected static final org.slf4j.Logger LOG =
LoggerFactory.getLogger(SeleniumGridProtocol.class);
protected static LinkedBlockingQueue<Holder> driversQueue;
private NavigationFilters filters;
private final DesiredCapabilities capabilities = new DesiredCapabilities();
msghasan marked this conversation as resolved.
Show resolved Hide resolved

protected String gridAddress;

@Override
public void configure(Config conf) {
super.configure(conf);
driversQueue = new LinkedBlockingQueue<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is initalized on the same JVM multiple times, we will override the current driver queue. Is this field required to be static at all? I would assume, that the grid can deal with multiple clients?

If it needs to be static, we should prevent multiple initialization of it (but leads to the same thing, if we are working on multiple JVMs).

Copy link
Contributor Author

@msghasan msghasan Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzo1 please suggest, shall I instantiate the blocking queue using singleton pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be static ?

filters = NavigationFilters.fromConf(conf);
gridAddress =
ConfUtils.getString(conf, "selenium.grid.address", "http://localhost:4444/wd/hub");
}

protected synchronized List<Map<String, Object>> getAllNodesList() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can handle a potential exception from getStatusStream() inside this method instead of relying on an exception-based code flow? For example just return null or an empty list and deal with it on the caller side?`

Map<String, Object> valueMap = null;
boolean ready = false;
while (!ready) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if we should make this async and use a latch here to prevent active waiting? In any case, we might want to add some sort of wait here to prevent hammering the grid?

Map<String, Object> map = getStatusStream();
valueMap = (Map<String, Object>) map.get("value");
ready = (boolean) valueMap.get("ready");
if (!ready) {
LOG.warn("Selenium Grid is not ready yet");
msghasan marked this conversation as resolved.
Show resolved Hide resolved
msghasan marked this conversation as resolved.
Show resolved Hide resolved
}
}
LOG.info("Grid Is Ready to Serve");
msghasan marked this conversation as resolved.
Show resolved Hide resolved
return (List<Map<String, Object>>) valueMap.get("nodes");
}

private Map<String, Object> getStatusStream() throws IOException {
Gson gson = new Gson();
msghasan marked this conversation as resolved.
Show resolved Hide resolved
URL url = new URL(gridAddress + "/status");
InputStream stream = url.openStream();
msghasan marked this conversation as resolved.
Show resolved Hide resolved
Reader reader = new InputStreamReader(stream);
Map<String, Object> map = gson.fromJson(reader, Map.class);
stream.close();
return map;
}

protected int getSessionsCount(List<Map<String, Object>> nodes) {
int availableSessions = 0;
for (Map<String, Object> node : nodes) {
List<Map<String, Object>> slots = (List<Map<String, Object>>) node.get("slots");
for (Map<String, Object> slot : slots) {
if (slot.get("session") == null) {
availableSessions++;
}
}
}
return availableSessions;
}

public class Holder {
msghasan marked this conversation as resolved.
Show resolved Hide resolved
public RemoteWebDriver driver;
public Long time;

public void setDriver(RemoteWebDriver driver) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed as it is never called and driver is set via constructor.

Copy link
Contributor Author

@msghasan msghasan Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzo1 Since it is just a pojo class can I move it to some other place, can you please suggest a package where should I move this file. I want to avoid using the new operator inside seleniumgridimpl and directly use the Holder.java as model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to make a suggestions.

this.driver = driver;
}

public void setTime(Long time) {
this.time = time;
}

public RemoteWebDriver getDriver() {
return this.driver;
}

public Long getTime() {
return this.time;
}

public Holder(RemoteWebDriver driver, Long time) {
this.driver = driver;
this.time = time;
}
}

public ProtocolResponse getProtocolOutput(String url, Metadata metadata) throws Exception {
RemoteWebDriver driver;
while ((driver = getDriver()) == null) {}
try {
// This will block for the page load and any
// associated AJAX requests
driver.get(url);

String u = driver.getCurrentUrl();

// call the filters
ProtocolResponse response = filters.filter(driver, metadata);
if (response != null) {
return response;
}

// if the URL is different then we must have hit a redirection
if (!u.equalsIgnoreCase(url)) {
byte[] content = new byte[] {};
Metadata m = new Metadata();
m.addValue(HttpHeaders.LOCATION, u);
return new ProtocolResponse(content, 307, m);
}

// if no filters got triggered
byte[] content = driver.getPageSource().getBytes();
return new ProtocolResponse(content, 200, new Metadata());

} catch (Exception e) {
if (e.getMessage() != null) {
if ((e.getMessage().contains("ERR_NAME_NOT_RESOLVED")
|| e.getMessage().contains("ERR_CONNECTION_REFUSED")
|| e.getMessage().contains("ERR_CONNECTION_CLOSED")
|| e.getMessage().contains("ERR_SSL_PROTOCOL_ERROR")
|| e.getMessage().contains("ERR_CONNECTION_RESET")
|| e.getMessage().contains("ERR_SSL_VERSION_OR_CIPHER_MISMATCH")
|| e.getMessage().contains("ERR_ADDRESS_UNREACHABLE"))) {
LOG.info(
"Exception is of webpage related hence continuing with the driver and adding it back to queue");
msghasan marked this conversation as resolved.
Show resolved Hide resolved
} else {
LOG.error(
"Exception wile doing operation via driver url {} with driver hashcode {}"
msghasan marked this conversation as resolved.
Show resolved Hide resolved
+ "with excepiton {}",
url,
driver.hashCode(),
e);
closeConnectionGracefully(driver);
driver = null;
}
}
throw new Exception(e);
} finally {
// finished with this driver - return it to the queue
if (driver != null) driversQueue.put(new Holder(driver, System.currentTimeMillis()));
}
}

private final RemoteWebDriver getDriver() {
try {
return driversQueue.take().getDriver();
} catch (Exception e) {
return null;
}
}

private void closeConnectionGracefully(RemoteWebDriver driver) {
try {
LOG.info("Before disposing driver : {}", driver.hashCode());

if (driver.getSessionId() != null) {
msghasan marked this conversation as resolved.
Show resolved Hide resolved
if (driver.getSessionId().toString() != null) driver.quit();
}
} catch (Exception e) {
LOG.info("Error while closing driver", e);
msghasan marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ config:
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"

# Please define this for selenium grid configuration
selenium.grid.address: http://localhost:4444
msghasan marked this conversation as resolved.
Show resolved Hide resolved

# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
Expand Down
1 change: 1 addition & 0 deletions external/urlfrontier/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
<plugins>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.2</version>
msghasan marked this conversation as resolved.
Show resolved Hide resolved
<executions>
<execution>
<id>default-test</id>
Expand Down