Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapping from python results of GBFS-validator #165

Open
iaguerri opened this issue Jan 4, 2024 · 1 comment
Open

Scrapping from python results of GBFS-validator #165

iaguerri opened this issue Jan 4, 2024 · 1 comment

Comments

@iaguerri
Copy link

iaguerri commented Jan 4, 2024

If you are new to the GBFS Validator, please introduce yourself (name and organization/link to GBFS). It’s helpful to know who we're chatting with!

I'm working in a MaaS application. I need to validate the GBFS that the public operators gives to me.

What is the issue and why is it an issue?

I'm trying to do a request from python to the result of a validation (https://gbfs-validator.mobilitydata.org/validator?url=https://gbfs.api.ridedott.com/public/v2/brussels/gbfs.json)
I'm trying from POSTMAN

The problem is that the response is a 200 (OK) but the info is not possible to extract (even with scrapping) because the body says "We're sorry but my-project doesn't work properly without Javascript enabled. Please enable to continue"

The code used:

import requests
from bs4 import BeautifulSoup
 
url_validator = "[https://gbfs-validator.mobilitydata.org/validator"](https://gbfs-validator.mobilitydata.org/validator%22)
 
# Jsons de prueba
json_main_full_brusels = "[https://gbfs.api.ridedott.com/public/v2/brussels/gbfs.json"](https://gbfs.api.ridedott.com/public/v2/brussels/gbfs.json%22)                                               # Json Correcto
json_main_nolastupdated_brusels = "[https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasNoLastUpdated.json"](https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasNoLastUpdated.json%22)                 # Json Incorrecto (No last Updated)
json_main_vehiclyType_nolastupdated = "[https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasVehiclyTypeCorrupted.json"](https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasVehiclyTypeCorrupted.json%22)      # Json Incorrecto - feed VehicleTypes sin lastUpdated
json_main_nofeed_systeminformation = "[https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasNoSysteminformationfeed.json"](https://github.com/Almanes/GtfsFiles/raw/main/pruebasBruselasNoSysteminformationfeed.json%22)    # Json Incorrecto - No feed SystemInformation
 
params = {
    "url": json_main_nolastupdated_brusels
}
 
url_completa = requests.Request('GET', url_validator, params=params).prepare().url
print("URL de la solicitud:", url_completa)
 

#APPROACH 1: access from the request
respuesta = requests.get(url_validator, params=params)

if respuesta.status_code == 200:
     datos_respuesta = respuesta.text
     print("Respuesta del Validador:", datos_respuesta)
else:
     print("Error en la solicitud. Código de estado:", respuesta.status_code)
     print("Contenido de la respuesta:", respuesta.text)`


#APPROACH 2: with selenium
soup = BeautifulSoup(respuesta.content, 'html.parser')
 
for div_element in soup.find_all('div', class_='data-v-7c2075bd'):
    # Extract the text content of the div element
    div_text = div_element.get_text(strip=True)
   
    # Print the value of k
    print("Valor de k es:", div_text)

image

image

Please describe some potential solutions you have considered (even if they aren’t related to GBFS).

I don't know why the html is not loaded after, but maybe activating Javascript it would be nicer to get this info

Thanks!!

@davidgamez
Copy link
Member

Hi @iaguerri, the GBFS Validator is currently deployed on Netlify. Looking at the error message you are getting, Netlify is detecting and blocking the use of a bot consumer. You can browse the Internet for solutions on how to avoid user-agent detection. However, I suggest using the "not documented/no stable" API endpoint if you want to get the validation report response for specific feeds. Unfortunately, we are not offering a stable API endpoint yet. The following issue contains information on how to access the API #95. If you would like to follow the development of the stable API, follow this issue #129.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants