Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update of Repo from Python2 to Python3 & Implementation of metadata use via CSV file #12

Merged
merged 100 commits into from
Sep 9, 2022

Conversation

robertyoung2
Copy link
Contributor

Substantial update of the sidewalk-panorama-tools repository. High-level overview and more detailed breakdown will be given below.

1.0 What Has Changed - Two underlying core changes:

  1. Repository upgraded from Python2 to Python3.
  2. Metadata for Google Street View panorama images are now obtained and used via a CSV file versus direct download from Google Street View data endpoint.

2.0 Why These Changes

2.1 Python2 to Python3

As of 01/01/2020 Python2 has been depreciated. No more support or improvements are offered for Python2 so an upgrade to Python3 was advisable. Further to this, a lot of the modules this repository was dependent on were no longer supported/compatible with Python2, so an upgrade was also required for this reason.

2.2 Data Endpoint Download to CSV File With Metadata

Google removed the data endpoint that allowed the download of image metadata, including the depth data, all of which was required for the cropping tool. A workaround for this is to replace direct metadata download with a CSV file which contains all the required information. For further information on this, see Depth data endpoint removed, need to get metadata elsewhere #8 .

3.0 How & Where These Changes Were Implemented

3.1 DownloadRunner.py

3.1.1 Updates to Code to Cover Section 2.2

  • Script no longer uses or attempts to download any of the Google Street View XML metadata.
  • All required metadata (pano ids) are loaded via CSV file versus the Project Sidewalk Database.
  • Unused obsolete/no longer functioning code blocks have either been updated, or commented out. Any code that is commented out with statements such as "# no longer used" requires the Project Sidewalk group to decide if they wish to keep it, or delete it. These code blocks are mostly based upon the XML data, including the depth maps.

3.1.2 Additional New Features

  • Now uses asyncio so threads for I/O tasks can be used (large speed up). This has resulted in a number of new functions:
    • request_session() - sets up requests session for duration of script.
    • get_response() - used to grab responses when doing initial image checks, such as zoom level, if link is broken, or if returned image is black.
    • generate_gsv_urls() - generates a batch of all the urls required to create a single panorama image.
    • download_single_gsv() - uses asyncio to download a single given tile from a GSV image.
    • download_all_gsv_images() - creates the worker tasks to download all the images for a given GSV pano id.
    • new_random_delay() - returns a random delay value to be used if proxies are not in use (avoid hammering the server).
    • random_header() - generates random browser headers to be used for requests.
  • progress_check() - checks a new csv to ascertain if this pano id has been visited before and if download was a success or failure. Provides speed up by avoiding trying to download pano ids with no data endpoint.

3.1.3 Unused/Obsolete Code Requiring Review by Project Sidewalk - Refactor Task

Below are core functions that were left in the script, but were either unused, or commented out. I have not deleted these in case the group wishes to keep them, so the ultimate refactor decision rests with The Project Sidewalk Group.

3.2 CropRunner.py

3.2.1 Updates to Code to Cover Section 2.2

  • No longer make any calls to access XML data. All information is retrieved from the provided csv with metadata.
  • Simplified predict_crop_size() to the core required calculations.
  • Simplified make_single_crop() to the core required calculations.

3.2.2 Additional New Features

No new or additional features were added for this script.

3.2.3 Unused/Obsolete Code Requiring Review by Project Sidewalk - Refactor Task

Below are core functions that were left in the script, but were either unused, or commented out. I have not deleted these in case the group wishes to keep them, so the ultimate refactor decision rests with The Project Sidewalk Group.

3.3 config.py

A new additional configuration file. Allows the user set how many threads to use in asyncio (for I/O tasks) and set proxies if required. Also provides browser headers to be passed with the requests.

3.4 README.md

Updated to reflect the changes made to the code base and it's operation. There are some placeholders where the Project Sidewalk Group will need to update based on the csv file with metadata they provide to the public, and where it will be accessed from. References in README are:

3.5 Legacy Code/Scrips

The below scrips were also updated, but only to make them Python3 compatible. There are no new features added, and the functionality remains the same:

4.0 Testing

  • No test scripts are in the repository, so no prior existing tests were run.
  • No new tests were written.
  • User testing was carried out on DownloadRunner.py and CropRunner.py and these both executed successfully and as expected.
  • No tests or changes were made to the function crop_box_helper() contained in CropRunner.py - this more than likely does not function in its current state.

5.0 COMMENTS

I have provided a sample csv with the required metadata in the repo - sidewalk-panorama-tools/metadata/sample_csv-metadata-seattle.csv . Project Sidewalk group will need to decide what metadata they want to make publicly available with the repo, and where the full file should be stored to enable correct functionality of the codebase.

Switched the httplib (Python2) library to http.client (Python3). Updated all references in DownloadRunner.py to http.client.
Switched import statement from cStringIO (module in Python2) to io.StringIO (Python3 upgrade). All references to cString.StringIO have been replaced with StringIO.
Remove 'urllib2' import statement.
Modified urllib import to contain 'requests' module.
All instances of 'urllib.openurl' and 'urllib2.openurl' replaced with 'request.openurl'
Python3 replaces Python2s 'cString.StringIO' with 'io.BytesIO'. Corrected this mistake now.
download_single_metadata_xml()
download_single_pano()
Added docstrings for the following functions:
* generate_gsv_urls()
* download_single_gsv()
* download_all_gsv_images()
Added comments to lines that are commented out in the code.
These are code blocks that were previously used for the XML metadata
for Google Street View. As this is no longer accessible, we now rely
on the input csv file which contains the metadata and do not need these
lines. They have been kept in for the Project Sidewalk team to decide
if they should be maintained or removed.
Updated print statements for logging to have a consistent format
across the three possible image size settings. Added in the check
to see if the pano id had been visited previously.
Three changes:
* Updated 'warn' (deprecated) to 'warning'.
* Made sample file paths that match DownloadRunner.py.
* Set the marking of the image centre with a red dot to False in run statement.
Function was not used in code, deleted.
@misaugstad
Copy link
Member

@ThatOneGoat

Script no longer uses or attempts to download any of the Google Street View XML metadata

Can you reverse this bit? We are still able to download the XML and depth data right now, so I'd like to keep trying to do this as long as we can.

Unused obsolete/no longer functioning code blocks have either been updated, or commented out. Any code that is commented out with statements such as "# no longer used" requires the Project Sidewalk group to decide if they wish to keep it, or delete it. These code blocks are mostly based upon the XML data, including the depth maps

Similarly, some of the deleted code is probably for the XML/depth data. We should make sure to retain anything related to those. Feel free to ask me about specific functions as you go along.

@misaugstad
Copy link
Member

@ThatOneGoat here are the changes I've made to the API:

  1. I've renamed the endpoint from /adminapi/labels/panoid to /adminapi/panos because that's a more accurate naming. Both endpoints point to the same thing for now. Once this code is merged with the new endpoint name, I can remove the old one.
  2. I've added image_width and image_height into the API call.
  3. I've simplified the formatting of the API output quite a bit, so you'll need to make some changes to how we read said output.

Here's a full before/after for the output of the API.
Before:

{
    "type": "FeatureCollection",
    "features": [
        {
            "properties": {
                "gsv_panorama_id": "example-id"
            }
        },
        ...
    ]
}

After:

[
    {
        "gsv_panorama_id": "example-id",
        "image_width": 16384,
        "image_height": 8192
    },
    ...
]

@misaugstad misaugstad merged commit 6b77032 into ProjectSidewalk:master Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants