Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap: More Enhancements in Development #9

Open
28 of 39 tasks
LeeThompson opened this issue May 19, 2023 · 6 comments
Open
28 of 39 tasks

Roadmap: More Enhancements in Development #9

LeeThompson opened this issue May 19, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@LeeThompson
Copy link
Contributor

LeeThompson commented May 19, 2023

Status:

June 23rd 2023
Haven't been able to do much work this week due to some unexpected household emergencies, should be back at it next week.

202306161401

  • Added a MIME database, too many functions were all doing different types of lookups and so I consolidated it into a database "object". Works pretty well so far.
  • Added a content buffer for internal processing. The goal is to prevent unnecessary reloading of data.
  • Added yet more switches, mostly end users want to see the HTTP warnings (4xx) and errors (5xx) they can be enabled in the ini or --showhttpwarnings --showhttperrors. (With the options enabled, they will output as TYPE_WARNING and TYPE_ERROR.)
  • Image size doesn't always work even if a valid image so until that gets resolved, if you specify a minimum size it's more of a 'goal'.
  • Added SVG detect to our own data check
  • Added new logging levels TYPE_OBJECTS, TYPE_TIMERS (full debug logging is now 1023)
  • convertRelativeToAbsolute now has one return path making for easier debugging.
  • Tightened up domain parsing and regex code, it should be a bit better dealing with subdomains.
  • Integrating MIME Database, checkIconAcceptance, and other new things to existing code. Then I'm going to do a battery of tests, after which I'm going to simplify/optimize and continue with adding the remaining new features (check local icon, etc).

202306121848:

  • Added new 'extensions' section to the ini, it's mostly for testing but could be used if something isn't working right. They are all simple boolean values (true or false), the list is: curl, exif, get, put, mbstring, fileinfo, mimetype, gd, imagemagick, gmagick, hrtime. If an extension is listed as true in this section but is not loaded or available, it will change to false. (Please note, GD, ImageMagick and gmagick are not currently used at all.)
  • Image identification fallbacks added to local file loading
  • Been testing/fixing up extension/function fallback code
  • Fixed an issue where the log could be initialized too soon and not honor some settings
  • Added --sites as an alternate to --list
  • Added raw datacheck for most common icon formats
  • Added a "confidence" level, not used other than logging yet
  • This isn't the "big update" yet, as I wanted to test some of the fallbacks before I started going down the bigger rabbit hole and that's probably going to continue this week.

Some notes on this:

Having our own image identification is important should the PHP installation be limited (for whatever reason) and going by file extension is still the last resort.

The method used for this is looking for the "signature" of the image file. Most image formats have a header with signature data to be used by software trying to open it (this is also called a "magic number".) The new code knows PNG, GIF, JPEG, WEBP, BMP and ICO formats.

Some image formats are easier to identify than others, for example PNG format's "magic" which is \x89PNG\r\n\x1A\n which is pretty good. BMP and ICO have very very simple identifiers and so having false positives is much more likely which is why I've been adding a "certainty" rating. Eventually you'll be able to set a minimum acceptable "certainty" and reject possibly invalid files. (You can currently set it but nothing looks at it.)

Here's some sample trace logging showing this in action:

2023-06-12 18:47:21 [TRACE] [grap_favicon(20):listIcons:getMIMETypeFromFile] pathname='icons/whatsapp.png', content_type=image/png, confidence=certain, method=signature

Ideally, if everything is available to get-fav.php the following methods are used, in order:

  1. The content-type returned by the server (remote only)
  2. FileInfo
  3. mime_content_type (local files only)
  4. exif_imagetype (and image_type_to_mime_type if available)
  5. getMIMETypeFromBinary (the new fallback function using "magic")
  6. file extension

202306071311:

  • Initial work for processing parameters in HTML mode created (completely untested)
  • Added --checklocal / --nochecklocal, --storeifnew (requires --checklocal and --store) ( Not implemented yet. )
  • Added --showconfig / --noshowconfig to show running configuration options
  • Added --showconfigonly (implies --showconfig), shows running configuration and exits.
  • Added --silent (console mode only) (turns off the console completely)
  • Near the top of the script there are two defines ENABLE_SAME_FOLDER_INI and ENABLE_SAME_FOLDER_API_INI. They default to false. If they are set to true, if get-fav.ini and get-fav-api.ini, respectively, are in the same folder as get-fav.php they will be read and used automatically. --configfile and --apiconfigfile, if specified, will be applied after.

It will likely be a few days before I do another git push as the next one is a big one:

  • Path write checking
  • Check local icons against criteria (if required, replacements will be downloaded; if the current icon is ok but there is a different icon online, if storeifnew is enabled it will be replaced)
  • Icons will also be tested for size criteria.
  • Blocklists will be applied.
  • Code will be put in place for storing local icons in sub-folders.
  • Some test HTTP mode variables will be parsed.
  • Documentation will be updated.

202306062230:

  • Refined HTTP Response Parsing (now includes general 'class' of response as part of the data)
  • PHP .ini values are now in defines so if something changes down the road it's easier to update
  • Added more parameter checking
  • Added major/minor to version
  • If cURL is disabled and file_get_contents is not available, check if PHP.INI: allow_url_fopen is disabled, if so show an error message.

202306042323:

  • Bug fixing.
  • Added --apiconfigfile=PATHNAME to load API Definitions
  • Loading of 'same folder' API and config file can be controlled in the special runtime defines section. Default is OFF. (They can always be overridden with command line switch)
  • API: Updated favicongrabber's built-in definition
  • API: Added iconhorse to built-in definition
  • Added more to the capabilities structure
  • If exif is used content-type will be looked up using the image_type_to_mime_type function
  • Capability checking is more thorough and accurate. ("exif" requires "mbstring" etc)

202306021445:

  • Mostly "under the hood" work today. Mostly internal structures prepping for some of the features still being implemented.
  • Added a HTTP response parser for cleaner coding and better log messaging
  • Made some small changes for PHP 5.6.40 compatibility

202306011529:

  • Today one of the APIs was returning 502 errors which gave me the opportunity to add some error handling.
  • Rewrote the JSON parsing for APIs, this required a change to the .ini file for APIs but it should be more flexible (once all the bugs are fixed)
  • It will now go through more than one icon record (for API's that support it) and return the first that matches criteria (size, format, etc) (I need to do the same for the regex search.)
  • Added another switch pair --allowoctetstream / --disallowoctetstream, the default is false because if the more accurate content-type detection is not available most will return application/octet-stream. I may make the default true if and mime_content_type and/or finfo_open are available. (.ini file is [global] allow_octet_stream=boolean )
  • If in debugMode (--debug and/or debug/trace/special logging) active settings will be shown.
  • minor changes for PHP 8.2 compatibility
  • This version has only been tested with PHP 8.2.6.

202305312016:

  • Mostly bug fixing and optimization.
  • Debug logging is at about 80% complete.
  • changed more internal structures, probably not done with that (mostly to accommodate new features)
  • added tenacious mode will try all APIs until it gets a successful result (default is off)
  • added precision timers for internal use
  • it will now warn if, due to the PHP configuration, some functions that identify formats are not available that results may not be that great
  • you can now specify what icon types are acceptable (careful) (note: it is not wired in everywhere yet)

202305281757:

  • Added a 4th API (INI file only right now)
  • Rewrote API randomizer
  • Setting up proper debug logging which is about 20% complete
  • Unified output into the new logging function (automatically renders HTML if not in console mode). (It is possible now to have the script not output anything if you disable both file and console outputs.)
  • Added switch for icon size
  • Added switches for console output (timestamps, level, etc)
  • Debug/HTML mode icons should set the correct MIME type for display (not tested)
  • I know it's looking like a lots been done, and it has but very little has been tested. If you choose to try my branch out, please keep that in mind.

202305251719:

  • Debug logging added (not implemented much yet).
  • Greatly improved image detection although it uses fileinfo which may not be installed everywhere. It will fallback to exif etc.
  • Introduced HTTP Load buffering. If the load function gets a URL that it already loaded it will just return what it got last time. (can be disabled)
  • A lot of new "under the hood" functions, if you choose to play with it from my fork be very careful and please report bugs.

202305242106:

  • APIs can be read in from an INI file (get-fav-api.ini)

202305241803:

  • Added remove TLD support (needs a lot of testing)
  • Made load function allow recursion for redirects (needs a lot of testing)

202305241420:

  • Bugfix. Now setting timeout for PHP level HTTP and socket operations. (exif_imagetype Issues #13)
  • Bugfix. Now keeps specified protocol active (http, https) (cURL Not Obeying Timeout #12)
  • Preliminary support to keep port and user/password information (not hooked up yet)
  • Added a new direct try, it takes the url and adds favicon.ico to it and sees if it gets anything then falls back to previous behavior.

202305221634:

  • Reads config files, command line switches will always override any ini setting. (It is using parse_ini_file with INI_SCANNER_RAW, does array_replace_recursive with the existing configuration structure and finally validates boolean/numeric (with range checks).)

202305231619:

  • Path and other settings are validated
  • Settings are checked against capabilities
  • Updated --help
  • Help menu now shows actual defaults from the defines
  • Help menu now shows available APIs (* by ones that are disabled)
  • Updated copyright notice (year changed to 2019-2023)
  • Individual APIs can be enabled/disabled

Stuff being worked on:

(I'm keeping my github fork up to date as I work on stuff, assuming it's not throwing horrible errors.)

  • New --checkicon --checklocal option will check the icon in the local path first and check online only if missing or otherwise invalid (size, type, blocklist). (in progress)
  • The main design of the script seems to be as a server side script so I plan to add options for it (passed in via query string or form, default will be disabled for security reasons) (Web Mode (Not-Console/CLI) #11) (in progress)
  • Icon validation where it can be checked with generic fallback icons (via md5 hash comparisons in a 'blocklist') (in progress)
  • Updating README.MD to reflect command line switches etc. (in progress)
  • Document functions and config file format (ini file). (in progress)
  • MD5 fragment sub-folder option (Split Download Folder Into Several Sub-Dirs If There Are Lot of Icons #14)
  • Configuration file support (command line switches will still override the config)
  • Added configuration.md for detailed help on options.
  • Redoing configuration throughout the code (to better handle config file overrides) (it's more of an array structure)
  • Add a configuration validation check for paths
  • Moved defaults & constants to defines for easier maintenance.
  • Improved error handling
  • Add code to enable/disable individual apis by name (.e.g. --disableapis=google,faviconkit)
  • Option to strip the TLD domain from the filename (.e.g microsoft.com.ico becomes microsoft.ico)
  • Investigate defining APIs in the ini file.
  • Adding more comments to code
  • Log file support with timestamp and append options (mostly for debugging purposes)
  • Final configuration validation check should include capabilities, so if you force enable curl but you php doesn't have it, it should use the fallback.
  • Added --version (aka v and ver)
  • Added a version as a define
  • Some bug fixing
  • More command line switches for troubleshooting and for specific situations allowing control over connection, http and dns timeouts.
  • Changed $debug to a bool
  • cURL path now handles http->https redirects.
  • PHP's user agent is now set as well as cURLs (not permanently) (Issue With Website Security Checks #7) if --user-agent is passed in.
  • Allow manual disabling of curl.
  • New structure for APIs (will allow adding APIs in the future). (NOTE: it does not currently fallback if the randomly selected one fails)
  • Ability to enable/disable individual API methods
  • Unifying message output/debug messages (function writeOutput)
  • Update command line help.
  • API definitions should allow for apikey (untested)

Issues:

  • New API system allows for more APIs but currently doesn't allow fallbacks
  • --help output takes more than one standard console screen (| more or | clip need to be used)
  • exif_imagetype fails on some sites for some reason, probably because fopen isn't doing something it likes. May add a 'temporary' download of the potential icon file for analysis instead of a direct open. (exif_imagetype Issues #13) (Partial fix, should be used less.)

Before pull request:

  • Lots of testing
  • HTML mode testing
  • Regression testing with PHP 5, PHP 7 and PHP 8
  • Bug fixes

Other Tasks:

  • "How to use" will need to be updated.

Notes:

  • Most of the internal structure has changed. There are now functions to set (and validate) and get configuration data.
  • The main function now just needs a url, it gets the configuration data when it starts.
  • This will make reading an ini config file and applying it much easier which will be the next step.
  • Almost all constants are now in a define block at the top,
  • The "how to use" notes will need to be updated.
  • I am now testing with PHP 5.6.4, 7.4.33, 8.1.19 and 8.2.6.
@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 19, 2023

--help output as of 202306011529

Usage: get-fav.php (Switches)

Available APIs: faviconkit, favicongrabber, google, iconhorse (get-fav-api.ini)
Lists can be separated with space, comma or semi-colon.

--configfile=FILE           Pathname to read for configuration.
--list=FILE/LIST            Pathname or a delimited list of URLs to check.
--blocklist=FILE/LIST       Pathname or a delimited list of MD5 hashes to block.
--validtypes=FILE/LIST      Valid icon types (default is gif,webp,png,ico,bmp,svg,jpg)
--logfile=FILE              Pathname for log file (default is get-fav.log)
--path=PATH                 Location to store icons (default is ./)
--size=NUMBER               Try to get icon size (default is 16)

--tryhomepage               Try homepage first, then APIs. (default is true)
--onlyuseapis               Only use APIs.
--disableapis               Don't use APIs.
--enableblocklist           Enable blocklist. (default is true)
--disableblocklist          Disable blocklist.
--store                     Store favicons locally. (default is true)
--nostore                   Do not store favicons locally.
--overwrite                 Overwrite local favicons. (default is false)
--skip                      Skip local favicons.
--removetld                 Remove top level domain from filename. (default is false)
--noremovetld               Don't remove top level domain from filename.
--tenacious                 Try all enabled APIs until success. (default is false)
--notenacious               Try a random API.
--allowoctetstream          Allow MimeType 'application/octet-stream'. (default is false)
--disallowoctetstream       Block MimeType 'application/octet-stream' for icons.
--consolemode               Force console output.
--noconsolemode             Force HTML output.
--debug                     Enable debug mode.
--help                      This listing and exit.
--version                   Show version and exit.

Advanced:
--user-agent=AGENT_STRING   Customize the user agent.
--nocurl                    Disable cURL.
--bufferhttp                Buffer HTTP page loading. (default is true)
--nobufferhttp              Disable HTTP page load buffering.
--curl-verbose              Enable cURL verbose.
--curl-progress             Enable cURL progress bar.
--enableapis=FILE/LIST      Filename or a delimited list of APIs to enable.
--disableapis=FILE/LIST     Filename or a delimited list of APIs to disable.
--http-timeout=SECONDS      Set HTTP timeout. (default is 60).
--connect-timeout=SECONDS   Set HTTP connect timeout. (default is 30).
--dns-timeout=SECONDS       Set DNS lookup timeout. (default is 120).

Logging:
--log                       Enable debug logging. (default is false)
--nolog                     Disable debug logging.
--append                    Append debug log. (default is true)
--noappend                  Always overwrite debug log.
--timestamp                 Enable debug log timestamps. (default is true)
--notimestamp               Do not show timestamps in debug log.
--loglevel=NUMBER           Set debug logging level. (default is 255)

Console:
--level=NUMBER              Set debug logging level. (default is 31)
--showtimestamp             Enable debug log timestamps. (default is false)
--hidetimestamp             Do not show timestamps in debug log.

Notes:

  • Blocklists are not yet implemented.

@gaffling gaffling added the enhancement New feature or request label May 22, 2023
@gaffling gaffling added this to the WORKING ON LIST milestone May 22, 2023
@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 22, 2023

Configuration Files Use INI file format. Each value is optional. Comments can be used "; " etc. Complex strings need to be quoted. (See the useragent entry below).

[files]
overwrite=true
store=true
local_path=./

[http]
try_homepage=true
http_timeout=60
useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0"

[curl]
enabled=true

[global]
debug=true

@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 22, 2023

Notes on the blocklist concept:

This is already done in get-fav with the google API and the default icon. This simply allows a list of md5 hashes of other icons for the program to ignore.

@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 25, 2023

get-fav-api.ini format

  • ; can be used for comments
  • Each section is a different API definition
  • <DOMAIN>, <APIKEY> and <SIZE> will be substituted at runtime
  • The section text and name must match exactly
Field Description
name ID of the definition (used for enable/disable)
display Cosmetic Display Name (defaults to name)
url API URL (if it contains = and certain other characters it needs to be quoted)
json Does API return json format?
apikey Does the API require a key? (not tested)
enabled Is this definition enabled?

If a json structure is used, it is defined as follows with "json_structure[field] = "item" in the section, for example:

json_structure[icons] = "icons"
json_structure[link] = "src"
json_structure[sizeWxH] = "sizes"
json_structure[mime] = "type"
json_structure[error] = "error"

Supported Fields are (so far):

  • icons
  • link
  • size
  • sizeWxH
  • mime
  • error

Sample:

;
; PHP-Grab-Favicon
; APIs
;

[faviconkit]
display=FavIconKit
name=faviconkit
url=https://api.faviconkit.com/<DOMAIN>/<SIZE>
json=false
enabled=true

[favicongrabber]
display=FavIconGrabber
name=favicongrabber
url=http://favicongrabber.com/api/grab/<DOMAIN>
json=true
enabled=true
json_structure[icons] = "icons"
json_structure[link] = "src"
json_structure[sizeWxH] = "sizes"
json_structure[mime] = "type"
json_structure[error] = "error"

[google]
display=Google
name=google
url="http://www.google.com/s2/favicons?domain=<DOMAIN>&sz=<SIZE>"
json=false
enabled=true

[iconhorse]
display=Icon Horse
name=iconhorse
url=https://icon.horse/icon/<DOMAIN>
json=false
enabled=true

@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 26, 2023

Debug Log File Information

Define Value Description
TYPE_ALL 1 Should always be output
TYPE_NOTICE 2 Important information
TYPE_WARNING 4 Potential issue
TYPE_VERBOSE 8 Extra information
TYPE_ERROR 16 Something has gone wrong
TYPE_DEBUGGING 32 Debug message, usually tops of functions
TYPE_TRACE 64 Extra debug messaging, usually sub/helper functions
TYPE_SPECIAL 128 Special debug messaging, usually sub/helper functions

The "shipping" default is 31 which is all bug debug and trace.

The timestamp, by default uses Y-m-d H:i:s which looks like 2023-05-25 17:27:39. There isn't a switch to change it but it can be changed in the .ini file:

The default log separator used if it is appending to an existing log file is 80 *'s. This cannot be changed via a switch but can also be changed in the .ini file.

[logging]
timestampformat="Y-m-d H:i:s"
separator=(whatever)

Switches:

Files:

Switch Description
--loglevel=NUMBER Log level to use, for everything generally you want 255
--logfile=FILE Pathname for log file (default is get-fav.log)
--log / --nolog Enable/Disable Log File
--append / --noappend Enable/Disable Appending the Log File
--timestamp / --notimestamp Use Timestamps in Log FIle or Not

Console:

Switch Description
--level=NUMBER Log level to use, for everything generally you want 255
--showtimestamp / --hidetimestamp Use Timestamps on Console

Configuration Options:

[logging]
enabled=true/false
append=true/false
level=value
pathname=filename or full path
separator=separator to use when appending
timestamp=true/false
timestampformat="Y-m-d H:i:s"

[console]
enabled=true/false
level=value
timestamp=true/false
timestampformat="Y-m-d H:i:s"

Notes:

  • Timestamps depend on PHP's date.timezone being set correctly in the php.ini file.

@LeeThompson
Copy link
Contributor Author

LeeThompson commented May 30, 2023

Proposed Web Variables

  • Variables can be passed in via a form or query string.
  • Options that make no sense in an HTTP context will not be implemented.
  • These will only be used if define('ENABLE_WEB_INPUT', true); in the script which is not the default for security reasons.
Variable Internal/INI File Switch Comments
GETFAVDEBUG debug --debug Enables special debug mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants