- newer versions of py modules in requirements.txt, update with "pip install -r requirements.txt"
- new version of killredisconn.py - fixed zombie idle worker connections not getting removed from Redis
- diskover-bot-launcher.sh v.1.6.3
- requests dependency warning supported version for urllib3 in lsio docker hub image
- user_prompt error when asking to overwrite index in Python 2
- SyntaxError: invalid syntax user_prompt function error when starting bots using Python 2
- diskover Storage Agent support with new --storagent cli option (see https://github.com/shirosaidev/diskover-storage-agent)
- log output if one of the tree walk threads is scanning a directory with many files
- optimized tree walk code
- unicode decode path error will now print warning instead of diskover tree walk thread raising an error and stopping
- diskover-bot-launcher.sh version 1.6.2 - added bot start check and check for .py file paths (config settings at top of .sh file)
- select indices page now shows any index still being built status in drop down list
- dir calc issues with newline "\n" characters in paths
- version increase to match diskover-web updates
- ended release candidate (rc) ver
- crawl api to allow diskover to crawl file system apis (see diskover github wiki for usage instructions)
- crawlapi section to diskover.cfg.sample, copy to your diskover.cfg
- --crawlapi flag to diskover.py
- optional usage of json files for storagecost and autotag definitions (see diskover.cfg.sample and wiki for how to) (@mathse)
- cli option -F --forcedropexisting to silenty drop existing index (@fake-name)
- user prompt before deleting existing index (@fake-name)
- removed qumulo section from diskover.cfg.sample, remove from your config as is no longer used
- removed diskover_qumulo.py and all code references in diskover (future will add as addon/plugin to new crawl api)
- indexing a small number of directories would cause dir sizes to not get calculated
- NameError exception when running --crawlbot continuous scanner mode
- faster finddupes
- worker bot warnings output for finddupes for any io/os exceptions
- restoretimes config setting to dupescheck section in diskover.cfg.sample, copy to your config - setting to True will try to restore atime and mtime for any files which get opened from byte check and md5 (useful for cifs which does not work with noatime mount option)
- finddupes now uses threads setting in diskover.cfg dupescheck section, copy from diskover.cfg.sample and adjust for your env, prev. was 4 for threads, default is now 8
- requirements.txt to support newer versions of rq and redis python modules
- bots disappearing from redis rq (rqinfo and rq-dashboard), upgrade to redis 3.0.1 and rq 0.13.0 python modules using pip
- diskover socket server Traceback Exception "TypeError: can't concat JSONDecodeError to bytes" from sending non json data to socket server
- export.json Kibana export missing some visualizations
- UnicodeEncodeError: 'ascii' codec can't encode character when running diskover-gource.sh using python 2
- Traceback errors when running hotdirs or copytags
- multiple es hosts can now be set in diskover.cfg elasticsearch section, see diskover.cfg.sample
- improved worker bot stability
- unix socket setting to redis section in diskover.cfg.sample, copy to your config and set if using redis unix socket, see diskover github wiki for more information about redis optimization for diskover
- switch to using redis connections pools
- removed redis worker ttl, remove from your diskover.cfg redis section
- removed dir calc threads for bots which was causing issues with es number of queued jobs (issue #47)
- dir calcs now use batchsize setting and adaptive batch (if using -a) for sending to worker bots
- running in verbose or debug now prints out directories being processed by treewalk/scandirwalk
- es error with too many queued dir calc jobs (issue #47)
- bug with calculating directory sizes for subdirs in / (root)
- bug with directory excludes and scandirwalk_worker building dir/file lists for excluded directories, this was causing slow downs in crawling for directories excluded that contain a large amount of files/dirs
- -m --mtime cli arg for diskover.py now allows for negative numbers to only index files modified in the last n days, example -m -30 would only index files that have been modified in last 30 days
-
- v1.0.21 tree walk client - changed lsthreaded to pls (parallel ls), ls tree walk methods require GNU ls, set path using -g, improved directory excludes, see -e)
- support for running diskover.py in Windows10 (cifs mapped drives) and sending to bots running in linux or linux subsystem for windows (bots do not work in Windows)
- --replacepath cli arg to diskover.py for replacing paths sent to bots (windows/linux path translation)
- warning message output for bots for any exceptions getting meta data for files/directories
- multithreading to bot dir size calcs to help speed up dir size calc times
- --twcport for changing port for tree walk client socket server from one set in config
- improved detection if bots are still busy doing jobs (patch from seanbales)
- removed redis socket timeout options from diskover.cfg.sample - causing issues with rq bots dissapearing
- removed -n --noreconnect cli arg for worker bot
- improved performance of --dirsonly cli arg
- changed inode mapping data type to keyword, prev was float to account for very large inode numbers
- bug with tree walk client and directory excludes using ls or pls tree walk methods, requires GNU ls
- bug where dir calcs might start before bots are all finished doing very long crawl jobs and bot disapears from rq (patch from seanbales)
- bug with tree walk client pls tree walk mode and not indexing files in rootdir
- dirs which have no files/subdirs (from excludes) getting indexed
- --dirsonly cli arg to not include files in batch sent to bots, only send dirs, bots scan for files
- maxfiles config setting in adaptivebatch in diskover.cfg.sample for max number of files in batch, copy from diskover.cfg.sample
- redis socket timeout setting to diskover.cfg.sample redis section, copy to your diskover.cfg and edit for your env
- -n --noreconnect cli arg to diskover_worker_bot.py to not reconnect on redis timeout (default is to reconnect)
- -l --loglevel cli arg to diskover_worker_bot.py to set logging level
- v1.0.20 tree walk client - added in ls, lsthreaded tree walk methods, pscandir pathches, pscandir (parallel scandir) is now the default tree walk method (prev was scandir)
- v1.6 diskover-bot-launcher.sh - added logging level, log to file
- better handling of checking if worker bots are idle and queues empty in diskover.py
- set socket keep alive and retry on timeout to True for redis connections
- issue where an io/os error such as permission denied caused the tree walk to not finish
- occasional issue where bots/queues incorreclty tested to be idle and empty
- costpergb field to es mapping for storing file and directory costs
- storagecost section in diskover.cfg.sample, copy to your config and edit for you env
- -G --costpergb cli arg for storing cost per gb in file and directory docs
- -S --sizeondisk cli arg for setting file's size on disk (disk usage size from blockcount x blocksize) instead of filesize from stat
- -B --blocksize cli option for setting block size for --sizeondisk (default is 512 bytes)
- tree walk client v1.0.19
- ownersgroups section in diskover.cfg.sample for adjusting how owner (user) and group fields are stored for file and directory docs, copy to your config and edit for you env
- function get_owner_group_names to diskover_bot_module.py for handling uid/gid -> name lookups and cacheing
- dirlisttime setting in crawlbot section in diskover.cfg.sample, copy to your config and edit for you env
- friendlier error message if missing section from diskover.cfg during startup
- inode field es mapping for file and directory doc types to keyword (string), prev was long
- removed -S flag for --crawlbot
- set fixed version numbers to python dependencies for pip in requirements.txt, check you are using those versions using pip, newer versions may cause issues
- bug with indexing file systems with inode values larger than es long number type
- bugs with --crawlbot crawlbot continuous scanner
- bug with a very high number of files in a single directory causing walk worker threads to exit prematurely causing not everything to get indexed (pr from seanbales)
- increased rc version number
- threaded tree walk
- dirs/sec to crawl progress bar
- updatedirsizes action to socket server for diskover-web
- reduced time to do dir size calcs
- multithreading for qumulo api crawl
- -T --walkthreads to diskover.py cli options for setting num of threads for tree walk (default is cpu cores x 2)
- additional progress bars indicating ETA for crawling and dir size calcs, loaded after tree walk complete and all dir batches enqueued and after all dir size batches enqueued
- rolled back to rc20 way of calculating dir sizes at end of crawl
- tree walk client v1.0.18
- added pscandir (parallel scandir) tree walk method to client, see -h for new cli options in client
- replaced scandir walk with scandir and faster custom scandirwalk function
- redis timeout in diskover.cfg.sample to 3600 sec (rq job timeout), default is 180 sec for rq
- improved scandir.py in treewalk_client, better isilon hacks for faster performance using ctypes
- removed --ls from tree walk client
- issues with dir size calcs
- bug with directories getting walked which are excluded in normal crawl and tree walk client (affected earlier rc23 builds)
- mem issues with long running crawls
- bug when running diskover tree walk client with Python 2 and any os/io error with scandir caused client to crash
- tree walk client v1.0.14
- removed lswalk from diskover and ls, lsthreaded from tree walk client
- memory issues with storing dir sizes and updating dir sizes at end of crawl
- bugs with unicode when client running python2 and server running python3
- much faster dir size updates at end of crawl
- tree walk client v1.0.13 - added cli args, see -h for help
- redis ttl (key/results expiry time) setting to diskover.cfg.sample, copy to your config file and set for your env
- dir size calculations are now done by diskover.py process and using size results returned from rq jobs, no longer enqueueing dir calc jobs to bots
- removed workerbot section from diskover.cfg.sample including bot logging settings, remove from your config
- bug with tree walk client not sending last batch of dirs
- bug with tree walk client and not remove trailing slashes from paths causing traceback in diskver.py when updating dir sizes at end of crawl
- bug with tree walk client and using ls walk method, ls: invalid line width: f
- bug with diskover and using --lswalk, ls: invalid line width: f
- bug with treewalk client and metaspider crawl method
- bug with lswalk and directory excludes
- bug with qumulo tree walk
- rc18 and rc19 had bugs with dir calcs and were calculating incorrect sizes, please update to rc20
- improved socket server
- improved dir calc speeds
- cli arg -L --listentwc to listen for directory listings messages (pickle) from remote python diskover-treewalk-client.py
- diskover-treewalk-client.py - v1.0.8 python client for diskover socket server to run direct on storage servers for faster tree walking (see wiki)
- additional redis config options in diskover.cfg: db, timeout, queues (copy from diskover.cfg.sample into your config)
- additional socket server options in diskover.cfg: maxconnections, twcport (copy from diskover.cfg.sample into your config))
- can now specify different diskover config file using env var DISKOVER_CONFIG
- cli arg --dircalcsonly for calculating sizes and item counts in all directory docs in existing index
- diskover_connections.py
- diskover_bot_module.py
- diskover_lswalk.py
- scrollsize (elasticsearch search scroll size) to diskover.cfg.sample elasticsearch section (copy to your diskover.cfg and adjust for your env)
- --lswalk cli arg which uses custom lswalk generator (faster treewalk) instead of default scandir walk
- updated diskover-bot-launcher.sh to v1.5
- removed -q queue cli arg from diskover-bot-launcher.sh, use queues in diskover.cfg redis section
- removed -q queue cli arg from diskover bots, use queues in diskover.cfg redis section
- any uppercase index names are automatically lowercased (helge000 pr)
- set file mode to 755 for py and sh files (helge000 pr)
- switched to rq SimpleWorker since Worker was opening up new connections to es and redis due to fork for every new job
- diskover-treewalk-client.py v1.0.9 - added lsthreaded tree walk method, threads adjustable at top of client py
- diskover modules import cleanup
- moved elasticsearch and redis connection code into diskover_connections.py
- moved worker bot functions into diskover_bot_module.py
- reduced output logging for worker bots
- removed threads for file meta scraping and es bulk adding in worker bots as did not see any real performance gain
- removed job passing between bots as did not provide any performance gain
- switched to generator for dir calcs to help speed up dir calc processing time
- removed -n --nodelete cli arg, use --reindex or --reindexrecurs to add data to existing index
- file symlinks getting indexed
- directories containing just symlinks (no actual file/subdirs) getting indexed
- elasticsearch error when using index with uppercase letters (helge000 pr)
- Qumulo api crawl
- s3 inventory file importing
- when using -O to optimize index at end of crawl, stack trace could occur if running longer than es timeout, added catch for this event
- reduced crawl times
- reduced number of es bulk updates and optimized frequency of bulk updates
- improved crawl performance over nfs/cifs mounts
- bots will now enqueue paths into redis queue (rq) if other bots are idle to improve crawl efficiency
- threads for es bulk adds and file meta collecting in bots
- removed filethreadtime from diskover.cfg.sample, removed thread code for long running directories
- removed treethreads from diskover.cfg.sample, removed thread code for crawling directories in rootdir since provided no real benefit and was causing slower crawls over nfs and cifs
- use datetime isoformat instead of strftime (faster)
- python error when not using -d rootdir flag with qumulo crawl (--qumulo)
- traceback error output when optimizing index (-O) takes longer than es timeout setting in diskover.cfg
- requires diskover-web >= 1.5.0-rc15
- index sizes are now up to 15% smaller (optimize your indices after crawling for best size reduction)
- -O --optimizeindex cli option to automatically optimize index (reduce size) after crawl and dir size calcs are complete
- removed docs for crawlstat for directories and added crawl_time field to directory docs
- crawlstat doc now has "state" field to indicate running/finished_crawl/finished_dircalc
- better usage help info for optimize_indices.sh
- threaded bulk importing of s3 inventory files
- s3 inventory file importing is handled by python threads instead of rq worker bots
- show progress bar for s3 inventory importing
- s3 inventory import issue causing duplicate bucket/directory docs in es when importing multiple inventory files
- s3 inventory import issue causing multiple buckets in inventory files to not be recognized correctly
- bug with hot dir calculation when directory changed from 0 bytes to > 0 bytes not updating 100% change
- slow importing when using many s3 inventory files
- set exit code to 1 when index named incorrectly
- version change only, no additional updates
- Amazon S3 inventory support is beta, requires diskover-web >= v1.5.0-rc10
- --s3 requires index named diskover_s3-indexname
- changes to diskover.cfg.sample, please copy over to your diskover.cfg and adjust for your env
- Amazon S3 inventory support - you can now import Amazon S3 inventory (CSV gzip format) to diskover ES index using --s3 cli arg and supplying 1 or multiple gzipped csv inventory files (see wiki or -h)
- faster directory size calculations at end of crawl by reducing es update calls and using bulk update
- maxsize in checkdupes section in diskover.cfg.sample - used for setting max file size to check for dupes (copy to diskover.cfg)
- checkbytes in checkdupes section in diskover.cfg.sample - used for setting bytes to check at start and end of file before doing md5 sum check (copy to diskover.cfg)
- new es optimization settings to elasticsearch section in diskover.cfg.sample - new settings for indexrefresh, disablereplicas, translogsize (copy from diskover.cfg.sample to your diskover.cfg)
- autotagging to diskover_qumulo
- progress bar output for dir size calculation jobs
- additional characters to escape_chars function
- --maxdcdepth to cli args - maximum depth to calculate directory sizes/items (default 10)
- autobatch section in diskover.cfg.sample for setting auto batch options (when using -a) (copy to your diskover.cfg)
- separate queues diskover, diskover_crawl, diskover_calcdir
- cachedirtimes setting in diskover.cfg redis section - for enabling/disabling caching directory times in Redis (used for -I index2 cli arg), default is False (don't cache)
- diskover_worker_bot.py cli arg -q --queue for setting queue that the worker listens on and processes jobs for (default all queues)
- v1.4 of diskover-bot-launcher.sh - added -q option for setting which queue worker bots should listen on (default all queues)
- optimized es bulk adding in es_bulk_adder function
- creating indices with --s3 (from Amazon S3 inventory files) now creates fake dir entries for all keys
- progress indicators for hotdirs and finddupes
- filethreadtime to workerbot section in diskover.cfg.sample - threads are started to help scrape file meta if rq job time (path crawl) > seconds (copy to your diskover.cfg)
- multithreading for file md5 checking when running finddupes
- new Kibana dashboards/visualizations (export.json)
- optimize_indices.sh script for optimize elasticsearch diskover indices (reduces index size, accepts 1 required arg eshost and 2 optional args username password)
- hotdirs to socket server commands
- directory paths are hashed using base64 encode when storing in redis for cacheing directory times (times are used when crawling with -I)
- moved autotag code after plugin code when setting file/directory doc meta data fields
- set default for shards/replicas to 1/0 in diskover.cfg.sample (most users are just using single es node, if you are, you might want to set these)
- directory size/items calculations at end of crawl are now limited by --maxdcdepth cli arg (default 10), previously was unlimited depth
- improved treewalk and qumulo_treewalk functions
- set default for -b (batchsize) to 50 (prev was 25) (using -a usually results in faster crawl times, overrides -b)
- different job types go into different queues (diskover, diskover_crawl, diskover_calcdir)
- dir times are no longer cached in Redis by default (used by -I index2 cli arg) (settings in diskover.cfg.sample, copy to your diskover.cfg)
- threads for treewalking are now limited by threads setting in diskover.cfg treewalk section (copy from diskover.cfg.sample)
- set checkbytes size to 64 in diskover.cfg.sample to help improve dupes checking (to account for header info data in image/video files)
- diskover s3 indices are required to be named diskover_s3- (changed to better deal with index patterns in Kibana)
- diskover qumulo indices are required to be named diskover_qumulo- (changed to better deal with index patterns in Kibana)
- bugs with autotagging
- crawlbot continuous scanner (-B) strack trace error (logger)
- bugs with dupe finding (--finddupes)
- bugs with Kibana dashboards/visualizations (export.json)
- bugs with reindexing using --reindex or --reindexrecurs
- bugs with directory calculations
- bug with waiting if any worker bots are running
- bug with disk space info path getting set to sub directory when reindexing sub directory
- diskover-bot-launcher.sh has been updated, when updating with git please check that any of your env settings at top of file have not changed, you may need to edit these again
- if using the autotag flag, you may want to add a new custom tag in diskover-web admin page for "autotag" if you are using that as the tag_custom value in autotag patterns
- directory excludes (see diskover.cfg.sample) now includes better wildcard searching including for example tmp or tmp* or *tmp
- socket server to accept use of adaptivebatch or batchsize (see wiki for how to)
- --autotag cli arg to turn on bot auto-tagging
- autotag section to diskover.cfg (see diskover.cfg.sample and copy from there) - can be used to get bots to auto tag files/directories during crawl based on patterns
- v1.3 of diskover-bot-launcher.sh - added restart bot cli arg -r (changed redis worker remove to -R), added -f to force remove redis client connections and cleaned up script
- improved dupe checking
- better killredisconn.py (output of status and -f arg to force remove (ignore idle time)
- bug with dupe md5 check
- bug with regular expression matching for directory excludes
- bug with killredisconn.py not working with Python 3
- requires diskover-web >= v1.5.0-rc6
- new directory doc fields/mappings for change percents (change_percent_filesize, change_percent_items, change_percent_items_files, change_percent_items_subdirs), used by hotdirs
- --hotdirs cli arg for calculating directory change percents between index2 to index (hot directories)
- killredisconn.py script to kill any stale/idle redis rq worker bots (redis clients); this could happen from the worker bots cold shutdown (sigkill) instead of warm (sigint/sigterm)
- v1.2 of diskover-bot-launcher.sh
- various bug fixes
- bug when changing redis host in diskover.cfg
- bug causing worker bots to not start (unable to connect to Redis) when running on a host other than same host as Redis/ES
- requires diskover-web >= v1.5.0-rc5
- items_files and items_subdirs fields (es mappings) for directory doc type for storing total files and subdirs items when calculating directory sizes
- bug causing files not to be indexed when using qumulo crawl and file/directory owner/group is local type
- threaded crawling for each top level subdir when using --qumulo (Qumulo api)
- qumulo_api_listdir function to diskover_qumulo.py module
- qumulo_api_walk function to use qumulo_api_listdir
- issue with diskover_qumulo.py module and urllib.quote with paths with special characters like è (needed to encode utf-8)
- Qumulo api support is beta and supports only Python 2.7.
- Qumulo requires python module qumulo-api, install using pip (no python 3 module)
- no file/dir access times in diskover-qumulo-name indices, not supported in Qumulo api
- Qumulo api support, new --qumulo cli option, Qumulo api will be used instead of scandir, requires index names diskover-qumulo-
- diskover_qumulo.py module
- different ES index mappings for qumulo (removed last_access, added creation_time) (Qumulo api does not have file access time)
- qumulo section to diskover.cfg.sample
- hardlinks, inode fields to directory mappings/docs
- improved screen output logging for worker bots
- $ to escape_chars function
- moved file_excluded function to diskover_worker_bot.py module
- occasional issue where not all directories were getting calculated (added sleep before index refresh and getting directory docs)
- progress bar showing when running in debug or verbose
- unicode decode errors when using -I and paths with special characters
- adaptivebatch_maxsize global variable to control max size (number of directories in batch) sent to Redis (set to 500)
- @, ', " to escape_chars function in diskover.py
- added includes section to diskover.cfg to whitelist dirs/files
- improved -a adaptivebatch algorithm
- adaptivebatch_startsize is now set to 10 (prev was 5)
- unicode issues with sending paths to Redis which contain special characters
- adaptivebatch_startsize global variable
- when using adaptivebatch, batchsize cliarg is updated during crawl to show current batchsize in worker output
- improved speed of using -I flag to get meta data (doc source) from previous index instead of disk when comparing directory times
- using second index (index2) when comparing directory sizes to get meta data from previous index instead of off disk (-I)
- bug fixes for crawlbot continuous scanner (-B)
- tag copying from index2 to index (-C)
- set root path (-d) to unicode if using python2
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-rc1
- this is a release candidate for v1.5.0
- ** crawlbot continuous scanner (-B) is buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- mtime + ctime for directories is now stored in Redis to help speed up indexing of directories which don't change from previous index (index2) to new index (when using -I flag). When crawling, directory mtime + ctime are checked and if same as in Redis cache then meta data for directory and all it's files is used from index2 instead of off disk.
- -I index2 cli option for setting prev index when doing directory comparison (see above)
- dirtimesttl option in Redis section in diskover.cfg.sample for setting how long directory times are stored in Redis (default 1 week)
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is still buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- = (equals sign) to escape_chars function
- when running dupes check, file md5 sums are now checked in chunks against previous file rather than comparing whole md5 sum
- crawl elapsed time now gets set when all crawl jobs are finished (workers done all crawl jobs), before dir sizes are calculated
- directories not getting indexed which had similar name to excluded directory, example Cache in dir excludes was not indexing directories named Caches, if you want you can exclude all similar directories using wildcard such as Cache*
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is still buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- removed adding file filesizes to directory doc during crawl (was causing issues with calculating directory sizes)
- bug with directory size/items calculations
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is still buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- set default batch size to 5 (adjust using -b n if you find workers being idle (set lower number) or queue too large (set higher number))
- improved adaptivebatch algorithm to try and reduce idle workers
- adaptivebatch applies to directory calculations now too
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is still buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- reduced file stat calls/crawl time by checking for excluded file extension before min size, file will not get stat call now if extension in exclude list
- reduced calculating directory size time by adding file sizes and items to directory doc during crawl and then aggregate sum the sub directory docs instead of files
- bug with worker bot failing jobs when using verbose/debug logging
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is still buggy, hoping to have it stable in final release **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- threading to crawlbot continuous scanner
- threads setting in diskover.cfg for crawlbot continuous scanner, default is 8, for searching for mtime changes in directories in existing index
- chunksize (es bulk size) in diskover.cfg.sample to 1000 from 500 (help speed up crawl times)
- maxsize (es connection count) in diskover.cfg.sample to 20 from 10 (help speed up crawl times)
- rootdir files not getting indexed
- bugs with reindexing and crawlbot continuous scanner
- various bug fixes
- caching of uid/gid owner group names to help speed up crawl times and reduce lookups on directory services
- improved adaptive batch
- faster crawl times
- workerbot section to diskover.cfg with log settings
- ver 1.1 of diskover-bot-launcher.sh, better log handling (logs will be named diskover_bot_worker___log and default is stored in /tmp, change dir in diskover.cfg)
- seperate redis connection for each worker and es connection is loaded only once when worker starts
- renamed diskover.cfg to diskover.cfg.sample to help with updates (copy diskover.cfg.sample to diskover.cfg if you don't have)
- bug where some directories sizes were not getting calculated at end of crawl (index was not being refreshed)
- threading module to diskover.py to parallel tree walk from each directory in rootdir and enqueue those directories into Redis for worker bots to process
- removed RLock from diskover_socket_server.py, using python global threading lock
- diskover-gource.sh to ver 1.1
- bug in diskover_gource.py
- requires Redis
- requires rq and redis python modules (pip install)
- requires diskover-web >= v1.5.0-beta.5
- this is a pre-release beta for v1.5.0
- ** crawlbot continuous scanner (-B) is very buggy, hoping to have it more stable in later releases **
- recommended to pip install rq-dashboard (rq-dashboard is used for monitoring rq redis queue)
- diskover_worker_bot.py - worker bot module for processing Redis queue
- requirement for Redis
- requirement for rq and redis python modules
- new options in diskover.cfg for redis
- -b --batchsize flags to diskover.py for controling the batch size (num of dirs) to enqueue for each worker bot to process
- -a --adaptivebatch for auto-adjusting batch size during crawl
- config option in diskover.cfg for ES to wait for at least yellow status before bulk uploading (default is False)
- removed python Queue and using Redis for enqueuing jobs
- no longer using threading, switched to using workers (diskover_worker_bot.py), run multiple workers to consume queue jobs
- moved dupes, gource, socket server, crawlbot, redis worker into their own modules diskover_.py
- removed dependency for blessings
- removed diskover-mp.sh and requirement for parallel. diskover_worker_bots.py can be run in parallel to help with the redis queue.
- requires diskover-web >= v1.5.0
- this is a pre-release beta for v1.5.0
- requirements for progressbar2 and blessings (ncurses) python modules, install with pip
- re module for regular expression searches for wildcards in directory excludes (example tmp* or /dir1/tmp* will now work)
- ability to send json data to diskover socket server using curl (see wiki for how to)
- an additional diskspace doc is now added for every reindex of a directory (also for crawlbot)
- better progress bars using progressbar2 and blessings (ncurses) python modules
- removed --progress flag (json output)
- crawlbot bugs and crawlbot using high cpu
- requires diskover-web >= v1.5.0
- this is a pre-release beta for v1.5.0
- you can now specify how many threads for crawling directory meta as well as file meta, this should help out users with directories that have tons of files -w and -W cli flags for setting crawl worker threads for directories (-w) and files (-W), separated the control of threading for each -c --calcrootdir flag to calculate rootdir size after running parallel crawls (used by diskover-mp.sh)
- diskover-mp.sh ver 1.2 - improved parallel crawls
- improved crawl progress bar to show separate dir and file percents
- added function for checking file and extension excludes check_file_excludes
- check for excludes of files/directories is now also done at beginning to reduce parallel crawl times
- crawl_stat mapping for elapsed_time field to type float, issues with long and crawl taking less than 0 seconds
- bug with directory items count not matching exact number of sub dir/file docs
- requires diskover-web >= v1.5.0
- this is a pre-release beta for v1.5.0
- faster crawling and directory size calculations (directory sizes/items are now calculated during crawl)
- faster tag copying from previous index to new index
- crawlbot continuous scanner is now multi-threaded
- diskover-mp.sh (muliproc shell helper script) ver 1.1 - use to run parallel diskover.py processes across top-level directories (settings at top of script) requires GNU parallel command https://www.gnu.org/software/parallel/
- added -c --calcrootdir for calculating rootdir filesize/items after running parallel crawls (run this after all crawl processes finish)
- added -e --indexemptydirs flag to index empty directories (empty directories will show item count as 1 for itself)
- ES index shard size and replica settings in diskover.cfg
- Queue size setting in diskover.cfg
- improved crawlstats
- warning if indexing 0 Byte empty files (-s 0)
- empty directories are not indexed (reduce index size), if you want to index, use -e flag
- removed scandir directory iteration and just use scandir.walk in main thread to add tuple of directory and files to queue
- removed path_parent.tree text field from directory and file mappings since was not being used (help reduce index size)
- -S --dirsize flags has been removed since dirsize is calculated during crawl
- crawlstats es mapping and add_crawl_stats function now only uses crawlstat doctype instead of crawlstat_start, crawlstat_stop
- diskover no longer enforces to be run as root user. Will only output warning instead when not run as root.
- moved pythonpath and diskoverpath in config to a new paths section
- combined index_add_files, index_add_dirs into index_bulk_add functiion
- dupe_md5 field being set to same as filehash instead of md5sum when running tagdupes
- bugs in diskover-mp.sh
- crawl stats not updating in ES when running in -q quiet mode
- crawl stats output at end of crawl, file count was showing total instead of indexed count
- -r reindex option reindexing files in 2nd level subdirs causing duplicate docs in index
- bugs in crawlbot
- calculating directory sizes for / (root) and directories in /
- elapsed time when crawling for more than 24 hours
- not being able to load more than 1 plugin
- removed tag "untagged" from all files and directories and is just now empty string (help reduce index size)
- when calculating directory sizes, count of subdirs are added to items (prev was just count of files)
- --tagdupes cli arg has been renamed to --finddupes
- --finddupes now updates dupe_md5 to be the md5sum of the file (previously was just boolean)
- changed is_dupe boolean field to dupe_md5 keyword field (default is empty)
- -C --copytags cli flag to copy tags from source index (index2) to destination index (overwrites any existing tags in index)
- plugins will now work with adding additional meta fields and mappings for directories
- worker_setup_copytags function for setting up worker threads for copying tags
- worker_setup_copytags and copytag_worker functions for setting up worker threads and copying tags
- reindexing, single file indexing and crawlbot (continuous scanning) now preserves any existing tags in index
- added check for plugins to see if for file or directory
- renamed index_get_dirs to index_get_docs and added ability to get file or directory docs and also return doc id as well as fullpath and mtime
- diskover project is now accepting donations on PayPal. Please consider supporting if you are using diskover :) https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=CLF223XAS4W72
- bug in calculating directory sizes with similar path names
- bug in finding directories with similar path names when collecting directories for reindex
- improvements to reduce function calls in get_file_meta and get_dir_meta (rapphil)
- improved performance by reducing plugin loading (rapphil)
- diskover project is now accepting donations on Patreon. Please consider supporting if you are using diskover :) https://www.patreon.com/diskover
- faster crawl and reindexing performance
- improved duplicate file finding functionality and performance
- improved -b --breadthfirst crawl algorithm
- improved progress bar output
- escape_chars function to better escape special characters (in paths) when searching in ES
- -S --dirsize cli option to calculate single directory size and item counts or all in existing index and update dir doc filesize, items fields
- -B --crawlbot cli option to start up crawl bot which runs in continuous loop to check index for directories which have changed (mtime) and recrawl those directories
- added crawlbot section and sleeptime option to config file to control how long bot sleeps before scanning next directory in list
- socket server support for python3
- debug output to socket server
- directory tree is now walked using scandir.walk before added directories to queue
- -s --minsize cli flag is now in Bytes (previously was MB), default is >0 Bytes. You can crawl empty files now by setting -s 0
- set maxsize for Queue to 1000 items
- cli args -v is now for verbose and -V for version
- socket server switched to TCP and allows up to 5 connections with threaded tasks
- progress and progressbar will only update screen when progress has increased
- crawlstat_start and crawlstat_stop doc no longer gets indexed when tagging dupes (--tagdupes)
- --maxdepth crawling files 1 depth past maxdepth (matches find command now)
- fatal error when outputting for Gource
- socket server now works properly with python3
- duplicate file finder progress output
- required by diskover-web v1.4.0
- Elasticsearch/Kibana v5.6.4 support
- scandir python module v1.6 support
- maxsize to config file to adjust the maximum connections open to ES when crawling
- add_diskspace function to add disk space info (path, total, free, available disk space) to elasticsearch
- additional mappings and fields for disk space info (fields: path, total, free, available), new es document type is named 'diskspace'
- additional mappings and fields for directory doc type: filename, path_parent, filesize, user, group, tag, tag_custom
- add_crawl_stats function to add crawl stat info (start/stop/elapsed time) to elasticsearch
- additional mappings and fields for crawlstats doc type: path, start_time, stop_time, elapsed_time
- additional banner and random color for banner and stats
- removed Windows support
- path field in directory doc type to filename (keyword type)
- removed type=str from argparse
- added try condition to import elasticsearch5 (for elasticsearch 5.6.)
- imported Urllib3HttpConnection from Elasticsearch
- empty directories not getting indexed causing diskover-web filetree to not show all subfolders/files
- unicode issues in python2.7
- rootpath is stored as directory name instead of . in ES
- Connection pool is full, discarding connection warning messages in log output when crawling using a high number of threads (new maxsize setting in config file)
- ability to add additional diskover index mappings (file meta data fields) using diskover plugins
- -b or --breadthfirst cli option to crawl breadth-first rather than depth-first (default)
- empty directory meta data is no longer indexed. Previously if a directory was empty, the directory meta data would still get indexed.
- moved file size check above excludes check in get_file_meta function
- renamed function add_file_to_es to get_file_meta
- renamed function index_add_dir to index_add_dirs
- switched to entry.inode() to get inode number. Previously was entry.stat().st_ino
- only thread 0 updates progress bar
- maxretries to config file for changing the amount of retries for ES operations (default is 0)
- chunksize to config file for changing the max amount of documents before ES bulk operation (default is 500)
- added check before ES bulk operations to wait for yellow status of ES health
- added request_timeout to helpers.bulk operations
- code cleanup/refactoring
- dupescheck section in config file to modify readsize for md5 sum file check
- tagDupes function now loads in file x KB at at time when doing md5 sum check, previously loaded whole file into memory
- tagdupes causing python memmoryerror crash when loading large file into memory when doing md5 sum check
- dupesFinder function now searches for the 10000 hashgroups with largest files, 1000 dupe files per hashgroup
- tagdupes causing crash with fatal error "Killed" when searching index with a lot of file hashes
- --listen cli option for opening listen socket for remote commands
- improved progress bar now shows directories per second and eta
- --progress cli option to only output progress in json format
- --reindex (non-recursive) and --reindexrecurs (recursive) cli options to reindex (freshen) existing directory
- cacheing of owner/group names
- --maxdepth cli option for setting maximum directory depth to crawl
- diskover-mp.sh shell script to help run parallel diskover.py processes
- optimized crawler by not adding empty directories to Queue
- set to bulk load data to ES when file/dir list sizes at 500 (previously was 1000)
- set default threads to 8
- code cleanup
- occassionaly at end of crawl remaining files in filelist would not get indexed in ES
- file exists check when indexing single file
- absolute paths in excluded directory list not being skipped in crawl
- crawl sometimes hanging at end when using more than default number of threads
- duplicates count at end of tagdupes showing wrong number of dupes tagged in Elasticsearch
- keyboard interupt sometimes not working when stopping tagdupes
- elasticsearch timeout setting in config diskover.cfg
- increased timeout to 30 secconds for finding dupes using scroll api, default for Elasticsearch python client is 10 sec which was causing crash searching index containing many duplicate hashes
- bug causing directory's to get indexed as file type documents in Elasticsearch and also excludes being ignored (due to changes in v1.2.4)
- combined excludes (dirs/files) into one group "excludes" in config diskover.cfg
- increased timeout from 10 seconds (default) to 30 seconds for Elasticsearch transport class in elasticsearchConnect function
- check if path exists before crawling
- index single file using "-f or --file" cli argument
- no longer using python 3 built in os scandir, requires scandir module same as python 2
- more debug output for file and excludes
- decreased crawl time by creating Queue for subdirs in rootdir and using half the threads to recursively crawl down those paths. Previously only the main thread was used to crawl down tree from rootdir
- reduced cpu usage by removing stdout flush for progress bar
- occasionally at end of crawl few remaining files in Queue would not get bulk added to ES
- unicode issues
- can now set minimum file size using '-s' or '--minsize' for duplicate file finding '--tagdupes'
- '--mtime' cli option for modified time now also checks directory mtime and skips adding to queue
- decreased crawl times by modifying Elasticsearch bulk item size, reducing file stat calls, reducing queue wait sleep time
- filelist and dirlist now gets bulk added to ES and emptied when at 1000 or more items, previously dirlist would get bulk added after all directories were crawled and filelist was bulk added after each directory
- reduced file stat calls by storing entry.stat() and os.stat(path) into stat var and using it for different stat
- tagdupes duplicate finder will now search ES for all results and dupe finding is done in php rather than ES aggregate buckets tophits. This allows to find all dupes and not limit of 10000 hashgroups.
- excluded_dirs can have absolute paths as well as just directory names
- improved code for duplicate file detection
- default is now >0MB for cli option '--minsize' (min file size)
- default is now 0 for cli option '--mtime' (min days old)
- removed global variables for total file counts and replaced with local for each thread, totals are calculated at end of crawl stats output
- better handling of unicode, unicode was causing Exception errors
- crawl stats not reporting correct file count in python 2
- progress bar for tagdupes now more accurately reflects check progress
- bugs with unicode text causing indexing errors
- path_parent is now multi field, keyword and also path_parent.tree text field
- path_parent.tree uses ES path hierarchy tokenizer
- all directories are now indexed (ES type is directory) with fields path, last_access, last_modified, last_change, indexing_date
- path field is multi field both keyword and text, path.tree text field uses ES path hierarchy tokenizer
- nice cli flag to reduce cpu/disk io
- stats output at end of crawl/dupe check
- removed find command for building directory queue and replaced with python scandir
- set default crawl threads to 4
- tagdupes would occasionaly hang if file couldn't be opened for byte check
- files are not marked as duplicate if hardlink count > 1
- better handling of keyboard interupts and killing threads
- multi-threaded duplicate file checking
- bytes are stored in base64 when doing duplicate file byte comparison
- some duplicate files not being found in ES
- ES connection timeout issue when searching for a lot of duplicate files
- fatal error and crash when searching for duplicate files that no longer existed
- fatal error and crash when duplicate file only 1 byte and running byte check
- improved duplicate file finding using multi-pass detection 1) filehash (mtime/filesize) 2) first and last few bytes 3) md5 sums
- tag_custom field and es keyword mapping
- tagdupes cli flag will now only update existing index and will not overwrite any existing index
- removed path_full field and es mapping. duplicate data in path_parent and filename
- gource visualization support, --gourcert and --gourcemt cli options
- diskover-gource.sh shell script for gource
- gource section in config file
- can now exclude files with no extension using NULLEXT
- quiet cli option to run with no output
- new elasticsearch field 'indexing_thread' (used by gource)
- tested on es/kibana 5.4.2 and es client 5.4.0
- better handling of exclude lists. find command now looks for exact exclude directory name and no longer adds wildcards to name by default
- swtiched to version output of argparse
- better handling of exceptions and log output for any errors crawling files or directories
- indexing_date field now includes milliseconds
- cleaned up logging code
- -v or --version to display version, --verbose to run in verbose mode
- bug with using wildcards in exclude lists in config file
- support for Windows (requires pywin32 and cygwin)
- support for Python 3
- switched to scandir instead of os.listdir to process files in directory (faster)
- app fatal error if config file had no items in exclude lists
- bug reading config file for aws setting, user, password and indexname
- elasticsearch and requests to requirements.txt for pip install
- is_dupe field to elasticsearch index
- tagDupes function
- indexUpdate function for updating is_dupe field
- kibana saved search FileListIsDuplicate
- tagdupes runs after crawl if cli flag
- printStats function stats_type for different output
- diskover web interface and many features
- dupesindex cli flag to tagdupes
- duplicate files can now be tagged true or false in is_dupe field rather than creating separate index
- kibana dupes dashboard and all dupes visualizations to use is_dupe field
- indexCreateDupes function
- ES_INDEX_DUPES in Constants.php
- tag field to elasticsearch index
- diskover web tag manager
- diskover dark dashboard in Kibana
- bug where directories were getting indexed (not just files) when using -s 0 flag
- debug and version cli arguments
- banner and progressbar color to see easier on white terminals
- bug in calling printProgressBar when dircount is empty causing crash
- bug indexing many dupes causing Elasticsearch to hang
- check dircount before calling printProgressBar
- better keyboard interrupt handling
- printStats function
- capture exceptions of crawling directory and indexing files
- reduced size limits for finding dupes
- replaced printLog function with python logging module
- verbose logging uses logging module for debug output
- progressbar colors
- cleaned up various code
- times in ES indices are now stored as utc strings instead of unix time
- ES index mappings for times
- 0 byte empty files are no longer indexed when using flag -s 0
- crawl thread crashing when file/directory gets deleted during crawl
- removed code to rstrip newline characters on filename
- additional comments
- check for file/directory deletion during crawl
- new kibana dashboard
- improved crawlFiles function to speed up crawl times
- bug in finding duplicate files
- http auth for elasticsearch x-pack
- word cloud visualization to dupes dashboard
- check for required config items that are commented out
- added user and password to config file for http auth
- bug in finding duplicate files
- nodelete command line argument to not delete existing index
- dupes command line argument to create a duplicate files index
- replaced optparse module with argparse
- cleaned up parseCLIArgs function
- verbose now requires no integer value
- duplicate file index is now created with optional command line argument rather than at end of crawl
- progress bar not showing 100% for when crawl finishes
- check for index name is diskover-
- crash caused from unicode decoding if integer value for owner/group
- check for running as root user
- keyboard interrupt
- index field "filehash": md5 hash of file metadata combining filename+filesize+mtime strings
- diskover_dupes-* index created for duplicate files
- filemeta is now stored in a dictionary instead of a string
- utf-8 decoded all strings stored in filemeta_dict
- file check now checks for symbolic links
- various code cleanup
- progress bar printing on new lines if queue was empty
- getting extension from some files caused unicode decode error
- debug cli option
- inode field type in the Elasticsearch index mapping from type int to long
- strip new line chars and not spaces from end of directory and file names