- master
- v0.6.0 (2019-07-13)
- Broadcast, Accumulator and AccumulatorParam by @alexprengere
- support for increasing partition numbers in coalesce and repartition by @tools4origins
- v0.5.0 (2019-05-03)
- fixes for HDFS thanks to @tools4origins
- fix for empty partitions by @tools4origins
- api fixes by @artem0 and @tools4origins
- various updates for streaming submodule
- various updates to lint and test system
- logging: converted some info messages to debug
- ... documentation for some point releases is missing
- v0.4.1 (2017-05-27)
- retries for failed partitions
- improve
pysparkling.streaming.DStream
- updates to docs
- v0.4.0 (2017-03-11)
- major addition:
pysparkling.streaming
- updates to
RDD.sample()
- reorganized
scripts
andtests
- added
RDD.partitionBy()
- minor updates to
pysparkling.fileio
- major addition:
- v0.3.23 (2016-08-06)
- small improvements to fileio and better documentation
- v0.3.22 (2016-06-18)
- reimplement RDD.groupByKey()
- clean up of docstrings
- v0.3.21 (2016-05-31)
- faster text file reading by using
io.TextIOWrapper
for decoding
- faster text file reading by using
- v0.3.20 (2016-05-01)
- Google Storage file system (using
gs://
) - dependencies:
requests
andboto
are not optional anymore aggregateByKey()
andfoldByKey()
return RDDs- Python 3: use
sys.maxsize
instead ofsys.maxint
- flake8 linting
- Google Storage file system (using
- v0.3.19 (2016-03-06)
- removed use of
itertools.tee()
and replaced with clear ownership of partitions and partition data - replace some remaining use of
str()
withformat()
- bugfix for
RDD.groupByKey()
andRDD.reduceByKey()
for non-hashable values by @pganssle - small updates to docs and their build process
- removed use of
- v0.3.18 (2016-02-13)
- bring docs and Github releases back in sync
- ... many updates.
- v0.2.28 (2015-07-03)
- implement
RDD.sortBy()
andRDD.sortByKey()
- additional unit tests
- implement
- v0.2.24 (2015-06-16)
- replace dill with cloudpickle in docs and test
- add tests with pypy and pypy3
- v0.2.23 (2015-06-15)
- added RDD.randomSplit()
- saveAsTextFile() saves single file if there is only one partition (and does not break it out into partitions)
- v0.2.22 (2015-06-12)
- added Context.wholeTextFiles()
- improved RDD.first() and RDD.take(n)
- added fileio.TextFile
- v0.2.21 (2015-06-07)
- added doc strings and created Sphinx documentation
- implemented allowLocal in
Context.runJob()
- v0.2.19 (2015-06-04)
- new IPython demo notebook at
docs/demo.ipynb
at https://github.com/svenkreiss/pysparkling/blob/master/docs/demo.ipynb parallelize()
can take an iterator (used inzip()
now for lazy loading)
- new IPython demo notebook at
- v0.2.16 (2015-05-31)
- add
values()
,union()
,zip()
,zipWithUniqueId()
,toLocalIterator()
- improve
aggregate()
andfold()
- add
stats()
,sampleStdev()
,sampleVariance()
,stdev()
,variance()
- make
cache()
andpersist()
do something useful - better partitioning in
parallelize()
- logo
- fix
foreach()
- add
- v0.2.10 (2015-05-27)
- fix
fileio.codec
import - support
http://
- fix
- v0.2.8 (2015-05-26)
- parallelized text file reading (and made it lazy)
- parallelized take() and takeSample() that only computes required data partitions
- add example: access Human Microbiome Project
- v0.2.6 (2015-05-21)
- factor out
fileio.fs
andfileio.codec
modules - merge
WholeFile
intoFile
- improved handling of compressed files (backwards incompatible)
fileio
interface changed todump()
andload()
methods. Addedmake_public()
for S3.- factor file related operations into
fileio
submodule
- factor out
- v0.2.2 (2015-05-18)
- compressions:
.gz
,.bz2
- compressions:
- v0.2.0 (2015-05-17)
- proper handling of partitions
- custom serializers, deserializers (for functions and data separately)
- more tests for parallelization options
- execution of distributed jobs is such that a chain of
map()
operations gets executed on workers without sending intermediate results back to the master - a few more methods for RDDs implemented
- v0.1.1 (2015-05-12)
- implemented a few more RDD methods
- changed handling of context in RDD
- v0.1.0 (2015-05-09)