Skip to content

Commit

Permalink
Merge branch 'master' into pypy
Browse files Browse the repository at this point in the history
  • Loading branch information
svenkreiss committed Jun 16, 2015
2 parents 4bd0abe + 2e24e2c commit 1117449
Show file tree
Hide file tree
Showing 6 changed files with 24 additions and 20 deletions.
5 changes: 4 additions & 1 deletion HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@
Changelog
=========

* `master <https://github.com/svenkreiss/pysparkling/compare/v0.2.22...master>`_
* `master <https://github.com/svenkreiss/pysparkling/compare/v0.2.23...master>`_
* `v0.2.23 <https://github.com/svenkreiss/pysparkling/compare/v0.2.22...v0.2.23>`_ (2015-06-15)
* added RDD.randomSplit()
* saveAsTextFile() saves single file if there is only one partition (and does not break it out into partitions)
* `v0.2.22 <https://github.com/svenkreiss/pysparkling/compare/v0.2.21...v0.2.22>`_ (2015-06-12)
* added Context.wholeTextFiles()
* improved RDD.first() and RDD.take(n)
Expand Down
11 changes: 5 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,20 +103,19 @@ in the local thread and is never serialized or deserialized.

If you want to process the data in parallel, you can use the ``multiprocessing``
module. Given the limitations of the default ``pickle`` serializer, you can
specify to serialize all methods with ``dill`` instead. For example, a common
instantiation with ``multiprocessing`` looks like this:
specify to serialize all methods with ``cloudpickle`` instead. For example,
a common instantiation with ``multiprocessing`` looks like this:

.. code-block:: python
c = Context(
multiprocessing.Pool(4),
serializer=dill.dumps,
deserializer=dill.loads,
serializer=cloudpickle.dumps,
deserializer=pickle.loads,
)
This assumes that your data is serializable with ``pickle`` which is generally
faster than ``dill``. You can also specify a custom serializer/deserializer
for data.
faster. You can also specify a custom serializer/deserializer for data.

*API doc*: http://pysparkling.trivial.io/v0.2/api.html#pysparkling.Context

Expand Down
11 changes: 5 additions & 6 deletions docs/sphinx/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,19 @@ in the local thread and is never serialized or deserialized.

If you want to process the data in parallel, you can use the ``multiprocessing``
module. Given the limitations of the default ``pickle`` serializer, you can
specify to serialize all methods with ``dill`` instead. For example, a common
instantiation with ``multiprocessing`` looks like this:
specify to serialize all methods with ``cloudpickle`` instead. For example,
a common instantiation with ``multiprocessing`` looks like this:

.. code-block:: python
c = Context(
multiprocessing.Pool(4),
serializer=dill.dumps,
deserializer=dill.loads,
serializer=cloudpickle.dumps,
deserializer=pickle.loads,
)
This assumes that your data is serializable with ``pickle`` which is generally
faster than ``dill``. You can also specify a custom serializer/deserializer
for data.
faster. You can also specify a custom serializer/deserializer for data.

.. autoclass:: pysparkling.Context
:members:
Expand Down
2 changes: 1 addition & 1 deletion pysparkling/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""pysparkling module."""

__version__ = '0.2.22'
__version__ = '0.2.23'

from .exceptions import (FileAlreadyExistsException,
ConnectionException)
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
tests_require=[
'nose>=1.3.4',
'futures>=3.0.1',
'dill>=0.2.2',
'cloudpickle>=0.1.0',
],
test_suite='nose.collector',

Expand Down
13 changes: 8 additions & 5 deletions tests/test_multiprocessing.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
import dill
import math
import pickle
import logging
import cloudpickle
import multiprocessing
from concurrent import futures
from pysparkling import Context


def test_multiprocessing():
p = multiprocessing.Pool(4)
c = Context(pool=p, serializer=dill.dumps, deserializer=dill.loads)
c = Context(pool=p, serializer=cloudpickle.dumps,
deserializer=pickle.loads)
my_rdd = c.parallelize([1, 3, 4])
r = my_rdd.map(lambda x: x*x).collect()
print(r)
Expand All @@ -25,7 +27,8 @@ def test_concurrent():

def test_first_mp():
p = multiprocessing.Pool(4)
c = Context(pool=p, serializer=dill.dumps, deserializer=dill.loads)
c = Context(pool=p, serializer=cloudpickle.dumps,
deserializer=pickle.loads)
my_rdd = c.parallelize([1, 2, 2, 4, 1, 3, 5, 9], 3)
print(my_rdd.first())
assert my_rdd.first() == 1
Expand Down Expand Up @@ -65,8 +68,8 @@ def test_lazy_execution_processpool():
with futures.ProcessPoolExecutor(4) as p:
r = Context(
pool=p,
serializer=dill.dumps,
deserializer=dill.loads,
serializer=cloudpickle.dumps,
deserializer=pickle.loads,
).textFile('tests/test_multiprocessing.py')
r = r.map(indent_line).cache()
r.collect()
Expand Down

0 comments on commit 1117449

Please sign in to comment.