Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Python 3.5 Support, Sampler Pipelining, Finer Control of Random State, New Corporate Sponsor

Latest
Compare
Choose a tag to compare
@PetrochukM PetrochukM released this 04 Nov 05:16
· 59 commits to master since this release
41fe6cc

Major Updates

  • Updated my README emoji game to be more ambiguous while maintaining fun and heartwarming vibe. 🐕
  • Support for Python 3.5
  • Extensive rewrite of README to focus on new users and building an NLP pipeline.
  • Support for Pytorch 1.2
  • Added torchnlp.random for finer grain control of random state building on PyTorch's fork_rng. This module controls the random state of torch, numpy and random.
import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))
  • Refactored torchnlp.samplers enabling pipelining. For example:
from torchnlp.samplers import DeterministicSampler
from torchnlp.samplers import BalancedSampler

data = ['a', 'b', 'c'] + ['c'] * 100
sampler = BalancedSampler(data, num_samples=3)
sampler = DeterministicSampler(sampler, random_seed=12)
print([data[i] for i in sampler])  # ['c', 'b', 'a']
  • Added torchnlp.samplers.balanced_sampler for balanced sampling extending Pytorch's WeightedRandomSampler.
  • Added torchnlp.samplers.deterministic_sampler for deterministic sampling based on torchnlp.random.
  • Added torchnlp.samplers.distributed_batch_sampler for distributed batch sampling.
  • Added torchnlp.samplers.oom_batch_sampler to sample large batches first in order to force an out-of-memory error.
  • Added torchnlp.utils.lengths_to_mask to help create masks from a batch of sequences.
  • Added torchnlp.utils.get_total_parameters to measure the number of parameters in a model.
  • Added torchnlp.utils.get_tensors to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for torchnlp.samplers.oom_batch_sampler.
from torchnlp.utils import get_tensors

random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])])
tensors = get_tensors(random_object_)
assert len(tensors) == 2

Minor Updates

  • Fixed snli example (#84)
  • Updated .gitignore to support Python's virtual environments (#84)
  • Removed requests and pandas dependency. There are only two dependencies remaining. This is useful for production environments. (#84)
  • Added LazyLoader to reduce dependency requirements. (4e84780)
  • Removed unused torchnlp.datasets.Dataset class in favor of basic Python dictionary lists and pandas. (#84)
  • Support for downloading tar.gz files and unpacking them faster. (eb61fee)
  • Rename itos and stoi to index_to_token and token_to_index respectively. (#84)
  • Fixed batch_encode, batch_decode, and enforce_reversible for torchnlp.encoders.text (#69)
  • Fix FastText vector downloads (#72)
  • Fixed documentation for LockedDropout (#73)
  • Fixed bug in weight_drop (#76)
  • stack_and_pad_tensors now returns a named tuple for readability (#84)
  • Added torchnlp.utils.split_list in favor of torchnlp.utils.resplit_datasets. This is enabled by the modularity of torchnlp.random. (#84)
  • Deprecated torchnlp.utils.datasets_iterator in favor of Pythons itertools.chain. (#84)
  • Deprecated torchnlp.utils.shuffle in favor of torchnlp.random. (#84)
  • Support encoding larger datasets following fixing this issue (#85).
  • Added torchnlp.samplers.repeat_sampler following up on this issue: pytorch/pytorch#15849