GitHub - Intsights/PyDomainExtractor: A blazingly fast domain extraction library written in Rust

A blazingly fast domain extraction library written in Rust

About The Project

PyDomainExtractor is a Python library designed to parse domain names quickly. In order to achieve the highest performance possible, the library was written in Rust.

Built With

Performance

Extract From Domain

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

Library	Function	Time
PyDomainExtractor	pydomainextractor.extract	1.50s
publicsuffix2	publicsuffix2.get_sld	9.92s
tldextract	__call__	29.23s
tld	tld.parse_tld	34.48s

Extract From URL

The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)

Library	Function	Time
PyDomainExtractor	pydomainextractor.extract_from_url	2.24s
publicsuffix2	publicsuffix2.get_sld	10.84s
tldextract	__call__	36.04s
tld	tld.parse_tld	57.87s

Installation

pip3 install PyDomainExtractor

Usage

Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
    'tld\n'
    'custom.tld\n'
)

domain_extractor.extract('google.com')
>>> {
>>>     'subdomain': 'google',
>>>     'domain': 'com',
>>>     'suffix': ''
>>> }

domain_extractor.extract('google.custom.tld')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'custom.tld'
>>> }

URL Extraction

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.extract_from_url('http://google.com/')
>>> {
>>>     'subdomain': '',
>>>     'domain': 'google',
>>>     'suffix': 'com'
>>> }

Validation

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.is_valid_domain('google.com')
>>> True

domain_extractor.is_valid_domain('domain.اتصالات')
>>> True

domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True

domain_extractor.is_valid_domain('domain-.com')
>>> False

domain_extractor.is_valid_domain('-sub.domain.com')
>>> False

domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False

TLDs List

import pydomainextractor


# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()

domain_extractor.get_tld_list()
>>> [
>>>     'bostik',
>>>     'backyards.banzaicloud.io',
>>>     'biz.bb',
>>>     ...
>>> ]

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Gal Ben David - [email protected]

Project Link: https://github.com/Intsights/PyDomainExtractor

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
images		images
pydomainextractor		pydomainextractor
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
cortex.yaml		cortex.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A blazingly fast domain extraction library written in Rust

Table of Contents

About The Project

Built With

Performance

Extract From Domain

Extract From URL

Installation

Usage

Extraction

URL Extraction

Validation

TLDs List

License

Contact

About

Releases 25

Packages

Contributors 8

Languages

License

Intsights/PyDomainExtractor

Folders and files

Latest commit

History

Repository files navigation

A blazingly fast domain extraction library written in Rust

Table of Contents

About The Project

Built With

Performance

Extract From Domain

Extract From URL

Installation

Usage

Extraction

URL Extraction

Validation

TLDs List

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases 25

Packages 0

Contributors 8

Languages

Packages