Skip to content

Commit

Permalink
Merge branch 'dev-minor' of github.com:SciPhi-AI/R2R into Nolan/Intel…
Browse files Browse the repository at this point in the history
…lisense
  • Loading branch information
NolanTrem committed Oct 20, 2024
2 parents 7a463fb + 65a3a51 commit 0daab42
Show file tree
Hide file tree
Showing 47 changed files with 1,540 additions and 68 deletions.
65 changes: 43 additions & 22 deletions .github/actions/setup-postgres-ext/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,39 +32,60 @@ runs:
- name: Setup PostgreSQL on Windows
if: inputs.os == 'windows-latest'
shell: pwsh
shell: cmd
run: |
choco install postgresql15 --params '/Password:postgres' --force
$env:PATH += ";C:\Program Files\PostgreSQL\15\bin"
$env:PGPASSWORD = 'postgres'
echo Starting PostgreSQL setup and pgvector installation...
echo Installing PostgreSQL...
choco install postgresql15 --params "/Password:postgres" --force
echo Updating PATH and setting PGPASSWORD...
set PATH=%PATH%;C:\Program Files\PostgreSQL\15\bin
set PGPASSWORD=postgres
echo PATH updated and PGPASSWORD set.
echo Altering PostgreSQL user password...
psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
echo PostgreSQL user password altered.
# Install Visual Studio Build Tools
echo Installing Visual Studio Build Tools...
choco install visualstudio2022buildtools --package-parameters "--add Microsoft.VisualStudio.Workload.VCTools --includeRecommended --passive --norestart"
echo Visual Studio Build Tools installed.
# Set up environment for building pgvector
$vcvars64Path = "C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
cmd.exe /c "call `"$vcvars64Path`" && set > %temp%\vcvars.txt"
Get-Content "$env:temp\vcvars.txt" | Foreach-Object {
if ($_ -match "^(.*?)=(.*)$") {
Set-Content "env:\$($matches[1])" $matches[2]
}
}
# Clone and build pgvector
$env:PGROOT = "C:\Program Files\PostgreSQL\15"
Set-Location -Path $env:TEMP
echo Setting up Visual Studio environment...
call "C:\Program Files\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
echo Visual Studio environment set up.
echo Cloning and building pgvector...
set PGROOT=C:\Program Files\PostgreSQL\15
cd /d %TEMP%
git clone --branch v0.7.4 https://github.com/pgvector/pgvector.git
Set-Location -Path "$env:TEMP\pgvector"
cd pgvector
echo pgvector cloned.
echo Creating vector extension...
psql -U postgres -c "CREATE EXTENSION vector;"
echo Vector extension created.
echo Building pgvector...
nmake /F Makefile.win
echo pgvector built.
echo Installing pgvector...
nmake /F Makefile.win install
echo pgvector installed.
psql -U postgres -c "CREATE EXTENSION vector;"
echo Setting max_connections to 1024...
echo max_connections = 1024 >> "C:\Program Files\PostgreSQL\15\data\postgresql.conf"
echo max_connections set.
# Set max_connections to 1024
Add-Content -Path "C:\Program Files\PostgreSQL\15\data\postgresql.conf" -Value "max_connections = 1024"
Restart-Service postgresql-x64-15
echo Restarting PostgreSQL service...
net stop postgresql-x64-15
net start postgresql-x64-15
echo PostgreSQL service restarted.
echo Setup complete!
- name: Setup PostgreSQL on macOS
if: inputs.os == 'macos-latest'
Expand Down
9 changes: 5 additions & 4 deletions .github/actions/setup-python-full/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,9 @@ runs:
- name: Install Poetry and dependencies on Windows
if: inputs.os == 'windows-latest'
shell: pwsh
shell: cmd
run: |
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
$env:PATH += ";$env:USERPROFILE\AppData\Roaming\Python\Scripts"
cd py; poetry install -E core -E ingestion-bundle
python -c "import urllib.request; print(urllib.request.urlopen('https://install.python-poetry.org').read().decode())" > install-poetry.py
python install-poetry.py
echo %USERPROFILE%\AppData\Roaming\Python\Scripts >> %GITHUB_PATH%
cd py && poetry install -E core -E ingestion-bundle
9 changes: 5 additions & 4 deletions .github/actions/setup-python-light/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@ runs:
- name: Install Poetry and dependencies on Windows
if: inputs.os == 'windows-latest'
shell: pwsh
shell: cmd
run: |
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
$env:PATH += ";$env:USERPROFILE\AppData\Roaming\Python\Scripts"
cd py; poetry install -E core -E ingestion-bundle
python -c "import urllib.request; print(urllib.request.urlopen('https://install.python-poetry.org').read().decode())" > install-poetry.py
python install-poetry.py
echo %USERPROFILE%\AppData\Roaming\Python\Scripts >> %GITHUB_PATH%
cd py && poetry install -E core -E ingestion-bundle
2 changes: 1 addition & 1 deletion .github/workflows/r2r-light-py-integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:

strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
os: [windows-latest]
test_category:
- cli-ingestion
- cli-retrieval
Expand Down
1 change: 1 addition & 0 deletions py/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
<a href="https://github.com/SciPhi-AI"><img src="https://img.shields.io/github/stars/SciPhi-AI/R2R" alt="Github Stars"></a>
<a href="https://github.com/SciPhi-AI/R2R/pulse"><img src="https://img.shields.io/github/commit-activity/w/SciPhi-AI/R2R" alt="Commits-per-week"></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-purple.svg" alt="License: MIT"></a>
<a href="https://gurubase.io/g/r2r"><img src="https://img.shields.io/badge/Gurubase-Ask%20R2R%20Guru-006BFF" alt="Gurubase: R2R Guru"></a>
</p>

<img width="1041" alt="r2r" src="https://github.com/user-attachments/assets/b6ee6a78-5d37-496d-ae10-ce18eee7a1d6">
Expand Down
54 changes: 54 additions & 0 deletions py/cli/commands/kg.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,60 @@ async def create_graph(
click.echo(json.dumps(response, indent=2))


@cli.command()
@click.option(
"--collection-id",
required=False,
help="Collection ID to deduplicate entities for.",
)
@click.option(
"--run",
is_flag=True,
help="Run the deduplication process.",
)
@click.option(
"--force-deduplication",
is_flag=True,
help="Force the deduplication process.",
)
@click.option(
"--deduplication-settings",
required=False,
help="Settings for the deduplication process.",
)
@pass_context
def deduplicate_entities(
ctx, collection_id, run, force_deduplication, deduplication_settings
):
"""
Deduplicate entities in the knowledge graph.
"""
client = ctx.obj

if deduplication_settings:
try:
deduplication_settings = json.loads(deduplication_settings)
except json.JSONDecodeError:
click.echo(
"Error: deduplication-settings must be a valid JSON string"
)
return
else:
deduplication_settings = {}

run_type = "run" if run else "estimate"

if force_deduplication:
deduplication_settings = {"force_deduplication": True}

with timer():
response = client.deduplicate_entities(
collection_id, run_type, deduplication_settings
)

click.echo(json.dumps(response, indent=2))


@cli.command()
@click.option(
"--collection-id",
Expand Down
3 changes: 2 additions & 1 deletion py/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,10 @@ def add_command_with_telemetry(command):
add_command_with_telemetry(management.documents_overview)
add_command_with_telemetry(management.document_chunks)

# Restructure
# Knowledge Graph
add_command_with_telemetry(kg.create_graph)
add_command_with_telemetry(kg.enrich_graph)
add_command_with_telemetry(kg.deduplicate_entities)

# Retrieval
add_command_with_telemetry(retrieval.search)
Expand Down
6 changes: 5 additions & 1 deletion py/core/base/abstractions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@
from shared.abstractions.kg import (
KGCreationSettings,
KGEnrichmentSettings,
KGEntityDeduplicationSettings,
KGEntityDeduplicationType,
KGRunType,
)
from shared.abstractions.llm import (
Expand Down Expand Up @@ -115,9 +117,11 @@
"VectorSearchResult",
"VectorSearchSettings",
"HybridSearchSettings",
# Restructure abstractions
# KG abstractions
"KGCreationSettings",
"KGEnrichmentSettings",
"KGEntityDeduplicationSettings",
"KGEntityDeduplicationType",
"KGRunType",
# User abstractions
"Token",
Expand Down
6 changes: 5 additions & 1 deletion py/core/base/api/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@
from shared.api.models.kg.responses import (
KGCreationResponse,
KGEnrichmentResponse,
KGEntityDeduplicationResponse,
WrappedKGCommunitiesResponse,
WrappedKGCreationResponse,
WrappedKGEnrichmentResponse,
WrappedKGEntitiesResponse,
WrappedKGTriplesResponse,
WrappedKGEntityDeduplicationResponse,
)
from shared.api.models.management.responses import (
AnalyticsResponse,
Expand Down Expand Up @@ -78,11 +80,13 @@
"WrappedUpdateResponse",
"CreateVectorIndexResponse",
"WrappedCreateVectorIndexResponse",
# Restructure Responses
# Knowledge Graph Responses
"KGCreationResponse",
"WrappedKGCreationResponse",
"KGEnrichmentResponse",
"WrappedKGEnrichmentResponse",
"KGEntityDeduplicationResponse",
"WrappedKGEntityDeduplicationResponse",
# Management Responses
"PromptResponse",
"ServerStats",
Expand Down
5 changes: 4 additions & 1 deletion py/core/base/providers/ingestion.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,18 @@
import logging
from abc import ABC
from enum import Enum

from .base import Provider, ProviderConfig
from shared.abstractions.ingestion import ChunkEnrichmentSettings

logger = logging.getLogger()


class IngestionConfig(ProviderConfig):
provider: str = "r2r"
excluded_parsers: list[str] = ["mp4"]
chunk_enrichment_settings: ChunkEnrichmentSettings = (
ChunkEnrichmentSettings()
)
extra_parsers: dict[str, str] = {}

@property
Expand Down
14 changes: 12 additions & 2 deletions py/core/base/providers/kg.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
KGSearchSettings,
RelationshipType,
Triple,
KGEntityDeduplicationSettings,
)
from .base import ProviderConfig

Expand All @@ -34,6 +35,9 @@ class KGConfig(ProviderConfig):
kg_store_path: Optional[str] = None
kg_enrichment_settings: KGEnrichmentSettings = KGEnrichmentSettings()
kg_creation_settings: KGCreationSettings = KGCreationSettings()
kg_entity_deduplication_settings: KGEntityDeduplicationSettings = (
KGEntityDeduplicationSettings()
)
kg_search_settings: KGSearchSettings = KGSearchSettings()

def validate_config(self) -> None:
Expand Down Expand Up @@ -104,9 +108,10 @@ async def get_existing_entity_extraction_ids(
async def get_entities(
self,
collection_id: UUID,
offset: int,
limit: int,
offset: int = 0,
limit: int = -1,
entity_ids: list[str] | None = None,
entity_names: list[str] | None = None,
entity_table_name: str = "entity_embedding",
) -> dict:
"""Abstract method to get entities."""
Expand Down Expand Up @@ -259,6 +264,11 @@ async def get_community_count(self, collection_id: UUID) -> int:
"""Abstract method to get the community count."""
pass

@abstractmethod
async def update_entity_descriptions(self, entities: list[Entity]):
"""Abstract method to update entity descriptions."""
pass


def escape_braces(s: str) -> str:
"""
Expand Down
13 changes: 11 additions & 2 deletions py/core/configs/full.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,20 @@ new_after_n_chars = 512
max_characters = 1_024
combine_under_n_chars = 128
overlap = 256

[ingestion.extra_parsers]
pdf = "zerox"
pdf = "zerox"

[ingestion.chunk_enrichment_settings]
strategies = ["semantic", "neighborhood"]
forward_chunks = 3
backward_chunks = 3
semantic_neighbors = 10
semantic_similarity_threshold = 0.7
generation_config = { model = "azure/gpt-4o-mini" }

[orchestration]
provider = "hatchet"
kg_creation_concurrency_limit = 32
kg_creation_concurrency_lipmit = 32
ingestion_concurrency_limit = 128
kg_enrichment_concurrency_limit = 8
7 changes: 7 additions & 0 deletions py/core/examples/data_dedup/a1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Aristotle[A] (Greek: Ἀριστοτέλης Aristotélēs, pronounced [aristotélɛːs]; 384–322 BC) was an Ancient Greek philosopher and polymath. His writings cover a broad range of subjects spanning the natural sciences, philosophy, linguistics, economics, politics, psychology, and the arts. As the founder of the Peripatetic school of philosophy in the Lyceum in Athens, he began the wider Aristotelian tradition that followed, which set the groundwork for the development of modern science.

Little is known about Aristotle's life. He was born in the city of Stagira in northern Greece during the Classical period. His father, Nicomachus, died when Aristotle was a child, and he was brought up by a guardian. At 17 or 18, he joined Plato's Academy in Athens and remained there until the age of 37 (c. 347 BC). Shortly after Plato died, Aristotle left Athens and, at the request of Philip II of Macedon, tutored his son Alexander the Great beginning in 343 BC. He established a library in the Lyceum, which helped him to produce many of his hundreds of books on papyrus scrolls.

Though Aristotle wrote many elegant treatises and dialogues for publication, only around a third of his original output has survived, none of it intended for publication. Aristotle provided a complex synthesis of the various philosophies existing prior to him. His teachings and methods of inquiry have had a significant impact across the world, and remain a subject of contemporary philosophical discussion.

Aristotle's views profoundly shaped medieval scholarship. The influence of his physical science extended from late antiquity and the Early Middle Ages into the Renaissance, and was not replaced systematically until the Enlightenment and theories such as classical mechanics were developed. He influenced Judeo-Islamic philosophies during the Middle Ages, as well as Christian theology, especially the Neoplatonism of the Early Church and the scholastic tradition of the Catholic Church.
31 changes: 31 additions & 0 deletions py/core/examples/data_dedup/a10.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Newton's "forced" motion corresponds to Aristotle's "violent" motion with its external agent, but Aristotle's assumption that the agent's effect stops immediately it stops acting (e.g., the ball leaves the thrower's hand) has awkward consequences: he has to suppose that surrounding fluid helps to push the ball along to make it continue to rise even though the hand is no longer acting on it, resulting in the Medieval theory of impetus.[45]

Four causes
Main article: Four causes

Aristotle argued by analogy with woodwork that a thing takes its form from four causes: in the case of a table, the wood used (material cause), its design (formal cause), the tools and techniques used (efficient cause), and its decorative or practical purpose (final cause).[47]
Aristotle suggested that the reason for anything coming about can be attributed to four different types of simultaneously active factors. His term aitia is traditionally translated as "cause", but it does not always refer to temporal sequence; it might be better translated as "explanation", but the traditional rendering will be employed here.[48][49]

Material cause describes the material out of which something is composed. Thus the material cause of a table is wood. It is not about action. It does not mean that one domino knocks over another domino.[48]
The formal cause is its form, i.e., the arrangement of that matter. It tells one what a thing is, that a thing is determined by the definition, form, pattern, essence, whole, synthesis or archetype. It embraces the account of causes in terms of fundamental principles or general laws, as the whole (i.e., macrostructure) is the cause of its parts, a relationship known as the whole-part causation. Plainly put, the formal cause is the idea in the mind of the sculptor that brings the sculpture into being. A simple example of the formal cause is the mental image or idea that allows an artist, architect, or engineer to create a drawing.[48]
The efficient cause is "the primary source", or that from which the change under consideration proceeds. It identifies 'what makes of what is made and what causes change of what is changed' and so suggests all sorts of agents, non-living or living, acting as the sources of change or movement or rest. Representing the current understanding of causality as the relation of cause and effect, this covers the modern definitions of "cause" as either the agent or agency or particular events or states of affairs. In the case of two dominoes, when the first is knocked over it causes the second also to fall over.[48] In the case of animals, this agency is a combination of how it develops from the egg, and how its body functions.[50]
The final cause (telos) is its purpose, the reason why a thing exists or is done, including both purposeful and instrumental actions and activities. The final cause is the purpose or function that something is supposed to serve. This covers modern ideas of motivating causes, such as volition.[48] In the case of living things, it implies adaptation to a particular way of life.[50]
Optics
Further information: History of optics
Aristotle describes experiments in optics using a camera obscura in Problems, book 15. The apparatus consisted of a dark chamber with a small aperture that let light in. With it, he saw that whatever shape he made the hole, the sun's image always remained circular. He also noted that increasing the distance between the aperture and the image surface magnified the image.[51]

Chance and spontaneity
Further information: Accident (philosophy)
According to Aristotle, spontaneity and chance are causes of some things, distinguishable from other types of cause such as simple necessity. Chance as an incidental cause lies in the realm of accidental things, "from what is spontaneous". There is also more a specific kind of chance, which Aristotle names "luck", that only applies to people's moral choices.[52][53]

Astronomy
Further information: History of astronomy
In astronomy, Aristotle refuted Democritus's claim that the Milky Way was made up of "those stars which are shaded by the earth from the sun's rays," pointing out partly correctly that if "the size of the sun is greater than that of the earth and the distance of the stars from the earth many times greater than that of the sun, then... the sun shines on all the stars and the earth screens none of them."[54] He also wrote descriptions of comets, including the Great Comet of 371 BC.[55]

Geology and natural sciences
Further information: History of geology

Aristotle noted that the ground level of the Aeolian islands changed before a volcanic eruption.
Aristotle was one of the first people to record any geological observations. He stated that geological change was too slow to be observed in one person's lifetime.[56][57] The geologist Charles Lyell noted that Aristotle described such change, including "lakes that had dried up" and "deserts that had become watered by rivers", giving as examples the growth of the Nile delta since the time of Homer, and "the upheaving of one of the Aeolian islands, previous to a volcanic eruption."'[58]

Meteorologica lends its name to the modern study of meteorology, but its modern usage diverges from the content of Aristotle's ancient treatise on meteors. The ancient Greeks did use the term for a range of atmospheric phenomena, but also for earthquakes and volcanic eruptions. Aristotle proposed that the cause of earthquakes was a gas or vapor (anathymiaseis) that was trapped inside the earth and trying to escape, following other Greek authors Anaxagoras, Empedocles and Democritus.[59]
Loading

0 comments on commit 0daab42

Please sign in to comment.