Documentation Index
Fetch the complete documentation index at: https://docs.puredocs.org/llms.txt
Use this file to discover all available pages before exploring further.
Installation
To use the ContentCleaner, you first need to install thepurecpp_extract Python package.
Before you begin, ensure your environment meets the following requirements:
- Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
- Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
- pip: Ensure pip is installed and updated to the latest version.
Default Cleaning Patterns
When you initialize the ContentCleaner without passing any patterns, it uses the following default regex patterns:| Name | Regex Pattern | Description |
|---|---|---|
| Extra Spaces | \s+ | Replaces multiple spaces with a single one. |
| Non-ASCII Characters | [^\x00-\x7F]+ | Removes all non-ASCII characters. |
| Symbols at Line Edges | ^W+|\W+$ | Removes symbols at the start/end of lines. |
Usage
Initialization
You can initialize the ContentCleaner using default patterns or provide custom patterns.Cleaning a Document
To clean the text of a document, use the ProcessDocument method. For creating documents, use theRAGDocument class from purecpp_libs.