A Python utility to convert TXT and CSV files to Parquet format. Developed by Sami Adnan.
Parquet Converter is a command-line tool that allows you to convert text-based data files (TXT and CSV) to the Parquet format. It provides options for batch processing, detailed conversion statistics, and flexible configuration.
This project is part of Sami Adnan’s DPhil research at the Nuffield Department of Primary Care Health Sciences, University of Oxford.
pip install parquet-converter
# Clone the repository
git clone https://github.com/sami5001/parquet-converter
cd parquet-converter
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .
For more detailed setup instructions, see README-setup.md.
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Basic Usage
# Convert a single file
parquet-converter input.csv -o output_dir/
# Convert a directory of files
parquet-converter input_dir/ -o output_dir/
# Advanced Usage
# Use a configuration file
parquet-converter input.csv -c config.yaml
# Enable verbose logging
parquet-converter input.csv -v
# Save current configuration to a file
parquet-converter input.csv --save-config my_config.yaml
# Convert with custom output directory
parquet-converter input.csv -o /path/to/output/
# Convert multiple file types in a directory
parquet-converter data_dir/ -o output_dir/ -c config.yaml
# Convert with verbose logging and custom config
parquet-converter input.csv -v -c custom_config.yaml -o output_dir/
# config.yaml
csv:
delimiter: ","
encoding: "utf-8"
header: 0
txt:
delimiter: "\t"
encoding: "utf-8"
header: 0
datetime_formats:
default: "%Y-%m-%d"
custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression: "snappy"
log_level: "INFO"
# advanced_config.yaml
csv:
delimiter: ","
encoding: "utf-8"
header: 0
low_memory: true
chunk_size: 10000
txt:
delimiter: "\t"
encoding: "utf-8"
header: 0
low_memory: true
chunk_size: 10000
datetime_formats:
default: "%Y-%m-%d"
custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression:
type: "snappy"
level: 1
block_size: "128MB"
log_level: "DEBUG"
log_file: "conversion.log"
parallel:
enabled: true
max_workers: 4
You can configure the converter using environment variables:
# Set input and output paths
export INPUT_PATH="data.csv"
export OUTPUT_DIR="output/"
# Configure logging
export LOG_LEVEL="DEBUG"
export LOG_FILE="conversion.log"
# Set file-specific options
export DELIMITER=","
export ENCODING="utf-8"
The Parquet format offers several advantages:
Available environment variables for configuration:
INPUT_PATH
: Path to input file/directoryOUTPUT_DIR
: Output directory pathLOG_LEVEL
: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)LOG_FILE
: Path to log fileDELIMITER
: Custom delimiter for text filesFor more information on Parquet, please refer to:
# Convert a single CSV file
parquet-converter data.csv -o output/
# Convert all files in a directory
parquet-converter data_dir/ -o output/ -c config.yaml
# Convert with detailed logging
parquet-converter data.csv -v -o output/
parquet-converter data.csv –save-config my_config.yaml
parquet-converter data.csv -c my_config.yaml -o output/
5. **Performance-Optimized Conversion**:
```bash
# Convert large files with memory optimization
parquet-converter large_data.csv -c performance_config.yaml -o output/
The converter generates:
Example conversion report:
{
"timestamp": "2024-03-14T12:00:00",
"summary": {
"total_files": 2,
"successful": 2,
"failed": 0
},
"files": [
{
"input_file": "data.csv",
"output_file": "data.parquet",
"rows_processed": 1000,
"rows_converted": 1000,
"success": true
}
]
}
# Run all tests
pytest
# Run with coverage
pytest --cov=parquet_converter
# Run specific test file
pytest parquet_converter/tests/test_converter.py
The project uses several tools to maintain code quality:
These are enforced using pre-commit hooks.
The converter is optimized for performance with the following features:
low_memory=True
optionTo run performance tests:
pytest parquet_converter/tests/test_performance.py -v
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions to the Parquet Converter project are welcome! Please refer to CONTRIBUTING.md for detailed guidelines on how to contribute.
Briefly, please follow these steps to contribute:
git checkout -b feature/amazing-feature
)pytest
)git commit -m 'Add some amazing feature'
)git push origin feature/amazing-feature
)low_memory=True
in configurationencoding='utf-8-sig'
for files with BOMCommon error messages and their solutions:
ValueError: Could not infer delimiter
Solution: Specify delimiter in config file or use --delimiter option
MemoryError: Unable to allocate array
Solution: Use low_memory=True or reduce chunk size
UnicodeDecodeError: 'utf-8' codec can't decode byte
Solution: Specify correct encoding in config file
You can customize data type inference by modifying the config:
data_types:
integers:
- int32
- int64
floats:
- float32
- float64
dates:
- date
- datetime
booleans:
- bool
strings:
- string
- category
For batch processing, you can enable parallel processing:
parallel:
enabled: true
max_workers: 4
chunk_size: 10000
Configure compression settings:
compression:
type: snappy # Options: snappy, gzip, brotli, zstd
level: 1 # Compression level (1-9)
block_size: 128MB
usage: parquet-converter [-h] [-o OUTPUT_DIR] [-c CONFIG] [-v] input_path
positional arguments:
input_path Path to input file or directory
optional arguments:
-h, --help Show this help message and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory path
-c CONFIG, --config CONFIG
Path to configuration file
-v, --verbose Enable verbose logging
Option | Type | Default | Description |
---|---|---|---|
csv.delimiter | string | ”,” | CSV file delimiter |
csv.encoding | string | “utf-8” | File encoding |
csv.header | int | 0 | Header row index |
txt.delimiter | string | “\t” | TXT file delimiter |
datetime_formats | list | [“%Y-%m-%d”] | Date format patterns |
infer_dtypes | bool | true | Enable type inference |
compression | string | “snappy” | Compression type |
log_level | string | “INFO” | Logging level |
Planned features and improvements:
For issues and feature requests, please use the GitHub issue tracker.