parquet-converter

Parquet Converter

A Python utility to convert TXT and CSV files to Parquet format. Developed by Sami Adnan.

Overview

Parquet Converter is a command-line tool that allows you to convert text-based data files (TXT and CSV) to the Parquet format. It provides options for batch processing, detailed conversion statistics, and flexible configuration.

This project is part of Sami Adnan’s DPhil research at the Nuffield Department of Primary Care Health Sciences, University of Oxford.

Features

Convert individual files or entire directories of TXT and CSV files to Parquet
Automatic file type detection based on extension
Configurable delimiters for CSV and TXT files
Generate detailed conversion statistics and reports in JSON format
Flexible output path configuration with automatic directory creation
Support for secure configuration via:
- Environment variables
- JSON/YAML config files
- Command-line arguments
Automatic data type inference for:
- Integers (int32, int64)
- Floats (float32, float64)
- Dates (with custom format support)
- Booleans
- Strings
Configurable parsing options:
- File encoding
- Header row position
- NA value handling
- Memory usage optimization
Environment variable support for key configuration options
Comprehensive logging system:
- Console output
- File logging
- Different log levels
- Formatted statistics tables
Conversion statistics and reports:
- JSON format reports
- Success/failure tracking
- Row and column statistics
- Error and warning logging
Progress tracking for batch conversions with visual progress bars
Pre-commit hooks for code quality:
- Black for code formatting
- isort for import sorting
- Flake8 for linting
- mypy for type checking
Performance optimization options:
- Configurable compression (snappy, gzip, etc.)
- Memory usage control
- Chunk-based processing for large files

Installation

From PyPI

pip install parquet-converter

From GitHub

# Clone the repository
git clone https://github.com/sami5001/parquet-converter
cd parquet-converter

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

Development Setup

For more detailed setup instructions, see README-setup.md.

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Usage

Command Line Interface

# Basic Usage
# Convert a single file
parquet-converter input.csv -o output_dir/

# Convert a directory of files
parquet-converter input_dir/ -o output_dir/

# Advanced Usage
# Use a configuration file
parquet-converter input.csv -c config.yaml

# Enable verbose logging
parquet-converter input.csv -v

# Save current configuration to a file
parquet-converter input.csv --save-config my_config.yaml

# Convert with custom output directory
parquet-converter input.csv -o /path/to/output/

# Convert multiple file types in a directory
parquet-converter data_dir/ -o output_dir/ -c config.yaml

# Convert with verbose logging and custom config
parquet-converter input.csv -v -c custom_config.yaml -o output_dir/

Example Configuration Files

Basic YAML Configuration:

# config.yaml
csv:
  delimiter: ","
  encoding: "utf-8"
  header: 0
txt:
  delimiter: "\t"
  encoding: "utf-8"
  header: 0
datetime_formats:
  default: "%Y-%m-%d"
  custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression: "snappy"
log_level: "INFO"

Advanced Configuration with Performance Settings:

# advanced_config.yaml
csv:
  delimiter: ","
  encoding: "utf-8"
  header: 0
  low_memory: true
  chunk_size: 10000
txt:
  delimiter: "\t"
  encoding: "utf-8"
  header: 0
  low_memory: true
  chunk_size: 10000
datetime_formats:
  default: "%Y-%m-%d"
  custom: ["%d/%m/%Y", "%Y-%m-%d %H:%M:%S"]
infer_dtypes: true
compression:
  type: "snappy"
  level: 1
  block_size: "128MB"
log_level: "DEBUG"
log_file: "conversion.log"
parallel:
  enabled: true
  max_workers: 4

Environment Variables

You can configure the converter using environment variables:

# Set input and output paths
export INPUT_PATH="data.csv"
export OUTPUT_DIR="output/"

# Configure logging
export LOG_LEVEL="DEBUG"
export LOG_FILE="conversion.log"

# Set file-specific options
export DELIMITER=","
export ENCODING="utf-8"

Parquet Format Benefits

The Parquet format offers several advantages:

Columnar storage format enabling efficient querying and compression
Reduced I/O operations when querying specific columns
Efficient data encoding and compression schemes
Compatible with various big data tools (Spark, Hadoop, etc.)

Environment Variables Reference

Available environment variables for configuration:

INPUT_PATH: Path to input file/directory
OUTPUT_DIR: Output directory path
LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_FILE: Path to log file
DELIMITER: Custom delimiter for text files

Additional Resources

For more information on Parquet, please refer to:

Example Workflows

Basic File Conversion:

# Convert a single CSV file
parquet-converter data.csv -o output/

Batch Processing:

# Convert all files in a directory
parquet-converter data_dir/ -o output/ -c config.yaml

Debug Mode:

# Convert with detailed logging
parquet-converter data.csv -v -o output/

Custom Configuration: ```bash
Save current settings as config

parquet-converter data.csv –save-config my_config.yaml

Use custom config

parquet-converter data.csv -c my_config.yaml -o output/

5. **Performance-Optimized Conversion**:
```bash
# Convert large files with memory optimization
parquet-converter large_data.csv -c performance_config.yaml -o output/

Output

The converter generates:

Parquet files with inferred data types
Conversion statistics in JSON format
Detailed logs with conversion summary
Progress indicators for batch operations

Example conversion report:

{
  "timestamp": "2024-03-14T12:00:00",
  "summary": {
    "total_files": 2,
    "successful": 2,
    "failed": 0
  },
  "files": [
    {
      "input_file": "data.csv",
      "output_file": "data.parquet",
      "rows_processed": 1000,
      "rows_converted": 1000,
      "success": true
    }
  ]
}

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=parquet_converter

# Run specific test file
pytest parquet_converter/tests/test_converter.py

Code Quality

The project uses several tools to maintain code quality:

Black for code formatting
isort for import sorting
Flake8 for linting
mypy for type checking

These are enforced using pre-commit hooks.

Performance Considerations

The converter is optimized for performance with the following features:

Efficient memory management for large files
Parallel processing for batch conversions
Optimized data type inference
Configurable compression options

Best Practices

File Size Optimization
- For files > 1GB, consider using the low_memory=True option
- Use appropriate compression (snappy for speed, gzip for size)
- Process large files in chunks when possible
Memory Usage
- Monitor memory usage during conversion
- Use appropriate chunk sizes for large files
- Consider system resources when processing multiple files
Performance Tuning
- Adjust compression settings based on your needs
- Use appropriate data types to minimize memory usage
- Consider using SSD storage for better I/O performance
Batch Processing
- Use directory conversion for multiple files
- Monitor system resources during batch operations
- Consider using parallel processing for large batches

To run performance tests:

pytest parquet_converter/tests/test_performance.py -v

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions to the Parquet Converter project are welcome! Please refer to CONTRIBUTING.md for detailed guidelines on how to contribute.

Briefly, please follow these steps to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run the tests to ensure everything works (pytest)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Acknowledgements

Sami Adnan, Nuffield Department of Primary Care Health Sciences, University of Oxford
Apache Parquet, PyArrow, and pandas development teams

Troubleshooting

Common Issues

Memory Errors
- If you encounter memory errors with large files, try:
  - Using low_memory=True in configuration
  - Reducing chunk size
  - Processing files in smaller batches
Encoding Issues
- If you see encoding errors, try:
  - Specifying the correct encoding in config (e.g., ‘utf-8’, ‘latin1’)
  - Using encoding='utf-8-sig' for files with BOM
Performance Issues
- If conversion is slow:
  - Check if compression settings are appropriate
  - Consider using SSD storage
  - Adjust chunk size based on available memory

Error Messages

Common error messages and their solutions:

ValueError: Could not infer delimiter
Solution: Specify delimiter in config file or use --delimiter option

MemoryError: Unable to allocate array
Solution: Use low_memory=True or reduce chunk size

UnicodeDecodeError: 'utf-8' codec can't decode byte
Solution: Specify correct encoding in config file

Advanced Usage

Custom Data Type Inference

You can customize data type inference by modifying the config:

data_types:
  integers:
    - int32
    - int64
  floats:
    - float32
    - float64
  dates:
    - date
    - datetime
  booleans:
    - bool
  strings:
    - string
    - category

Parallel Processing

For batch processing, you can enable parallel processing:

parallel:
  enabled: true
  max_workers: 4
  chunk_size: 10000

Custom Compression

Configure compression settings:

compression:
  type: snappy  # Options: snappy, gzip, brotli, zstd
  level: 1      # Compression level (1-9)
  block_size: 128MB

API Reference

Command Line Arguments

usage: parquet-converter [-h] [-o OUTPUT_DIR] [-c CONFIG] [-v] input_path

positional arguments:
  input_path            Path to input file or directory

optional arguments:
  -h, --help            Show this help message and exit
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory path
  -c CONFIG, --config CONFIG
                        Path to configuration file
  -v, --verbose         Enable verbose logging

Configuration Options

Option	Type	Default	Description
csv.delimiter	string	”,”	CSV file delimiter
csv.encoding	string	“utf-8”	File encoding
csv.header	int	0	Header row index
txt.delimiter	string	“\t”	TXT file delimiter
datetime_formats	list	[“%Y-%m-%d”]	Date format patterns
infer_dtypes	bool	true	Enable type inference
compression	string	“snappy”	Compression type
log_level	string	“INFO”	Logging level

Roadmap

Planned features and improvements:

Short Term
- Support for more input formats (JSON, Excel)
- Enhanced data type inference
- Improved error handling
Medium Term
- Distributed processing support
- Web interface for file conversion
- Real-time conversion monitoring
Long Term
- Cloud storage integration
- Advanced data validation
- Custom transformation rules

Support

For issues and feature requests, please use the GitHub issue tracker.

Getting Help

Check the Troubleshooting section
Search existing issues
Create a new issue with detailed information
Join our Discussions for community support

This site is open source. Improve this page.