Enhance data integration and transcript parsing

- Improved date parsing in fetch_and_store_chat_data to support multiple formats and added error logging for unparseable dates.
- Enhanced parse_and_store_transcript_messages to handle empty transcripts and expanded message pattern recognition for both User and Assistant.
- Implemented intelligent splitting of transcripts based on detected patterns and timestamps, with fallback mechanisms for unrecognized formats.
- Updated documentation for Celery and Redis setup, troubleshooting, and project structure.
- Added markdown linting configuration and scripts for code formatting.
- Updated Nginx configuration to change the web server port.
- Added xlsxwriter dependency for Excel file handling in project requirements.
This commit is contained in:
2025-05-18 19:18:31 +00:00
parent 8bbbb109bd
commit f0ae061fa7
24 changed files with 1672 additions and 931 deletions

View File

@ -6,10 +6,10 @@ This document explains how to set up and use Redis and Celery for background tas
The data integration module uses Celery to handle:
- Periodic data fetching from external APIs
- Processing and storing CSV data
- Downloading and parsing transcript files
- Manual data refresh triggered by users
- Periodic data fetching from external APIs
- Processing and storing CSV data
- Downloading and parsing transcript files
- Manual data refresh triggered by users
## Installation
@ -31,32 +31,33 @@ redis-cli ping # Should output PONG
After installation, check if Redis is properly configured:
1. Open Redis configuration file:
1. Open Redis configuration file:
```bash
sudo nano /etc/redis/redis.conf
```
```bash
sudo nano /etc/redis/redis.conf
```
2. Ensure the following settings:
2. Ensure the following settings:
```bash
# For development (localhost only)
bind 127.0.0.1
```bash
# For development (localhost only)
bind 127.0.0.1
# For production (accept connections from specific IP)
# bind 127.0.0.1 your.server.ip.address
# For production (accept connections from specific IP)
# bind 127.0.0.1 your.server.ip.address
# Protected mode (recommended)
protected-mode yes
# Protected mode (recommended)
protected-mode yes
# Port
port 6379
```
# Port
port 6379
```
3. Restart Redis after any changes:
```bash
sudo systemctl restart redis-server
```
3. Restart Redis after any changes:
```bash
sudo systemctl restart redis-server
```
#### macOS
@ -79,7 +80,7 @@ If Redis is not available, the application will automatically fall back to using
Set these environment variables in your `.env` file or deployment environment:
```env
```sh
# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
@ -126,28 +127,29 @@ docker-compose up -d
Development requires multiple terminal windows:
1. **Django Development Server**:
1. **Django Development Server**:
```bash
make run
```
```bash
make run
```
2. **Redis Server** (if needed):
2. **Redis Server** (if needed):
```bash
make run-redis
```
```bash
make run-redis
```
3. **Celery Worker**:
3. **Celery Worker**:
```bash
make celery
```
```bash
make celery
```
4. **Celery Beat** (for scheduled tasks):
```bash
make celery-beat
```
4. **Celery Beat** (for scheduled tasks):
```bash
make celery-beat
```
Or use the combined command:
@ -161,12 +163,12 @@ make run-all
If you see connection errors:
1. Check that Redis is running: `redis-cli ping` should return `PONG`
2. Verify firewall settings are not blocking port 6379
3. Check Redis binding in `/etc/redis/redis.conf` (should be `bind 127.0.0.1` for local dev)
1. Check that Redis is running: `redis-cli ping` should return `PONG`
2. Verify firewall settings are not blocking port 6379
3. Check Redis binding in `/etc/redis/redis.conf` (should be `bind 127.0.0.1` for local dev)
### Celery Workers Not Processing Tasks
1. Ensure the worker is running with the correct app name: `celery -A dashboard_project worker`
2. Check the Celery logs for errors
3. Verify broker URL settings in both code and environment variables
1. Ensure the worker is running with the correct app name: `celery -A dashboard_project worker`
2. Check the Celery logs for errors
3. Verify broker URL settings in both code and environment variables

View File

@ -25,39 +25,40 @@ python manage.py test_redis
If this fails, check the following:
1. Redis might not be running. Start it with:
1. Redis might not be running. Start it with:
```bash
sudo systemctl start redis-server
```
```bash
sudo systemctl start redis-server
```
2. Connection credentials may be incorrect. Check your environment variables:
2. Connection credentials may be incorrect. Check your environment variables:
```bash
echo $REDIS_URL
echo $CELERY_BROKER_URL
echo $CELERY_RESULT_BACKEND
```
```bash
echo $REDIS_URL
echo $CELERY_BROKER_URL
echo $CELERY_RESULT_BACKEND
```
3. Redis might be binding only to a specific interface. Check `/etc/redis/redis.conf`:
3. Redis might be binding only to a specific interface. Check `/etc/redis/redis.conf`:
```bash
grep "bind" /etc/redis/redis.conf
```
```bash
grep "bind" /etc/redis/redis.conf
```
4. Firewall rules might be blocking Redis. If you're connecting remotely:
```bash
sudo ufw status # Check if firewall is enabled
sudo ufw allow 6379/tcp # Allow Redis port if needed
```
4. Firewall rules might be blocking Redis. If you're connecting remotely:
```bash
sudo ufw status # Check if firewall is enabled
sudo ufw allow 6379/tcp # Allow Redis port if needed
```
## Fixing CSV Data Processing Issues
If you see the error `zip() argument 2 is shorter than argument 1`, it means the data format doesn't match the expected headers. We've implemented a fix that:
1. Pads shorter rows with empty strings
2. Uses more flexible date format parsing
3. Provides better error handling
1. Pads shorter rows with empty strings
2. Uses more flexible date format parsing
3. Provides better error handling
After these changes, your data should be processed correctly regardless of format variations.
@ -77,15 +78,18 @@ python manage.py test_celery
If the task isn't completing, check:
1. Look for errors in the Celery worker terminal
2. Verify broker URL settings match in both terminals:
```bash
echo $CELERY_BROKER_URL
```
3. Check if Redis is accessible from both terminals:
```bash
redis-cli ping
```
1. Look for errors in the Celery worker terminal
2. Verify broker URL settings match in both terminals:
```bash
echo $CELERY_BROKER_URL
```
3. Check if Redis is accessible from both terminals:
```bash
redis-cli ping
```
## Checking Scheduled Tasks
@ -99,36 +103,36 @@ python manage.py celery inspect scheduled
Common issues with scheduled tasks:
1. **Celery Beat not running**: Start it with:
1. **Celery Beat not running**: Start it with:
```bash
cd dashboard_project
celery -A dashboard_project beat
```
```bash
cd dashboard_project
celery -A dashboard_project beat
```
2. **Task registered but not running**: Check worker logs for any errors
2. **Task registered but not running**: Check worker logs for any errors
3. **Wrong schedule**: Check the interval in settings.py and CELERY_BEAT_SCHEDULE
3. **Wrong schedule**: Check the interval in settings.py and CELERY_BEAT_SCHEDULE
## Data Source Configuration
If data sources aren't being processed correctly:
1. Verify active data sources exist:
1. Verify active data sources exist:
```bash
cd dashboard_project
python manage.py shell -c "from data_integration.models import ExternalDataSource; print(ExternalDataSource.objects.filter(is_active=True).count())"
```
```bash
cd dashboard_project
python manage.py shell -c "from data_integration.models import ExternalDataSource; print(ExternalDataSource.objects.filter(is_active=True).count())"
```
2. Create a default data source if needed:
2. Create a default data source if needed:
```bash
cd dashboard_project
python manage.py create_default_datasource
```
```bash
cd dashboard_project
python manage.py create_default_datasource
```
3. Check source URLs and credentials in the admin interface or environment variables.
3. Check source URLs and credentials in the admin interface or environment variables.
## Manually Triggering Data Refresh