Enhance data integration and transcript parsing

- Improved date parsing in fetch_and_store_chat_data to support multiple formats and added error logging for unparseable dates.
- Enhanced parse_and_store_transcript_messages to handle empty transcripts and expanded message pattern recognition for both User and Assistant.
- Implemented intelligent splitting of transcripts based on detected patterns and timestamps, with fallback mechanisms for unrecognized formats.
- Updated documentation for Celery and Redis setup, troubleshooting, and project structure.
- Added markdown linting configuration and scripts for code formatting.
- Updated Nginx configuration to change the web server port.
- Added xlsxwriter dependency for Excel file handling in project requirements.
This commit is contained in:
2025-05-18 19:18:31 +00:00
parent 8bbbb109bd
commit f0ae061fa7
24 changed files with 1672 additions and 931 deletions

View File

@ -25,39 +25,40 @@ python manage.py test_redis
If this fails, check the following:
1. Redis might not be running. Start it with:
1. Redis might not be running. Start it with:
```bash
sudo systemctl start redis-server
```
```bash
sudo systemctl start redis-server
```
2. Connection credentials may be incorrect. Check your environment variables:
2. Connection credentials may be incorrect. Check your environment variables:
```bash
echo $REDIS_URL
echo $CELERY_BROKER_URL
echo $CELERY_RESULT_BACKEND
```
```bash
echo $REDIS_URL
echo $CELERY_BROKER_URL
echo $CELERY_RESULT_BACKEND
```
3. Redis might be binding only to a specific interface. Check `/etc/redis/redis.conf`:
3. Redis might be binding only to a specific interface. Check `/etc/redis/redis.conf`:
```bash
grep "bind" /etc/redis/redis.conf
```
```bash
grep "bind" /etc/redis/redis.conf
```
4. Firewall rules might be blocking Redis. If you're connecting remotely:
```bash
sudo ufw status # Check if firewall is enabled
sudo ufw allow 6379/tcp # Allow Redis port if needed
```
4. Firewall rules might be blocking Redis. If you're connecting remotely:
```bash
sudo ufw status # Check if firewall is enabled
sudo ufw allow 6379/tcp # Allow Redis port if needed
```
## Fixing CSV Data Processing Issues
If you see the error `zip() argument 2 is shorter than argument 1`, it means the data format doesn't match the expected headers. We've implemented a fix that:
1. Pads shorter rows with empty strings
2. Uses more flexible date format parsing
3. Provides better error handling
1. Pads shorter rows with empty strings
2. Uses more flexible date format parsing
3. Provides better error handling
After these changes, your data should be processed correctly regardless of format variations.
@ -77,15 +78,18 @@ python manage.py test_celery
If the task isn't completing, check:
1. Look for errors in the Celery worker terminal
2. Verify broker URL settings match in both terminals:
```bash
echo $CELERY_BROKER_URL
```
3. Check if Redis is accessible from both terminals:
```bash
redis-cli ping
```
1. Look for errors in the Celery worker terminal
2. Verify broker URL settings match in both terminals:
```bash
echo $CELERY_BROKER_URL
```
3. Check if Redis is accessible from both terminals:
```bash
redis-cli ping
```
## Checking Scheduled Tasks
@ -99,36 +103,36 @@ python manage.py celery inspect scheduled
Common issues with scheduled tasks:
1. **Celery Beat not running**: Start it with:
1. **Celery Beat not running**: Start it with:
```bash
cd dashboard_project
celery -A dashboard_project beat
```
```bash
cd dashboard_project
celery -A dashboard_project beat
```
2. **Task registered but not running**: Check worker logs for any errors
2. **Task registered but not running**: Check worker logs for any errors
3. **Wrong schedule**: Check the interval in settings.py and CELERY_BEAT_SCHEDULE
3. **Wrong schedule**: Check the interval in settings.py and CELERY_BEAT_SCHEDULE
## Data Source Configuration
If data sources aren't being processed correctly:
1. Verify active data sources exist:
1. Verify active data sources exist:
```bash
cd dashboard_project
python manage.py shell -c "from data_integration.models import ExternalDataSource; print(ExternalDataSource.objects.filter(is_active=True).count())"
```
```bash
cd dashboard_project
python manage.py shell -c "from data_integration.models import ExternalDataSource; print(ExternalDataSource.objects.filter(is_active=True).count())"
```
2. Create a default data source if needed:
2. Create a default data source if needed:
```bash
cd dashboard_project
python manage.py create_default_datasource
```
```bash
cd dashboard_project
python manage.py create_default_datasource
```
3. Check source URLs and credentials in the admin interface or environment variables.
3. Check source URLs and credentials in the admin interface or environment variables.
## Manually Triggering Data Refresh