Monday, November 20, 2023

Thousands of Australian UFO newsclippings - downloaded, organised and assimilated into wider collection using AI-generated code [AI assisted ufology]

Last year, I uploaded a few hundred scanned Australian newspaper clippings about UFOs. In the last few weeks, I have explored the (impressive) ability of Artificial Intelligence tools to generate code to find further UFO newspaper clippings from various Australian sources and organise them so that they can be assimilated into my wider collection of UFO material. As a result, I added over 3,500 Australian newspaper clippings to my online UFO collection this weekend. More significantly, now that relevant tools have been developed, it should be relatively easy to add thousands more from Australia (or apply the same technique to combine similar UFO newspaper clipping collections from dozens of other countries with previous scans and share them online).

The current Australian collection is at:
https://files.afu.se/Downloads/?dir=.%2FTopics%2F0%20-%20Countries%2FAustralia%2FCuttings

I've previously done various projects in relation to UFO newspaper clippings. For example, a few years ago I shared a basic list of over 60,000 scanned UFO newspaper articles that were then in my collection (and shared a sample of that collection, from the 1980s), More recently, I uploaded - with permission from Rod Dyke - scans of the "UFO Newsclipping Service" (1969-2011) and - with permission from Ron and Richard Smotek - scans of a similar service offered by the Aerial Phenomenon Clipping Information Center ("APCIC") (1970s-1990)

Artificial Intelligence tools now allow very rapid reorganisation of UFO material from numerous sources, so it is now relatively simple to assimilate UFO newspaper clippings from online databases, scrapbooks of UFO researchers, official documents (e.g. the Australian files that I've uploaded as PDFs over the years, working with Keith Basterfield and Paul Dean - see HERE), offline UFO databases, archives of UFO groups/researchers and digitised UFO material. 

These UFO newspaper clippings from various sources can all be organised so that they can, in turn, then be assimilated into a wider collection of UFO case files, official UFO documents, UFO magazines, PhD dissertations regarding UFOs, UFO databases, discussion forum posts and related emails/correspondence.  

As a further little case study, this weekend I uploaded a few thousand further Australian newspaper clippings to my free online UFO archive (kindly hosted by the AFU in Sweden). These are being combined with material from the AFU's offline archive in Sweden plus collections of newspaper clippings from various Australian researchers (including collection put together by Anthony Clarke and Judith Houston McGinness - with their kind permission).

I picked Australia for this little case study due to the existence of a huge, free online database of Australian newspaper stories: Trove.   

Various Australian UFO researchers have previously highlighted the existence of Trove in blog posts, including posts by Keith Basterfield and Paul Dean. The Trove newspaper archive includes a huge number of Australian newspaper stories. Unfortunately, it is not easy to find a comprehensive online collection of the UFO newspaper clippings available on Trove (or any collection supplemented by further UFO newspaper clippings from other sources, such as those mentioned above).

Searching Trove can be slightly frustrating. For example, a search of the content of articles on Trove for "UFO" finds _many_ articles from long before 1947 (i.e. before the modern UFO era, and before the term "UFO" was coined).  Some of those early newspaper articles have been scanned poorly so the text as recognised by the Trove software is basically a collection of random letters. By chance, those random letters includes the letters "UFO" in hundreds of articles (e.g. in a line of text which is recognised by the Trove system [wrongly] as being "adfr AWTA hAWrhyu UFO akaRF jsD AlE").   

Very brief search terms such as "UFO" therefore generate hundreds (if not thousands) of false positive results which would have to be weeded out if the collection is to be limited to just articles relating to UFOs.  

At the other extreme, it is possible to search Trove for articles which readers have tagged with the label "UFO".  The material found using this search has far less irrelevant material (with a relevant material being almost 100% of all the results) - BUT but only a small fraction of relevant material is found (say 1%). At present, most of the articles in Trove that may be of interest to UFO researchers have not been tagged.

So, the challenge is either:
(1)  To find the time to weed out irrelevant results from wider search terms, or
(2) Finding search terms/restrictions which result in only (or at least almost entirely) material which is relevant to UFO research.

I don't have the time for (1), so working on my own the only real option is (2).

Fortunately, it is possible to come up with search terms and restrictions which greatly reduce the amount of irrelevant material while finding far more relevant material than just the articles currently tagged with a label such as "UFO".  

To help find useful search terms (and to archive material which is found), I found it useful to download all the articles found as a result of a search and then to glance through the articles offline (which is much faster than reviewing them online). In particular, I used AI software to generate code to download all the search results, then manually reviewed folder of PDFs for each of those search results, setting the view in Windows Explorer to include a preview pane on the right hand side of the screen - allowing relatively rapid review of the PDFs to determine if the results were largely relevant or whether they included a lot of irrelevant material.  

I think it would be useful to have a discussion of the pros and cons of different potential search terms, with at least a qualitative discussion of actual experiments with those search terms. I don't recall seeing this done within ufology so far.  For example:

The term "UFO", as indicated above, generated far too many irrelevant results due to random letters being perceived by relevant OCR software - particularly in poor scans of earlier newspaper articles (e.g. articles from the 1800s).

The term "UFO" could be combined with another search term or collection of alternative search terms, e.g. searching for "UFO" AND (light OR mysterious OR unidentified OR flying OR sighting OR sighted).  Unfortunately, pre-1952, most of the results were irrelevant (with, say, less than 10% being relevant).   Most were hits for the word "unidentified" in poorly scanned articles from the 1800s with lots of random characters that happen to include the three consecutive letters "UFO".  Post 1952, the percentage rises to, say, about 50% relevant - with fewer poor scans with random characters, but quite a lot of the hits for the keyword "UFO" are in reviews of science fiction books and movies.

Better than "UFO" was a search for "Unidentified" AND "Flying" AND "Object".  I'd estimate that about 95% of results after 1947 were result. However, there were surprisingly lots of results prior to 1947, only about 10% of which were relevant. One possibility would be devising searches that use terms combined with date restrictions e.g. "Unidentified" AND "Flying" AND "Object" but only in relation to articles dating from, say, 1947 or later.

Turning from "UFO", some other terms resulted some interesting articles being found but too many irrelevant articles for the search to be relied upon alone (without manual intervention). For example, Trove had about 200 hits for "strange lights in the sky" (i.e. a much smaller number than for UFO or flying saucer).  This found articles containing both "lights" in the plural ("strange lights in the sky") and light in the singular ("strange light in the sky"). However, a relatively high percentage were irrelevant or uninteresting. 

By a significant margin, the most productive search term was "flying saucer" (the results for which included the plural, "flying saucers", in addition to the singular).  This resulted in about 2,000 hits, mainly from the 1940-1950s, almost all of which were relevant.  Due to the high rate of relevancy, the numerous results of this particular search probably do not require much (if any) manual intervention and is the most promising search term to be rolled out and applied to databases of newspaper articles from other English-speaking countries.





I'll set out below the code I generated using AI tools to download material from Trove (which includes a line stating the URL for the relevant search results to be downloaded, which obviously has to be changed for each different search term used):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import requests
import os
import time
import datetime
import re

MAX_RETRIES = 100  # Number of retries for downloading an article and for loading search result pages

def safe_click(driver, element):
    driver.execute_script("arguments[0].scrollIntoView();", element)
    time.sleep(1)
    try:
        element.click()
    except Exception:
        driver.execute_script("arguments[0].click();", element)

def sanitize_filename(filename):
    invalid_chars = ['<', '>', ':', '"', '/', '\\', '|', '?', '*']
    for char in invalid_chars:
        filename = filename.replace(char, '_')
    return filename

def download_pdf_from_search_result(driver, search_result_url, save_dir, max_retries=MAX_RETRIES):
    retries = 0
    success = False

    while retries < max_retries and not success:
        try:
            driver.get(search_result_url)
            WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, '//*[@id="grp2Download"]/span[1]')))
                
            # Construct the filename dynamically
            filename = construct_filename_dynamic(driver)
            
            # The rest of the download steps
            download_icon = driver.find_element(By.XPATH, '//*[@id="grp2Download"]/span[1]')
            safe_click(driver, download_icon)
            
            pdf_option = driver.find_element(By.XPATH, '//*[@id="articlePdfLink"]')
            safe_click(driver, pdf_option)
            
            change_size_btn = driver.find_element(By.XPATH, '//*[@id="articleImageQualityShow"]')
            safe_click(driver, change_size_btn)
            
            largest_checkbox = driver.find_element(By.XPATH, '//*[@id="inlineRadio5"]')
            safe_click(driver, largest_checkbox)
            
            create_pdf_btn = driver.find_element(By.XPATH, '//*[@id="downloadModal"]/div/div/div[3]/a[10]')
            safe_click(driver, create_pdf_btn)
            
            time.sleep(10)  # Wait for the "View PDF" button to appear/change after the "Create PDF" button is clicked
            
            view_pdf_btn = driver.find_element(By.XPATH, '//*[@id="downloadModal"]/div/div/div[3]/a[10]')
            pdf_url = view_pdf_btn.get_attribute('href')
            
            # Fetch the PDF content
            response = requests.get(pdf_url, stream=True, timeout=60)  # Add a timeout for the request
       
            # Sanitize the filename
            safe_filename = sanitize_filename(filename)
            
            # Download the PDF
            with open(os.path.join(save_dir, safe_filename), 'wb') as pdf_file:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        pdf_file.write(chunk)

            success = True
        except (TimeoutException, requests.exceptions.ConnectionError) as e:  # Handle both timeout and connection errors
            retries += 1
            print(f"Attempt {retries} failed due to {str(e)}. Retrying...")
            wait_time = min(5 + retries * 2, 600)
            time.sleep(wait_time)
    
    if not success:
        print(f"Failed to download article from {search_result_url} after {max_retries} attempts.")

def construct_filename_dynamic(driver):
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    source_element = driver.find_element(By.XPATH, '//*[@id="breadcrumb-c"]/ul/li[4]/a')
    source = source_element.text.strip().split('(')[0].strip() + " (Australia)" if source_element else "Browse (Australia)"

    title_element = driver.find_element(By.XPATH, '//*[@id="breadcrumb-c"]/ul/li[7]/a')
    title = title_element.text.strip().lower().title() if title_element else "UnknownTitle"

    date_pattern = re.compile(r'(\w+ \d{1,2} \w+ \d{4})')
    date_match = date_pattern.search(page_source)

    if date_match:
        date_info = date_match.group(0).split()
        day, month, year = date_info[1].zfill(2), date_info[2], date_info[3]
    else:
        day, month, year = "00", "UnknownMonth", "0000"

    month_mapping = {
        'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06',
        'Jul': '07', 'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'
    }

    date_string = f"{year} {month_mapping.get(month, '00')} {day}"

    filename = f"{date_string}_{source} - {title}.pdf"
    return filename

def process_search_results_page(driver, search_results_url, save_dir, start_article=1, end_article=397, articles_per_page=20):
    # Calculate starting page and ending page
    start_page = (start_article - 1) // articles_per_page + 1
    end_page = (end_article - 1) // articles_per_page + 1

    # Generate the URL for the starting page
    start_pos = (start_page - 1) * articles_per_page
    initial_url = f"{search_results_url}&startPos={start_pos}"
    driver.get(initial_url)

    # Add a short delay to let the page load and to see if the popup appears
    time.sleep(5)

    # Try to close the popup
    try:
        close_popup = driver.find_element(By.XPATH, '//*[@id="culturalModal___BV_modal_footer_"]/div/div/div[2]/button/span')
        close_popup.click()
        time.sleep(2)  # Give it a moment to close
    except Exception as e:
        print(f"Error closing popup: {e}")

    article_counter = start_article - 1
    page_count = start_page

    while page_count <= end_page:
        # Extract and process articles on the current page
        print(f"Attempting to extract articles from page {page_count}...")
        article_links_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/newspaper/article/')]")
        article_links = list(set([link_elem.get_attribute('href') for link_elem in article_links_elements]))

        print(f"Found {len(article_links)} articles on the current page.")

        article_num = 0  # Counter to keep track of which article is being processed on the page

        for link in article_links:
            article_num += 1
            print(f"Processing article number {article_num} with URL {link}...")

            try:
                download_pdf_from_search_result(driver, link, save_dir)
            except Exception as e:
                print(f"Error processing article number {article_num} with URL {link}. Error: {e}")

            article_counter += 1
            print(f"Processed {article_counter} articles.")

        print(f"Current article count after processing page: {article_counter}")

        # Check if we need to navigate to the next page
        if article_num % articles_per_page == 0:
            print("Verifying if we need to navigate to the next page...")
            print(f"article_counter: {article_num}, articles_per_page: {articles_per_page}, Modulus result: {article_counter % articles_per_page}")
            
            for i in range(MAX_RETRIES):
                next_page_url = f"{search_results_url}&startPos={article_counter}"
                driver.get(next_page_url)
                time.sleep(min(5 + i*2, 600))  # Wait time increases with each retry, maxing out at 600 seconds (10 minutes)

                # If articles are found on the page, break out of the retry loop
                article_links_elements = driver.find_elements(By.XPATH, "//a[contains(@href, '/newspaper/article/')]")
                if article_links_elements:
                    break
                else:
                    print(f"Retry {i+1}: No articles found on the new page. Retrying...")

            page_count += 1
        else:
            print(f"No more articles to process or reached the limit.")
            print(f"Processed {page_count - start_page + 1} pages.")
            break



# Script execution starts here
if __name__ == '__main__':
    driver = webdriver.Chrome()
    save_directory = 'e:/temp/Trove/testdown'

    # URL of the search results page
    search_results_url = "https://trove.nla.gov.au/search/advanced/category/newspapers?keyword.phrase=unidentified%20flying%20object"

    # For processing articles 
    process_search_results_page(driver, search_results_url, save_directory, start_article=1, end_article=422)

    time.sleep(20)
    driver.quit()


No comments:

Post a Comment