To register for an Internet.com membership to receive newsletters and white papers, use the Register button ABOVE.
To participate in the message forums BELOW, click here
PHPBuilder.com  
 

 

Go Back   PHPBuilder.com > PHP Help > Coding

Coding Help with PHP coding

Reply
 
Thread Tools Search this Thread Rate Thread Display Modes
Old 09-25-2006, 09:08 AM   #1
essexboyracer
Member
 
Join Date: Jun 2001
Location: UK
Posts: 51
Extract Transform Load

I currently have a system set up which pipes emails (700/day) into a script which breaks the mail down. I then run some regexp on it to extract some information, then insert into mysql and do some other stuff.

I am not confident in the regexp I created and it only gets the minimum amount of data I need to perform post-processing, i struggled to get that far. I now need to capture all the data into MySQL.

I have seen so-called ETL tools, but havnt found anything that could be deployed in a shared cPanel hosting environment, over which I have no control. Some googling on XML bought up some interesting results on transforming non-XML data into XML, but I couldnt find a clearly written tut on how to go about it.

Does anyone know of an alternative way to do this? or have come across any good tutorials?

BTW the data in the text email look a bit like;

THIS IS A SECTION HEADER
First Name: Alfred Bloggs
Last Name: Cesear

THIS IS ANOTHER SECTION HEADER
Type of Plane: Cessna Plane Capacity: 5000cc
Etc, etc

Not everything is on a seperate line, as in Plane Capacity above, and this sometimes occurs within the message.
essexboyracer is offline   Reply With Quote
Old 09-27-2006, 04:55 AM   #2
void_function
Uncaught Exception
 
void_function's Avatar
 
Join Date: Jun 2005
Posts: 266
You can go two ways:

1. Write an elaborate matching function that scans char by char, comparing to possible char sequences, unless there is a plane type called "Capacity", for instance, where even this will break
2. Enforce proper writing of emails - either enforce XML, or use INI type format:

[THIS IS ANOTHER SECTION]
Type= Cessna Plane
Capacity= 5000
Somevar= Somekey

At any rate, some enforcing of email format is required.

Remember that XML is meta-language, basically. It only prescribes format, not the content, so conversion into XML is entirely up to the data structure that you have. Remember that garbage-in = garbage-out, so if you have properly parsed input values, you can do properly formatted xml, and you write it manually:

Code:
<sectionname>
  <plane capacity="$capacity">$type</plane>
</sectionname>
Or

Code:
<sectionname>
  <plane>
    <type capacity="$capacity">$type</type>
  </plane>
</sectionname>
Or

Code:
<sectionname>
  <plane>
      <type>$type</type>
      <capacity>$capacity</capacity>
  </plane>
</sectionname>
As you see, the format of XML is entirely up to you. As a general rule, I put numeric values in tag attributes, and strings in tags.

Last edited by void_function; 09-27-2006 at 04:58 AM.
void_function is offline   Reply With Quote
Old 09-28-2006, 10:54 AM   #3
essexboyracer
Member
 
Join Date: Jun 2001
Location: UK
Posts: 51
Thank you void. I have come up with sort of a better solution than regexp. I have found a function that will extract text between two different markers. The section info is then in the first element of an arry. I then str_replace with a pipe on each bit of text I dont need, like "First Name: ". I then explode the first array element on the pipe.

The only point it falls down is when a section repeats itself, with different information in. I have tried looping but I fear it will require regexp again I havnt sussed that out yet, but I can get majority of info out from the mail. Thank you for the XML intro. I know it would be a better solution, if only I had the ability to change the format of the email that I get, which I dont.

PHP Code:
  /**
   Returns an array containing each of the sub-strings from text that
   are between openingMarker and closingMarker. The text from
   openingMarker and closingMarker are not included in the result.
   This function does not support nesting of markers.
  */

  
function returnSubstrings($text, $openingMarker, $closingMarker) {
   
$openingMarkerLength = strlen($openingMarker);
   
$closingMarkerLength = strlen($closingMarker);

   
$result = array();
   
$position = 0;
   while ((
$position = strpos($text, $openingMarker, $position)) !== false) {
     
$position += $openingMarkerLength;
     if ((
$closingMarkerPosition = strpos($text, $closingMarker, $position)) !== false) {
       
$result[] = substr($text, $position, $closingMarkerPosition - $position);
       
$position = $closingMarkerPosition + $closingMarkerLength;
     }
   }
   return
$result;
  }
I am going to stay with what ive got, and change the way it works when I have more time or when the firms partners jump up and down. Hopefully Garbage In = The Good Stuff Out (for the moment)

Last edited by essexboyracer; 09-28-2006 at 01:31 PM.
essexboyracer is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 07:42 AM.








Acceptable Use Policy

Internet.com
The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.