parsing resume doc file

csharp United States
  • 12 years ago
    Anybody got any ideas on how to parse a resume form a word documnet. Basically the asp.net form would required users to upload word document, the webserver would then use ms word to open and save as text or xml. I would like to figure out a way to parse out name, address, phone etc...

    I've tried saving it as a ms xml but have not had any success figuring out what's my next step. even if somebody had an example in javascript , c# or vb.net

    thanks
  • 12 years ago
    You must use object model of Microsoft word.
    The only one problem you will have is that for word 97, XP, 2003- they are different, So you must write universal parser.

  • 12 years ago
    I have no problem opening and saving the word doc. as word xml I just have no idea of how to do the parsing, or set the sml to some type of xds template.

  • 12 years ago

    Send an example of doc file. Is the format of your files constant?

  • 12 years ago

    If your documents use templates i.e. have constant format then exporting them to xml will make possible to parse very easy.

  • 12 years ago
    It's basically a standard resume:
    first name last name
    address
    city state zip
    phone

    objective

    education

    exprerience

    I'm sure most of the resume that people will upload will have this type of layout.
    I'm not really asking for perfection, just maybee being able to pick up the name, address
    city, state , and zip from the documet and filling a textbox form.

    thanks

  • 12 years ago
    When we speak about word, should have in mind the formatting of data, different characters used to note actions and symbols. In the other hand we cannot determine structure from it's content i.e. from name, last name  etc. But if suppose the resume is in .txt format (or even in .doc) the first line contains name and surname separated by spce chars, the second address, the third city state zip then simply we just read it line by line and use space character as separator.

    Let the resume file resume.txt be

    George Michael
    Wallstreet 11/1A
    London England 11000
    111-111-111

    .....

    the code for reading and parsing the file is:

    ...

    using System.Text.RegularExpressions;

    ...

    private void Parse()
    {
    FileStream fs= new FileStream("resume.txt", FileMode.Open , FileAccess.Read);
    StreamReader sr= new StreamReader(fs);

    int row=0;

    while(sr.Peek()>0)
    {
        string st = sr.ReadLine();
        row++;

        switch(row)
        {
          case 1:
                    Regex Spliter=new Regex(@" ");
                    string[] NameSurname = Spliter.Split(st);
                    txtName.Text = NameSurname[0];
                    txtSurname.Text = NameSurname[1];
                    break;
          case 2:
                    txtAddress.Text = st;
                    break;
          case 3:
                    Regex Spliter=new Regex(@" ");
                    string[] CityStateZip = Spliter.Split(st);
                    txtCity.Text = CityStateZip[0];
                    txtState.Text = CityStateZip[1];
                    txtZip.Text = CityStateZip[2];
                    break;
        }
       
       
    }

    sr.Close() ;
    fs.Close() ;
    }

    ...

    because the names of the cities, states can consist of muliple parts you should separate them by comma "," in our case London, England, 11000. In addition change the Regex Spliter=new Regex(@" "); with Regex Spliter=new Regex(@","); in case 3 block.
  • 12 years ago
    ups ... you have to include

    using System.IO;  too

  • 12 years ago

    Thanks that make a lot of sence, not sure why I did not think of that. I will give it a try!
    I was looking for something totally different.. thanks for the insight.


  • 12 years ago

    Hi,
    Thanks for your coding. but my problem is i have to parse the resume which has Different format.
    can you give me the idea for parsing such type of word doc file ?


    Thanks,
    Haribala.





    Quote:
    [1]Posted by Kujtim on 3 Jun 2005 08:29 AM[/1]
    When we speak about word, should have in mind the formatting of data, different characters used to note actions and symbols. In the other hand we cannot determine structure from it's content i.e. from name, last name  etc. But if suppose the resume is in .txt format (or even in .doc) the first line contains name and surname separated by spce chars, the second address, the third city state zip then simply we just read it line by line and use space character as separator.


    Let the resume file resume.txt be


    George Michael
    Wallstreet 11/1A
    London England 11000
    111-111-111


    .....


    the code for reading and parsing the file is:


    ...


    using System.Text.RegularExpressions;


    ...


    private void Parse()
    {
    FileStream fs= new FileStream("resume.txt", FileMode.Open , FileAccess.Read);
    StreamReader sr= new StreamReader(fs);


    int row=0;


    while(sr.Peek()>0)
    {
        string st = sr.ReadLine();
        row++;


        switch(row)
        {
          case 1:
                    Regex Spliter=new Regex(@" ");
                    string[] NameSurname = Spliter.Split(st);
                    txtName.Text = NameSurname[0];
                    txtSurname.Text = NameSurname[1];
                    break;
          case 2:
                    txtAddress.Text = st;
                    break;
          case 3:
                    Regex Spliter=new Regex(@" ");
                    string[] CityStateZip = Spliter.Split(st);
                    txtCity.Text = CityStateZip[0];
                    txtState.Text = CityStateZip[1];
                    txtZip.Text = CityStateZip[2];
                    break;
        }
       
       
    }


    sr.Close() ;
    fs.Close() ;
    }


    ...


    because the names of the cities, states can consist of muliple parts you should separate them by comma "," in our case London, England, 11000. In addition change the Regex Spliter=new Regex(@" "); with Regex Spliter=new Regex(@","); in case 3 block.


Post a reply

Enter your message below

Sign in or Join us (it's free).

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'” - Isaac Asimov