ChatGPT Data Extraction: A quick demonstration

Поделиться
HTML-код
  • Опубликовано: 1 дек 2024

Комментарии • 19

  • @ianmatiello
    @ianmatiello 10 месяцев назад +4

    I have no words how to thank you.
    I'm a Brazilian lawyer and for months, maybe even more than a year, I've been looking for ways to automate boring repetitive and analogue tasks in my work, which only waste my time, with my limited programming knowledge (almost zero).
    You were the first to teach a method that is easily understandable and applicable to those who don't have much knowledge in programming, and, most importantly, it is useful.
    I work in a Union, and here there is a culture almost against technology, but perhaps it is due to ignorance of its possibilities than anything else.
    To give you an idea, we calculate how much a union member has to receive in a lawsuit.
    The information is contained in a financial statement provided by the City Hall.
    Until then, the calculation is done "by hand", that is, reading the information from the table and manually entering it into an Excel spreadsheet, which, for each calculation, takes around one or two hours.

  • @Back_at_Bardot
    @Back_at_Bardot Год назад

    ❤ finally been searching for content on this sort of topics

  • @biraescudero
    @biraescudero Год назад

    Great! I usually have this kind of problem and your approach is very good! Thanks for sharing!

  • @mookfaru835
    @mookfaru835 2 месяца назад

    Have you tried to work with chatgpt through the API? I have heard it does not have a upper limit on the amount of work it can do. Just as much as you can pay.

  • @TheHavyxon
    @TheHavyxon Год назад +1

    Do you think that the police reports are intentionally written so they are this difficult to read specifically because of data extraction?

    • @bxroberts
      @bxroberts  Год назад +3

      The documents in question were written over a 20 year time span, between 2000 and 2020, so they were written before most people even knew about automated extraction. So I don't think it's intentional. The records suffer from a few things that trip up ChatGPT: 1) They're really messy and OCR isn't perfect 2) many of them are excerpts from large email chains making context difficult to figure out and 3) there are a mix of documents, use of force reports, reprimands and memos of termination and they're all written totally differently.

    • @TheHavyxon
      @TheHavyxon Год назад

      @@bxroberts well I think the term "machine readable" is kinda old

    • @13statistician13
      @13statistician13 7 месяцев назад

      No. Police reports are public and FOIA-able. An even easier method than doing all this programming in Python and Extraction with ChatGPT, is to go to the source database. You'd be amazed how much easier you can make your life by simply picking up the phone and contacting the information technology department at your local police department. You can usually ask them to supply you with an electronic copy of their police reports in a machine readable format, and they will oblige. Typically, you can ask for csv files or even a copy of their database, but more often than not they will simply provide you with csv files rather than their database since their DB design may be proprietary. You'll want to limit the result set by providing a date range (limited by a date range of course). In many cases, you'll get several tables. In that case, you'll simply need to write some basic SQL code to join the tables, but that's super easy to learn. You could use R, SAS or other statistical programming languages to accomplish that as well.
      In general, the only reason you might not get data in an easy to use format, is because that particular PD's It department is incompetent or resource constrained - not because they are attempting to hide anything.
      One final note: if you do request the information, you might be expected to pay a nominal fee for the service. It's usually significantly cheaper to pay this fee than spending the time to build out, often times unreliable, Dat extraction processes.

    • @13statistician13
      @13statistician13 7 месяцев назад

      ​@@TheHavyxonHuh? No. Machine readable is a very modern term used by cloud engineers, data engineering teams, data scientists, and statisticians to this very day.

  • @lorenzoleongutierrez7927
    @lorenzoleongutierrez7927 Год назад +1

    Thanks for sharing!

  • @andre-le-bone-aparte
    @andre-le-bone-aparte Год назад

    Excellent Content - Another sub for you sir!

  • @retrogamingplayback
    @retrogamingplayback Год назад

    Don't forget on long outputs, typing "continue" is your friend. It should resume where it left off.

    • @bxroberts
      @bxroberts  Год назад +1

      While that does work, for the purposes of data extraction I found the "continue" prompt to cause more problems than it fixed. When outputting code, asking ChatGPT to "continue" usually caused it to completely re-output the entire JSON, often in a different format and ignoring the schema. The further in a context ChatGPT gets from the prompt, the less likely it is to obey it. For long schemas, it would be better to split the schema in two and ask two separate times. Just my experience, though!

    • @retrogamingplayback
      @retrogamingplayback Год назад

      @@bxroberts Makes perfect sense, appreciate the tip for splitting JSON schema - hadn't thought of that.

  • @sarahbratt5178
    @sarahbratt5178 Год назад

    Awesome video! Could you link to PDF Plumber?

    • @bxroberts
      @bxroberts  Год назад

      Sure! It's on GitHub here: github.com/jsvine/pdfplumber

  • @ain92ru
    @ain92ru Год назад

    I guess the occasional mistakes might have been caused by the temperature set too high (unfortunately, I don't know how to change it in the ChatGPT interface because I don't use it)

    • @bxroberts
      @bxroberts  Год назад +1

      Hello! Temperature is exposed by the GPT-3 API, but you can't change it in ChatGPT (currently Mar 2023). You can definitely improve the hallucination rate using some of the other APIs and params but then you also need to invest more time in prompt engineering. Ultimately, even the best tuned temperature will still exhibit some hallucination, but you're def right that it can be controlled a bit with fine tuning params.