After millions of rows, thousands of files, and hundreds of hours, we've learned a thing or two about working with big data sources, especially files in fixed or CSV formats. They have been around for a really long time, and they are universally known and present. Almost any data storage solution has dedicated tools for dealing with CSV files. That is very good but believe me, there's a limit. I'm talking hundreds of MB in a single file, millions of rows. Nice and beefy.
What are my options?
As I mentioned before: this is not a new problem, and there are several potential solutions:
What's better than SQL LOADER?
What if humans could use 100% of their brain capacity*? Well, some do, and they came up with External Tables. All jokes aside, this is just one more tool at your disposal, depending on your specific use case. Sometimes you need a hammer, and sometimes you need a screwdriver.
External Tables are somewhat of a complement to SQL LOADER. They allow you to access data outside the database and treat it like just another table. Sounds cool, doesn't it? Kind of a mix between SQL LOADER and a pipelined function. Files need to be in a text format, not binary (not entirely true, but for our purposes, it is). It works with clearly delimited records, either a line break or a specific character. If you're dealing with highly structured data such as JSON or XML, this is not for you. Records in JSON or XML format are hard to tell apart without basic parsing, which defeats the purpose of using SQL LOADER. Parsing makes it slow(er). External tables are super fast; they even support partitioning (since release 12.2.0.1), which allows you to take advantage of the same performance improvements provided when you partition tables stored in a database.
Just show me the goods already.
Fine! Here it is. When creating an External Table, you need to specify certain parameters:
We went for the most generic approach we could think of. Instead of creating an External Table for every format we had, we used a single table with generic columns such as A, B, C... AA, AB, AC, and so on. Think of an Excel spreadsheet.
CREATE TABLE STAGE_GENERIC_CSV_EXT
(
row_nr NUMBER,
c_a VARCHAR2(4000),
c_b VARCHAR2(4000),
c_c VARCHAR2(4000),
.
.
.
organization external
(
type ORACLE_LOADER
default directory MISC_FILES
access parameters
(
RECORDS DELIMITED BY NEWLINE
readsize 1048576
fields CSV WITH embedded record Terminators
MISSING FIELD VALUES ARE NULL
NULLIF = '#missing'
(row_nr recnum,
"S_A" CHAR(10000),
"S_B" CHAR(10000),
"S_C" CHAR(10000),
.
.
.
)
column transforms
(C_A from startof S_A (4000)
,C_B from startof S_B (4000)
,C_C from startof S_C (4000)
.
.
.
)
)
location (MISC_FILES:'generic.csv')
)
reject limit 0;
This doesn't make any sense to me.
Let's take it one step at a time. The first group of columns represents the columns of the actual table unrelated to the file contents.
ORGANIZATION EXTERNAL
This is simply telling Oracle you're creating an External Table.
TYPE, DEFAULT DIRECTORY
We discussed this a little earlier.
ACCESS PARAMETERS
This is the important section specific to our use case. You can basically copy what we did, but we also included a link to the official documentation if our use case does not fit yours. You start by defining the record properties:
The Query
Finally, what wraps everything together is the query:
SELECT
C_A, C_B, C_C
FROM STAGE_GENERIC_CSV_EXT
EXTERNAL MODIFY (ACCESS PARAMETERS (NOBADFILE NOLOGFILE NODISCARDFILE)
LOCATION(MISC_FILES:'CUST_DW_DELTA_RPT.CSV')
Like I mentioned before, you have the option to override some of the settings you defined when creating the table. The most important one you can see here is the LOCATION. Here you can point to a different directory and file. Since our solution is generic, this changes all the time, so we specify at the time we query, and Oracle will go and open that specific file. The only caveat is that the DIRECTORY needs to exist by the time you query.
The more we progress as a species, the more data we produce and log. Everyone has a phone nowadays. Simple analog tools and devices, like tractors and traffic lights, now have GPS sensors and are network enabled. All this new data needs to be loaded into a centralized repository for us to analyze and make decisions on. CSV is a well-known format that you can use to bring together all those different sources no matter how massive the dataset is. Now that you know how to load big files there's one thing to worry about. You can spend more time focusing on the activities that bring value to your business. Good luck!
*This is a myth, by the way. We do use our entire brain all the time. Well, some people do. I think. I'm not sure what my numbers are. I wonder how they calculate that. Should I see a doctor? Hmmm.